Methods for Large-Scale Mining of Networks of Human Genes
Proceedings of the 2001 SIAM International Conference on Data Mining
In molecular biology there is much interest in various types of relationships between genes. Due to the complexity and rapid development of this field, much of this knowledge exists only in free-text form. A database of relationships between genes may allow background knowledge to be used in computerised analyses. As far as we know, no comprehensive manually cured database of this kind exists, and constructing and maintaining such a database manually would be very labour-intensive. Efficient
... nsive. Efficient automated methods for extraction and structuring of relationships between genes from free-text would be valuable. A database named PubGene has previously been created and it contains a comprehensive network of human genes created by automated extraction of co-occurrence of gene terms in over 10 million MEDLINE records. Co-occurring genes were linked together under the hypothesis that two genes will co-occur only if they have some biological relationship. In this paper, we show that for the subset of human genes encoding enzymes, pairs of co-occurring enzyme genes are significantly more closely related biologically than when these genes are compared randomly. Manual inspection, however, shows that some of the links in PubGene are not correct and it also indicates how the noise can be reduced. We propose a complementary method for automated extraction of relationships between genes by use of information from the Science Citation Index (SCI) database. We relate two genes if they have been co-referred, that is, having reference articles being co-cited in a third article. The alternative approach confirms relationships found in PubGene, and it also finds other relevant relationships. Although further experiments are . † These authors contributed equally 2 required for the SCI approach, the results are encouraging. Furthermore, the two methods combined can be used to generate networks that have high specificity or high sensitivity by either requiring that relationships should be found by both methods or by only one, respectively.