Enhanced functional information from predicted protein networks

Jason McDermott, Ram Samudrala
2004 Trends in Biotechnology  
Experimentally derived genome-wide protein interaction networks have been useful in the elucidation of functional information that is not evident from examining individual proteins but determination of these networks is complex and time consuming. To address this problem, several computational methods for predicting protein networks in novel genomes have been developed. A recent publication by Date and Marcotte describes the use of phylogenetic profiling for elucidating novel pathways in
more » ... es that have not been experimentally characterized. This method, in combination with other computational methods for generating protein-interaction networks, might help identify novel functional pathways and enhance functional annotation of individual proteins. The advent of the 'genomic age' in biology has brought about several new challenges, particularly to the area of computational biology. The vast amount of information already present and becoming available daily is driving the need for new techniques used to derive useful hypotheses from genomic sequence data, even in the absence of experimental data from the particular organism. Protein networks -the representation of the functional, contextual or physical linkages between all proteins in an organism -have been useful in the prediction of function for proteins that cannot be annotated (i.e. assigned a function) by conventional means [1 -3]. Date and Marcotte [4] used a phylogenetic profile method to predict functional linkage networks for several organisms and then use the networks to find and describe previously uncharacterized cellular pathways. This approach is one of several new network-based techniques for improving the functional annotation of novel genomes [5, 6] and highlights some of the challenges facing the field. Here, we provide an outline of this and similar methods and compare the results of this method to networks predicted by the Bioverse [7] computational framework (http://bioverse.compbio.washington.edu). Utility of protein-interaction networks Several experimental techniques have been used to derive protein-interaction networks for yeast and Helicobacter pylori [8 -10] and these networks exhibit a specific topology and functional modularity [2, 11] . The interactions between complexes in specific pathways are highlighted and many previously uncharacterized proteins can be associated with known pathways. Other features of the networks are interesting for biologists, including the observation that highly connected proteins in the yeast network correlate with essential proteins [12] . Prediction of protein networks A number methods based on evolutionary and/or contextual sequence information have been developed to predict protein -protein interaction and functional relationship networks in novel genomes [5,6,13 -16]. Contextual methods include examining patterns of domain fusion across genomes, operon association and gene-order analysis [5, 6] . Evolutionary methods include experimental similarity methods (i.e. the identification of pairs of proteins encoded by a target genome similar to pairs of proteins experimentally determined to interact [13, 14] ) and the phylogenetic profiling methods used by Date and Marcotte [15, 16] . Phylogenetic profiling involves the construction of a homolog profile, which measures the occurrences of homologous proteins across a number of genomes for a particular protein. A score describing the co-occurrence of pairs of genes across multiple genomes (mutual information score) is used to predict functional linkages on the assumption that proteins in the same pathway or complex are more likely to be inherited together in the course of evolution. Whereas sequencesimilarity methods (and to a certain extent contextual methods) provide predictions of physical protein interactions, phylogenetic profiling provides functional linkages between proteins. Functional annotation using protein-context networks Several methods have been described for providing functional annotation for uncharacterized proteins using protein networks [2, 6, 11, 17] . Function prediction based on protein-interaction networks assumes that interacting proteins are likely to share similar functions. The 'majority rule' method annotates a protein by surveying the functions of all the proteins predicted to interact with it and choosing the most frequently occurring function [2]. A more sophisticated method designed for use on predicted protein interaction networks provides a confidence score for each function based on the scores of the functional annotations and the score of the predicted interaction (McDermott and Samudrala, unpublished). Other methods use global network properties [17] or probability-based models [18] to provide accurate functional annotations. The phylogenetic profiling prediction method clusters proteins with similar functions in the same area of the network. Date and Marcotte used a predominantly manual method to derive functions for the unknown proteins in their networks. Clusters of proteins in the network with no clear function are identified and extended to include proteins with linkages below the selected threshold or proteins found in the same operon. The function of unknown proteins is then predicted from their location in the network. Figure 1a shows the largest predicted E. coli network generated using Date and Marcotte's phylogenetic profile linkages [Date-Marcotte (DM) network] consisting of 1751 proteins and 12 874 linkages [4, supplementary information]. Figure 1b is the E. coli network predicted by the Bioverse (511 proteins; 4075 interactions), based on similarity to experimentally derived interactions. Proteins are colored by broad gene ontology (GO) [19] categories and the 220 proteins shared by both networks are shown with a blue outline. The Date -Marcotte network has a significantly higher average number of connections per protein (15.6 with DM versus 3.8 with Bioverse), similar to that observed in predicted eukaryotic networks (e.g. C. elegans network; http://bioverse.compbio.washington. edu). The Bioverse-generated network was more accurate for more specific functional categories but provided fewer annotations (Figure 2 ). Network comparison Implications of network-based functional annotation Date and Marcotte described a technique for predicting genomic-scale protein networks based on evolutionary information and they have used it to elucidate novel, uncharacterized pathways from genomes. In prokaryotes, these networks provide more coverage than networks predicted by similarity to experimentally determined interactions but the similarity-derived network contains 291 proteins not included in the DM network. In addition the functional resolution of the DM networks is less specific than that in the similarity-derived networks. These factors suggest that the two methods could be combined both to improve the quality of the networks and annotations and to expand their coverage [20] . Identification of uncharacterized conserved pathways is important for providing new insights into cellular function. The method described by Date and Marcotte provides results independent of experimental data suggesting the utility of integrating experimental similarity data with the functional linkages to provide a more complete picture of proteomic-scale networks. The functional linkages provide groupings that can be further resolved by examining predicted protein-interaction networks generated via other methods. Improved prediction of protein networks will allow rapid and accurate functional annotation of newly sequenced genomes and provide a convenient Figure 1. Comparison of predicted protein networks for E. coli. (a) Protein pairs and their mutual information scores based on phylogenetic profiling were used to generate a network for E. coli. Figure generated using data from [4, supplementary information] (b) Protein interactions were predicted using Bioverse [7] based on finding pairs of proteins similar in sequence to proteins from a database of experimentally determined interactions. Figure generated using data from Bioverse (http://bioverse.compbio. washington.edu). For both networks, nodes representing proteins are colored based on their gene ontology (GO) [19] category and the 220 proteins present in both networks are outlined in blue. Edges represent the predicted relationships between proteins [functional linkages in (a) and protein interactions in (b)] and are colored by confidence (a) or mutual information score (b).
doi:10.1016/j.tibtech.2003.11.010 pmid:14757037 fatcat:erckyvyis5euhny3jaadb5crce