Detecting evolutionary relationships across existing fold space, using sequence order-independent profile-profile alignments

L. Xie, P. E. Bourne
2008 Proceedings of the National Academy of Sciences of the United States of America  
Here, a scalable, accurate, reliable, and robust protein functional site comparison algorithm is presented. The key components of the algorithm consist of a reduced representation of the protein structure and a sequence order-independent profile-profile alignment (SOIPPA). We show that SOIPPA is able to detect distant evolutionary relationships in cases where both a global sequence and structure relationship remains obscure. Results suggest evolutionary relationships across several previously
more » ... olutionary distinct protein structure superfamilies. SOIPPA, along with an increased coverage of protein fold space afforded by the structural genomics initiative, can be used to further test the notion that fold space is continuous rather than discrete. functional site ͉ structure T he evolutionary relationship between protein sequences, protein structures, and their associated function(s) remains a central topic of molecular biology and one resulting in the development of many computational methods (1-3). A central question is: What were the early protein folds and how did these folds change over long evolutionary time scales (4-7)? Comparative genomics studies and structural and phylogenetic analyses (8-10) have established that a subset of proteins, dominated by the structure classification of proteins (SCOP) (11) ␣/␤ class, were likely present in the last universal common ancestor (12, 13). Concurrently, growing evidence suggests that recurring substructures, that is, 3D fragments of noncontiguous sequence shared between different folds, may be clues that protein fold space is more continuous than discreet (14, 15). The sequence/ structure similarity of such substructures correlates well with the similarity of function found between the different folds containing these substructures (16). The notion that protein fold space is a continuum is further supported by recent studies that show that protein domains can adopt different topologies through combination, swapping, deletion (4, 17, 18), and cyclic permutation (19, 20) of subdomains. Likewise, new folds can emerge from accretion (21) or embellishment (22) of substructures around a core of conserved secondary structures. Given these findings concerning the dynamic nature of protein structure and the possible continuous nature of protein fold space, it is important to distinguish proteins that share a common ancestor (divergent evolution) from those that have adopted common structural constraints (convergent evolution). Typically, evolutionary relationships between protein sequence, structure, and function are deduced from the respective comparisons among known genes and their products. These comparisons are made at various levels, from genome sequences to protein domains and motifs to biochemical pathways. Such comparisons may miss important relationships because sequence relationships may be too weak to detect, and/or fail to identify complex evolutionary events such as domain swapping and cyclic permutation. Likewise, differences in global protein structure may disguise a true evolutionary relationship that exists between substructures. One approach, which involves the comparative analysis of substructures, including functional sites between proteins (1, 23-26), has been successful in detecting evolutionary relationships between different fold superfamilies and has been applied mostly to enzyme families. One study of 31 diverse enzyme superfamilies revealed that functional diversity during evolution is achieved by local sequence variation and domain shuffling (24). Such functional diversity can also be observed within a single SCOP superfamily. For example, within the protein kinase-like superfamily, it has been suggested that atypical kinases diverged early in evolution from protein kinases (26). In doing so, the overall catalytic mechanism is retained through a high level of conservation associated with the ATP binding cassette, thus preserving phosphorylation, yet the substrate binding motif exhibits significant diversity. In the case of mechanistically diverse enzymes, whose members catalyze different overall reactions but share a partial reaction, it has been found that these enzymes use a similar active site to generate a common intermediate, then direct the intermediate to different products in different active sites (25). Beyond these case studies, the global evolutionary relationship of functional sites across fold space has not been systematically studied and remains elusive. Global functional site comparison has been thwarted by the lack of efficient and accurate computational tools to undertake such a large scale comparison and a lack of rigorous statistics to test their similarity. The work described herein is a step toward accurate and efficient functional site comparison and analysis and is subsequently applied to seek out new evolutionary relationships. Although the concept of functional site matching is not new, and a variety of approaches have been attempted (27-47), it has not proven an easy task to design and implement a practical software solution with performance that is close to that of routine sequence comparison. These site comparison algorithms usually consist of three interrelated components; the representation of the functional site, an algorithm to superimpose two sites and a method to score their similarity. The functional site is usually represented either by a coordinate set with certain physicochemical or evolutionary properties, or by 3D shape descriptors that define pockets within the protein (44). The coordinate set can consist of atoms (28), chemical groups (34) or surface points (33, 41). The optimum superimposition between two sites is achieved with geometric hashing (33, 42),
doi:10.1073/pnas.0704422105 pmid:18385384 pmcid:PMC2291117 fatcat:ncjgup3igzafrkyfc3xbdlscge