EVIDENCE FOR SEQUENCE-INDEPENDENT EVOLUTIONARY TRACES IN GENOMICS DATA
Sequence conservation during evolution is the foundation for the functional classification of the ennormous number of new protein sequences being discovered in the current era of genome sequencing. Conventional methods to detect homologous proteins are not always able to distinguish between true homologs and false positive hits in the twilight zone of sequence similarity. Several different approaches have been proposed to improve the sensitivity of these methods. Among the most successful are
... st successful are sequence profiles, multi-linked alignment, and threading. However, evolution might offer up other clues about a protein's ancestry that are sequence independent. Here we report the discovery of two such traces of evolution that could potentially be used to help infer the fold of a protein and hence improve the ability to predict the biochemical function. The first such evolutionary trace is a conservation of fold along the genome, i.e. nearby genes tend to share a fold more often than expected by chance alone-a not unexpected observation, but one which holds true even when no pair of genes being examined share appreciable homology. The second such evolutionary trace is, surprisingly, present in expression data: genes that are correlated in expression are more apt to share a fold than two randomly chosen genes. This result is surprising because correlations in expression have previously only been considered useful for determining biological function (e.g. what pathway a particular gene fits into), yet the observed fold enrichment in the expression data permits us to say something about biochemical function since fold corresponds strongly with biochemical function. Again, the fold enrichment observed in the expression data is apparent even when no pair of genes being examined share appreciable homology.