Homology and phylogeny and their automated inference

Georg Fuellen
2008 Die Naturwissenschaften  
The analysis of the ever-increasing amount of biological and biomedical data can be pushed forward by comparing the data within and among species. For example, an integrative analysis of data from the genome sequencing projects for various species traces the evolution of the genomes and identifies conserved and innovative parts. Here I review the foundations and advantages of this "historical" approach and evaluate recent attempts at 20 automating such analyses. Biological data is comparable if
more » ... a common origin exists (homology), as is the case for members of a gene family originating via duplication of an ancestral gene. If the family has relatives in other species, we can assume that the ancestral gene was present in the ancestral species from which all the other species evolved. In particular, describing the relationships among the duplicated biological sequences found in examples in this review demonstrate that homology and phylogeny analyses, done on a large (and automated) scale, can give insights into function in biology and biomedicine. 35 Introduction and terminology. Homology is the relation of biological sequences by way of 40 their common evolutionary origin (Fitch 1970) . That is, there once was a piece of DNA, a gene, an interaction between proteins, etc. It was duplicated, and the duplicates evolved separately, gaining, for example, substitutions in sequence. The duplicates are called homologs, no matter how similar they are. Nevertheless, since usually we cannot look back in time, homology is an inference based on similarity. It is a pragmatic yes/no decision that 45 can have an estimate of significance, or probability, attached to it. This estimate is usually based on the quantity of similarity. Thus, two genes can be said to have a high chance of being homologous, or, some part of the sequence of one gene can be homologous to another gene. However, two genes should not be called "highly homologous". Terminology does not allow such a statement; sequences have a common origin, or they do not have 50 one. More importantly, though, it must be recognized that homology is always something we know with limited certainty: Certainty cannot be established since the similarity that we can measure may be due to convergence, where two sequences of different origin become similar because they fulfill a common function. Or, similarity due to common ancestry may get lost in time and no longer be recognizable. Thus, with a few exceptions (using fossil data 55 or evolution in the lab), homology is a concept that must be handled with pragmatism: We cannot be certain about homology, but estimates of homology can nevertheless be used as a foundation of meaningful analyses and valuable predictions, even if the term is misused by many, and the fundamental uncertainty of homology statements is neglected all too often. 65 from a controlled vocabulary of biological terms such as the Enzyme Commission (EC, www.hem.qmul.ac.uk/iubmb/enzyme/) classification scheme or the Gene Ontology Consortium (GO) scheme (2006, www.geneontology.org), or a term which is not yet part of such a controlled vocabulary, but which can be added to it by specializing an existing term. We mentioned "duplicates" of a biological sequence; but we have to distinguish between two 70 scenarios: 1) The standard duplication of a sequence within a single species, e.g. the appearance of two copies of a gene and their subsequent divergence. Whole-genome duplication, segmental duplication, tandem duplication, retrotransposition, and other processes may cause such a duplication. 75 2) The other common mechanism that brings two copies into existence is speciation, that is the "duplication" of the entire species hosting the gene. Glossing over the speciation process itself (which involves individuals in a population, giving rise to a wide array of complicating factors, see e.g. Maddison and Knowles, 2006) , the result is that the gene is found (and usually continues to be found) in the two species, and in their subsequent 80 descendants, and it diverges in these. Standard duplication gives rise to paralogs, speciation gives rise to orthologs (Fitch 1970), and a history of duplication and speciation events gives rise to bewildering scenarios. Unfortunately, in this case confusion is all too often heightened by a misuse of terminology which suggests a certainty that does not exist: Two most similar genes found in two species 85 are often called orthologs without any further justification. Even if they are each other's
doi:10.1007/s00114-008-0348-1 pmid:18288471 fatcat:epi54zhudffl7ms3avrht2qgri