Extending the Reach of Phylogenetic Inference [chapter]

Bernard M. E. Moret
2013 Lecture Notes in Computer Science  
One of the most cited articles in biology is a 1973 piece by Theodosius Dobzhansky in the The American Biology Teacher entitled "Nothing in biology makes sense except in the light of evolution." It was also around that time that the development of computational approaches for the inference of phylogenies (evolutionary histories) started. Since then, phylogenetic inference has grown to become one of the standard research tools throughout biological and biomedical research. Today, phylogenetic
more » ... ls receive over 10,000 citations every year. Concurrently, many groups are engaged in fundamental research in phylogenetic methods and in the design and study of computationally oriented models of evolution for systems ranging from simple genetic sequences through entire genomes to interaction networks. Yet, in spite of the fame of Dobzhansky's article and the spread of phylogenetic methods beyond the original applications to systematics, the use of methods grounded in evolutionary biology is not as pervasive as it could be. In this talk, we illustrate some of the algorithmic problems raised by current research and some of the potential new applications of phylogenetic approaches through several projects carried out in our laboratory. The problems arise from combinatorial and algorithmic questions about models of evolution and approaches to the analysis of whole genomes. The new approaches include an extension of the time-tested and universally used comparative method, as well as applications of phylogenetic approaches to genomic transcripts and cell types, objects not typically studied through the lens of evolution. Comparing the complete genomes of vertebrates is a daunting problem. Not only does each genome have billions of nucleotides, but almost nothing is known for 90% of even the best studied of these genomes. The standard approach today partitions the genomes into syntenic blocks, contiguous intervals along the genome that are viewed as homologous-as descending from the same contiguous interval in the genome of the last common ancestor (LCA). Since mutations, rearrangements, insertions, and other evolutionary events have transformed the LCA genome in different ways along each evolutionary path, one cannot expect to find high levels of similarity between the sequences defined by these intervals. Instead, one looks for markers, nearly perfectly conserved short sequences that are nevertheless long enough to make accidental conservation highly improbable. Homologous blocks should share (most of) their markers and have few, if any, shared markers with non-homologous blocks. Under most reasonable formulations, this problem is NP-hard and solutions to date are mostly ad hoc. The issue of genomic evolution "in the large," that is, at the scale of markers, genes, or blocks and through rearrangements, duplications, and losses, has been intensely studied for nearly 20 years now, with a number of remarkable algorithmic results. Every new algorithmic result, however, has served mostly to raise interest in more complete or more sophisticated models, or to motivate new and harder problems. Combining
doi:10.1007/978-3-642-40453-5_1 fatcat:ilmiouqtdnbzlodqtzumcnsywi