Rate Matrices for Analyzing Large Families of Protein Sequences

C. Devauchelle, A. Grossmann, A. Hénaut, M. Holschneider, M. Monnerot, J.L. Risler, B. Torrésani
2001 Journal of Computational Biology  
We propose and study a new approach for the analysis of families of protein sequences. This method is related to the LogDet distances used in phylogenetic reconstructions; it can be viewed as an attempt to embed these distances into a multi-dimensional framework. The proposed method starts by associating a Markov matrix to each pairwise alignments deduced from a given multiple alignment. The central objects under consideration here are matrix-valued logarithms L of these Markov matrices, which
more » ... xist under conditions that are compatible with fairly large divergence between the sequences. These logarithms allow us to compare data from a family of aligned proteins with simple models (in particular, continuous reversible Markov models) and to test the adequacy of such models. If one neglects uctuations arising from the nite length of sequences, any continuous reversible Markov model with a single rate matrix Q over an arbitrary tree predicts that all the observed matrices L are multiples of Q. Our method exploits this remark, without relying on any tree estimation. We test this prediction on a family of proteins encoded by the mitochondrial genome of 26 multicellular animals, which include vertebrates, arthropods, echinoderms, molluscs and nematodes. A principal component analysis of the observed matrices L shows that a single rate model can be used as a rough approximation to the data, but that systematic deviations from any such model are unmistakable, and related to the evolutionary history of the species under consideration.
doi:10.1089/106652701752236205 pmid:11571074 fatcat:doquwcwvirbpjj7vi7dqfagnia