Barking Up The Wrong Treelength: The Impact of Gap Penalty on Alignment and Tree Accuracy
IEEE/ACM Transactions on Computational Biology & Bioinformatics
The current technique for estimating phylogenies from sequence data uses two phases: first, the sequences are aligned, and then the tree is estimated using the obtained alignment. More recently, however, several computational methods have been developed for simultaneous estimation of the alignment and the tree, of which POY (a heuristic for the NP-hard "minimum treelength" problem, which extends maximum parsimony (MP) so that gaps contribute to the cost) is the most popular. In a 2007 paper
... ished in Systematic Biology, Ogden and Rosenberg reported on a simulation study in which they compared POY to the very simple two-phase method of estimating the alignment using ClustalW and then analyzing the resultant alignment using MP. They found that in the overwhelming majority of the cases, ClustalW þ MP outperformed POY with respect to alignment and phylogenetic tree accuracy, and they concluded that simultaneous estimation techniques (collectively referred to as "Direct Optimization") are not competitive with two-phase techniques. Our paper presents a simulation study in which we take a closer look at the points raised by Ogden and Rosenberg. Instead of focusing specifically on POY, we focus on the NP-hard optimization problem that POY addresses: minimizing treelength. Since this optimization depends upon the specific edit distance criterion used to score a tree, our study considers the impact of the gap penalty (in particular, affine versus simple) on the accuracy of the resultant alignment and tree that optimizes the treelength for that gap penalty function. Our study suggests that the poor performance observed for POY by Ogden and Rosenberg is due to the simple gap penalties they used to score alignment/tree pairs, but also suggests the intriguing possibility that optimizing under an affine gap penalty might produce alignments that are not only better than ClustalW alignments, but competitive with (or perhaps better than) those produced by the best current alignment methods. This study also shows that optimizing under this affine gap penalty produces trees whose topological accuracy is better than ClustalW þ MP, and competitive with the current best two-phase methods.