Neighbor Joining Algorithms for Inferring Phylogenies via LCA Distances

Ilan Gronau, Shlomo Moran
2007 Journal of Computational Biology  
Reconstructing phylogenetic trees efficiently and accurately from distance estimates is an ongoing challenge in computational biology from both practical and theoretical considerations. We study algorithms which are based on a characterization of edge-weighted trees by distances to LCAs (Least Common Ancestors). This characterization enables a direct application of ultrametric reconstruction techniques to trees which are not necessarily ultrametric. A simple and natural neighbor joining
more » ... n based on this observation is used to provide a family of efficient neighbor-joining algorithms. These algorithms are shown to reconstruct a refinement of the Buneman tree, which implies optimal robustness to noise under criteria defined by Atteson. In this sense, they outperform many popular algorithms such as Saitou and Nei's NJ. One member of this family is used to provide a new simple version of the 3-approximation algorithm for the closest additive metric under the l ∞ norm. A byproduct of our work is a novel technique which yields a time optimal O(n 2 ) implementation of common clustering algorithms such as UPGMA. 1 2 GRONAU AND MORAN which elements are chosen to be joined. The simplest neighbor-joining criterion is probably the "closest-pair" criterion, which is used in several well known clustering algorithms such as UPGMA, WPGMA (Sneath and Sokal, 1973) and the single-linkage algorithm (Barthelemy and Guenoche, 1991; Krivanek, 1988) . While this criterion is inconsistent in general, it is consistent for the special case of ultrametric trees, which contain a point (root) which is equidistant from all taxa. Ultrametric reconstruction algorithms typically have very efficient implementations: O(n 2 log(n)) for UPGMA and WPGMA, and O(n 2 ) for the single linkage algorithm. Neighbor-joining algorithms which consistently reconstruct general trees (which are not necessarily ultrametric) typically use more complex neighbor joining criteria, significantly increasing their running time. The problem of consistent reconstruction can be reduced to the special case of ultrametric reconstruction by applying the Farris transform (Farris, 1973) . The Farris transform converts any additive metric into an ultrametric while conserving the topology of the corresponding tree (Fig. 1) . After applying the Farris transform, ultrametric reconstruction methods (such as the ones listed above) can be used to obtain an intermediate ultrametric tree. Finally, in order to obtain the desired tree, the weights of external edges need to be restored. This approach leads to several time optimal consistent reconstruction algorithms (see e.g., Agarwala et al., 1999; Gusfield, 1997) . In this paper we introduce an alternative technique for reducing the problem of consistent reconstruction to the problem of ultrametric reconstruction. Using distances to least common ancestors (LCAs), this technique directly reconstructs the desired tree, thus bypassing the intermediate ultrametric tree mentioned above. This direct approach enables the proof of certain robustness properties which are strictly stronger than consistency alone. Consistency is a natural and basic requirement, guaranteeing correct reconstruction when distance estimates are accurate. However, in practice we are rarely able to obtain accurate distance estimates, and the input from which trees are reconstructed is seldom additive. The input dissimilarity matrix is often regarded as a noisy version of some original additive metric, and distance-based reconstruction methods are required to be robust to this noise. Informally, robustness of an algorithm to noise is measured by the amount of noise under which correct reconstruction of the tree's topology (or parts of it) is still guaranteed. One notion of robustness is defined by the ability to reconstruct the correct topology given nearly additive input. A dissimilarity matrix D is said to be nearly additive with respect to a binary edge-weighted tree T (whose induced additive metric is denoted by D T ), if ||D, D T || ∞ < 1 2 · min e∈T {w(e)} (Atteson, 1999) . The topology of T is uniquely determined by any dissimilarity matrix D which is nearly additive with respect to it. This is because the topology of T is uniquely determined by the configurations of all taxon-quartets in the tree, and a matrix D which is nearly additive w.r.t. T is also quartet-consistent with it in the following sense: Definition 1.1 (Quartet consistency). Let D be a dissimilarity matrix, then: • D is consistent with quartet-configuration (ij : kl), if: • D is quartet-consistent with some tree T if it is consistent with all quartet-configurations induced by T . FIG . 1. The Farris transform. Given a dissimilarity matrix D, a taxon r and some value m ≥ maxi{D(r, i)}, the Farris-transform defines a dissimilarity matrix U s.t. U (i, j) = 2m + D(i, j) − D(r, i) − D(r, j) . If D is additive, consistent with some tree T , then U is consistent with an ultrametric tree achieved by elongating the external edges of T (elongation marked by dashed line).
doi:10.1089/cmb.2006.0115 pmid:17381342 fatcat:zplzpphu4vbgzp7lqy32ow4fgq