Benchmark datasets and software for developing and testing methods for large-scale multiple sequence alignment and phylogenetic inference

C. Randal Linder, Rahul Suri, Kevin Liu, Tandy Warnow
2010 PLOS Currents  
We have assembled a collection of web pages that contain benchmark datasets and software tools to enable the evaluation of the accuracy and scalability of computational methods for estimating evolutionary relationships. They provide a resource to the scientific community for development of new alignment and tree inference methods on very difficult datasets. The datasets are intended to help address three problems: multiple sequence alignment, phylogeny estimation given aligned sequences, and
more » ... ertree estimation. Datasets from our work include empirical datasets with carefully curated alignments suitable for testing alignment and phylogenetic methods for large-scale systematics studies. Links to other empirical datasets, lacking curated alignments, are also provided. We also include simulated datasets with properties typical of large-scale systematics studies, including high rates of substitutions and indels, and we include the true alignment and tree for each simulated dataset. Finally, we provide links to software tools for generating simulated datasets, and for evaluating the accuracy of alignments and trees estimated on these datasets. We welcome contributions to the benchmark datasets from other researchers. initiative is "[a]ssembly of a framework phylogeny, or Tree of Life, for all major lineages of life." [1] Much of that effort has focused on accumulating and analyzing data for the major taxonomic groups. However, because of the scale of the problems (numbers of species and amount of sequence information), the initiative has also required development of methods for sequence alignment, phylogenetic inference and supertree estimation that can handle hundreds, thousands or even tens of thousands of sequences. In the last decade, many new methods have been developed to address these challenging computational problems, including RAxML [2], GARLI [3], POY [4], SATé [5], and MrBayes [6]. However, evaluations of the efficacy of these methods for large-scale alignment and tree estimation-required for highly accurate estimations of the Tree of Life-have lagged behind method development. To facilitate testing of large-scale alignment and phylogeny estimation methods, we have assembled a collection of web pages of (1) benchmark datasets and (2) software appropriate for creating new simulated benchmark datasets ( http://www .cs.utexas.edu/users/phylo/datasets/). Because these datasets have been assembled with an eye to their usefulness for Tree of Life-scale projects, only datasets that have large numbers of taxa and/or present other difficulties for phylogenetic reconstruction and alignment (e.g., high rates of substitution and insertions and deletions) are included. The datasets we provide range in numbers of taxa from a few hundred to more than 300,000 sequences. The datasets are broken down into sets most appropriate for three types of phylogenetic problems: phylogenetic estimation given aligned sequences, supertree estimation, and multiple sequence alignment. Some datasets are appropriate for more than one type of problem and therefore are referenced more than once. Reference information and links are provided for all published datasets. Benchmarks for phylogenetic estimation The benchmark datasets for phylogenetic estimation are both empirical and simulated. They have been used in large-scale systematics studies, and so present challenges for maximum likelihood, maximum parsimony and Bayesian estimation. A subset of the empirical datasets (Table 1) include curated alignments and reference trees (generated using RAxML version 7.0.4 [2]). Reference trees have been assessed by bootstrapping, with edges having less than 75% support contracted. The remaining empirical datasets lack curated alignments and reference trees, but are appropriate for assessing the ability of alignment and phylogenetic software to operate on large and/or difficult datasets. They can also be used to compare how well algorithms solve particular optimality criteria, e.g., maximum parsimony or maximum likelihood. The empirical datasets include Abstract We have assembled a collection of web pages that contain benchmark datasets and software tools to enable the evaluation of the accuracy and scalability of computational methods for estimating evolutionary relationships. They provide a resource to the scientific community for development of new alignment and tree inference methods on very difficult datasets. The datasets are intended to help address three problems: multiple sequence alignment, phylogeny estimation given aligned sequences, and supertree estimation. Datasets from our work include empirical datasets with carefully curated alignments suitable for testing alignment and phylogenetic methods for large-scale systematics studies. Links to other empirical datasets, lacking curated alignments, are also provided. We also include simulated datasets with properties typical of large-scale systematics studies, including high rates of substitutions and indels, and we include the true alignment and tree for each simulated dataset. Finally, we provide links to software tools for generating simulated datasets, and for evaluating the accuracy of alignments and trees estimated on these datasets. We welcome contributions to the benchmark datasets from other researchers. 21. McMahon, M. and M. Sanderson. 2006. Phylogenetic supermatrix analysis of GenBank sequences from 2228 papilionoid legumes. Systematic Biology 55:818-836. 22. Cardillo, M., O. R. P. Bininda-Emonds, E. Boakes, and A. Purvis. 2004. A species-level phylogenetic supertree of
doi:10.1371/currents.rrn1195 pmid:21113335 pmcid:PMC2989560 fatcat:h6zwnb74vnfwnayowketgddohu