eMAFFTadd: scaling MAFFT-linsi-add to large datasets [article]

Chengze Shen, Baqiao Liu, Tandy Warnow
2022 bioRxiv   pre-print
Multiple sequence alignment is essential for many biological downstream analyses. Yet, accurate alignments on large datasets are challenging and can require very long running times; since new sequence data are frequently and regularly obtained, this calls for methods that can add sequences into large alignments rather than requiring the re-estimation from scratch. In addition, sequence datasets exhibiting substantial sequence length heterogeneity are also difficult to align with high accuracy.
more » ... ethods, such as UPP, have been able to provide good accuracy and operate by extracting a subset of the sequences deemed to be full length, aligning that subset (thus producing a "backbone alignment"), and then adding the remaining sequences into the backbone alignment. There are also standalone methods, such as MAFFT--add, that can add sequences into backbone alignments, but the best version of this method (which uses --linsi) is computationally intensive. Because adding sequences into alignments is a basic and important step in bioinformatics analyses, the development of new approaches with high scalability and accuracy is important. In this study, we present a new sequence-adding method, eMAFFTadd, that achieves high accuracy and scalability. In essence, eMAFFTadd is a way of scaling MAFFT-linsi-add to large datasets. We show that eMAFFTadd is more accurate than UPP, can run on datasets too large for MAFFT-linsi-add, and is fast enough to use on very large sequence datasets. Our software for eMAFFTadd is available in open source shape at https://github.com/c5shen/eMAFFTadd.
doi:10.1101/2022.05.23.493139 fatcat:lkwchbd4hnak7copjv3txnkyzu