Peer Review #1 of "Evaluation of computational methods for human microbiome analysis using simulated data (v0.1)" [peer_review]

A Gorska
2020 unpublished
Background. Our understanding of the composition, function, and health implications of human microbiota has been advanced by high-throughput sequencing and the development of new genomic analyses. However, trade-offs among alternative strategies for the acquisition and analysis of sequence data remain understudied. Methods. We assessed eight popular taxonomic profiling pipelines; MetaPhlAn2, metaMix, PathoScope 2.0, Sigma, Kraken, ConStrains, Centrifuge and Taxator-tk, against a battery of
more » ... enomic datasets simulated from real data. The metagenomic datasets were modelled on 426 complete or permanent draft genomes stored in the Human Oral Microbiome Database and were designed to simulate various experimental conditions, both in the design of a putative experiment; read length (75-1000 bp reads), sequence depth (100K-10M), and in metagenomic composition; number of species present (10, 100, 426), species distribution. The sensitivity and specificity of each of the pipelines under various scenarios were measured. We also estimated the relative root mean square error and average relative error to assess the abundance estimates produced by different methods. Additional datasets were generated for five of the pipelines to simulate the presence within a metagenome of an unreferenced species, closely related to other referenced species. Additional datasets were also generated in order to measure computational time on datasets of ever-increasing sequencing depth (up to 6x10 7 ). Results. Testing of eight pipelines against 144 simulated metagenomic datasets initially produced 1,104 discrete results. Pipelines using a marker gene strategy; MetaPhlAn2 and ConStrains, were overall less sensitive, than other pipelines; with the notable exception of Taxator-tk. This difference in sensitivity was largely made up in terms of runtime, significantly lower than more sensitive pipelines that rely on wholegenome alignments such as PathoScope2.0. However, pipelines that used strategies to speed-up alignment between genomic references and metagenomic reads, such as kmerization, were able to combine both high sensitivity and low run time, as is the case with Kraken and Centrifuge. Absent species genomes in the database mostly led to assignment of reads to the most closely related species available in all pipelines. Our results therefore suggest that taxonomic profilers that use kmerization have largely superseded those that use gene markers, coupling low run times with high sensitivity and specificity. Taxonomic profilers using more time-consuming read reassignment, such as PathoScope 2.0, PeerJ reviewing PDF | (2019:12:43682:2:0:NEW 15 Jul 2020) Manuscript to be reviewed provided the most sensitive profiles under common metagenomic sequencing scenarios. All the results described and discussed in this paper can be visualized using the dedicated R Shiny application ( https://github.com/microgenomics/HumanMicrobiomeAnalysis ). All of our datasets, pipelines and results are made readily available through the Shiny App for future benchmarking. Abstract 40 Background. Our understanding of the composition, function, and health implications of human 41 microbiota has been advanced by high-throughput sequencing and the development of new 42 genomic analyses. However, trade-offs among alternative strategies for the acquisition and 43 analysis of sequence data remain understudied. 44 Methods. We assessed eight popular taxonomic profiling pipelines; MetaPhlAn2, metaMix, 45 PathoScope 2.0, Sigma, Kraken, ConStrains, Centrifuge and Taxator-tk, against a battery of 46 metagenomic datasets simulated from real data. The metagenomic datasets were modelled on 47 426 complete or permanent draft genomes stored in the Human Oral Microbiome Database and 48 were designed to simulate various experimental conditions, both in the design of a putative 49 experiment; read length (75-1000 bp reads), sequence depth (100K-10M), and in metagenomic 50 composition; number of species present (10, 100, 426), species distribution. The sensitivity and 51 specificity of each of the pipelines under various scenarios were measured. We also estimated 52 the relative root mean square error and average relative error to assess the abundance estimates 53 produced by different methods. Additional datasets were generated for five of the pipelines to 54 simulate the presence within a metagenome of an unreferenced species, closely related to other 55 referenced species. Additional datasets were also generated in order to measure computational 56 time on datasets of ever-increasing sequencing depth (up to 6x10 7 ). 57 Results. Testing of eight pipelines against 144 simulated metagenomic datasets initially 58 produced 1,104 discrete results. Pipelines using a marker gene strategy; MetaPhlAn2 and 59 ConStrains, were overall less sensitive, than other pipelines; with the notable exception of 60 Taxator-tk. This difference in sensitivity was largely made up in terms of runtime, significantly 61 lower than more sensitive pipelines that rely on whole-genome alignments such as 62 PathoScope2.0. However, pipelines that used strategies to speed-up alignment between genomic 63 references and metagenomic reads, such as kmerization, were able to combine both high 64 sensitivity and low run time, as is the case with Kraken and Centrifuge. Absent species genomes 65 in the database mostly led to assignment of reads to the most closely related species available in 66 all pipelines. Our results therefore suggest that taxonomic profilers that use kmerization have 67 largely superseded those that use gene markers, coupling low run times with high sensitivity and 68 specificity. Taxonomic profilers using more time-consuming read reassignment, such as 69 PathoScope 2.0, provided the most sensitive profiles under common metagenomic sequencing 70 scenarios. All the results described and discussed in this paper can be visualized using the 71 dedicated R Shiny application (https://github.com/microgenomics/HumanMicrobiomeAnalysis). 72 All of our datasets, pipelines and results are made readily available through the Shiny App for 73 future benchmarking. PeerJ reviewing PDF | (2019:12:43682:2:0:NEW 15 Jul 2020) Manuscript to be reviewed 75 Introduction 76 77 Metagenomics is emerging as the highest-resolution approach to study the human microbiome, 78 providing essential insights into human health (Belizário and Napolitano, 2015) . High-79 throughput sequencing (HTS) techniques have unleashed vast amounts of metagenomic data, 80 which in turn have prompted the rapid development of sophisticated bioinformatic tools and 81 computational pipelines. A subset of these tools and pipelines are dedicated to providing an 82 answer to the quintessential question "who is there?" (Grice and Segre, 2012). By using shotgun 83 metagenome sequencing data and aligning reads against reference databases, we can answer this 84 question at various levels of interest. With taxonomic binning, individual sequence reads are 85 clustered into new or existing operational taxonomic units (OTUs), obtained through sequence 86 similarity and other intrinsic shared features present in reads. With taxonomic profiling the focus 87 is on estimating the presence and quantity of taxa in a microbial population as well as the relative 88 abundance of each species present. The taxonomic profiling of several metagenomes can in turn 89 give us an understanding of the alpha diversity of a given microbiome across several individuals. 90 The fact that most microbial species, what Rappé and Giovanoni categorize as the 91 "uncultured microbial majority" (2003), cannot be grown in laboratory conditions poses a real 92 challenge for taxonomic profiling. Given the extreme richness of certain communities, such as 93 those residing in soil or sea/freshwater, we can anticipate that a strong proportion of 94 microorganisms extracted and sequenced from those environments will be entirely novel, 95 precluding any taxonomic profiling.
doi:10.7287/peerj.9688v0.1/reviews/1 fatcat:o77ciioh6bcb5lac4auq53irim