Explaining Diversity in Metagenomic Datasets by Phylogenetic-Based Feature Weighting

Davide Albanese, Carlotta De Filippo, Duccio Cavalieri, Claudio Donati, Rachel Brem
2015 PLoS Computational Biology  
Metagenomics is revolutionizing our understanding of microbial communities, showing that their structure and composition have profound effects on the ecosystem and in a variety of health and disease conditions. Despite the flourishing of new analysis methods, current approaches based on statistical comparisons between high-level taxonomic classes often fail to identify the microbial taxa that are differentially distributed between sets of samples, since in many cases the taxonomic schema do not
more » ... allow an adequate description of the structure of the microbiota. This constitutes a severe limitation to the use of metagenomic data in therapeutic and diagnostic applications. To provide a more robust statistical framework, we introduce a class of feature-weighting algorithms that discriminate the taxa responsible for the classification of metagenomic samples. The method unambiguously groups the relevant taxa into clades without relying on pre-defined taxonomic categories, thus including in the analysis also those sequences for which a taxonomic classification is difficult. The phylogenetic clades are weighted and ranked according to their abundance measuring their contribution to the differentiation of the classes of samples, and a criterion is provided to define a reduced set of most relevant clades. Applying the method to public datasets, we show that the data-driven definition of relevant phylogenetic clades accomplished by our ranking strategy identifies features in the samples that are lost if phylogenetic relationships are not considered, improving our ability to mine metagenomic datasets. Comparison with supervised classification methods currently used in metagenomic data analysis highlights the advantages of using phylogenetic information. Author Summary In metagenomics, the composition of complex microbial communities is characterized using Next Generation Sequencing technologies. Thanks to the decreasing cost of sequencing, large amounts of data have been generated for environmental samples and for a variety of health-associated conditions. In parallel there has been a flourishing of statistical PLOS Computational Biology | methods to analyze metagenomic datasets, concentrating mainly on the problem of assessing the existence of significant differences between microbial communities in different conditions. However, for a large number of therapeutic and diagnostic applications it would be essential to identify and rank the microbial taxa that are most relevant in these comparisons. Here we present PhyloRelief, a novel feature-ranking algorithm that fills this gap by integrating the phylogenetic relationships amongst the taxa into a statistical feature weighting procedure. Without relying on a precompiled taxonomy, PhyloRelief determines the lineages most relevant to the diversification of the samples guided by the data. As such, PhyloRelief can be applied both to cases in which sequences can be classified according to a known taxonomy, and to cases in which this is not feasible, a common occurrence in metagenomic data analysis given the increasing number of new and uncultivable taxa that are discovered using these technologies.
doi:10.1371/journal.pcbi.1004186 pmid:25815895 pmcid:PMC4376673 fatcat:fgambesv5jdtpms47sqxkswmce