Supervised learning with decision tree-based methods in computational and systems biology

Pierre Geurts, Alexandre Irrthum, Louis Wehenkel
2009 Molecular Biosystems  
At the intersection between artificial intelligence and statistics, supervised learning provides algorithms to automatically build predictive models only from observations of a system. During the last twenty years, supervised learning has been a tool of choice to analyze the always increasing and complexifying data generated in the context of molecular biology, with successful applications in genome annotation, function prediction, or biomarker discovery. Among supervised learning methods,
more » ... ion tree-based methods stand out as non parametric methods that have the unique feature of combining interpretability, efficiency, and, when used in ensembles of trees, excellent accuracy. The goal of this paper is to provide an accessible and comprehensive introduction to this class of methods. The first part of the paper is devoted to an intuitive but complete description of decision tree-based methods and a discussion of their strengths and limitations with respect to other supervised learning methods. The second part of the paper provides a survey of their applications in the context of computational and systems biology. The supplementary material provides information about various non-standard extensions of the decision tree-based approach to modeling, some practical guidelines for the choice of parameters and algorithm variants depending on the practical objectives of their application, pointers to freely accessible software packages, and a brief primer going through the different manipulations needed to use the tree-induction packages available in the R statistical tool.
doi:10.1039/b907946g pmid:20023720 fatcat:25bpsowcznco5f6xs2cn73ke4u