Supervised Machine Learning for Population Genetics: A New Paradigm

Daniel R. Schrider, Andrew D. Kern
2018 Trends in Genetics  
As population genomic datasets grow in size, researchers are faced with the daunting task of making sense of a flood of information. To keep pace with this explosion of data, computational methodologies for population genetic inference are rapidly being developed to best utilize genomic sequence data. In this review we discuss a new paradigm that has emerged in computational population genomics: that of supervised machine learning (ML). We review the fundamentals of ML, discuss recent
more » ... ns of supervised ML to population genetics that outperform competing methods, and describe promising future directions in this area. Ultimately, we argue that supervised ML is an important and underutilized tool that has considerable potential for the world of evolutionary genomics. Population genetics over the past 50 years has been squarely focused on reconciling molecular genetic data with theoretical models that describe patterns of variation produced by a combination of evolutionary forces. This interplay between empiricism and theory means that many advances in the field have come from the introduction of new stochastic population genetic models, often of increasing complexity, that describe how population parameters (e.g., recombination or mutation rates) might generate specific features of genetic polymorphism (e.g., the site frequency spectrum, SFS; see Glossary). The goal, broadly stated, is to formulate a model that describes how nature will produce patterns of variation that we observe. With such a model in hand, all one would need to do would be to estimate its parameters, and in so doing learn everything about the evolution of a given population. Thus an overwhelming majority of population genetics research has focused on classical statistical estimation from a convenient probabilistic model (i.e., the Wright-Fisher model), or through an approximation to that model (i.e., the coalescent). The central assertion here is that the model sufficiently describes the data such that insights into nature can be made through parameter estimation. This mode of analysis that pervades population genetics is what Leo Breiman [1] famously referred to as the 'data modeling culture ', wherein This is an open access article under the CC BY license
doi:10.1016/j.tig.2017.12.005 pmid:29331490 pmcid:PMC5905713 fatcat:xzqm7666breqflmwtno5ntvxty