Probabilistic machine learning and artificial intelligence

Zoubin Ghahramani
2015 Nature  
Probabilistic machine learning and artificial intelligence. Nature 521:452-459. How can a machine learn from experience? Probabilistic modelling provides a framework for understanding what learning is, and has therefore emerged as one of the principal theoretical and practical approaches for designing machines that learn from data acquired through experience. The probabilistic framework, which describes how to represent and manipulate uncertainty about models and predictions, plays a central
more » ... e in scientific data analysis, machine learning, robotics, cognitive science, and artificial intelligence. This article provides an introduction to this probabilistic framework, and reviews some state-of-the-art advances in the field, namely, probabilistic programming, Bayesian optimisation, data compression, and automatic model discovery. Introduction The key idea behind the probabilistic framework to machine learning is that learning can be thought of as inferring plausible models to explain observed data. A machine can use such models to make predictions about future data, and decisions that are rational given these predictions. Uncertainty plays a fundamental role in all of this. Observed data can be consistent with many models, and therefore which model is appropriate given the data is uncertain. Similarly, predictions, about future data and the future consequences of actions, are uncertain. Probability theory provides a framework for modelling uncertainty. This article starts with an introduction to the probabilistic approach to machine learning and Bayesian inference, and then reviews some of the state-of-the-art in the field. The central thesis is that many aspects of learning and intelligence depend crucially on the careful probabilistic representation of uncertainty. Probabilistic approaches have only recently become a main-stream paradigm in artificial intelligence [1], robotics [2] , and machine learning [3, 4] . Even now, there is controversy in these fields about how important it is to fully represent uncertainty. For example, 1 recent advances using deep neural networks to solve challenging pattern recognition problems such as speech recognition [5] , image classification [6, 7] , and prediction of words in text [8], do not overtly represent the uncertainty in the structure or parameters of those neural networks. However, my focus will not be on these types of pattern recognition problems, characterised by the availability of large amounts of data, but rather on problems where uncertainty is really a key ingredient, for example where a decision may depend on the amount of uncertainty. In particular, I highlight five areas of current research at the frontiers of probabilistic machine learning, emphasising areas which are of broad relevance to scientists across many fields: (1) Probabilistic programming-a general framework for expressing probabilistic models as computer programs, which could have a major impact on scientific modelling; (2) Bayesian optimisation-an approach to globally optimising unknown functions; (3) Probabilistic data compression; (4) Automating the discovery of plausible and interpretable models from data; and (5) Hierarchical modelling for learning many related models, for example for personalised medicine or recommendation. While significant challenges remain, the coming decade promises substantial advances in artificial intelligence and machine learning based on the probabilistic framework. Probabilistic modelling and the representation of uncertainty At a most basic level, machine learning seeks to develop methods for computers to improve their performance at certain tasks based on observed data. Typical examples of such tasks might include detecting pedestrians in images taken from an autonomous vehicle, classifying gene-expression patterns from leukaemia patients into subtypes by clinical outcome, or translating English sentences into French. However, as we will see, the scope of machine learning tasks is even broader than these pattern classification or mapping tasks, and can include optimisation and decision making, compressing data, and automatically extracting interpretable models from data. Data are the key ingredient of all machine learning systems. But data, even so called "Big Data", are useless on their own until one extracts knowledge or inferences from them. Almost all machine learning tasks can be formulated as making inferences about missing or latent data from the observed data-we will variously use the terms inference, prediction or forecasting to refer to this general task. Elaborating the example mentioned above, consider classifying leukaemia patients into one of the four main subtypes of this disease, based on each patient's measured gene-expression patterns. Here the observed data are pairs of gene-expression patterns and labelled subtypes, and the unobserved or missing data to be inferred are the subtypes for new patients. In order to make inferences about unobserved data from the observed data, the learning system needs to make some assumptions; taken together these assumptions constitute a model. A model can be very simple 2 and rigid, such as a classical statistical linear regression model, or complex and flexible, such as a large and deep neural network or even a model with infinitely many parameters. We return to this point in the next section. A model is considered to be well-defined if it can make forecasts or predictions about unobserved data having been trained on observed data (otherwise, if the model can't make predictions it can't be falsified, in the sense of Karl Popper, or as Wolfgang Pauli said the model is "not even wrong"). For example, in the classification setting, a well-defined model should be able to provide predictions of class labels for new patients. Since any sensible model will be uncertain when predicting unobserved data, uncertainty plays a fundamental role in modelling. There are many forms of uncertainty in modelling. At the lowest level, model uncertainty is introduced from measurement noise, e.g., pixel noise or blur in images. At higher levels, a model may have many parameters, such as the coefficients of a linear regression, and there is uncertainty about which values of these parameters will be good at predicting new data. Finally, at the highest levels, there is often uncertainty about even the general structure of the model: is linear regression appropriate or a neural network, if the latter, how many layers, etc. The probabilistic approach to modelling uses probability theory to express all forms of uncertainty [9] . Probability theory is the mathematical language for representing and manipulating uncertainty [10], in much the same way as calculus is the language for representing and manipulating rates of change. Fortunately, the probabilistic approach to modelling is conceptually very simple: probability distributions are used to represent all the uncertain unobserved quantities in a model (including structural, parametric, and noise-related) and how they relate to the data. Then the basic rules of probability theory are used to infer the unobserved quantities given the observed data. Learning from data occurs through the transformation of the prior probability distributions (defined before observing the data), into posterior distributions (after observing data). The application of probability theory to learning from data is called Bayesian learning (Box 1). Apart from its conceptual simplicity, there are several appealing properties of the probabilistic framework for machine intelligence. Simple probability distributions over single or a few variables can be composed together to form the building blocks of larger more complex models. The dominant paradigm in machine learning over the last two decades for representing such compositional probabilistic models has been graphical models [11] , with variants including directed graphs (a.k.a. Bayesian networks and belief networks), undirected graphs (a.k.a. Markov networks and random fields), and mixed graphs with both directed and undirected edges. A simple example of Bayesian inference in a directed graph is given in Figure 1 . As we will see in a subsequent section, probabilistic programming offers an elegant way of generalising graphical models, allowing a much richer representations of models. The compositionality of probabilistic models means that the behaviour of these building blocks in the context of the larger model is often much easier to understand than, say, what will happen if one couples a nonlinear dynamical system (e.g., a recurrent neural network) 3 to another. In particular, for a well-defined probabilistic model, it is always possible to generate data from the model; such "imaginary" data provide a window into the "mind" of the probabilistic model, helping us understand both the initial prior assumptions and what the model has learned at any later stage. Probabilistic modelling also has some conceptual advantages over alternatives as a normative theory for learning in artificially intelligent (AI) systems. How should an AI system represent and update its beliefs about the world in light of data? The Cox axioms define some desiderata for representing beliefs; a consequence of these axioms is that 'degrees of belief', ranging from "impossible" to "absolutely certain", must follow all the rules of probability theory [12, 10, 13] . This justifies the use of subjective Bayesian probabilistic representations in AI. An argument for Bayesian representations in AI that is motivated by decision theory is given by the Dutch-Book theorems. The argument rests on the idea that the strength of beliefs of an agent can be assessed by asking the agent whether it would be willing to accept bets at various odds (ratios of payoffs). The Dutch-Book theorems state that unless an AI system's (or human's, for that matter) degrees of beliefs are consistent with the rules of probability it will be willing to accept bets that are guaranteed to lose money [14] . Because of the force of these and many other arguments on the importance of a principled handling of uncertainty for intelligence, Bayesian probabilistic modelling has emerged not only as the theoretical foundation for rationality in AI systems but also as a model for normative behaviour in humans and animals [15, 16, 17, 18] (but see [19] and [20]), and much research is devoted to understanding how neural circuitry may be implementing Bayesian inference [21, 22]. Although conceptually simple, a fully probabilistic approach to machine learning poses a number of computational and modelling challenges. Computationally, the main challenge is that learning involves marginalising (i.e., summing out) all the variables in the model except for the variables of interest (c.f. Box 1). Such high-dimensional sums and integrals are generally computationally hard, in the sense that for many models there is no known polynomial-time algorithm for performing them exactly. Fortunately, a number of approximate integration algorithms have been developed, including Markov chain Monte Carlo (MCMC) methods, variational approximations, expectation propagation, and sequential Monte Carlo [23, 24, 25, 26]. It's worth noting that computational techniques are one area where Bayesian machine learning differs from much of the rest of machine learning: for Bayesians the main computational problem is integration, whereas for much of the rest of the community the focus is on optimisation of model parameters. However, this dichotomy is not as stark as it appears: many gradient-based optimisation methods can be turned into integration methods through the use of Langevin and Hamiltonian Monte Carlo methods [27, 28], while integration problems can be turned into optimisation problems through the use of variational approximations[24]. We revisit optimisation in a later section. The main modelling challenge for probabilistic machine learning is that the model should be flexible 4 enough to capture all the properties of the data required to achieve the prediction task of interest. One approach to addressing this challenge is to develop a prior that encompasses an open-ended universe of models that can adapt in complexity to the data. The key statistical concept underlying flexible models that grow in complexity with the data is nonparametrics. Flexibility through nonparametrics One of the lessons of modern machine learning is that the best predictive performance is often obtained from highly flexible learning systems, especially when learning from large data sets. Flexible models can make better predictions because to a greater extent they allow data to "speak for themselves". 1 There are essentially two ways of achieving flexibility. The model could have a large number of parameters compared to the data set (for example, the neural network used recently to achieve near-state-of-the-art translation of English and French sentences is a probabilistic model with 384 million parameters [29] ). Alternatively, the model can be defined using nonparametric components. The best way to understand nonparametric models is through comparison to parametric ones. In a parametric model, there are a fixed finite number of parameters, and no matter how much training data are observed, all the data can do is set these finitely-many parameters that control future predictions. In contrast, nonparametric approaches have predictions that grow in complexity with the amount of training data, either by considering a nested sequence of parametric models with increasing numbers of parameters or by starting out with a model with infinitely many parameters. For example, in a classification problem, whereas a linear (i.e., parametric) classifier will always predict using a linear boundary between classes, a nonparametric classifier can learn a nonlinear boundary whose shape becomes more complex with more data. Many nonparametric models can be derived starting from a parametric model and considering that happens as the model grows to the limit of infinitely many parameters [30] . Clearly, fitting a model with infinitely many parameters to finite training data would result in "overfitting", in the sense that the model's predictions might reflect quirks of the training data rather than regularities that can be generalised to test data. Fortunately, Bayesian approaches are not prone to this kind of overfitting since they average over, rather than fit, the parameters (c.f. Box 1). Moreover, for many applications we have such huge data sets that the main concern is underfitting from the choice of an overly simplistic parametric model, rather than overfitting. A review of Bayesian nonparametrics is outside the scope of this article (see for example [31, 32, 9] ), 1 But note that all predictions involve assumptions and therefore the data are never exclusively "speaking for themselves". 5 but it's worth mentioning a few of the key models. Gaussian processes (GPs) are a very flexible nonparametric model for unknown functions, and are widely used for regression, classification, and many other applications that require inference on functions [33]. Consider learning a function relating the dose of some chemical to the response of an organism to that chemical. Instead of modelling this relationship with, say, a linear parametric function, a GP could be used to directly learn a nonparametric distribution of nonlinear functions consistent with the data. A notable example of a recent application of Gaussian processes is GaussianFace, a state-of-the-art approach to face recognition that outperforms humans and deep learning methods [34]. Dirichlet processes (DPs) are a nonparametric model with a long history in statistics [35] and are used for density estimation, clustering, time series analysis, and modelling the topics of documents [36]. To illustrate DPs, consider an application to modelling friendships in a social network, where each person can belong to one of many communities. A DP makes it possible to have a model where the number of inferred communities (i.e., clusters) grows with the number of people [37]. DPs have also been used for clustering gene expression patterns [38, 39]. The Indian buffet process (IBP) [40] is a nonparametric model which can be used for latent feature modelling, learning overlapping clusters, sparse matrix factorisation, or to nonparametrically learn the structure of a deep network [41] . Elaborating the social network modelling example, an IBP-based model allows each person to belong to some subset of a large number of potential communities (e.g., as defined by different families, workplaces, schools, hobbies, etc) rather than a single community, and the probability of friendship between two people depends on the number of overlapping communities they have [42] . In this case, the latent features of each person correspond to the communities, which are not assumed to be observed directly. The IBP can be thought of as a way of endowing Bayesian nonparametric models with "distributed representations" as popularised in the neural network literature [43] . An interesting link between Bayesian nonparametrics and neural networks is that, under fairly general conditions, a neural network with infinitely many hidden units is equivalent to a Gaussian process [44] . Note that the above nonparametric components should be thought of again as building blocks, which can be composed into more complex models as described in the introduction. The following section describes an even more powerful way of composing models, via probabilistic programming. Probabilistic Programming The basic idea in probabilistic programming it to use computer programs to represent probabilistic models 2 [45, 46, 47] . One way to do this is for the computer program to define a generator for data from the probabilistic model, i.e., a simulator ( Figure 2 ). This simulator makes calls to a random number generator in such a way that repeated runs from the simulator would sample 2 http://probabilistic-programming.org 6 different possible data sets from the model. This simulation framework is more general than the graphical model framework described previously since computer programs can allow constructs such as recursion (functions calling themselves) and control flow statements (e.g., if statements resulting in multiple paths a program can follow) which are difficult or impossible to represent in a finite graph. In fact, for many of the recent probabilistic programming languages that are based on extending Turing-complete languages (a class that includes almost all commonly-used languages), it is possible to represent any computable probability distribution as a probabilistic program [48] . The full potential of probabilistic programming comes from automating the process of inferring unobserved variables in the model conditioned on the observed data (c.f. Box 1). Conceptually, conditioning needs to compute input states of the program that generate data matching the observed data. Whereas normally we think of programs running from inputs to outputs, conditioning involves solving the inverse problem of inferring the inputs (in particular the random number calls) that match a certain program output. Such conditioning is performed by a universal inference engine, usually implemented by Monte Carlo sampling over possible executions over the simulator program that are consistent with the observed data. The fact that defining such universal inference algorithms for computer programs is even possible is somewhat surprising, but it is related to the generality of certain key ideas from sampling such as rejection sampling, sequential Monte Carlo [25], and "approximate Bayesian computation" [49]. As an example, imagine you write a probabilistic program that simulates a gene regulatory model relating unmeasured transcription factors to the expression levels of certain genes. Your uncertainty in each part of the model would be represented by the probability distributions used in the simulator. The universal inference engine can then condition the output of this program on the measured expression levels, and automatically infer the activity of the unmeasured transcription factors and other uncertain model parameters. Another application of probabilistic programming implements a computer vision system as the inverse of a computer graphics program [50]. There are several reasons why probabilistic programming could prove revolutionary for machine intelligence and scientific modelling. 3 Firstly, the universal inference engine obviates the need to manually derive inference methods for models. Since deriving and implementing inference methods is generally the most rate-limiting and bug-prone step in modelling, often taking months, automating this step so that it takes minutes or seconds will massively accelerate the deployment of machine learning systems. Second, probabilistic programming could be potentially transformative for the sciences, since it allows for rapid prototyping and testing of different models of data. Probabilistic programming languages create a very clear separation between the model and the inference procedures, encouraging model-based thinking [51] . 3 Its potential has been noticed by DARPA which is currently funding a major programme called PPAML. 7 There are a growing number of probabilistic programming languages. BUGS, STAN, AutoBayes, and Infer.NET [52, 53, 54, 55] allow only a restrictive class of models to be represented as compared to systems based on Turing-complete languages. In return for this restriction, inference in such languages can be much faster than for the more general languages, such as IBAL, BLOG, Figaro, Church, Venture, and Anglican [56, 57, 58, 59, 60, 61, 62] . A major emphasis of recent work is on fast inference in general languages (e.g., [63] ). Nearly all approaches to probabilistic programming are Bayesian since it's hard to create other coherent frameworks for automated reasoning about uncertainty. Notable exceptions are systems like Theano, which is not itself a probabilistic programming language but uses symbolic differentiation to speed up and automate optimisation of parameters of neural networks and other probabilistic models [64] . While parameter optimisation is commonly used to improve probabilistic models, in the next section we will describe recent work on how probabilistic modelling can be used to improve optimisation!
doi:10.1038/nature14541 pmid:26017444 fatcat:sw42v3vzcraj3mhimxr4w2g6du