Functional programming for modular Bayesian inference

Adam Ścibior, Ohad Kammar, Zoubin Ghahramani
2018 Proceedings of the ACM on Programming Languages  
We present an architectural design of a library for Bayesian modelling and inference in modern functional programming languages. The novel aspect of our approach are modular implementations of existing state-ofthe-art inference algorithms. Our design relies on three inherently functional features: higher-order functions, inductive data-types, and support for either type-classes or an expressive module system. We provide a performant Haskell implementation of this architecture, demonstrating
more » ... high-level and modular probabilistic programming can be added as a library in sufficiently expressive languages. We review the core abstractions in this architecture: inference representations, inference transformations, and inference representation transformers. We then implement concrete instances of these abstractions, counterparts to particle filters and Metropolis-Hastings samplers, which form the basic building blocks of our library. By composing these building blocks we obtain state-of-the-art inference algorithms: Resample-Move Sequential Monte Carlo, Particle Marginal Metropolis-Hastings, and Sequential Monte Carlo Squared. We evaluate our implementation against existing probabilistic programming systems and find it is already competitively performant, although we conjecture that existing functional programming optimisation techniques could reduce the overhead associated with the abstractions we use. We show that our modular design enables deterministic testing of inherently stochastic Monte Carlo algorithms. Finally, we demonstrate using OCaml that an expressive module system can also implement our design. explain these two effects using Haskell types and monads, following Staton [2017]. For a traditional explanation of Bayesian modelling and inference we refer the reader to one of the many excellent textbooks available [Barber 2012; Bishop 2006; MacKay 2003] . A monad m supporting these two constructs should support the following two functions random :: m Double score :: Log Double → m () As an aside, values of type Log Double are non-negative reals represented internally as a logarithm. The use of logarithms replaces multiplication with addition of double-precision floating point numbers, somewhat reducing underflow issues. Each computation of type m a represents a distribution over the values of type a. The function random draws a random variable uniformly from the unit interval [0, 1]. The computation score r scales the distribution by the factor r ∈ [0, ∞), and its effect is more subtle. If we ignore the scores in a given computation c of type m a, then c represents a probability distribution over values of type a, called the prior distribution. Given a value x of type a, let the likelihood of x, be the łsum" w(x), over all the program traces that result in x, aggregating the product of scores along each trace. As an example, consider the following program, often used to explain Bayesian modelling. Using the distribution bernoulli p = fmap (< p) random, i.e., True with probability p and False with probability 1-p, define the following computation c: rain ← bernoulli 0.2 sprinkler ← bernoulli 0.1 let prob_lawn_wet = case ( rain , sprinkler ) of ( True , True ) → 0.99 ( True , False ) → 0.70 ( False , True ) → 0.90 ( False , False ) → 0.01 score prob_lawn_wet --observe lawn wet return rain Such programs are called (probabilistic) models. In this program, we observe that the lawn is wet, which can be caused by either rain or the sprinkler, and try to infer the probability that it was raining. The prior is bernoulli 0.2 and the likelihood is: w(True ) = 0.1 * 0.99 + 0.9 * 0.70 = 0.729 w(False) = 0.1 * 0.90 + 0.9 * 0.01 = 0.099 Given the prior distribution prior c and the likelihood w c of a computation c :: m a, the distribution µ c that c represents is given by the rescaling of the prior by the likelihood, assigning to each event E the weighted sum: i.e., Lebesgue integrating the likelihood over E according to the prior. Probabilistic programs give us a formalism for expressing distributions by programming with the abstractions of Bayesian statistics: sampling from the prior distribution and changing the likelihood, also known as conditioning. The example above is discrete so the integral amounts to a weighted sum: µ c (True ) = 0.2 * 0.729 = 0.1458 µ c (False) = 0.8 * 0.099 = 0.0792 The value of µ c on the entire domain is called the model evidence. Bayesian statisticians regard model evidence that is close to 0 as indicating a bad fit of the model to the data. In general, they regard the model evidence as a score for the whole model. The model evidence of our example is: µ c (Bool) = µ c (True) + µ c (False) = 0.1458 + 0.0792 = 0.225
doi:10.1145/3236778 dblp:journals/pacmpl/ScibiorKG18 fatcat:g6tspxtoh5gfvho4iy6mdxu5ay