One of my computational learning goals for 2019 is probabilistic machine learning. This articles provides an introduction on how to estimate solve a linear regression problem — Bayesian style with Markov Chain Monte Carlo simulations!
Alert! If you have a basic machine learning background then I highly recommend reading Prof.Zoubin Ghahramani’s (Director of Uber AI, University of Cambridge) excellent and persuasive review on Probabilistic Machine Learning (Nature 2015), and his NIPS talk on said topic. I became desensitised by his presentations and felt bold enough to finally dive into probabilistic machine learning.
Why Probabilistic Programming?
A model describes data that could be observed from a system, given a set of assumptions. Uncertainty plays a fundamental role in modelling role since it is inherent in all data. Sources of uncertainty includes:
- choosing the appropriate model given the data, and
- the parameter values that define model which are used to predict new data.
A well-defined model is one that forecasts or make predictions about unobserved data having been trained on observed data. Probability theory is the mathematical framework for modelling all forms of uncertainty. Bayesian probabilistic models are used to quantify uncertainty, and incorporate this information during forecasting.
Components of Bayesian Probabilistic Models
Key Point: Instead of the conventional approach of learning single point estimates for the weights that define a model, Bayesian models learn the distribution underlying the weights.
The following are 3 the fundamental components to probabilistic models:
- Model Prior: this is the probability distribution, p(Θ) that represents our beliefs or domain knowledge about the unobserved data before taking into consideration the observed data.
- Model Posterior: This is the probability distribution, p(Θ|X) that represents the updated version of the prior after learning from the data (X).
- Likelihood: this is the probability distribution, p(X|Θ) that represents the observed data. i.e. how we think the data is distributed
Therefore, probabilistic models allows us to update our beliefs in light of the data. Supervised learning in Bayesian looks like this p(y|X,Θ), i.e., having seen X what can we say now about y?
PyMc3 is python package for probabilistic modelling. These examples are mostly from the originally published PyMC3 article from Peer Journal Computer Science.
The full implementation of my code with detailed walk-through can be found on my github here.
Why we use Markov Chain Monte Carlo
Simulation based algorithms are necessary because for many real-world high-dimensional data we cannot analytically compute the posterior distribution according to Bayes’ Theorem. Simulation based sampling methods like Markov Chain Monte Carlo algorithms can generate samples from the posterior distribution. There are several different implementations of the MCMC algorithm, for example, Hamiltonian, Metropolis, NUTS, Slice etc.
Metropolis-Hastings MCMC is the most basic flavour of MCMC. MH-MCMC samples from a generic probability distribution target distribution by constructing a Markov Chain such that the stationary distribution is the target distribution. In this case, the target distribution is the posterior distribution of the unknown model parameters that we want to estimate.
Below is a screenshot from my code, where I used MH-MCMC to estimate the model parameters alpha, beta and sigma.
from Thomas Wiecki’s blog post on MCMC
The basic idea is to sample from the posterior distribution by combining a “random search” (the Monte Carlo aspect) with a mechanism for intelligently “jumping” around, but in a manner that ultimately doesn’t depend on where we started from (the Markov Chain aspect). Hence Markov Chain Monte Carlo methods are memoryless searches performed with intelligent jumps.
N.B: Prof. Matthew Heiner (UC Santa Cruz) gives an intuitive breakdown of the MCMC in his Bayesian Statistics: Techniques and Models course that you can audit for free on Coursera.
No U-Turn Sampler (NUTS)
However, PyMC3’s most robust MCMC method is the No-U-Turn Sampler, an extension the Hamiltonian MCMC. The properties of NUTS enables faster convergence on high-dimensional target distributions especially those with many continuous variables, a situation where older MCMC algorithms work very slowly.
Compare the NUTS-MCMC performance below with the Metropolis-Hastings.
Even if you don’t know exactly what you should look out for when comparing these two methods, you can intuit that the NUTS plots looks better than the Metropolis-Hastings.
In a follow-up article I will explain in more detail how to evaluate the performance of a Markov Chains in order to find the best model parameters.
My primary learning resources for theory are:
- Machine Learning- A Probabilistic Perspective by Kevin Murphy (e-book,effective so far),
- Course Notes for Bayesian Models for Machine Learning by John Paisley (Columbia University, google it)
- Information Theory, Inference and Learning Algorithms by David Mackay (ebook, more intense read but worthwhile so far)