# Bayesian Inference for Big Data with Stochastic Gradient Markov Chain Monte Carlo

Lead Research Organisation:
University of Oxford

Department Name: Statistics

### Abstract

We are in the midst of an information revolution, where advances in science and technology, as well as the day-to-day operation of successful organisations and businesses, are increasingly reliant on the analyses of data. Driving these advances is a deluge of data, which is far outstripping the increase in computational power available. The importance of managing, analysing, and deriving useful understanding from such large scale data is highlighted by high-profile reports by McKinsey and The Economist as well as other outlets, and by the EPSRC's recent ICT priority of "Towards an Intelligent Information Infrastructure".

Bayesian analysis is one of the most successful family of methods for analysing data, and one now widely adopted in the statistical sciences as well as in AI technologies like machine learning. The Bayesian approach offers a number of attractive advantages over other methods: flexibility in constructing complex models from simple parts; fully coherent inferences from data; natural incorporation of prior knowledge; explicit modelling assumptions; precise reasoning of uncertainties over model order and parameters; and protection against overfitting.

On the other hand, there is a general perception that they can be too slow to be practically useful on big data sets. This is because exact Bayesian computations are typically intractable, so a range of more practical approximate algorithms are needed, including variational approximations, sequential Monte Carlo (SMC) and Markov chain Monte Carlo (MCMC). MCMC methods arguably form the most popular class of Bayesian computational techniques, due to their flexibility, general applicability and asymptotic exactness. Unfortunately, MCMC methods do not scale well to big data sets, since they require many iterations to reduce Monte Carlo noise, and each iteration already involves an expensive sweep through the whole data set.

In this project we propose to develop the theoretical foundations for a new class of MCMC inference procedures that can scale to billions of data items, thus unlocking the strengths of Bayesian methods for big data. The basic idea is to use a small subset of the data during each parameter update iteration of the algorithm, so that many iterations can be performed cheaply. This introduces excess stochasticity in the algorithm, which can be controlled by annealing the update step sizes towards zero as the number of iterations increases. The resulting algorithm is a cross between an MCMC and a stochastic optimization algorithm. An initial exploration of this procedure, which we call stochastic gradient Langevin dynamics (SGLD), was initiated by us recently (Welling and Teh, ICML 2011).

Our proposal is to lay the mathematical foundations for understanding the theoretical properties of such stochastic MCMC algorithms, and to build on these foundations to develop more sophisticated algorithms. We aim to understand the conditions under which the algorithm is guaranteed to converge, and the type and speed of convergence. Using this understanding, we aim to develop algorithmic extensions and generalizations with better convergence properties, including preconditioning, adaptive and Riemannian methods, Hamiltonian Monte Carlo methods, Online Bayesian learning methods, and approximate methods with large step sizes. These algorithms will be empirically validated on real world problems, including large scale data analysis problems for text processing and collaborative filtering which are standard problems in machine learning, and large scale data from ID Analytics, a partner company interested in detecting identity theft and fraud.

Bayesian analysis is one of the most successful family of methods for analysing data, and one now widely adopted in the statistical sciences as well as in AI technologies like machine learning. The Bayesian approach offers a number of attractive advantages over other methods: flexibility in constructing complex models from simple parts; fully coherent inferences from data; natural incorporation of prior knowledge; explicit modelling assumptions; precise reasoning of uncertainties over model order and parameters; and protection against overfitting.

On the other hand, there is a general perception that they can be too slow to be practically useful on big data sets. This is because exact Bayesian computations are typically intractable, so a range of more practical approximate algorithms are needed, including variational approximations, sequential Monte Carlo (SMC) and Markov chain Monte Carlo (MCMC). MCMC methods arguably form the most popular class of Bayesian computational techniques, due to their flexibility, general applicability and asymptotic exactness. Unfortunately, MCMC methods do not scale well to big data sets, since they require many iterations to reduce Monte Carlo noise, and each iteration already involves an expensive sweep through the whole data set.

In this project we propose to develop the theoretical foundations for a new class of MCMC inference procedures that can scale to billions of data items, thus unlocking the strengths of Bayesian methods for big data. The basic idea is to use a small subset of the data during each parameter update iteration of the algorithm, so that many iterations can be performed cheaply. This introduces excess stochasticity in the algorithm, which can be controlled by annealing the update step sizes towards zero as the number of iterations increases. The resulting algorithm is a cross between an MCMC and a stochastic optimization algorithm. An initial exploration of this procedure, which we call stochastic gradient Langevin dynamics (SGLD), was initiated by us recently (Welling and Teh, ICML 2011).

Our proposal is to lay the mathematical foundations for understanding the theoretical properties of such stochastic MCMC algorithms, and to build on these foundations to develop more sophisticated algorithms. We aim to understand the conditions under which the algorithm is guaranteed to converge, and the type and speed of convergence. Using this understanding, we aim to develop algorithmic extensions and generalizations with better convergence properties, including preconditioning, adaptive and Riemannian methods, Hamiltonian Monte Carlo methods, Online Bayesian learning methods, and approximate methods with large step sizes. These algorithms will be empirically validated on real world problems, including large scale data analysis problems for text processing and collaborative filtering which are standard problems in machine learning, and large scale data from ID Analytics, a partner company interested in detecting identity theft and fraud.

## People |
## ORCID iD |

Arnaud Doucet (Principal Investigator) |

### Publications

A Bouchard-Côté
(2015)

*The Bouncy Particle Sampler: A Non-Reversible Rejection-Free Markov Chain Monte Carlo Method*
Bardenet R

*On Markov chain Monte Carlo Methods for Tall Data*in Journal of Machine Learning Research
Bardenet R.
(2014)

*Towards scaling up Markov chain Monte Carlo: An adaptive subsampling approach*in 31st International Conference on Machine Learning, ICML 2014
Bardenet Remi
(2017)

*On Markov chain Monte Carlo methods for tall data*in JOURNAL OF MACHINE LEARNING RESEARCH
Bishop A
(2014)

*Distributed Nonlinear Consensus in the Space of Probability Measures*in IFAC Proceedings Volumes
Bouchard-Côté A
(2018)

*The Bouncy Particle Sampler: A Nonreversible Rejection-Free Markov Chain Monte Carlo Method*in Journal of the American Statistical Association
Del Moral P
(2015)

*Uniform Stability of a Particle Approximation of the Optimal Filter Derivative*in SIAM Journal on Control and Optimization
Hasenclever L

*Distributed Bayesian Learning with Stochastic Natural-gradient Expectation Propagation and the Posterior Server*in Journal of Machine Learning Research
Lienart T.
(2015)

*Expectation particle belief propagation*in Advances in Neural Information Processing Systems
Paige B.
(2014)

*Asynchronous anytime sequential Monte Carlo*in Advances in Neural Information Processing Systems
Rainforth T.
(2016)

*Interacting particle markov chain monte carlo*in 33rd International Conference on Machine Learning, ICML 2016
Yildirim S
(2013)

*An Online Expectation-Maximization Algorithm for Changepoint Models*in Journal of Computational and Graphical StatisticsDescription | We are studying sophisticated new statistical methods to analyze big data sets. Current methods are very computationally intensive and do not scale in presence of big data. We are developing scalable yet sophisticated techniques to extract useful information from massive datasets. |

Exploitation Route | There is still a lot of room for improvement, both methodologically and theoretically. So we expect over the forthcoming year to develop further our new algorithms. |

Sectors | Aerospace, Defence and Marine,Digital/Communication/Information Technologies (including Software),Electronics,Security and Diplomacy |

URL | http://www.stats.ox.ac.uk/~doucet/journalsbysubject.html |