# Bayesian Inference for Big Data with Stochastic Gradient Markov Chain Monte Carlo

Lead Research Organisation:
University of Oxford

Department Name: Statistics

### Abstract

We are in the midst of an information revolution, where advances in science and technology, as well as the day-to-day operation of successful organisations and businesses, are increasingly reliant on the analyses of data. Driving these advances is a deluge of data, which is far outstripping the increase in computational power available. The importance of managing, analysing, and deriving useful understanding from such large scale data is highlighted by high-profile reports by McKinsey and The Economist as well as other outlets, and by the EPSRC's recent ICT priority of "Towards an Intelligent Information Infrastructure".

Bayesian analysis is one of the most successful family of methods for analysing data, and one now widely adopted in the statistical sciences as well as in AI technologies like machine learning. The Bayesian approach offers a number of attractive advantages over other methods: flexibility in constructing complex models from simple parts; fully coherent inferences from data; natural incorporation of prior knowledge; explicit modelling assumptions; precise reasoning of uncertainties over model order and parameters; and protection against overfitting.

On the other hand, there is a general perception that they can be too slow to be practically useful on big data sets. This is because exact Bayesian computations are typically intractable, so a range of more practical approximate algorithms are needed, including variational approximations, sequential Monte Carlo (SMC) and Markov chain Monte Carlo (MCMC). MCMC methods arguably form the most popular class of Bayesian computational techniques, due to their flexibility, general applicability and asymptotic exactness. Unfortunately, MCMC methods do not scale well to big data sets, since they require many iterations to reduce Monte Carlo noise, and each iteration already involves an expensive sweep through the whole data set.

In this project we propose to develop the theoretical foundations for a new class of MCMC inference procedures that can scale to billions of data items, thus unlocking the strengths of Bayesian methods for big data. The basic idea is to use a small subset of the data during each parameter update iteration of the algorithm, so that many iterations can be performed cheaply. This introduces excess stochasticity in the algorithm, which can be controlled by annealing the update step sizes towards zero as the number of iterations increases. The resulting algorithm is a cross between an MCMC and a stochastic optimization algorithm. An initial exploration of this procedure, which we call stochastic gradient Langevin dynamics (SGLD), was initiated by us recently (Welling and Teh, ICML 2011).

Our proposal is to lay the mathematical foundations for understanding the theoretical properties of such stochastic MCMC algorithms, and to build on these foundations to develop more sophisticated algorithms. We aim to understand the conditions under which the algorithm is guaranteed to converge, and the type and speed of convergence. Using this understanding, we aim to develop algorithmic extensions and generalizations with better convergence properties, including preconditioning, adaptive and Riemannian methods, Hamiltonian Monte Carlo methods, Online Bayesian learning methods, and approximate methods with large step sizes. These algorithms will be empirically validated on real world problems, including large scale data analysis problems for text processing and collaborative filtering which are standard problems in machine learning, and large scale data from ID Analytics, a partner company interested in detecting identity theft and fraud.

Bayesian analysis is one of the most successful family of methods for analysing data, and one now widely adopted in the statistical sciences as well as in AI technologies like machine learning. The Bayesian approach offers a number of attractive advantages over other methods: flexibility in constructing complex models from simple parts; fully coherent inferences from data; natural incorporation of prior knowledge; explicit modelling assumptions; precise reasoning of uncertainties over model order and parameters; and protection against overfitting.

On the other hand, there is a general perception that they can be too slow to be practically useful on big data sets. This is because exact Bayesian computations are typically intractable, so a range of more practical approximate algorithms are needed, including variational approximations, sequential Monte Carlo (SMC) and Markov chain Monte Carlo (MCMC). MCMC methods arguably form the most popular class of Bayesian computational techniques, due to their flexibility, general applicability and asymptotic exactness. Unfortunately, MCMC methods do not scale well to big data sets, since they require many iterations to reduce Monte Carlo noise, and each iteration already involves an expensive sweep through the whole data set.

In this project we propose to develop the theoretical foundations for a new class of MCMC inference procedures that can scale to billions of data items, thus unlocking the strengths of Bayesian methods for big data. The basic idea is to use a small subset of the data during each parameter update iteration of the algorithm, so that many iterations can be performed cheaply. This introduces excess stochasticity in the algorithm, which can be controlled by annealing the update step sizes towards zero as the number of iterations increases. The resulting algorithm is a cross between an MCMC and a stochastic optimization algorithm. An initial exploration of this procedure, which we call stochastic gradient Langevin dynamics (SGLD), was initiated by us recently (Welling and Teh, ICML 2011).

Our proposal is to lay the mathematical foundations for understanding the theoretical properties of such stochastic MCMC algorithms, and to build on these foundations to develop more sophisticated algorithms. We aim to understand the conditions under which the algorithm is guaranteed to converge, and the type and speed of convergence. Using this understanding, we aim to develop algorithmic extensions and generalizations with better convergence properties, including preconditioning, adaptive and Riemannian methods, Hamiltonian Monte Carlo methods, Online Bayesian learning methods, and approximate methods with large step sizes. These algorithms will be empirically validated on real world problems, including large scale data analysis problems for text processing and collaborative filtering which are standard problems in machine learning, and large scale data from ID Analytics, a partner company interested in detecting identity theft and fraud.

### Planned Impact

The Bayesian approach to data analysis offers significant advantages over conventional techniques. It has become very popular over the past twenty years thanks to Markov chain Monte Carlo (MCMC) algorithms which allow us to carry out Bayesian computation for complex models. Unfortunately, current MCMC methods do not scale to big data sets and consequently Bayesian techniques are almost never used in a data rich environment. At a time where the amount of data available is growing exponentially fast, this project proposes to unlock the strengths of Bayesian analysis for big data by developing new MCMC-type algorithms that can scale to billions of data items.

The methodologies developed in this proposal will allow the development of Bayesian approaches to large scale data analysis problems now frequently encountered in healthcare, industries and services to the general public. In the medium term, the applications to collaborative filtering, text processing, and to the data of our industrial partner ID Analytics will benefit the general public with more powerful search engines, recommender systems, better credit card scoring techniques and improved methods for identity fraud detection. Other industrial beneficiaries include business analytics, finance, pharmaceuticals, and the defence industry, where the developments of this proposal will have a significant impact on how large scale data can be analysed.

The longer term benefits of this project are also closely linked to the RCUK "Digital Economy" programme. For example the "digital hospital" component of this programme involves the real-time accurate data fusion and tracking of patients. It could directly benefit from the techniques we aim to develop in this proposal.

The methodologies developed in this proposal will allow the development of Bayesian approaches to large scale data analysis problems now frequently encountered in healthcare, industries and services to the general public. In the medium term, the applications to collaborative filtering, text processing, and to the data of our industrial partner ID Analytics will benefit the general public with more powerful search engines, recommender systems, better credit card scoring techniques and improved methods for identity fraud detection. Other industrial beneficiaries include business analytics, finance, pharmaceuticals, and the defence industry, where the developments of this proposal will have a significant impact on how large scale data can be analysed.

The longer term benefits of this project are also closely linked to the RCUK "Digital Economy" programme. For example the "digital hospital" component of this programme involves the real-time accurate data fusion and tracking of patients. It could directly benefit from the techniques we aim to develop in this proposal.

## People |
## ORCID iD |

Yee Whye Teh (Principal Investigator) |

### Publications

Hasenclever L
(2017)

*Distributed Bayesian learning with stochastic natural-gradient expectation propagation and the posterior server*in Journal of Machine Learning Research
Jacob P
(2018)

*Bayesian Inference in Non-Markovian State-Space Models With Applications to Battery Fractional-Order Systems*in IEEE Transactions on Control Systems Technology
Jacob P
(2015)

*On nonnegative unbiased estimators*in The Annals of Statistics
Jacob P
(2013)

*Path storage in the particle filter*in Statistics and Computing
Jacob P
(2014)

*The Wang-Landau algorithm reaches the flat histogram criterion in finite time*in The Annals of Applied Probability
Lakshminarayanan B
(2015)

*Particle Gibbs for Bayesian additive regression trees*
Lakshminarayanan B
(2016)

*Mondrian Forests for Large-Scale Regression when Uncertainty Matters*
Lienart T
(2015)

*Expectation Particle Belief Propagation*
Lu X
(2017)

*Relativistic Monte Carlo*Description | We have met the original objectives of the project. The major question we set out to investigate on the first phase of this grant is on the theoretical foundations of stochastic gradient MCMC methods for Bayesian inference on Big Data problems. A first answer is a recently accepted paper (Teh, Thiery, Vollmer, JMLR 2016) at the Journal of Machine Learning Research (JMLR), the top journal in machine learning. In that paper we showed that SGLD is indeed a consistent method, and developed a central limit theorem which indicates that the rate of convergence is O(m^{-1/3}) where m is the number of iterations of the algorithm. This is slower than standard Monte Carlo convergence rates of O(m^{-1/2}). We are following this result up with a finite time analysis of SGLD without decreasing step size, where we found the same convergence rate of O(m^{-1/3}). This has now been accepted as a second paper (Vollmer, Zygalakis, Teh, JMLR 2016). Our postdoc Vollmer has continued working in the direction and has a number of follow up works (not funded by this project). Vollmer is now an associate professor at Warwick Statistics and Turing. We have published an extension of SGLD (Patterson and Teh, NIPS 2013) taking into account Riemannian manifold structure of probabilistic models at Neural Information Processing Systems 2013, the premier international conference in machine learning. This allowed SGLD type inference algorithms to be applicable to a large class of models defined over probability simplices. A second extension develops Relativistic Monte Carlo (Lu et al, AISTATS 2016) which imposes a maximum speed (of light) on updates to parameters, which is important in applications in deep learning. Recently we have developed a new class of optimization methods called Hamiltonian descent (Maddison et al, 2018, 2019, both on arxiv) which shed light on why relativistic methods work well in practice. We have also developed methods for Bayesian inference for Big Data using distributed and parallel computing architectures. This culminated in two recent papers (Xu et al and Paige et al, both at NIPS 2014). We have extended (Xu et al), where we developed a novel alternative to EP which is more robust to Monte Carlo errors in moment estimates, and a system for distributed learning which we call the Posterior Server, which was accepted at Journal of Machine Learning Research (Hasenclever et al, JMLR 2017). In addition a number of side projects investigated computational methods based on sequential Monte Carlo and Markov chain Monte Carlo and have been published as well, and we have also worked on scalable Bayesian learning methods based on Mondrian forests (Lakshminarayanan et al NIPS 2014, AISTATS 2016, UAI 2016). |

Exploitation Route | Development of Bayesian inference algorithms that are scalable are hugely important in scaling up the powerful methods for data analysis based on Bayesian statistics to the age of Big Data. The project partners held a project meeting in August 2014, with a summary appearing as an invited letter to the President's Newsletter of the International Society for Bayesian Analysis. We have also followed this meeting with another one in summer of 2015 in Oxford, and a NIPS workshop in December 2015. All workshops were well-attended and instigated interesting discussions both among research project members as well as others. The ideas first explored in this project have blossomed in the scientific community since the project ended, demonstrating the impact the project has had. A number of research groups around the world have built on our work and developed a number of sophisticated extensions. This includes the stochastic gradient Hamilitonian Monte Carlo method (Chen et al ICML 2014) and the stochastic gradient thermostat (Ding et al NIPS 2014, Shang & Leimkuhler 2015), as well as many superb theoretical analyses building on our work. The notion that we can link optimization methods and Monte Carlo methods via a study of randomness injected into the system, along with studying such systems in continuous time, are now standard in the community. Our original paper on the SGLD (Welling & Teh 2011) has garnered over 800 citations on Google Scholar, demonstrating the impact and excitement around research in this area, while the papers which are outcomes of this project has garnered over 300 citations. Over the last year there has been more than 300 citations to our associated papers on Google Scholar, with contributions from many leading research groups across the world. |

Sectors | Digital/Communication/Information Technologies (including Software) |

URL | http://mlcs.stats.ox.ac.uk/projects/sgmcmc/ |

Description | SGMCMC |

Organisation | University of California, Irvine |

Department | Donald Bren School of Information and Computer Sciences (ICS) |

Country | United States |

Sector | Academic/University |

PI Contribution | This is a joint NSF-EPSRC funded project, with US participants based at UCI and Amsterdam funded by NSF, and Oxford and Bristol participants funded by EPSRC. We have been working towards methods for scaling up Bayesian inference using Markov chain Monte Carlo to big data settings. The partnership have provided a forum to exchange ideas on this topic. We have organised a series of 3 workshops (one in Amsterdam, one in Oxford and one at Neural Information Processing Systems (Montreal, Canada). |

Collaborator Contribution | See above. |

Impact | We have organised a series of 3 workshops (one in Amsterdam, one in Oxford and one at Neural Information Processing Systems (Montreal, Canada). Workshop websites: https://github.com/BigBayes/bigbayes.github.io/wiki/BIBiD-2015 http://babaks.github.io/ScalableMonteCarlo/ |

Start Year | 2013 |

Title | Expectation Particle Belief Propagation (EPBP) |

Description | Research software for approximate Bayesian inference in graphical models based on EPBP (Lienart NIPS 2015). Software written in Julia. |

Type Of Technology | Software |

Year Produced | 2015 |

Impact | None. |

URL | https://github.com/tlienart/EPBP |

Title | PGBart |

Description | Software used in the paper "Particle Gibbs for Bayesian Additive Regression Trees" by Lakshminarayanan, Roy and Teh (AISTATS 2015). |

Type Of Technology | Software |

Year Produced | 2015 |

Impact | n/a |

URL | https://github.com/balajiln/pgbart |

Title | Posterior Server and SNEP |

Description | This Julia code implements stochastic natural gradient expectation propagation, a novel algorithm for distributed Bayesian learning (see http://arxiv.org/abs/1512.09327). In addition, there is code for various SGD methods such as asynchronous SGD (https://papers.nips.cc/paper/4687-large-scale-distributed-deep-networks.pdf) and elastic averaging SGD (https://papers.nips.cc/paper/5761-deep-learning-with-elastic-averaging-sgd.pdf). The code is research code that is fairly modular to allow testing on various different models. It is not production code meant to compete eg with standard neural network packages. We believe it should be possible to implement our algorithm much more efficiently in such a package. |

Type Of Technology | Software |

Year Produced | 2016 |

Open Source License? | Yes |

Impact | None yet. Software associated with paper that was just accepted. |

URL | https://github.com/BigBayes/PosteriorServer |

Title | SGMCMC.jl |

Description | SGMCMC.jl is a Julia test bed for stochastic gradient Markov chain Monte Carlo algorithms. There is a large range of SGMCMC algorithms but it can be difficult for practitioners to get a feel for different algorithms. We provide a simple package to try out different samplers on commonly used models. SGMCMC.jl is can also speed up experiments for researchers working on SGMCMC. It is simple to define new samplers and test them against existing ones on a number of commonly used models. |

Type Of Technology | Software |

Year Produced | 2016 |

Open Source License? | Yes |

Impact | None yet. Just released. |

URL | https://github.com/BigBayes/SGMCMC.jl |

Title | Sampling via Moment Sharing (SMS) |

Description | Method for performing Markov chain Monte Carlo in a distributed networked setting. Development software based on MATLAB for research purposes only. |

Type Of Technology | Software |

Year Produced | 2014 |

Open Source License? | Yes |

Impact | None. |

URL | https://github.com/BigBayes/SMS |

Title | Stochastic Gradient Riemannian Langevin Dynamics |

Description | The Python code implements the stochastic gradient Riemannian Langevin dynamics (SGRLD) algorithm presented in the paper "Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex" by Sam Patterson and Yee Whye Teh at NIPS 2013. |

Type Of Technology | Software |

Year Produced | 2013 |

Open Source License? | Yes |

Impact | Our software was released as research software (rather than production software) for transparency and to help the research community build on and make use of our EPSRC funded work. We have not noticed notable impacts besides other researchers using our software and citing our work. |

URL | https://github.com/BigBayes/SGRLD |