Big Hypotheses: A Fully Parallelised Bayesian Inference Solution

Lead Research Organisation: University of Liverpool

Department Name: Electrical Engineering and Electronics

Abstract

Bayesian inference is a process which allows us to extract information from data. The process uses prior knowledge articulated as statistical models for the data. We are focused on developing a transformational solution to Data Science problems that can be posed as such Bayesian inference tasks.

An existing family of algorithms, called Markov chain Monte Carlo (MCMC) algorithms, offer a family of solutions that offer impressive accuracy but demand significant computational load. For a significant subset of the users of Data Science that we interact with, while the accuracy offered by MCMC is recognised as potentially transformational, the computational load is just too great for MCMC to be a practical alternative to existing approaches. These users include academics working in science (e.g., Physics, Chemistry, Biology and the social sciences) as well as government and industry (e.g., in the pharmaceutical, defence and manufacturing sectors). The problem is then how to make the accuracy offered by MCMC accessible at a fraction of the computational cost.

The solution we propose is based on replacing MCMC with a more recently developed family of algorithms, Sequential Monte Carlo (SMC) samplers. While MCMC, at its heart, manipulates a single sampling process, SMC samplers are an inherently population-based algorithm that manipulates a population of samples. This makes SMC samplers well suited to the task of being implemented in a way that exploits parallel computational resources. It is therefore possible to use emerging hardware (e.g., Graphics Processor Units (GPUs), Field Programmable Gate Arrays (FPGAs) and Intel's Xeon Phis as well as High Performance Computing (HPC) clusters) to make SMC samplers run faster. Indeed, our recent work (which has had to remove some algorithmic bottlenecks before making the progress we have achieved) has shown that SMC samplers can offer accuracy similar to MCMC but with implementations that are better suited to such emerging hardware.

The benefits of using an SMC sampler in place of MCMC go beyond those made possible by simply posing a (tough) parallel computing challenge. The parameters of an MCMC algorithm necessarily differ from those related to a SMC sampler. These differences offer opportunities for SMC samplers to be developed in directions that are not possible with MCMC. For example, SMC samplers, in contrast to MCMC algorithms, can be configured to exploit a memory of their historic behaviour and can be designed to smoothly transition between problems. It seems likely that by exploiting such opportunities, we will generate SMC samplers that can outperform MCMC even more than is possible by using parallelised implementations alone.

Our interactions with users, our experience of parallelising SMC samplers and the preliminary results we have obtained when comparing SMC samplers and MCMC make us excited about the potential that SMC samplers offer as a "New Approach for Data Science".

Our current work has only begun to explore the potential offered by SMC samplers. We perceive significant benefit could result from a larger programme of work that helps us understand the extent to which users will benefit from replacing MCMC with SMC samplers. We propose a programme of work that combines a focus on users' problems with a systematic investigation into the opportunities offered by SMC samplers.

Our strategy for achieving impact comprises multiple tactics. Specifically, we will: use identified users to act as "evangelists" in each of their domains; work with our hardware-oriented partners to produce high-performance reference implementations; engage with the developer team for Stan (the most widely-used generic MCMC implementation); work with the Industrial Mathematics Knowledge Transfer Network and the Alan Turing Institute to engage with both users and other algorithmic developers.

Planned Impact

The core of our strategy for achieving impact is to work with each of a number of users to improve their ability to access MCMC's accuracy at a tiny fraction of its computational cost. The potential impact associated with those specific users are summarised as follows (listed in the (feature-oriented) order used in Table 1 of the case for support):
1. An improved ability to profile offenders would result in the National Crime Agency (and the other international law enforcement agencies who are using the profiling tool developed at the University of Liverpool) protecting more children and prosecuting more sex offenders;
2. An improved ability to use nuclear forensics to understand the source of nuclear material being smuggled into the UK would reduce the threat to the safety of the UK that is posed by international terrorism;
3. A rapid and improved ability to characterise crystals from powder diffraction data would promote a step-change in how new materials are discovered and applied across many industries (from the design of improved batteries to the formulation of new drugs, as relevant to EPSRC's Directed Assembly and Dial-a-molecule grand challenges for Chemical sciences and engineering);
4. An improved estimate of the sources of chemical or biological agent's release would help save the lives of those soldiers who would otherwise be exposed to the agent;
5. An improved ability to handle missing data when analysing results from clinical trials would maximise the information extracted from people undertaking the trials, thereby resulting in safer drugs which have greater efficacy;
6. An improved ability to analyse quantum field theory by first principle computational methods would transform our ability to understand the fabric of the Universe (e.g., the dark matter that we know so little about yet is 95% of that Universe);
7. An improved ability to combine expert knowledge and experimental data would enable the manufacturing industry to explore the formulation space and thereby manufacture products that are more attractive to consumers;
8. An improved ability to quantify protein abundance would have a dramatic impact on our understanding of biological systems (e.g., mankind).

We will help these users to be "evangelists" in each of their domains, will work with Stan's developer team and with both the Industrial Mathematics Knowledge Transfer Network and the Alan Turing Institute to engage new users (and other algorithmic developers). We anticipate further impact will result from this indirect engagement. We now summarise a generic view of such impact.

We have been working with clusters of up to 130,000 cores. To provide some context, there are 86,400 seconds in a day such that, if we can fully capitalise on the processing power of such a cluster, tasks that might have historically taken 6 months with one core can run in just 3 minutes. The users we have and continue to engage with see that kind of speed-up as transformational: problems that are currently beyond reach would be solved routinely. We believe other users will have a similar view. We hope to change academic users' abilities to conduct their science and to change industrial users' abilities to conduct their business.

We do recognise that, in some users' contexts, it is possible to develop problem-specific parallel implementations of MCMC (e.g., using Spark) that give rise to such speed-ups without the need for the parallelised SMC samplers that we plan to develop. However, the scope to exploit the very nature of SMC samplers to get additional computational gains beyond those associated with parallel implementation makes us optimistic that we will develop a truly game-changing capability to solve Bayesian inference tasks.

Our (overtly ambitious) hope is that this project's legacy will be that we live in a future where MCMC is no longer deemed to be the state-of-the-art and SMC samplers are the de-facto standard for numerical Bayesian inference.

Funded Value:

£2,557,654

Funded Period:

Apr 18 - Mar 24

Funder:

EPSRC

Project Status:

Active

Project Category:

Research Grant

Project Reference:

EP/R018537/1

Principal Investigator:

Simon Maskell

Research Subject:

Info. & commun. Technol. (60%)

Mathematical sciences (40%)

Research Topic:

Fundamentals of Computing (60%)

Statistics & Appl. Probability (40%)

Organisations

People	ORCID iD
Simon Maskell (Principal Investigator)
Alejandro Diaz (Co-Investigator)
Robin Pinning (Co-Investigator)
Jeyan Thiyagalingam (Co-Investigator)
Peter Green (Co-Investigator)
Andrew Jones (Co-Investigator)
Samantha Yu-Ling Chong (Co-Investigator)	http://orcid.org/0000-0002-3095-875X
Laurence Alison (Co-Investigator)
Kurt Langfeld (Co-Investigator)	http://orcid.org/0000-0002-4368-3580

Publications

Author Name

Title Publication Date Published

10 25 50

Chatzopoulou A (2021) SMC samplers for Bayesian Optimisation and Discovery of Additive Kernel Structure

Green P (2022) Increasing the efficiency of Sequential Monte Carlo samplers through the use of approximately optimal L-kernels in Mechanical Systems and Signal Processing

Maskell S (2022) Control Variates for Constrained Variables in IEEE Signal Processing Letters

Moore RE (2022) Refining epidemiological forecasts with simple scoring rules. in Philosophical transactions. Series A, Mathematical, physical, and engineering sciences

Rosato C (2022) Efficient Learning of the Parameters of Non-Linear Models Using Differentiable Resampling in Particle Filters in IEEE Transactions on Signal Processing

Rosato C (2023) Extracting Self-Reported COVID-19 Symptom Tweets and Twitter Movement Mobility Origin/Destination Matrices to Inform Disease Models in Information

Varsi A (2021) An O(log2N) Fully-Balanced Resampling Algorithm for Particle Filters on Distributed Memory Architectures in Algorithms

Varsi A (2020) A Fast Parallel Particle Filter for Shared Memory Systems in IEEE Signal Processing Letters

Wu J (2022) Ensemble Kalman filter based sequential Monte Carlo sampler for sequential Bayesian inference in Statistics and Computing

Key Findings
Impact Summary
Policy Influence
Further Funding
Collaboration


Description	We've been successful in demonstrating an ability to outperform the state-of-the-art in terms of time (by exploiting parallel processing resources) as well as per unit of computation. The second of these two is particularly important since it has implications on how a team (albeit of computers at this point) might be able to outperform an indidivual given the same total amount of effort. This appears to be a result of an individual necessarily being sufficiently cautious that it can't go wrong whereas a team only has to be cautious enough that all individuals don't simultaneously go wrong.
Exploitation Route	We are working with our project partners (and co-investigators working in scientific disciplines) to see this work taken forwards.
Sectors	Aerospace, Defence and Marine,Chemicals,Creative Economy,Digital/Communication/Information Technologies (including Software),Financial Services, and Management Consultancy,Healthcare,Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology,Retail,Security and Diplomacy,Transport


Description	Our work has led to the use, since February 2020, of a specific Bayesian model as one of the portfolio of models used by UK government to calculate the COVID-19 R-number.
First Year Of Impact	2020
Sector	Healthcare
Impact Types	Policy & public services


Description	Calculation of COVID R-number for UK government
Geographic Reach	National
Policy Influence Type	Participation in a guidance/advisory committee
Impact	The calculation of the R-number has been a critical input to UK government's decision making regarding interventions related to COVID-19.
URL	https://www.gov.uk/government/publications/reproduction-number-r-and-growth-rate-methodology/reprodu...


Description	Bayesian Localisation in the Underwater Environment
Amount	£1,429,518 (GBP)
Organisation	Defence Science & Technology Laboratory (DSTL)
Sector	Public
Country	United Kingdom
Start	11/2019
End	10/2021


Description	Fusion and Information Thoery
Amount	£929,241 (GBP)
Organisation	Defence Science & Technology Laboratory (DSTL)
Sector	Public
Country	United Kingdom
Start	11/2019
End	11/2021


Description	Novel Fusion Approaches to Mitigate Deception
Amount	£269,926 (GBP)
Organisation	Defence Science & Technology Laboratory (DSTL)
Sector	Public
Country	United Kingdom
Start	01/2021
End	03/2023


Description	Scalable Online Machine Learning
Amount	£100,000 (GBP)
Funding ID	2599530
Organisation	Engineering and Physical Sciences Research Council (EPSRC)
Sector	Public
Country	United Kingdom
Start	09/2021
End	09/2025


Description	To be confirmed
Amount	£100,000 (GBP)
Funding ID	2599529
Organisation	Engineering and Physical Sciences Research Council (EPSRC)
Sector	Public
Country	United Kingdom
Start	09/2021
End	09/2025


Description	Co-development of Stan
Organisation	Stan
Sector	Charity/Non Profit
PI Contribution	We are actively contributing to the Stan's code base.
Collaborator Contribution	Access to a route to impact.
Impact	So far, we have just injected a small change to the Stan maths library, but that is now in the latest release and so used by 100,000+ researchers.
Start Year	2018


Description	Joint Study Agreement with IBM
Organisation	IBM
Country	United States
Sector	Private
PI Contribution	We are developing next-generation data science techniques that can support both internal activity within IBM and their interactions with the customers.
Collaborator Contribution	IBM are providing people, access to large computers and, for example, secondment opportunities.
Impact	None as yet.
Start Year	2018

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications