Big Hypotheses: A Fully Parallelised Bayesian Inference Solution

Lead Research Organisation: University of Liverpool
Department Name: Electrical Engineering and Electronics

Abstract

Bayesian inference is a process which allows us to extract information from data. The process uses prior knowledge articulated as statistical models for the data. We are focused on developing a transformational solution to Data Science problems that can be posed as such Bayesian inference tasks.

An existing family of algorithms, called Markov chain Monte Carlo (MCMC) algorithms, offer a family of solutions that offer impressive accuracy but demand significant computational load. For a significant subset of the users of Data Science that we interact with, while the accuracy offered by MCMC is recognised as potentially transformational, the computational load is just too great for MCMC to be a practical alternative to existing approaches. These users include academics working in science (e.g., Physics, Chemistry, Biology and the social sciences) as well as government and industry (e.g., in the pharmaceutical, defence and manufacturing sectors). The problem is then how to make the accuracy offered by MCMC accessible at a fraction of the computational cost.

The solution we propose is based on replacing MCMC with a more recently developed family of algorithms, Sequential Monte Carlo (SMC) samplers. While MCMC, at its heart, manipulates a single sampling process, SMC samplers are an inherently population-based algorithm that manipulates a population of samples. This makes SMC samplers well suited to the task of being implemented in a way that exploits parallel computational resources. It is therefore possible to use emerging hardware (e.g., Graphics Processor Units (GPUs), Field Programmable Gate Arrays (FPGAs) and Intel's Xeon Phis as well as High Performance Computing (HPC) clusters) to make SMC samplers run faster. Indeed, our recent work (which has had to remove some algorithmic bottlenecks before making the progress we have achieved) has shown that SMC samplers can offer accuracy similar to MCMC but with implementations that are better suited to such emerging hardware.

The benefits of using an SMC sampler in place of MCMC go beyond those made possible by simply posing a (tough) parallel computing challenge. The parameters of an MCMC algorithm necessarily differ from those related to a SMC sampler. These differences offer opportunities for SMC samplers to be developed in directions that are not possible with MCMC. For example, SMC samplers, in contrast to MCMC algorithms, can be configured to exploit a memory of their historic behaviour and can be designed to smoothly transition between problems. It seems likely that by exploiting such opportunities, we will generate SMC samplers that can outperform MCMC even more than is possible by using parallelised implementations alone.

Our interactions with users, our experience of parallelising SMC samplers and the preliminary results we have obtained when comparing SMC samplers and MCMC make us excited about the potential that SMC samplers offer as a "New Approach for Data Science".

Our current work has only begun to explore the potential offered by SMC samplers. We perceive significant benefit could result from a larger programme of work that helps us understand the extent to which users will benefit from replacing MCMC with SMC samplers. We propose a programme of work that combines a focus on users' problems with a systematic investigation into the opportunities offered by SMC samplers.

Our strategy for achieving impact comprises multiple tactics. Specifically, we will: use identified users to act as "evangelists" in each of their domains; work with our hardware-oriented partners to produce high-performance reference implementations; engage with the developer team for Stan (the most widely-used generic MCMC implementation); work with the Industrial Mathematics Knowledge Transfer Network and the Alan Turing Institute to engage with both users and other algorithmic developers.

Planned Impact

The core of our strategy for achieving impact is to work with each of a number of users to improve their ability to access MCMC's accuracy at a tiny fraction of its computational cost. The potential impact associated with those specific users are summarised as follows (listed in the (feature-oriented) order used in Table 1 of the case for support):
1. An improved ability to profile offenders would result in the National Crime Agency (and the other international law enforcement agencies who are using the profiling tool developed at the University of Liverpool) protecting more children and prosecuting more sex offenders;
2. An improved ability to use nuclear forensics to understand the source of nuclear material being smuggled into the UK would reduce the threat to the safety of the UK that is posed by international terrorism;
3. A rapid and improved ability to characterise crystals from powder diffraction data would promote a step-change in how new materials are discovered and applied across many industries (from the design of improved batteries to the formulation of new drugs, as relevant to EPSRC's Directed Assembly and Dial-a-molecule grand challenges for Chemical sciences and engineering);
4. An improved estimate of the sources of chemical or biological agent's release would help save the lives of those soldiers who would otherwise be exposed to the agent;
5. An improved ability to handle missing data when analysing results from clinical trials would maximise the information extracted from people undertaking the trials, thereby resulting in safer drugs which have greater efficacy;
6. An improved ability to analyse quantum field theory by first principle computational methods would transform our ability to understand the fabric of the Universe (e.g., the dark matter that we know so little about yet is 95% of that Universe);
7. An improved ability to combine expert knowledge and experimental data would enable the manufacturing industry to explore the formulation space and thereby manufacture products that are more attractive to consumers;
8. An improved ability to quantify protein abundance would have a dramatic impact on our understanding of biological systems (e.g., mankind).

We will help these users to be "evangelists" in each of their domains, will work with Stan's developer team and with both the Industrial Mathematics Knowledge Transfer Network and the Alan Turing Institute to engage new users (and other algorithmic developers). We anticipate further impact will result from this indirect engagement. We now summarise a generic view of such impact.

We have been working with clusters of up to 130,000 cores. To provide some context, there are 86,400 seconds in a day such that, if we can fully capitalise on the processing power of such a cluster, tasks that might have historically taken 6 months with one core can run in just 3 minutes. The users we have and continue to engage with see that kind of speed-up as transformational: problems that are currently beyond reach would be solved routinely. We believe other users will have a similar view. We hope to change academic users' abilities to conduct their science and to change industrial users' abilities to conduct their business.

We do recognise that, in some users' contexts, it is possible to develop problem-specific parallel implementations of MCMC (e.g., using Spark) that give rise to such speed-ups without the need for the parallelised SMC samplers that we plan to develop. However, the scope to exploit the very nature of SMC samplers to get additional computational gains beyond those associated with parallel implementation makes us optimistic that we will develop a truly game-changing capability to solve Bayesian inference tasks.

Our (overtly ambitious) hope is that this project's legacy will be that we live in a future where MCMC is no longer deemed to be the state-of-the-art and SMC samplers are the de-facto standard for numerical Bayesian inference.

Publications

10 25 50
 
Description We've been successful in demonstrating an ability to outperform the state-of-the-art in terms of time (by exploiting parallel processing resources) as well as per unit of computation. The second of these two is particularly important since it has implications on how a team (albeit of computers at this point) might be able to outperform an indidivual given the same total amount of effort. This appears to be a result of an individual necessarily being sufficiently cautious that it can't go wrong whereas a team only has to be cautious enough that all individuals don't simultaneously go wrong.
Exploitation Route We are working with our project partners (and co-investigators working in scientific disciplines) to see this work taken forwards.
Sectors Aerospace, Defence and Marine,Chemicals,Creative Economy,Digital/Communication/Information Technologies (including Software),Financial Services, and Management Consultancy,Healthcare,Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology,Retail,Security and Diplomacy,Transport

 
Description Our work has led to the use, since February 2020, of a specific Bayesian model as one of the portfolio of models used by UK government to calculate the COVID-19 R-number.
First Year Of Impact 2020
Sector Healthcare
Impact Types Policy & public services

 
Description Calculation of COVID R-number for UK government
Geographic Reach National 
Policy Influence Type Participation in a guidance/advisory committee
Impact The calculation of the R-number has been a critical input to UK government's decision making regarding interventions related to COVID-19.
URL https://www.gov.uk/government/publications/reproduction-number-r-and-growth-rate-methodology/reprodu...
 
Description Bayesian Localisation in the Underwater Environment
Amount £1,429,518 (GBP)
Organisation Defence Science & Technology Laboratory (DSTL) 
Sector Public
Country United Kingdom
Start 11/2019 
End 10/2021
 
Description Fusion and Information Thoery
Amount £929,241 (GBP)
Organisation Defence Science & Technology Laboratory (DSTL) 
Sector Public
Country United Kingdom
Start 11/2019 
End 11/2021
 
Description Novel Fusion Approaches to Mitigate Deception
Amount £269,926 (GBP)
Organisation Defence Science & Technology Laboratory (DSTL) 
Sector Public
Country United Kingdom
Start 01/2021 
End 03/2023
 
Description Scalable Online Machine Learning
Amount £100,000 (GBP)
Funding ID 2599530 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start 09/2021 
End 09/2025
 
Description To be confirmed
Amount £100,000 (GBP)
Funding ID 2599529 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start 09/2021 
End 09/2025
 
Description Co-development of Stan 
Organisation Stan
Sector Charity/Non Profit 
PI Contribution We are actively contributing to the Stan's code base.
Collaborator Contribution Access to a route to impact.
Impact So far, we have just injected a small change to the Stan maths library, but that is now in the latest release and so used by 100,000+ researchers.
Start Year 2018
 
Description Joint Study Agreement with IBM 
Organisation IBM
Country United States 
Sector Private 
PI Contribution We are developing next-generation data science techniques that can support both internal activity within IBM and their interactions with the customers.
Collaborator Contribution IBM are providing people, access to large computers and, for example, secondment opportunities.
Impact None as yet.
Start Year 2018