Big Hypotheses: A Fully Parallelised Bayesian Inference Solution

Lead Research Organisation: University of Liverpool
Department Name: Electrical Engineering and Electronics

Abstract

Bayesian inference is a process which allows us to extract information from data. The process uses prior knowledge articulated as statistical models for the data. We are focused on developing a transformational solution to Data Science problems that can be posed as such Bayesian inference tasks.

An existing family of algorithms, called Markov chain Monte Carlo (MCMC) algorithms, offer a family of solutions that offer impressive accuracy but demand significant computational load. For a significant subset of the users of Data Science that we interact with, while the accuracy offered by MCMC is recognised as potentially transformational, the computational load is just too great for MCMC to be a practical alternative to existing approaches. These users include academics working in science (e.g., Physics, Chemistry, Biology and the social sciences) as well as government and industry (e.g., in the pharmaceutical, defence and manufacturing sectors). The problem is then how to make the accuracy offered by MCMC accessible at a fraction of the computational cost.

The solution we propose is based on replacing MCMC with a more recently developed family of algorithms, Sequential Monte Carlo (SMC) samplers. While MCMC, at its heart, manipulates a single sampling process, SMC samplers are an inherently population-based algorithm that manipulates a population of samples. This makes SMC samplers well suited to the task of being implemented in a way that exploits parallel computational resources. It is therefore possible to use emerging hardware (e.g., Graphics Processor Units (GPUs), Field Programmable Gate Arrays (FPGAs) and Intel's Xeon Phis as well as High Performance Computing (HPC) clusters) to make SMC samplers run faster. Indeed, our recent work (which has had to remove some algorithmic bottlenecks before making the progress we have achieved) has shown that SMC samplers can offer accuracy similar to MCMC but with implementations that are better suited to such emerging hardware.

The benefits of using an SMC sampler in place of MCMC go beyond those made possible by simply posing a (tough) parallel computing challenge. The parameters of an MCMC algorithm necessarily differ from those related to a SMC sampler. These differences offer opportunities for SMC samplers to be developed in directions that are not possible with MCMC. For example, SMC samplers, in contrast to MCMC algorithms, can be configured to exploit a memory of their historic behaviour and can be designed to smoothly transition between problems. It seems likely that by exploiting such opportunities, we will generate SMC samplers that can outperform MCMC even more than is possible by using parallelised implementations alone.

Our interactions with users, our experience of parallelising SMC samplers and the preliminary results we have obtained when comparing SMC samplers and MCMC make us excited about the potential that SMC samplers offer as a "New Approach for Data Science".

Our current work has only begun to explore the potential offered by SMC samplers. We perceive significant benefit could result from a larger programme of work that helps us understand the extent to which users will benefit from replacing MCMC with SMC samplers. We propose a programme of work that combines a focus on users' problems with a systematic investigation into the opportunities offered by SMC samplers.

Our strategy for achieving impact comprises multiple tactics. Specifically, we will: use identified users to act as "evangelists" in each of their domains; work with our hardware-oriented partners to produce high-performance reference implementations; engage with the developer team for Stan (the most widely-used generic MCMC implementation); work with the Industrial Mathematics Knowledge Transfer Network and the Alan Turing Institute to engage with both users and other algorithmic developers.

Planned Impact

The core of our strategy for achieving impact is to work with each of a number of users to improve their ability to access MCMC's accuracy at a tiny fraction of its computational cost. The potential impact associated with those specific users are summarised as follows (listed in the (feature-oriented) order used in Table 1 of the case for support):
1. An improved ability to profile offenders would result in the National Crime Agency (and the other international law enforcement agencies who are using the profiling tool developed at the University of Liverpool) protecting more children and prosecuting more sex offenders;
2. An improved ability to use nuclear forensics to understand the source of nuclear material being smuggled into the UK would reduce the threat to the safety of the UK that is posed by international terrorism;
3. A rapid and improved ability to characterise crystals from powder diffraction data would promote a step-change in how new materials are discovered and applied across many industries (from the design of improved batteries to the formulation of new drugs, as relevant to EPSRC's Directed Assembly and Dial-a-molecule grand challenges for Chemical sciences and engineering);
4. An improved estimate of the sources of chemical or biological agent's release would help save the lives of those soldiers who would otherwise be exposed to the agent;
5. An improved ability to handle missing data when analysing results from clinical trials would maximise the information extracted from people undertaking the trials, thereby resulting in safer drugs which have greater efficacy;
6. An improved ability to analyse quantum field theory by first principle computational methods would transform our ability to understand the fabric of the Universe (e.g., the dark matter that we know so little about yet is 95% of that Universe);
7. An improved ability to combine expert knowledge and experimental data would enable the manufacturing industry to explore the formulation space and thereby manufacture products that are more attractive to consumers;
8. An improved ability to quantify protein abundance would have a dramatic impact on our understanding of biological systems (e.g., mankind).

We will help these users to be "evangelists" in each of their domains, will work with Stan's developer team and with both the Industrial Mathematics Knowledge Transfer Network and the Alan Turing Institute to engage new users (and other algorithmic developers). We anticipate further impact will result from this indirect engagement. We now summarise a generic view of such impact.

We have been working with clusters of up to 130,000 cores. To provide some context, there are 86,400 seconds in a day such that, if we can fully capitalise on the processing power of such a cluster, tasks that might have historically taken 6 months with one core can run in just 3 minutes. The users we have and continue to engage with see that kind of speed-up as transformational: problems that are currently beyond reach would be solved routinely. We believe other users will have a similar view. We hope to change academic users' abilities to conduct their science and to change industrial users' abilities to conduct their business.

We do recognise that, in some users' contexts, it is possible to develop problem-specific parallel implementations of MCMC (e.g., using Spark) that give rise to such speed-ups without the need for the parallelised SMC samplers that we plan to develop. However, the scope to exploit the very nature of SMC samplers to get additional computational gains beyond those associated with parallel implementation makes us optimistic that we will develop a truly game-changing capability to solve Bayesian inference tasks.

Our (overtly ambitious) hope is that this project's legacy will be that we live in a future where MCMC is no longer deemed to be the state-of-the-art and SMC samplers are the de-facto standard for numerical Bayesian inference.

Publications

10 25 50
 
Description We've been successful in demonstrating an ability to outperform the state-of-the-art in terms of time (by exploiting parallel processing resources) as well as per unit of computation. The second of these two is particularly important since it has implications on how a team (albeit of computers at this point) might be able to outperform an indidivual given the same total amount of effort. This appears to be a result of an individual necessarily being sufficiently cautious that it can't go wrong whereas a team only has to be cautious enough that all individuals don't simultaneously go wrong.
Exploitation Route We are working with our project partners (and co-investigators working in scientific disciplines) to see this work taken forwards.
Sectors Aerospace

Defence and Marine

Chemicals

Creative Economy

Digital/Communication/Information Technologies (including Software)

Financial Services

and Management Consultancy

Healthcare

Manufacturing

including Industrial Biotechology

Pharmaceuticals and Medical Biotechnology

Retail

Security and Diplomacy

Transport

 
Description Our work has led to the use, since February 2021, of a specific Bayesian model as one of the portfolio of models used by UK government to calculate the COVID-19 R-number. We have also recently had approval for two spin-outs from the University (both licencing IP generated in or around this project). In addition, we have been working towards a new search for MH370 and that has resulted in, amongst other things, the work we are doing being on BBC News (and we even ended up with a numerical Bayesian algorithm being on daytime television).
First Year Of Impact 2021
Sector Aerospace, Defence and Marine,Financial Services, and Management Consultancy,Healthcare
Impact Types Societal

Economic

Policy & public services

 
Description Calculation of COVID R-number for UK government
Geographic Reach National 
Policy Influence Type Participation in a guidance/advisory committee
Impact The calculation of the R-number has been a critical input to UK government's decision making regarding interventions related to COVID-19.
URL https://www.gov.uk/government/publications/reproduction-number-r-and-growth-rate-methodology/reprodu...
 
Description Bayesian Localisation in the Underwater Environment
Amount £1,429,518 (GBP)
Organisation Defence Science & Technology Laboratory (DSTL) 
Sector Public
Country United Kingdom
Start 11/2019 
End 10/2021
 
Description EPSRC IAA Award: Capitalising on Big Hypotheses for Significantly Better Decision Support for the Insurance Industry
Amount £16,134 (GBP)
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start 05/2024 
End 07/2025
 
Description Fusion and Information Thoery
Amount £929,241 (GBP)
Organisation Defence Science & Technology Laboratory (DSTL) 
Sector Public
Country United Kingdom
Start 11/2019 
End 11/2021
 
Description Novel Fusion Approaches to Mitigate Deception
Amount £269,926 (GBP)
Organisation Defence Science & Technology Laboratory (DSTL) 
Sector Public
Country United Kingdom
Start 01/2021 
End 03/2023
 
Description Scalable Online Machine Learning
Amount £100,000 (GBP)
Funding ID 2599530 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start 08/2021 
End 09/2025
 
Description Scalable Online Machine Learning
Amount £100,000 (GBP)
Funding ID 2599529 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start 09/2021 
End 09/2025
 
Description Co-development of Stan 
Organisation Stan
Sector Charity/Non Profit 
PI Contribution We are actively contributing to the Stan's code base.
Collaborator Contribution Access to a route to impact.
Impact So far, we have just injected a small change to the Stan maths library, but that is now in the latest release and so used by 100,000+ researchers.
Start Year 2018
 
Description Joint Study Agreement with IBM 
Organisation IBM
Country United States 
Sector Private 
PI Contribution We are developing next-generation data science techniques that can support both internal activity within IBM and their interactions with the customers.
Collaborator Contribution IBM are providing people, access to large computers and, for example, secondment opportunities.
Impact None as yet.
Start Year 2018
 
Title METHOD OF PARALLEL IMPLEMENTATION IN DISTRIBUTED MEMORY ARCHITECTURES 
Description The present techniques relate to a method for a parallel implementation of a sequential Monte Carlo (SMC) method of modelling an industrial process on a distributed memory architecture, and a system for implementing the same. The method may comprise receiving, from at least one sensor, a measurement of at least one parameter within the physical system, wherein the at least one parameter is related to the true state of the physical system; and implementing, on a server comprising a distributed memory architecture, a sequential Monte Carlo (SMC) process using a plurality of statistically independent particles and the at least one measured parameter to estimate the true state of the physical system, wherein the distributed memory architecture has a plurality of cores each of which are ranked. The method and architecture provide an efficient parallel implementation by effectively parallelising a redistribute step which may be considered to be a constituent part of a resampling step. The SMC method may be used to perform state estimation of dynamic or static models under non-linear, non-Gaussian noise. 
IP Reference WO2022162386 
Protection Patent / Patent application
Year Protection Granted 2022
Licensed No
Impact The University of Liverpool has approved the formation of a spin-out, Voyant, which will seek to commercialise the application of the patent (and other work we have done) for tracking and data fusion problems pertinent to defence and security. Voyant is currently in the process of being formed.
 
Description Working towards a future Search for MH370 
Form Of Engagement Activity A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact 35 million people in 200 countries watched my live interview on BBC news (with others watching a documentary I was involved with on BBC 1 and others watching my live interview on Morning Live, listening to my live interview on BB4's World Tonight, reading my interviews in the Times and the Telegraph) related to the potential to analyse radio amateurs' data in support of a future search for MH370.
Year(s) Of Engagement Activity 2024
URL https://www.youtube.com/watch?v=9SEMSQDO-pg