New Approaches to Bayesian Data Science: Tackling Challenges from the Health Sciences
Lead Research Organisation:
Lancaster University
Department Name: Mathematics and Statistics
Abstract
The health sciences have seen an explosion in the amount of data collected at both individual and population levels. This data can be varied, including genetic information, health records, data on activity levels obtained from wearable devices, and image data from scans. There is huge potential for improved diagnoses, timely interventions and more effective treatments if we can fully extract understanding from this data. Example applications included real-time monitoring of patients, developing personalised treatment, or real-time monitoring and decision-making for epidemics. However the data science challenges in extracting these insights are vast.
Features of these challenges include the need to make inferences about and decisions for individuals from within a population, and the need to synthesise information from disparate data sources and data types. Whilst we have substantial data collected at a population level, the amount of information on any given individual may be still be limited. Appropriately quantifying uncertainty is crucial for making decisions, with the optimal decision often being driven by the probability of relatively rare events (e.g. extreme reaction to a drug). We need model-based approaches to data science that can leverage scientific understanding, but we need the statistical analyses to be robust to unavoidable inadequacies of these models. Underpinning many of these applications is the requirement to develop new understanding, and this differs from a focus on making predictions that it is most common among current statistical or machine learning methods.
Bayesian data science provides a natural framework for tackling these challenges. Bayesian methods are model-based, can appropriately quantify and propagate uncertainty, and through hierarchical models are able to use population-level information when making inferences about individuals. Repeated application of Bayes theorem gives a natural paradigm for synthesizing information across multiple data sources. However, current Bayesian data science methods are not feasible for many modern, big-data, applications in the health sciences. Bayesian methods require integrating over uncertainty. Such high-dimensional integration carries a substantial computational overhead when compared to alternative, often optimization-based, data science methods. So while the motivation for Bayesian analysis is clear, this computational overhead means that, currently, implementing Bayesian approaches is often not feasible.
This programme of research will develop the new approaches to Bayesian data science that are needed both within the health sciences and more widely. It builds on recent breakthroughs in Monte Carlo integration methods that show great promise for being efficient for large data; and on new paradigms for Bayesian-like updates that are suitable for complex models and which focus modelling effort just on the aspects of these models that are most important. It will address key research challenges in the health sciences -- directly developing new insights and understanding for these.
Features of these challenges include the need to make inferences about and decisions for individuals from within a population, and the need to synthesise information from disparate data sources and data types. Whilst we have substantial data collected at a population level, the amount of information on any given individual may be still be limited. Appropriately quantifying uncertainty is crucial for making decisions, with the optimal decision often being driven by the probability of relatively rare events (e.g. extreme reaction to a drug). We need model-based approaches to data science that can leverage scientific understanding, but we need the statistical analyses to be robust to unavoidable inadequacies of these models. Underpinning many of these applications is the requirement to develop new understanding, and this differs from a focus on making predictions that it is most common among current statistical or machine learning methods.
Bayesian data science provides a natural framework for tackling these challenges. Bayesian methods are model-based, can appropriately quantify and propagate uncertainty, and through hierarchical models are able to use population-level information when making inferences about individuals. Repeated application of Bayes theorem gives a natural paradigm for synthesizing information across multiple data sources. However, current Bayesian data science methods are not feasible for many modern, big-data, applications in the health sciences. Bayesian methods require integrating over uncertainty. Such high-dimensional integration carries a substantial computational overhead when compared to alternative, often optimization-based, data science methods. So while the motivation for Bayesian analysis is clear, this computational overhead means that, currently, implementing Bayesian approaches is often not feasible.
This programme of research will develop the new approaches to Bayesian data science that are needed both within the health sciences and more widely. It builds on recent breakthroughs in Monte Carlo integration methods that show great promise for being efficient for large data; and on new paradigms for Bayesian-like updates that are suitable for complex models and which focus modelling effort just on the aspects of these models that are most important. It will address key research challenges in the health sciences -- directly developing new insights and understanding for these.
Planned Impact
Who will benefit?
This proposal will benefit a variety of different stakeholders including:
(a) A range of public bodies, academic groups and companies within the health sciences;
(b) Society more generally through the application of this research;
(c) The academic data science research community;
(d) Project personnel: the PDRAs and PhD students.
How will they benefit?
Solution to current health science challenges [groups a,b]
The research project will tackle current key health science challenges, such as real-time decision-making for epidemics, structured association studies and in personalised medicine. This will involve working directly with scientists within each of these areas to develop and apply new data science methods. Impact will arise immediately from new insights found, for example within association studies; and from a suite of new methods that can be applied more widely. Direct work with project partners will see the quick uptake of new methods in practice. The wider impact will be supported through the development, by one of our project partners, of software for specific applications aimed at health scientists and practitioners.
New Bayesian data science techniques [groups a,b,c]
We will generalise the new data science methods developed to address specific health science challenges. This will lead to a new suite of Bayesian data science techniques, together with associated theory and insight. These methods will cover generic challenges such as scalable computational methods, robust Bayesian procedures and how to fuse information from disparate data sources and types. The impact of this work will be supported by making code developed freely-available; and by dissemination at international data science conferences and in journals that span a range of data science related disciplines. Part of this dissemination will be through an annual workshop linked to the research project.
Targeted knowledge exchange [group a]
Our project partners will benefit directly from this research project. To maximise this we have plans for two research retreats per year, each around a different specific health science challenge, and which will have appropriate project partner involvement. PDRAs and PhD students on the grant will spend periods of time working at and directly with our partners. The grant will also have an external advisory board with strong end-user involvement.
Developing good people [groups a,c,d]
We will develop highly skilled researchers with Bayesian data science and applications in health science. All PDRAs and PhD students will spend substantial time within research groups specialising both in fundamental data science and in health science applications. They will benefit from supportive training environments and opportunities provided by the five participating institutions; for example bespoke training courses run by the STOR-i and OxWasp doctoral training centres.
This grant will lead to an increase in the number of high-quality researchers working in a skill-shortage area, and able to seek future employment both within academy and industry.
This proposal will benefit a variety of different stakeholders including:
(a) A range of public bodies, academic groups and companies within the health sciences;
(b) Society more generally through the application of this research;
(c) The academic data science research community;
(d) Project personnel: the PDRAs and PhD students.
How will they benefit?
Solution to current health science challenges [groups a,b]
The research project will tackle current key health science challenges, such as real-time decision-making for epidemics, structured association studies and in personalised medicine. This will involve working directly with scientists within each of these areas to develop and apply new data science methods. Impact will arise immediately from new insights found, for example within association studies; and from a suite of new methods that can be applied more widely. Direct work with project partners will see the quick uptake of new methods in practice. The wider impact will be supported through the development, by one of our project partners, of software for specific applications aimed at health scientists and practitioners.
New Bayesian data science techniques [groups a,b,c]
We will generalise the new data science methods developed to address specific health science challenges. This will lead to a new suite of Bayesian data science techniques, together with associated theory and insight. These methods will cover generic challenges such as scalable computational methods, robust Bayesian procedures and how to fuse information from disparate data sources and types. The impact of this work will be supported by making code developed freely-available; and by dissemination at international data science conferences and in journals that span a range of data science related disciplines. Part of this dissemination will be through an annual workshop linked to the research project.
Targeted knowledge exchange [group a]
Our project partners will benefit directly from this research project. To maximise this we have plans for two research retreats per year, each around a different specific health science challenge, and which will have appropriate project partner involvement. PDRAs and PhD students on the grant will spend periods of time working at and directly with our partners. The grant will also have an external advisory board with strong end-user involvement.
Developing good people [groups a,c,d]
We will develop highly skilled researchers with Bayesian data science and applications in health science. All PDRAs and PhD students will spend substantial time within research groups specialising both in fundamental data science and in health science applications. They will benefit from supportive training environments and opportunities provided by the five participating institutions; for example bespoke training courses run by the STOR-i and OxWasp doctoral training centres.
This grant will lead to an increase in the number of high-quality researchers working in a skill-shortage area, and able to seek future employment both within academy and industry.
Organisations
- Lancaster University (Lead Research Organisation)
- AstraZeneca (United Kingdom) (Project Partner)
- Wellcome Sanger Institute (Project Partner)
- Health Protection Scotland (Project Partner)
- Public Health England (Project Partner)
- GlaxoSmithKline (United Kingdom) (Project Partner)
- MRC Harwell Institute (Project Partner)
Publications
Baker J
(2018)
Large-Scale Stochastic Sampling from the Probability Simplex
Baker J
(2019)
sgmcmc : An R Package for Stochastic Gradient Markov Chain Monte Carlo
in Journal of Statistical Software
Baker J
(2018)
Control variates for stochastic gradient MCMC
in Statistics and Computing
Baker J.
(2018)
Large-scale stochastic sampling from the probability simplex
in Advances in Neural Information Processing Systems
Behr M
(2019)
Testing for dependence on tree structures
Behr M
(2020)
Testing for dependence on tree structures.
in Proceedings of the National Academy of Sciences of the United States of America
Benschop J
(2021)
Still 'dairy farm fever'? A Bayesian model for leptospirosis notification data in New Zealand.
in Journal of the Royal Society, Interface
Benton Joe
(2022)
From Denoising Diffusions to Denoising Markov Models
in arXiv e-prints
Bierkens J
(2022)
High-dimensional scaling limits of piecewise deterministic sampling algorithms
in The Annals of Applied Probability
Bierkens J
(2020)
The Boomerang Sampler
Description | Substantial progress has been made in three key areas, that cut across theory, methods and application: (i) New methods for large-scale epidemic modelling, and application to Covid-19 monitoring. We have developed new Sequential Monte Carlo methods for analysing epidemics, including approaches that make inference for individual-based epidemic models possible. Grant members were involved in analysing Covid-19 data, with involvement in SPI-M that input into government decisions on health policies. (ii) New methods for non-reversible, continuous-time Monte Carlo methods. Substantial progress has been made at extending the scope of powerful new computational statistics algorithms for sampling. These methods, based on a stochastic process called PDMPs, are now easier to implement, and can be applied to a wider range of sampling problems. Software for implementing these methods has been developed. (iii) Generalising Bayesian inference. Work funded by the grant has been at the forefront of developing generalisations of Bayesian update rules, that can allow for modelling data through general loss function, use of bootstrap ideas, or that use martingale ideas to recast the Bayesian posterior in terms of predictive distributions. These ideas widen the applicable of Bayesian ideas. (iv) Falsification of digital twins. We have shown that it is not possible to certify that a twin is "correct" using real-world observational data unless potentially tenuous assumptions are made. To avoid these assumptions, we propose an assessment strategy that aims to find cases where the twin is not correct, and present a general-purpose statistical procedure for doing so that may be used across a wide variety of applications and twin models. (v) New theory for analysing convergence of MCMC algorithms. We have introduced new theoretical tools for analysing MCMC algorithms, which underpin most applications of Bayesian statistics. These tools lead to sharper and more general theoretical results on the performance of different algorithms, and help with understanding of which algorithms are suitable for different applications. |
Exploitation Route | New algorithms for epidemic models can be used to analyse future epidemic out-breaks. New non-reversible MCMC algorithms could developed further and applied to a wide-range of Bayesian statistical and other sampling problems. New theoretical tools can be used to derive new theoretical results on convergence and other properties of MCMC algorithms. |
Sectors | Aerospace Defence and Marine Energy Healthcare |
Description | Spatial epidemic analysis for COVID-19 Members of Bayes4Health (Jewell, Roberts, Corbella) have developed a data augmentation MCMC framework enabling spatial stochastic epidemic models to be fitted to the emerging local-authority level multidimensional case timeseries in the UK. Jewell was invited to the SPI-M-O (Scientific Pandemic Influenza - Modelling - Operational) subcommittee of SAGE. Through this platform, they have been providing results on spatial COVID-19 dynamics to multiple groups, including SPI-M-O for their weekly consensus statement, Cabinet Office for prioritising outbreak control in various UK regions, the Scottish government for spatial epidemiological intelligence, and Lancashire County Council for COVID-19 resource management purposes. The UK government's Chief ScientificAdvisor stated that the Lancaster outputs on spatial modelling have led to improved efforts by the national health agencies to provideCOVID-19 case data publicly (https://coronavirus.data.gov.uk/), which are now routinely used by SPI-M to generate ongoing COVID-19 reports. The Co-Chair of SPI-M-O said,"At this level, the Lancaster framework is the best available in the UKI would be surprised if there were better globallythese interventions have been instrumental in the formation of our advice to Government". |
First Year Of Impact | 2020 |
Sector | Healthcare |
Impact Types | Societal Policy & public services |
Description | Data Evaluation and Learning for Viral Epidemics group |
Geographic Reach | National |
Policy Influence Type | Membership of a guideline committee |
Description | Expert Advisory Group for Foreign Office and Cabinet Office |
Geographic Reach | National |
Policy Influence Type | Membership of a guideline committee |
Description | Expert Advisory Group for the International Comparators |
Geographic Reach | National |
Policy Influence Type | Membership of a guideline committee |
Description | International Best Practice Advisory Group on COVID-19 |
Geographic Reach | Multiple continents/international |
Policy Influence Type | Membership of a guideline committee |
Description | Joint Biosecurity Centre for COVID-19 |
Geographic Reach | National |
Policy Influence Type | Membership of a guideline committee |
URL | https://www.gov.uk/government/groups/joint-biosecurity-centre |
Description | Membership of SPI-M-O (Scientific Pandemic Influenza - Modelling - Operational) subcommittee of SAGE |
Geographic Reach | National |
Policy Influence Type | Participation in a guidance/advisory committee |
Impact | Jewell was invited to the SPI-M-O (Scientific Pandemic Influenza - Modelling - Operational) subcommittee of SAGE. Through this platform, they have been providing results on spatial COVID-19 dynamics to multiple groups, including SPI-M-O for their weekly consensus statement, Cabinet Office for prioritising outbreak control in various UK regions, the Scottish government for spatial epidemiological intelligence, and Lancashire County Council for COVID-19 resource management purposes. The UK government's Chief ScientificAdvisor stated that the Lancaster outputs on spatial modelling have led to improved efforts by the national health agencies to provideCOVID-19 case data publicly (https://coronavirus.data.gov.uk/), which are now routinely used by SPI-M to generate ongoing COVID-19 reports. The Co-Chair of SPI-M-O said,"At this level, the Lancaster framework is the best available in the UKI would be surprised if there were better globallythese interventions have been instrumental in the formation of our advice to Government". |
Description | Royal Statistical Task Force on COVID-19 |
Geographic Reach | National |
Policy Influence Type | Membership of a guideline committee |
Impact | The Royal Statistical Society (RSS) has a task force to marshal and promote, at a high level, the Society's expertise relevant to the COVID-19 crisis. The task force was co-founded by Sylvia Richardson, and she also co-chairs the initiative with RSS past president Sir David Spiegelhalter, and is supported by a steering group. |
URL | https://rss.org.uk/policy-campaigns/policy/covid-19-task-force/ |
Description | SPI-M-O reporting |
Geographic Reach | National |
Policy Influence Type | Implementation circular/rapid advice/letter to e.g. Ministry of Health |
Impact | Our weekly reports have fed into spatial resource allocation for the SARS-CoV-2 outbreak, informing decisions on where to send medical supplies, test kits, and focus field investigations of super-spreading events. |
URL | https://chicas-covid19.gitlab.io/bayesstm |
Description | COVID-19: Bayesian inference for high resolution stochastic modelling for the UK |
Amount | £151,402 (GBP) |
Funding ID | EP/W011840/1 |
Organisation | Engineering and Physical Sciences Research Council (EPSRC) |
Sector | Public |
Country | United Kingdom |
Start | 07/2021 |
End | 01/2023 |
Title | R package -- rjpdmp |
Description | R package, available on CRAN, that implements the methods in Chevallier, Augustin, Paul Fearnhead, and Matthew Sutton. "Reversible jump PDMP samplers for variable selection." arXiv preprint arXiv:2010.11771 (2020). |
Type Of Technology | Software |
Year Produced | 2021 |
Open Source License? | Yes |
Impact | None |
URL | https://cran.r-project.org/web/packages/rjpdmp/index.html |
Title | ccpdmp |
Description | R package that implements adaptive concave-convex sampling for PDMP samplers. |
Type Of Technology | Software |
Year Produced | 2022 |
Open Source License? | Yes |
Impact | None |
URL | https://github.com/matt-sutton/ccpdmp |
Title | prevdebiasr R package |
Description | https://github.com/alan-turing-institute/prevdebiasr |
Type Of Technology | Software |
Year Produced | 2021 |
Impact | Currently not known |
URL | https://github.com/alan-turing-institute/prevdebiasr |
Description | Article for Significance magazine - 'A perspective on real-time epidemic surveillance for Covid-19' |
Form Of Engagement Activity | A magazine, newsletter or online publication |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Public/other audiences |
Results and Impact | Sylvia Richardson co-wrote article for Significance magazine entitled 'A perspective on real-time epidemic surveillance for Covid-19' |
Year(s) Of Engagement Activity | 2020 |
URL | https://www.significancemagazine.com/science/685-a-perspective-on-real-time-epidemic-surveillance-fo... |
Description | Article for The Conversation - 'Vaccine rollouts, school testing and contact tracing could all be improved - here's how' |
Form Of Engagement Activity | A magazine, newsletter or online publication |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Public/other audiences |
Results and Impact | Sylvia Richardson wrote an article for The Conversation entitled 'Vaccine rollouts, school testing and contact tracing could all be improved - here's how' |
Year(s) Of Engagement Activity | 2021 |
URL | https://theconversation.com/vaccine-rollouts-school-testing-and-contact-tracing-could-all-be-improve... |
Description | Article for The Observer - 'Coronavirus statistics: what can we trust and what should we ignore?' |
Form Of Engagement Activity | A press release, press conference or response to a media enquiry/interview |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Media (as a channel to the public) |
Results and Impact | Sylvia Richardson co-wrote article for The Observer newspaper on 'Coronavirus statistics: what can we trust and what should we ignore?' |
Year(s) Of Engagement Activity | 2020 |
URL | https://www.theguardian.com/world/2020/apr/12/coronavirus-statistics-what-can-we-trust-and-what-shou... |
Description | Bayes4Health Workshop 2021 |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Other audiences |
Results and Impact | Sylvia Richardson co-organised and hosted the Bayes4Health 2021 workshop. |
Year(s) Of Engagement Activity | 2021 |
URL | https://www.lancaster.ac.uk/bayes-for-health/workshops/workshop-2021/ |
Description | Buchholz-Contributions to approximate Bayesian Inference, Glasgow |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Regional |
Primary Audience | Professional Practitioners |
Results and Impact | Alexander Buchholz- • Contributions to approximate Bayesian Inference, University of Glasgow, UK, 6/12/2020 |
Year(s) Of Engagement Activity | 2020 |
Description | Buchholz-Contributions to approximate Bayesian Inference, Warwick |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Regional |
Primary Audience | Professional Practitioners |
Results and Impact | Alexander Buchholz- • Contributions to approximate Bayesian Inference, University of Warwick, UK, 17/1/2020 |
Year(s) Of Engagement Activity | 2020 |
Description | CSAP Podcast: Science, Policy and Pandemics |
Form Of Engagement Activity | A broadcast e.g. TV/radio/film/podcast (other than news/press) |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Public/other audiences |
Results and Impact | Sylvia Richardson takes part in podcast organised by the Centre for Science and Policy, discussing lessons learnt during the COVID-19 pandemic and how researchers might use those lessons to prepare for future pandemics. |
Year(s) Of Engagement Activity | 2020 |
URL | https://www.youtube.com/watch?v=LB1aQb6Q3vc&feature=youtu.be |
Description | Cambridge Science Festival |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | Regional |
Primary Audience | Public/other audiences |
Results and Impact | Contributed to the Cambridge Science Festival, the online (Twitter) public engagement campaign of the MRC Biostatistics Unit. Festival aimed at the general public. |
Year(s) Of Engagement Activity | 2022 |
Description | Interview for Sky News |
Form Of Engagement Activity | A broadcast e.g. TV/radio/film/podcast (other than news/press) |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Media (as a channel to the public) |
Results and Impact | Sylvia Richardson carried out a live interview for Sky News about the Brazilian variant of COVID-19 being spread in the UK. |
Year(s) Of Engagement Activity | 2021 |
Description | RSS Presidency Speech |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Professional Practitioners |
Results and Impact | Sylvia Richardson gave talk for the launch of becoming the new President of the Royal Statistical Society |
Year(s) Of Engagement Activity | 2021 |
Description | Seaman- Adjusting for time-dependent confounding in survival analysis using structural nested cumulative survival time models |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Professional Practitioners |
Results and Impact | Shaun Seaman 02/10/19 De Morgan House, London Mathematical Society, London International Biometric Society half-day event on New perspectives on studying the effects of treatment on a time to event outcome Title of my talk: Adjusting for time-dependent confounding in survival analysis using structural nested cumulative survival time models |
Year(s) Of Engagement Activity | 2019 |
Description | Seaman- Handling time-dependent confounding using a structural nested cumulative survival time model- London |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Professional Practitioners |
Results and Impact | Shaun Seaman 15/12/19 CM Statistics conference, Senate House, University of London, London Title of my talk: Handling time-dependent confounding using a structural nested cumulative survival time model |
Year(s) Of Engagement Activity | 2019 |