Real-time phylogenetics using sequential Monte Carlo with tree sequences
Lead Research Organisation:
University of Warwick
Department Name: Statistics
Abstract
The COVID-19 pandemic has highlighted the importance of a number of different scientific fields in helping to tackle the spread of an infectious disease. One of these fields is pathogen genomics, which has, through the analysis of sequenced SARS-CoV-2 genomes, enabled the detection and tracking of different variants of the virus. The UK is particularly strong in this area, with the COVID-19 Genomics UK Consortium (COG-UK) providing important inputs into the government response to the pandemic.
Rapid sequencing of pathogen genomes is now possible. One possible use of this data is in real-time tracking of pathogen evolution and transmission through reconstruction of the ancestral history of sequenced genomes (phylogenetic inference). The pandemic has shown the value of having such information available (see, for example, the work of Nextstrain, https://nextstrain.org/).
The state-of-the-art in phylogenetic inference is to use Markov chain Monte Carlo (MCMC) algorithms for the Bayesian inference of the ancestral history, preferred due to its philosophy of rigorously describing the uncertainty associated with inferences drawn from the data. This approach is implemented in the BEAST (https://beast.community/) and BEAST2 (beast2.org) packages but in their current form these are unsuitable for "real-time" inference, since they perform inference on a batch of genome sequences. If a new sequence becomes available after starting the a run of the MCMC in the software, the algorithm must be restarted to take account of the new data. These MCMC algorithms are often run for tens of millions of iterations, so this process of restarting the algorithm is computationally wasteful and hinders the goal of real-time inference. This project proposes an alternative approach, with the aim of making real-time Bayesian inference feasible for large numbers of sequences, preparing the ground for a deployable system that could be used during a pandemic.
Rapid sequencing of pathogen genomes is now possible. One possible use of this data is in real-time tracking of pathogen evolution and transmission through reconstruction of the ancestral history of sequenced genomes (phylogenetic inference). The pandemic has shown the value of having such information available (see, for example, the work of Nextstrain, https://nextstrain.org/).
The state-of-the-art in phylogenetic inference is to use Markov chain Monte Carlo (MCMC) algorithms for the Bayesian inference of the ancestral history, preferred due to its philosophy of rigorously describing the uncertainty associated with inferences drawn from the data. This approach is implemented in the BEAST (https://beast.community/) and BEAST2 (beast2.org) packages but in their current form these are unsuitable for "real-time" inference, since they perform inference on a batch of genome sequences. If a new sequence becomes available after starting the a run of the MCMC in the software, the algorithm must be restarted to take account of the new data. These MCMC algorithms are often run for tens of millions of iterations, so this process of restarting the algorithm is computationally wasteful and hinders the goal of real-time inference. This project proposes an alternative approach, with the aim of making real-time Bayesian inference feasible for large numbers of sequences, preparing the ground for a deployable system that could be used during a pandemic.
Organisations
Publications
Roberts I
(2024)
Bayesian Inference of Pathogen Phylogeography using the Structured Coalescent Model
in biorXiv
R G Everitt
(2024)
Ensemble Kalman inversion approximate Bayesian computation
in arXiv
Ripoli L
(2025)
Improved MCMC with active subspaces
in arXiv
Ripoli L
(2024)
Sequential Monte Carlo with active subspaces
in arXiv
| Description | During the COVID-19 pandemic we saw the value in "online" updating of phylogenetic trees (in that case for the novel coronavirus) as new sequence data became available. This project investigated the details of implementing this online inference using a Bayesian approach (i.e. accounting for uncertainty) using a well-established methodology known as sequential Monte Carlo. The project had the following outputs: - The discovery that standard approaches to SMC for online phylogenetics may result in estimates with large errors, and the proposal of a remedy to this problem. There is the potential for infinite variance estimators due to the behaviour of posterior distributions on phylogenies with an increasing number of sequences. The project developed a Pareto-smoothed SMC approach to alleviate this issue. - The use of tree sequences to reduce the memory requirements of SMC. The project made use of the "waste-free" SMC framework, which use MCMC to generate particle diversity, allowing us to leverage the existing literature on MCMC for phylogenies. To gain the memory reduction through using tree sequences, and allow the easy use of parallelism within the SMC, the focus was on tree sequences with MCMC. - Contributions to the development of a framework that allows the estimate of a time-varying reproduction number from genetic and epidemiological data. - General-purpose software (https://github.com/richardgeveritt/ilike) for implementing SMC and related approaches, in particular updated for use on non-standard spaces such as phylogenetic trees. - An R package (still under development) that implements a range of priors, likelihoods and MCMC moves to allow applied users to specify and perform inference for a wide range of phylogenetic models. |
| Exploitation Route | This project has contributed to the development of a system for performing accurate online inference of phylogenies, whilst accurately accounting for uncertainty. Work to make a general software implementation is ongoing. There are three main areas in which the outcomes may be taken forward: - Academics working on methodology for online phylogenetics may build on the approach to online phylogenetics taken in this project. - Academics and other researchers in the life sciences may make use of the software implementation of the methods, for applications similar to the online phylogenetic inference performed for COVID-19. - Academics working in computational statistics may build on the Pareto smoothed sequential Monte Carlo approach conceived during this project. |
| Sectors | Digital/Communication/Information Technologies (including Software) Pharmaceuticals and Medical Biotechnology |
| URL | https://github.com/richardgeveritt/ilike |
| Title | ggsmc |
| Description | Visualising output from SMC samplers and EnK methods. |
| Type Of Technology | Software |
| Year Produced | 2023 |
| Open Source License? | Yes |
| Impact | This software allows the visualisation of the output from SMC samplers and EnK methods. It was developed in particular for output from the ilike package, but can be used whenever the algorithm output is provided in an R data frame of the required format. |
| URL | https://github.com/richardgeveritt/ggsmc |
| Title | ilike |
| Description | Software for Bayesian inference for, with a particular focus on intractable models. |
| Type Of Technology | Software |
| Year Produced | 2022 |
| Open Source License? | Yes |
| Impact | This software allows researchers to access a number of state-of-the-art algorithms for Bayesian Computation. |
| URL | https://github.com/richardgeveritt/ilike |
| Title | ilike.output |
| Description | Running, and processing the output from, the ilike package. |
| Type Of Technology | Software |
| Year Produced | 2023 |
| Open Source License? | Yes |
| Impact | This software facilitates running code from the ilike package, including on compute clusters. |
| URL | https://github.com/richardgeveritt/ilike.output |
| Title | stromboli_cpp |
| Description | Adaptation of strom (https://stromtutorial.github.io/) for online Bayesian inference for the coalescent. |
| Type Of Technology | Software |
| Year Produced | 2023 |
| Open Source License? | Yes |
| Impact | Uses functions from the "strom" package for online inference of the coalescent, when used in conjunction with the ilike package. |
| URL | https://github.com/richardgeveritt/stromboli_cpp |
| Description | Statistics and mathematical modelling public talk |
| Form Of Engagement Activity | A talk or presentation |
| Part Of Official Scheme? | No |
| Geographic Reach | Regional |
| Primary Audience | Public/other audiences |
| Results and Impact | A public talk as part of the Resonate Festival 2024, on combining mathematical models with statistical approaches for producing predictions. The event was sold out (30 people) and the talk provoked a number of questions. |
| Year(s) Of Engagement Activity | 2024 |
| URL | https://www.resonatefestival.co.uk/events/weather |
