📣 Help Shape the Future of UKRI's Gateway to Research (GtR)

We're improving UKRI's Gateway to Research and are seeking your input! If you would be interested in being interviewed about the improvements we're making and to have your say about how we can make GtR more user-friendly, impactful, and effective for the Research and Innovation community, please email gateway@ukri.org.

Real-time phylogenetics using sequential Monte Carlo with tree sequences

Lead Research Organisation: University of Warwick
Department Name: Statistics

Abstract

The COVID-19 pandemic has highlighted the importance of a number of different scientific fields in helping to tackle the spread of an infectious disease. One of these fields is pathogen genomics, which has, through the analysis of sequenced SARS-CoV-2 genomes, enabled the detection and tracking of different variants of the virus. The UK is particularly strong in this area, with the COVID-19 Genomics UK Consortium (COG-UK) providing important inputs into the government response to the pandemic.

Rapid sequencing of pathogen genomes is now possible. One possible use of this data is in real-time tracking of pathogen evolution and transmission through reconstruction of the ancestral history of sequenced genomes (phylogenetic inference). The pandemic has shown the value of having such information available (see, for example, the work of Nextstrain, https://nextstrain.org/).

The state-of-the-art in phylogenetic inference is to use Markov chain Monte Carlo (MCMC) algorithms for the Bayesian inference of the ancestral history, preferred due to its philosophy of rigorously describing the uncertainty associated with inferences drawn from the data. This approach is implemented in the BEAST (https://beast.community/) and BEAST2 (beast2.org) packages but in their current form these are unsuitable for "real-time" inference, since they perform inference on a batch of genome sequences. If a new sequence becomes available after starting the a run of the MCMC in the software, the algorithm must be restarted to take account of the new data. These MCMC algorithms are often run for tens of millions of iterations, so this process of restarting the algorithm is computationally wasteful and hinders the goal of real-time inference. This project proposes an alternative approach, with the aim of making real-time Bayesian inference feasible for large numbers of sequences, preparing the ground for a deployable system that could be used during a pandemic.
 
Description During the COVID-19 pandemic we saw the value in "online" updating of phylogenetic trees (in that case for the novel coronavirus) as new sequence data became available. This project investigated the details of implementing this online inference using a Bayesian approach (i.e. accounting for uncertainty) using a well-established methodology known as sequential Monte Carlo. The project had the following outputs:

- The discovery that standard approaches to SMC for online phylogenetics may result in estimates with large errors, and the proposal of a remedy to this problem. There is the potential for infinite variance estimators due to the behaviour of posterior distributions on phylogenies with an increasing number of sequences. The project developed a Pareto-smoothed SMC approach to alleviate this issue.

- The use of tree sequences to reduce the memory requirements of SMC. The project made use of the "waste-free" SMC framework, which use MCMC to generate particle diversity, allowing us to leverage the existing literature on MCMC for phylogenies. To gain the memory reduction through using tree sequences, and allow the easy use of parallelism within the SMC, the focus was on tree sequences with MCMC.

- Contributions to the development of a framework that allows the estimate of a time-varying reproduction number from genetic and epidemiological data.

- General-purpose software (https://github.com/richardgeveritt/ilike) for implementing SMC and related approaches, in particular updated for use on non-standard spaces such as phylogenetic trees.

- An R package (still under development) that implements a range of priors, likelihoods and MCMC moves to allow applied users to specify and perform inference for a wide range of phylogenetic models.
Exploitation Route This project has contributed to the development of a system for performing accurate online inference of phylogenies, whilst accurately accounting for uncertainty. Work to make a general software implementation is ongoing. There are three main areas in which the outcomes may be taken forward:

- Academics working on methodology for online phylogenetics may build on the approach to online phylogenetics taken in this project.

- Academics and other researchers in the life sciences may make use of the software implementation of the methods, for applications similar to the online phylogenetic inference performed for COVID-19.

- Academics working in computational statistics may build on the Pareto smoothed sequential Monte Carlo approach conceived during this project.
Sectors Digital/Communication/Information Technologies (including Software)

Pharmaceuticals and Medical Biotechnology

URL https://github.com/richardgeveritt/ilike
 
Title ggsmc 
Description Visualising output from SMC samplers and EnK methods. 
Type Of Technology Software 
Year Produced 2023 
Open Source License? Yes  
Impact This software allows the visualisation of the output from SMC samplers and EnK methods. It was developed in particular for output from the ilike package, but can be used whenever the algorithm output is provided in an R data frame of the required format. 
URL https://github.com/richardgeveritt/ggsmc
 
Title ilike 
Description Software for Bayesian inference for, with a particular focus on intractable models. 
Type Of Technology Software 
Year Produced 2022 
Open Source License? Yes  
Impact This software allows researchers to access a number of state-of-the-art algorithms for Bayesian Computation. 
URL https://github.com/richardgeveritt/ilike
 
Title ilike.output 
Description Running, and processing the output from, the ilike package. 
Type Of Technology Software 
Year Produced 2023 
Open Source License? Yes  
Impact This software facilitates running code from the ilike package, including on compute clusters. 
URL https://github.com/richardgeveritt/ilike.output
 
Title stromboli_cpp 
Description Adaptation of strom (https://stromtutorial.github.io/) for online Bayesian inference for the coalescent. 
Type Of Technology Software 
Year Produced 2023 
Open Source License? Yes  
Impact Uses functions from the "strom" package for online inference of the coalescent, when used in conjunction with the ilike package. 
URL https://github.com/richardgeveritt/stromboli_cpp
 
Description Statistics and mathematical modelling public talk 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Public/other audiences
Results and Impact A public talk as part of the Resonate Festival 2024, on combining mathematical models with statistical approaches for producing predictions. The event was sold out (30 people) and the talk provoked a number of questions.
Year(s) Of Engagement Activity 2024
URL https://www.resonatefestival.co.uk/events/weather