📣 Help Shape the Future of UKRI's Gateway to Research (GtR)

We're improving UKRI's Gateway to Research and are seeking your input! If you would be interested in being interviewed about the improvements we're making and to have your say about how we can make GtR more user-friendly, impactful, and effective for the Research and Innovation community, please email gateway@ukri.org.

Computational methods for pandemic-scale genomic epidemiology

Lead Research Organisation: European Bioinformatics Institute
Department Name: Goldman Group

Abstract

Phylogenetic analyses of genome sequences from infectious pathogens can reveal essential information regarding their evolution and transmission history. As the COVID-19 pandemic exemplified, these analyses and data play a crucial role in epidemiology and are essential to track and reconstruct the spread of infectious disease within communities and between countries; to understand the dynamics of transmission; to estimate the efficacy of containment measures; to predict epidemiological dynamics; and to monitor pathogen evolution as showcased by the identification of new SARS-CoV-2 mutations and variants of concern.

With ongoing improvement and widespread adoption of genome sequencing technologies, genomic epidemiology will become a key medical asset. Improvements to genomic epidemiological data analysis methods therefore will not only help us tackle ongoing infectious disease epidemics, but will also enhance our preparedness towards future pandemics.

However, current investigations of genomic epidemiological data are predominantly based on computational methods that are not tailored to their needs, but rather were developed for evolutionary biology studies where typically few, highly diverged genomes are considered. Most desirable analyses of large genome sequence data sets, such as those that emerged during the COVID-19 pandemic, are thus currently unfeasible.

In this project we address this limitation by developing computational methods tailored for pandemic-scale genomic epidemiology. These methods will enable accurate real-time analyses of large genomic epidemiological data sets. These objectives fall within several priorities of the MRC, such as "Global health", "Infections and immunity", "Antimicrobial resistance", and "Biomedical and health data science". Our specific aims are to:

1) Develop algorithms for genomic epidemiology. We will develop new algorithms for analysing genome sequence data. We will exploit the fact that sequences in genomic epidemiology are typically very closely related, and thus very similar to each other, to devise algorithms and mathematical approaches tailored for this field. Based on our past experience, we expect these approaches to be thousands of times more efficient than traditional methods: allowing the analysis of millions rather than thousands of genome sequences.

2) Increase realism and accuracy. Highly variable mutation rates and recurrent sequence errors, while common, cause errors and uncertainty in current genomic epidemiological analyses. To increase the accuracy of our methods without affecting their efficiency, we will develop bespoke mathematical models of genome evolution that take into account these complexities of genomic epidemiological data.

3) Pave the way to wider implementation. We will develop an efficient open-source software library to easily integrate our new methods within other highly impactful software packages for the analysis of genetic data. This will allow the broadest application of our methods, as users will be able to adopt them for a variety of analyses, such the estimation of transmission histories or the timely identification of variants of concern.

4) Enable pandemic-scale Bayesian phylogenetics. Bayesian phylogenetics is at the core of most advanced applications in genomic epidemiology, such as phylogeography (the study of the spread of pathogens within and between borders) and phylodynamics (the study of pathogen prevalence changes through time). We will integrate our methods within the widely used Bayesian phylogenetic package BEAST to allow the analysis of data sets of millions of genomes.

Technical Summary

Phylogenetic analyses of genome sequences from infectious pathogens reveal essential information regarding their evolution and transmission history. As the COVID-19 pandemic exemplified, these data and their analysis play a crucial role in epidemiology and are essential to (e.g.) track and reconstruct the spread of infectious disease within communities and between countries, to understand the dynamics of transmission, and to monitor pathogen evolution as showcased by the identification of new SARS-CoV-2 variants of concern. Improvements to genomic epidemiological data analysis methods therefore will enhance our preparedness towards future pandemics.

Current investigations of genomic epidemiological data often rely on computational methods that are not tailored to their needs, but rather were developed for evolutionary biology studies where typically few, highly diverged genomes are considered. Most desirable analyses of large genome sequence data sets, such as those that emerged during the COVID-19 pandemic, are thus currently unfeasible.

We will address this limitation by developing computational methods tailored for pandemic-scale genomic epidemiology. These methods will enable accurate real-time analyses of large genomic epidemiological data sets. We will develop new algorithms for analysing genome sequence data, allowing the analysis of millions genome sequences, based on our "mutation-annotated tree" approach; we will develop bespoke mathematical models of genome evolution that take into account complexities of genomic epidemiological data such as variation in mutation rates and recurrent sequencing errors; we will develop an efficient open-source C++ software library to permit easy integration of our methods with other software packages for the analysis of genetic data; and we undertake such integration with the BEAST Bayesian phylogenetic package to extend its applicability to data sets of millions of genomes.

Publications

10 25 50
 
Description Bui / ANU Canberra 
Organisation Australian National University (ANU)
Country Australia 
Sector Academic/University 
PI Contribution We are collaborating with Dr. Minh Bui and his team on phylogenetic methods appropriate for pandemic-scale data. This is helping us to progress the aims of the funded project. Some of our advanced methods are being incorporated in Dr. Bui's world-leading IQ-TREE software package, and we are jointly working on CMAPLE which makes our methods available to the wider community via a C-language package.
Collaborator Contribution We are collaborating with Dr. Minh Bui and his team on phylogenetic methods appropriate for pandemic-scale data. This is helping us to progress the aims of the funded project. Some of our advanced methods are being incorporated in Dr. Bui's world-leading IQ-TREE software package, and we are jointly working on CMAPLE which makes our methods available to the wider community via a C-language package.
Impact Peer-reviewed publication: https://academic.oup.com/mbe/article/41/7/msae134/7700168
Start Year 2023
 
Description Iqbal / University of Bath 
Organisation University of Bath
Country United Kingdom 
Sector Academic/University 
PI Contribution We have collaborated with Prof. Iqbal and his team during his time at EMBL-European Bioinformatics Institute and continuing now when he has moved to the University of Bath. We are working with him to better understand errors that have occurred during sequencing of large numbers of SARS-CoV-2 genomes in diverse labs around the world, and on methods to identify and where possible correct these errors.
Collaborator Contribution We have collaborated with Prof. Iqbal and his team during his time at EMBL-European Bioinformatics Institute and continuing now when he has moved to the University of Bath. We are working with him to better understand errors that have occurred during sequencing of large numbers of SARS-CoV-2 genomes in diverse labs around the world, and on methods to identify and where possible correct these errors.
Impact Preprint: https://www.biorxiv.org/content/10.1101/2024.07.12.603240v1 Preprint: https://www.biorxiv.org/content/10.1101/2024.04.29.591666v3
Start Year 2024
 
Title CMAPLE 
Description CMAPLE is a fast and efficient implementation of the software MAPLE. It enables fast computational of pandemic-scale phylogenetic trees. 
Type Of Technology Software 
Year Produced 2024 
Open Source License? Yes  
Impact TBD 
URL https://github.com/iqtree/cmaple
 
Title MAPLE - ongoing development 
Description Continued development of Maximum parsimonious likelihood estimation for pandemic-scale phylogenetics. Software for https://doi.org/10.1038/s41588-023-01368-0. 
Type Of Technology Software 
Year Produced 2024 
Open Source License? Yes  
Impact TBD 
URL https://doi.org/10.1038/s41588-023-01368-0
 
Description Talk at MASAMB 2024 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Other audiences
Results and Impact Conference talk at Mathematical and Statistical Aspects of Molecular Biology 2024
Year(s) Of Engagement Activity 2024
URL https://masamb2024.wixsite.com/masamb2024