Computational methods for pandemic-scale genomic epidemiology

Lead Research Organisation: European Bioinformatics Institute
Department Name: Goldman Group

Abstract

Phylogenetic analyses of genome sequences from infectious pathogens can reveal essential information regarding their evolution and transmission history. As the COVID-19 pandemic exemplified, these analyses and data play a crucial role in epidemiology and are essential to track and reconstruct the spread of infectious disease within communities and between countries; to understand the dynamics of transmission; to estimate the efficacy of containment measures; to predict epidemiological dynamics; and to monitor pathogen evolution as showcased by the identification of new SARS-CoV-2 mutations and variants of concern.

With ongoing improvement and widespread adoption of genome sequencing technologies, genomic epidemiology will become a key medical asset. Improvements to genomic epidemiological data analysis methods therefore will not only help us tackle ongoing infectious disease epidemics, but will also enhance our preparedness towards future pandemics.

However, current investigations of genomic epidemiological data are predominantly based on computational methods that are not tailored to their needs, but rather were developed for evolutionary biology studies where typically few, highly diverged genomes are considered. Most desirable analyses of large genome sequence data sets, such as those that emerged during the COVID-19 pandemic, are thus currently unfeasible.

In this project we address this limitation by developing computational methods tailored for pandemic-scale genomic epidemiology. These methods will enable accurate real-time analyses of large genomic epidemiological data sets. These objectives fall within several priorities of the MRC, such as "Global health", "Infections and immunity", "Antimicrobial resistance", and "Biomedical and health data science". Our specific aims are to:

1) Develop algorithms for genomic epidemiology. We will develop new algorithms for analysing genome sequence data. We will exploit the fact that sequences in genomic epidemiology are typically very closely related, and thus very similar to each other, to devise algorithms and mathematical approaches tailored for this field. Based on our past experience, we expect these approaches to be thousands of times more efficient than traditional methods: allowing the analysis of millions rather than thousands of genome sequences.

2) Increase realism and accuracy. Highly variable mutation rates and recurrent sequence errors, while common, cause errors and uncertainty in current genomic epidemiological analyses. To increase the accuracy of our methods without affecting their efficiency, we will develop bespoke mathematical models of genome evolution that take into account these complexities of genomic epidemiological data.

3) Pave the way to wider implementation. We will develop an efficient open-source software library to easily integrate our new methods within other highly impactful software packages for the analysis of genetic data. This will allow the broadest application of our methods, as users will be able to adopt them for a variety of analyses, such the estimation of transmission histories or the timely identification of variants of concern.

4) Enable pandemic-scale Bayesian phylogenetics. Bayesian phylogenetics is at the core of most advanced applications in genomic epidemiology, such as phylogeography (the study of the spread of pathogens within and between borders) and phylodynamics (the study of pathogen prevalence changes through time). We will integrate our methods within the widely used Bayesian phylogenetic package BEAST to allow the analysis of data sets of millions of genomes.

Technical Summary

Phylogenetic analyses of genome sequences from infectious pathogens reveal essential information regarding their evolution and transmission history. As the COVID-19 pandemic exemplified, these data and their analysis play a crucial role in epidemiology and are essential to (e.g.) track and reconstruct the spread of infectious disease within communities and between countries, to understand the dynamics of transmission, and to monitor pathogen evolution as showcased by the identification of new SARS-CoV-2 variants of concern. Improvements to genomic epidemiological data analysis methods therefore will enhance our preparedness towards future pandemics.

Current investigations of genomic epidemiological data often rely on computational methods that are not tailored to their needs, but rather were developed for evolutionary biology studies where typically few, highly diverged genomes are considered. Most desirable analyses of large genome sequence data sets, such as those that emerged during the COVID-19 pandemic, are thus currently unfeasible.

We will address this limitation by developing computational methods tailored for pandemic-scale genomic epidemiology. These methods will enable accurate real-time analyses of large genomic epidemiological data sets. We will develop new algorithms for analysing genome sequence data, allowing the analysis of millions genome sequences, based on our "mutation-annotated tree" approach; we will develop bespoke mathematical models of genome evolution that take into account complexities of genomic epidemiological data such as variation in mutation rates and recurrent sequencing errors; we will develop an efficient open-source C++ software library to permit easy integration of our methods with other software packages for the analysis of genetic data; and we undertake such integration with the BEAST Bayesian phylogenetic package to extend its applicability to data sets of millions of genomes.

Publications

10 25 50