Designing a better genomic surveillance system using generative AI models of pathogen evolution
Lead Research Organisation:
EMBL - European Bioinformatics Institute
Department Name: European Bioinformatics Institute
Abstract
Genomic surveillance is the practice of routinely sampling pathogenic microbes such as bacteria and viruses and then sequencing their genome. The initial uses of these sequences mostly looked at a single sample: determining the species and strain of the pathogen and its overall virulence and resistance properties; antimicrobial resistance prediction from the genome; whether this is a vaccine-escape mutant. As surveillance programs have expanded, particularly during the COVID-19 pandemic, large-scale population-based analyses became possible for infection control, offering practical information including:
Determining whether an infection was part of an ongoing local epidemic or an imported case from travel.
Inferring transmission patterns and risk factors for infection, using combined genomic and epidemiological data (such as exposure information).
Rapidly detecting emergence of new strains/variants, and whether these are likely to be treatment (vaccine, drug) resistant.
Predicting growth rates - whether a particular strain or gene is likely to spread further, and how rapidly.
Genomic surveillance initiatives are constrained by resources such as funding and number of samples available. Data producers need to consider carefully how to effectively use their limited resources. How much of a given pathogen should they sequence? From where should the samples be taken? Which types of cases should be sequenced? SARS-CoV-2 sequencing volumes have declined from a peak of around 200,000 sequences per week, but analyses of much smaller subsamples of SARS-CoV-2 data have shown that many questions which genomic surveillance addressed could have been achieved with fewer samples1. Although a more even global spread of sampling would have been more effective still.
Rather than sequence analysis 'making do' with the available data, a more fruitful approach would be to design a surveillance strategy which can answer the most important questions for a given pathogen. A major obstacle is that we do not possess the methods to undertake the kinds of power calculations that determine how much data is needed, and what the sampling frames should be.
We will use new developments in AI to fill this need, and by extension benefit scientific study in the UK: the funding will facilitate the design of sensible, economic and useful genomic surveillance strategies. We will use large amounts of publicly-available pathogen sequence data to train an AI model that can generate realistic pathogen sequences. The model will be similar in structure to recently released language models such as ChatGPT and LLaMA, but rather than using a sequence of words to generate text, we will use a sequence of bases or genes to generate genomes. By using this model to simulate large numbers of genomes, we can then test common surveillance questions and guide how much sampling is needed, who and where to sample, and how often.
Compared to existing kinds of genomic simulations, our new approach using AI and large language models will offer the following key advantages: the model structure will help identify evolutionary patterns, filling gaps where current models are currently lacking; potential for forecasting evolution and marking unexpected deviations (e.g. highly mutated variants); the computational performance will reach the numbers of genomes needed to assess future surveillance initiatives.
As new surveillance questions are designed, our generative model can be adapted to understand how best to answer them. This research will also unlock future potential uses in understanding the function of genes and general mechanisms of pathogen evolution.
Determining whether an infection was part of an ongoing local epidemic or an imported case from travel.
Inferring transmission patterns and risk factors for infection, using combined genomic and epidemiological data (such as exposure information).
Rapidly detecting emergence of new strains/variants, and whether these are likely to be treatment (vaccine, drug) resistant.
Predicting growth rates - whether a particular strain or gene is likely to spread further, and how rapidly.
Genomic surveillance initiatives are constrained by resources such as funding and number of samples available. Data producers need to consider carefully how to effectively use their limited resources. How much of a given pathogen should they sequence? From where should the samples be taken? Which types of cases should be sequenced? SARS-CoV-2 sequencing volumes have declined from a peak of around 200,000 sequences per week, but analyses of much smaller subsamples of SARS-CoV-2 data have shown that many questions which genomic surveillance addressed could have been achieved with fewer samples1. Although a more even global spread of sampling would have been more effective still.
Rather than sequence analysis 'making do' with the available data, a more fruitful approach would be to design a surveillance strategy which can answer the most important questions for a given pathogen. A major obstacle is that we do not possess the methods to undertake the kinds of power calculations that determine how much data is needed, and what the sampling frames should be.
We will use new developments in AI to fill this need, and by extension benefit scientific study in the UK: the funding will facilitate the design of sensible, economic and useful genomic surveillance strategies. We will use large amounts of publicly-available pathogen sequence data to train an AI model that can generate realistic pathogen sequences. The model will be similar in structure to recently released language models such as ChatGPT and LLaMA, but rather than using a sequence of words to generate text, we will use a sequence of bases or genes to generate genomes. By using this model to simulate large numbers of genomes, we can then test common surveillance questions and guide how much sampling is needed, who and where to sample, and how often.
Compared to existing kinds of genomic simulations, our new approach using AI and large language models will offer the following key advantages: the model structure will help identify evolutionary patterns, filling gaps where current models are currently lacking; potential for forecasting evolution and marking unexpected deviations (e.g. highly mutated variants); the computational performance will reach the numbers of genomes needed to assess future surveillance initiatives.
As new surveillance questions are designed, our generative model can be adapted to understand how best to answer them. This research will also unlock future potential uses in understanding the function of genes and general mechanisms of pathogen evolution.