Designing a better genomic surveillance system using generative AI models of pathogen evolution
Lead Research Organisation:
European Bioinformatics Institute
Department Name: European Bioinformatics Institute
Abstract
Genomic surveillance is the practice of routinely sampling pathogenic microbes such as bacteria and viruses and then sequencing their genome. The initial uses of these sequences mostly looked at a single sample: determining the species and strain of the pathogen and its overall virulence and resistance properties; antimicrobial resistance prediction from the genome; whether this is a vaccine-escape mutant. As surveillance programs have expanded, particularly during the COVID-19 pandemic, large-scale population-based analyses became possible for infection control, offering practical information including:
Determining whether an infection was part of an ongoing local epidemic or an imported case from travel.
Inferring transmission patterns and risk factors for infection, using combined genomic and epidemiological data (such as exposure information).
Rapidly detecting emergence of new strains/variants, and whether these are likely to be treatment (vaccine, drug) resistant.
Predicting growth rates - whether a particular strain or gene is likely to spread further, and how rapidly.
Genomic surveillance initiatives are constrained by resources such as funding and number of samples available. Data producers need to consider carefully how to effectively use their limited resources. How much of a given pathogen should they sequence? From where should the samples be taken? Which types of cases should be sequenced? SARS-CoV-2 sequencing volumes have declined from a peak of around 200,000 sequences per week, but analyses of much smaller subsamples of SARS-CoV-2 data have shown that many questions which genomic surveillance addressed could have been achieved with fewer samples1. Although a more even global spread of sampling would have been more effective still.
Rather than sequence analysis 'making do' with the available data, a more fruitful approach would be to design a surveillance strategy which can answer the most important questions for a given pathogen. A major obstacle is that we do not possess the methods to undertake the kinds of power calculations that determine how much data is needed, and what the sampling frames should be.
We will use new developments in AI to fill this need, and by extension benefit scientific study in the UK: the funding will facilitate the design of sensible, economic and useful genomic surveillance strategies. We will use large amounts of publicly-available pathogen sequence data to train an AI model that can generate realistic pathogen sequences. The model will be similar in structure to recently released language models such as ChatGPT and LLaMA, but rather than using a sequence of words to generate text, we will use a sequence of bases or genes to generate genomes. By using this model to simulate large numbers of genomes, we can then test common surveillance questions and guide how much sampling is needed, who and where to sample, and how often.
Compared to existing kinds of genomic simulations, our new approach using AI and large language models will offer the following key advantages: the model structure will help identify evolutionary patterns, filling gaps where current models are currently lacking; potential for forecasting evolution and marking unexpected deviations (e.g. highly mutated variants); the computational performance will reach the numbers of genomes needed to assess future surveillance initiatives.
As new surveillance questions are designed, our generative model can be adapted to understand how best to answer them. This research will also unlock future potential uses in understanding the function of genes and general mechanisms of pathogen evolution.
Determining whether an infection was part of an ongoing local epidemic or an imported case from travel.
Inferring transmission patterns and risk factors for infection, using combined genomic and epidemiological data (such as exposure information).
Rapidly detecting emergence of new strains/variants, and whether these are likely to be treatment (vaccine, drug) resistant.
Predicting growth rates - whether a particular strain or gene is likely to spread further, and how rapidly.
Genomic surveillance initiatives are constrained by resources such as funding and number of samples available. Data producers need to consider carefully how to effectively use their limited resources. How much of a given pathogen should they sequence? From where should the samples be taken? Which types of cases should be sequenced? SARS-CoV-2 sequencing volumes have declined from a peak of around 200,000 sequences per week, but analyses of much smaller subsamples of SARS-CoV-2 data have shown that many questions which genomic surveillance addressed could have been achieved with fewer samples1. Although a more even global spread of sampling would have been more effective still.
Rather than sequence analysis 'making do' with the available data, a more fruitful approach would be to design a surveillance strategy which can answer the most important questions for a given pathogen. A major obstacle is that we do not possess the methods to undertake the kinds of power calculations that determine how much data is needed, and what the sampling frames should be.
We will use new developments in AI to fill this need, and by extension benefit scientific study in the UK: the funding will facilitate the design of sensible, economic and useful genomic surveillance strategies. We will use large amounts of publicly-available pathogen sequence data to train an AI model that can generate realistic pathogen sequences. The model will be similar in structure to recently released language models such as ChatGPT and LLaMA, but rather than using a sequence of words to generate text, we will use a sequence of bases or genes to generate genomes. By using this model to simulate large numbers of genomes, we can then test common surveillance questions and guide how much sampling is needed, who and where to sample, and how often.
Compared to existing kinds of genomic simulations, our new approach using AI and large language models will offer the following key advantages: the model structure will help identify evolutionary patterns, filling gaps where current models are currently lacking; potential for forecasting evolution and marking unexpected deviations (e.g. highly mutated variants); the computational performance will reach the numbers of genomes needed to assess future surveillance initiatives.
As new surveillance questions are designed, our generative model can be adapted to understand how best to answer them. This research will also unlock future potential uses in understanding the function of genes and general mechanisms of pathogen evolution.
Publications
Hellewell J
(2024)
CELEBRIMBOR: core and accessory genes from metagenomes.
in Bioinformatics (Oxford, England)
| Description | Data-driven exploration of the chemical and genomic space of bacterial capsules |
| Amount | € 150,000 (EUR) |
| Organisation | European Molecular Biology Laboratory |
| Sector | Academic/University |
| Country | Germany |
| Start | 01/2024 |
| End | 12/2025 |
| Description | Developing and deploying a comprehensive database and web-based AI tool for accurate rapid pneumococcal serotyping from genomic data |
| Amount | $25,000 (USD) |
| Organisation | Pfizer Inc |
| Sector | Private |
| Country | United States |
| Start | 02/2024 |
| End | 01/2025 |
| Description | Generative AI for bacterial genomes |
| Organisation | Simon Fraser University |
| Country | Canada |
| Sector | Academic/University |
| PI Contribution | We are leading this partnership to create AI models to study gene order in bacterial genomes. We are developing new models, creating open training datasets, fitting the models and iteratively evaluating and improving them. |
| Collaborator Contribution | SFU partners: providing expert input and training on model architecture and epidemiological use cases. Sanger partners: providing expert input on genomics and evolutionary considerations. All partners: expert meetings regularly. In-person meeting March 2025. |
| Impact | Outputs: Open source pangenome simulator https://github.com/bacpop/Pansim Open source model architecture and training https://github.com/samhorsfield96/pangenome_LLM Disciplines: AI, statistics, genomics, evolutionary biology, epidemiology |
| Start Year | 2024 |
| Description | Generative AI for bacterial genomes |
| Organisation | The Wellcome Trust Sanger Institute |
| Country | United Kingdom |
| Sector | Charity/Non Profit |
| PI Contribution | We are leading this partnership to create AI models to study gene order in bacterial genomes. We are developing new models, creating open training datasets, fitting the models and iteratively evaluating and improving them. |
| Collaborator Contribution | SFU partners: providing expert input and training on model architecture and epidemiological use cases. Sanger partners: providing expert input on genomics and evolutionary considerations. All partners: expert meetings regularly. In-person meeting March 2025. |
| Impact | Outputs: Open source pangenome simulator https://github.com/bacpop/Pansim Open source model architecture and training https://github.com/samhorsfield96/pangenome_LLM Disciplines: AI, statistics, genomics, evolutionary biology, epidemiology |
| Start Year | 2024 |
| Title | Pangenome LLM |
| Description | A transformer model based on tokenised genes in bacterial genomes - used for predicting the next gene in a genome given a prompt, or masked genes either side. |
| Type Of Technology | Software |
| Year Produced | 2024 |
| Open Source License? | Yes |
| Impact | Applications include improved annotation, finding unusual clusters, and coselected genes. |
| Title | Pansim |
| Description | A forward simulator for bacterial pangenomes, which runs rapidly |
| Type Of Technology | Software |
| Year Produced | 2024 |
| Open Source License? | Yes |
| Impact | Created synthetic data to fit pangenome parameters to |
| Title | PopPIPE |
| Description | This pipeline can be used to automatically cluster and subcluster bacterial genomes. Visualisation and transmission analyses are also supported downstream. The pipeline is automated and reproducible. |
| Type Of Technology | Software |
| Year Produced | 2024 |
| Open Source License? | Yes |
| Impact | This software was independently used by NHS Scotland collaborators to retrospectively determine transmission clusters of vancomycin resistant Enterococcus faecium in hospital wards, achieving better resolution and automation than previous methods. |
| URL | https://poppunk.bacpop.org/subclustering.html |
| Title | PopPUNK-mod |
| Description | Simulates core and accessory genome divergence and calculates Hamming and Jaccard distances. Used to fit pangenome data to mechanistic models of evolution. |
| Type Of Technology | Software |
| Year Produced | 2024 |
| Open Source License? | Yes |
| Impact | Estimated whether accessory genome was likely adaptive for >25 bacterial species |
| Title | WTBcluster |
| Description | WTBcluster calls bacterial proteins using Prodigal iteratively clusters proteins using Linclust, part of the MMseqs2 suite of tools. |
| Type Of Technology | Software |
| Year Produced | 2024 |
| Open Source License? | Yes |
| Impact | Clustered 10 billion proteins from the AllTheBacteria dataset, ready for use in transformer models. These clusters have also been used by UniProt to identify redundancy. The software is being extended to compare clusters between different datasets. |
| Description | Young Adult Carers Public outreach |
| Form Of Engagement Activity | Participation in an activity, workshop or similar |
| Part Of Official Scheme? | No |
| Geographic Reach | Regional |
| Primary Audience | Schools |
| Results and Impact | ~150 Suffolk Young Adult Carers (children and parents) attended an EMBL-EBI public outreach day, run in Bury St Edmunds. We ran two interactive activities on understanding proteins, and talked to the children about science questions related to our research. |
| Year(s) Of Engagement Activity | 2024 |
