📣 Help Shape the Future of UKRI's Gateway to Research (GtR)

We're improving UKRI's Gateway to Research and are seeking your input! If you would be interested in being interviewed about the improvements we're making and to have your say about how we can make GtR more user-friendly, impactful, and effective for the Research and Innovation community, please email gateway@ukri.org.

Designing a better genomic surveillance system using generative AI models of pathogen evolution

Lead Research Organisation: European Bioinformatics Institute
Department Name: European Bioinformatics Institute

Abstract

Genomic surveillance is the practice of routinely sampling pathogenic microbes such as bacteria and viruses and then sequencing their genome. The initial uses of these sequences mostly looked at a single sample: determining the species and strain of the pathogen and its overall virulence and resistance properties; antimicrobial resistance prediction from the genome; whether this is a vaccine-escape mutant. As surveillance programs have expanded, particularly during the COVID-19 pandemic, large-scale population-based analyses became possible for infection control, offering practical information including:

Determining whether an infection was part of an ongoing local epidemic or an imported case from travel.
Inferring transmission patterns and risk factors for infection, using combined genomic and epidemiological data (such as exposure information).
Rapidly detecting emergence of new strains/variants, and whether these are likely to be treatment (vaccine, drug) resistant.
Predicting growth rates - whether a particular strain or gene is likely to spread further, and how rapidly.
Genomic surveillance initiatives are constrained by resources such as funding and number of samples available. Data producers need to consider carefully how to effectively use their limited resources. How much of a given pathogen should they sequence? From where should the samples be taken? Which types of cases should be sequenced? SARS-CoV-2 sequencing volumes have declined from a peak of around 200,000 sequences per week, but analyses of much smaller subsamples of SARS-CoV-2 data have shown that many questions which genomic surveillance addressed could have been achieved with fewer samples1. Although a more even global spread of sampling would have been more effective still.

Rather than sequence analysis 'making do' with the available data, a more fruitful approach would be to design a surveillance strategy which can answer the most important questions for a given pathogen. A major obstacle is that we do not possess the methods to undertake the kinds of power calculations that determine how much data is needed, and what the sampling frames should be.

We will use new developments in AI to fill this need, and by extension benefit scientific study in the UK: the funding will facilitate the design of sensible, economic and useful genomic surveillance strategies. We will use large amounts of publicly-available pathogen sequence data to train an AI model that can generate realistic pathogen sequences. The model will be similar in structure to recently released language models such as ChatGPT and LLaMA, but rather than using a sequence of words to generate text, we will use a sequence of bases or genes to generate genomes. By using this model to simulate large numbers of genomes, we can then test common surveillance questions and guide how much sampling is needed, who and where to sample, and how often.

Compared to existing kinds of genomic simulations, our new approach using AI and large language models will offer the following key advantages: the model structure will help identify evolutionary patterns, filling gaps where current models are currently lacking; potential for forecasting evolution and marking unexpected deviations (e.g. highly mutated variants); the computational performance will reach the numbers of genomes needed to assess future surveillance initiatives.

As new surveillance questions are designed, our generative model can be adapted to understand how best to answer them. This research will also unlock future potential uses in understanding the function of genes and general mechanisms of pathogen evolution.

Publications

10 25 50
 
Description Data-driven exploration of the chemical and genomic space of bacterial capsules
Amount € 150,000 (EUR)
Organisation European Molecular Biology Laboratory 
Sector Academic/University
Country Germany
Start 01/2024 
End 12/2025
 
Description Developing and deploying a comprehensive database and web-based AI tool for accurate rapid pneumococcal serotyping from genomic data
Amount $25,000 (USD)
Organisation Pfizer Inc 
Sector Private
Country United States
Start 02/2024 
End 01/2025
 
Description Generative AI for bacterial genomes 
Organisation Simon Fraser University
Country Canada 
Sector Academic/University 
PI Contribution We are leading this partnership to create AI models to study gene order in bacterial genomes. We are developing new models, creating open training datasets, fitting the models and iteratively evaluating and improving them.
Collaborator Contribution SFU partners: providing expert input and training on model architecture and epidemiological use cases. Sanger partners: providing expert input on genomics and evolutionary considerations. All partners: expert meetings regularly. In-person meeting March 2025.
Impact Outputs: Open source pangenome simulator https://github.com/bacpop/Pansim Open source model architecture and training https://github.com/samhorsfield96/pangenome_LLM Disciplines: AI, statistics, genomics, evolutionary biology, epidemiology
Start Year 2024
 
Description Generative AI for bacterial genomes 
Organisation The Wellcome Trust Sanger Institute
Country United Kingdom 
Sector Charity/Non Profit 
PI Contribution We are leading this partnership to create AI models to study gene order in bacterial genomes. We are developing new models, creating open training datasets, fitting the models and iteratively evaluating and improving them.
Collaborator Contribution SFU partners: providing expert input and training on model architecture and epidemiological use cases. Sanger partners: providing expert input on genomics and evolutionary considerations. All partners: expert meetings regularly. In-person meeting March 2025.
Impact Outputs: Open source pangenome simulator https://github.com/bacpop/Pansim Open source model architecture and training https://github.com/samhorsfield96/pangenome_LLM Disciplines: AI, statistics, genomics, evolutionary biology, epidemiology
Start Year 2024
 
Title Pangenome LLM 
Description A transformer model based on tokenised genes in bacterial genomes - used for predicting the next gene in a genome given a prompt, or masked genes either side. 
Type Of Technology Software 
Year Produced 2024 
Open Source License? Yes  
Impact Applications include improved annotation, finding unusual clusters, and coselected genes. 
 
Title Pansim 
Description A forward simulator for bacterial pangenomes, which runs rapidly 
Type Of Technology Software 
Year Produced 2024 
Open Source License? Yes  
Impact Created synthetic data to fit pangenome parameters to 
 
Title PopPIPE 
Description This pipeline can be used to automatically cluster and subcluster bacterial genomes. Visualisation and transmission analyses are also supported downstream. The pipeline is automated and reproducible. 
Type Of Technology Software 
Year Produced 2024 
Open Source License? Yes  
Impact This software was independently used by NHS Scotland collaborators to retrospectively determine transmission clusters of vancomycin resistant Enterococcus faecium in hospital wards, achieving better resolution and automation than previous methods. 
URL https://poppunk.bacpop.org/subclustering.html
 
Title PopPUNK-mod 
Description Simulates core and accessory genome divergence and calculates Hamming and Jaccard distances. Used to fit pangenome data to mechanistic models of evolution. 
Type Of Technology Software 
Year Produced 2024 
Open Source License? Yes  
Impact Estimated whether accessory genome was likely adaptive for >25 bacterial species 
 
Title WTBcluster 
Description WTBcluster calls bacterial proteins using Prodigal iteratively clusters proteins using Linclust, part of the MMseqs2 suite of tools. 
Type Of Technology Software 
Year Produced 2024 
Open Source License? Yes  
Impact Clustered 10 billion proteins from the AllTheBacteria dataset, ready for use in transformer models. These clusters have also been used by UniProt to identify redundancy. The software is being extended to compare clusters between different datasets. 
 
Description Young Adult Carers Public outreach 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Schools
Results and Impact ~150 Suffolk Young Adult Carers (children and parents) attended an EMBL-EBI public outreach day, run in Bury St Edmunds. We ran two interactive activities on understanding proteins, and talked to the children about science questions related to our research.
Year(s) Of Engagement Activity 2024