📣 Help Shape the Future of UKRI's Gateway to Research (GtR)

We're improving UKRI's Gateway to Research and are seeking your input! If you would be interested in being interviewed about the improvements we're making and to have your say about how we can make GtR more user-friendly, impactful, and effective for the Research and Innovation community, please email gateway@ukri.org.

Infection-AID: AI assisted genomic profiling to inform the Diagnosis, personalised treatment and control of infections

Lead Research Organisation: London School of Hygiene and Tropical Medicine
Department Name: Infectious and Tropical Diseases

Abstract

Characterising the genetic code ("genome") of an organism can inform on its ability to survive, tolerate drugs and treatments, and its likely geographical source. Researchers can investigate the genome of an organism, and its important mutations (genome "spelling mistakes"), through applying sequencing technologies to its DNA. Cost-effective and rapid sequencing technologies are now being rolled-out in hospitals and clinics to identify important mutations, and thereby prevent disease, diagnose, and personalise treatment of patients. Genome sequencing has become an important diagnostic tool in infectious disease settings, including to identify microorganisms causing infections ("pathogens") and their resistance to drugs, and to track outbreaks. Such knowledge is revolutionizing clinical decision making, public health surveillance and infection control; as demonstrated during the COVID-19 pandemic, where rapid sequencing of the causal SARS-CoV-2 viral genomes has assisted the detection of clinically important mutations (e.g., omicron variants) and informed on their geographical spread ("transmission patterns"). To assist the analysis of the large datasets arising from the sequencing of pathogens, it is important to identify key mutations linked to (severe) patient outcomes, drug resistance, likely geographical source, and other important "barcoding" information that can provide a "profile" of the pathogen underlying any infection. Computer software tools have been developed (e.g., our TB-Profiler and Malaria-Profiler software) that can rapidly analyse sequence data to provide such pathogen profiles, for easy interpretation by medical doctors and infection control specialists.

With the increasing use of sequencing technologies in hospitals and clinics, there is a need for Artificial Intelligence (AI) computational methods to analyse the resulting "big data" in real time, including to update the lists of barcoding genetic mutations and to identify if the pathogen genome
is related to those previously sequenced i.e., it is being transmitted. We have previously applied AI methods to identify known and novel genetic mutations linked to drug resistance and transmission, as well as created computing repositories (e.g., TB-ML) where the underlying software can be stored, allowing comparisons between statistical models and AI approaches. Our proposed project will integrate these AI-based tools into our profiling software to reveal drug resistance mutation and transmission patterns, and generate informative reports for clinical and infection control decision making. Working within established collaborations involving The UK Health Security Agency and Health ministries in Asia (Bangladesh, Philippines, Thailand, Vietnam), which are routinely using sequencing technologies to inform clinical diagnosis, we will attempt to implement the resulting AI systems software in the UK and overseas settings endemic for infectious diseases. We will initially focus on three main infectious diseases of high global burden, tuberculosis, malaria and Klebsiella infections, with the potential to extend the work to other infections. All sequence data and software developed will be made publicly accessible, leading to their use by other biomedical researchers and healthcare stakeholders. Ultimately, the implementation of such AI-based tools will reduce the burden of infectious diseases, leading to healthier populations and associated economic benefits.

Publications

10 25 50

 
Description We have developed methods to systematically download and analyse sequence data across infectious diseases (e.g., malaria, TB) and integrate these into AI models that predict key clinical and epidemiological insights, such as drug resistance, geographic origin, strain types, and transmission dynamics. These models, along with the key predictive mutations they identify, are currently undergoing validation through additional sequencing efforts, with three manuscripts in preparation. The malaria AI model is being incorporated into UKHSA workflows, while the TB model is being implemented within Thailand's health systems.
Exploitation Route The AI models, along with the underlying software and data, are being made accessible to the research community. We plan to seek follow-up funding to expand implementations to additional countries and pathogens. As noted, the UKHSA and Thailand Ministry of Public Health are integrating these AI and informatics tools into their systems and can serve as key advocates for future initiatives.
Sectors Digital/Communication/Information Technologies (including Software)

Education

Healthcare

 
Description Generated sequence data have been processed through AI models to characterise pathogen genotypic profiles, including drug resistance and geographic origin. These insights have supported the UKHSA in cryptic malaria investigations and informed clinical decision-making and outbreak investigations within the Thailand Ministry of Public Health.
First Year Of Impact 2024
Sector Digital/Communication/Information Technologies (including Software),Healthcare
Impact Types Economic

Policy & public services

 
Title Bioinformatic and AI tools 
Description We have established bioinformatic pipelines for all the pathogens considered in this project (e.g., Mycobacterium tuberculosis, Klebsiella, Plasmodium species), which process raw sequences into variants that are used in the machine learning models. To assist the application of the machine learning models, we have developed Docker containers that are functional software modules that cover data inputs, processing and outputs. These allow for the comparison of different machine learning methods and models across datasets. We propose to share this framework, linked to a scientific publication in preparation. 
Type Of Material Improvements to research infrastructure 
Year Produced 2024 
Provided To Others? No  
Impact The use of dockers means that we have a framework for sharing computing code and outputs from the implementation of different machine learning methods. 
 
Title Infection genomics datasets 
Description We have been automatically downloading sequence and meta data linked to the pathogens of interest in our project (e.g., Mycobacterium, Plasmodium, Klebsiella), and passing them through our bioinformatic pipelines. This is resulting in large datasets for each pathogen (e.g., M. tuberculosis n>100K), which we then apply in our machine learning approaches. 
Type Of Material Improvements to research infrastructure 
Year Produced 2023 
Provided To Others? No  
Impact This approach means that we have growing datasets to inform and validate our machine learning models, which in turn provide insights into mutations linked to drug resistance, strain-types and geographical source. The raw data are mostly in the public domain, but through combining them and developing machine learning models, these resources will be useful to those without computational expertise, but can use them to drive their research. 
 
Description Thailand Ministry of Public Health - Sequence data and informatics 
Organisation Ministry of Public Health
Country Thailand 
Sector Public 
PI Contribution We have developed the bioinformatic pipelines and adapted our informatic tools (e.g., TB-Profiler) for use by the MOPH.
Collaborator Contribution The MOPH are sharing TB sequence and AMR phenotypic data that is being used to update our machine learning models. They are also assessing the mutations being found by our machine learning models, for their biological and potential clinical relevance.
Impact Outputs include: (1) >1,200 M. tuberculosis with whole genome sequencing data to date; (2) TB-Profiler installed at the MOPH, and generating outputs in the Thai language.
Start Year 2023
 
Description Thailand Ministry of Public Health - Sequence data and informatics 
Organisation Ministry of Public Health
Country Thailand 
Sector Public 
PI Contribution We have developed the bioinformatic pipelines and adapted our informatic tools (e.g., TB-Profiler) for use by the MOPH.
Collaborator Contribution The MOPH are sharing TB sequence and AMR phenotypic data that is being used to update our machine learning models. They are also assessing the mutations being found by our machine learning models, for their biological and potential clinical relevance.
Impact Outputs include: (1) >1,200 M. tuberculosis with whole genome sequencing data to date; (2) TB-Profiler installed at the MOPH, and generating outputs in the Thai language.
Start Year 2023
 
Description UK Health Security Agency 
Organisation Public Health England
Country United Kingdom 
Sector Public 
PI Contribution We are working with the UKHSA Malaria reference laboratory (UKHSA-MRL) to sequence isolate DNA sourced from clinical cases, to infer parasite species and drug resistance. These data are being used in our machine learning models.
Collaborator Contribution The UKHSA-MRL are contributing Plasmodium DNA and linked anonymised clinical and parasitology data.
Impact To date, we have accrued sequence data and drug resistance phenotypes from 300 Plasmodium parasites sourced from the UKHSA. When used in our machine learning models, we are detecting mutations that are linked to geographical source and drug resistance. Follow-up experimental validation of drug resistance mutations by UKHS-MRL is ongoing.
Start Year 2023
 
Description Workshop on Genomics in Bangkok 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Postgraduate students
Results and Impact 60 researchers attended training on genomic and AI data analysis, which strengthens capacity in genomics-based investigations.
Year(s) Of Engagement Activity 2025