Infection-AID: AI assisted genomic profiling to inform the Diagnosis, personalised treatment and control of infections

Lead Research Organisation: London School of Hygiene and Tropical Medicine

Department Name: Infectious and Tropical Diseases

Abstract

Characterising the genetic code ("genome") of an organism can inform on its ability to survive, tolerate drugs and treatments, and its likely geographical source. Researchers can investigate the genome of an organism, and its important mutations (genome "spelling mistakes"), through applying sequencing technologies to its DNA. Cost-effective and rapid sequencing technologies are now being rolled-out in hospitals and clinics to identify important mutations, and thereby prevent disease, diagnose, and personalise treatment of patients. Genome sequencing has become an important diagnostic tool in infectious disease settings, including to identify microorganisms causing infections ("pathogens") and their resistance to drugs, and to track outbreaks. Such knowledge is revolutionizing clinical decision making, public health surveillance and infection control; as demonstrated during the COVID-19 pandemic, where rapid sequencing of the causal SARS-CoV-2 viral genomes has assisted the detection of clinically important mutations (e.g., omicron variants) and informed on their geographical spread ("transmission patterns"). To assist the analysis of the large datasets arising from the sequencing of pathogens, it is important to identify key mutations linked to (severe) patient outcomes, drug resistance, likely geographical source, and other important "barcoding" information that can provide a "profile" of the pathogen underlying any infection. Computer software tools have been developed (e.g., our TB-Profiler and Malaria-Profiler software) that can rapidly analyse sequence data to provide such pathogen profiles, for easy interpretation by medical doctors and infection control specialists.

With the increasing use of sequencing technologies in hospitals and clinics, there is a need for Artificial Intelligence (AI) computational methods to analyse the resulting "big data" in real time, including to update the lists of barcoding genetic mutations and to identify if the pathogen genome
is related to those previously sequenced i.e., it is being transmitted. We have previously applied AI methods to identify known and novel genetic mutations linked to drug resistance and transmission, as well as created computing repositories (e.g., TB-ML) where the underlying software can be stored, allowing comparisons between statistical models and AI approaches. Our proposed project will integrate these AI-based tools into our profiling software to reveal drug resistance mutation and transmission patterns, and generate informative reports for clinical and infection control decision making. Working within established collaborations involving The UK Health Security Agency and Health ministries in Asia (Bangladesh, Philippines, Thailand, Vietnam), which are routinely using sequencing technologies to inform clinical diagnosis, we will attempt to implement the resulting AI systems software in the UK and overseas settings endemic for infectious diseases. We will initially focus on three main infectious diseases of high global burden, tuberculosis, malaria and Klebsiella infections, with the potential to extend the work to other infections. All sequence data and software developed will be made publicly accessible, leading to their use by other biomedical researchers and healthcare stakeholders. Ultimately, the implementation of such AI-based tools will reduce the burden of infectious diseases, leading to healthier populations and associated economic benefits.

Funded Value:

£518,745

Funded Period:

Oct 23 - Mar 25

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/Y018842/1

Principal Investigator:

Taane Clark

Research Subject:

Info. & commun. Technol. (100%)

Research Topic:

Artificial Intelligence (100%)

Organisations

People	ORCID iD
Taane Clark (Principal Investigator)
Keertan Dheda (Co-Investigator)
Jody Phelan (Co-Investigator)
Susana Campino (Co-Investigator)	http://orcid.org/0000-0003-1403-6138
Colin Sutherland (Co-Investigator)

Publications

Author Name

Title Publication Date Published

|< < 1 2 > >|

10 25 50

Asghar M (2024) Exploring the Antimicrobial Resistance Profile of Salmonella typhi and Its Clinical Burden. in Antibiotics (Basel, Switzerland)

Azra (2025) Antibiotic Susceptibility Patterns and Virulence Profiles of Classical and Hypervirulent Klebsiella pneumoniae Strains Isolated from Clinical Samples in Khyber Pakhtunkhwa, Pakistan in Pathogens

Billows N (2024) Large-scale statistical analysis of Mycobacterium tuberculosis genome sequences identifies compensatory mutations associated with multi-drug resistance. in Scientific reports

Elias R (2025) Dissemination of arr-2 and arr-3 is associated with class 1 integrons in Klebsiella pneumoniae clinical isolates from Portugal. in Medical microbiology and immunology

Higgins M (2024) New reference genomes to distinguish the sympatric malaria parasites, Plasmodium ovale curtisi and Plasmodium ovale wallikeri in Scientific Reports

J. P. Thorpe (2024) Multi-platform whole genome sequencing for tuberculosis clinical and surveillance applications

Jody Phelan (2023) Rapid profiling of Plasmodium parasites from genome sequences to assist malaria control

Khan MF (2024) Exploring optimal drug targets through subtractive proteomics analysis and pangenomic insights for tailored drug design in tuberculosis. in Scientific reports

Key Findings
Impact Summary
Research Tools and Methods
Collaboration
Engagement Activities


Description	We have developed methods to systematically download and analyse sequence data across infectious diseases (e.g., malaria, TB) and integrate these into AI models that predict key clinical and epidemiological insights, such as drug resistance, geographic origin, strain types, and transmission dynamics. These models, along with the key predictive mutations they identify, are currently undergoing validation through additional sequencing efforts, with three manuscripts in preparation. The malaria AI model is being incorporated into UKHSA workflows, while the TB model is being implemented within Thailand's health systems.
Exploitation Route	The AI models, along with the underlying software and data, are being made accessible to the research community. We plan to seek follow-up funding to expand implementations to additional countries and pathogens. As noted, the UKHSA and Thailand Ministry of Public Health are integrating these AI and informatics tools into their systems and can serve as key advocates for future initiatives.
Sectors	Digital/Communication/Information Technologies (including Software) Education Healthcare


Description	Generated sequence data have been processed through AI models to characterise pathogen genotypic profiles, including drug resistance and geographic origin. These insights have supported the UKHSA in cryptic malaria investigations and informed clinical decision-making and outbreak investigations within the Thailand Ministry of Public Health.
First Year Of Impact	2024
Sector	Digital/Communication/Information Technologies (including Software),Healthcare
Impact Types	Economic Policy & public services


Title	Bioinformatic and AI tools
Description	We have established bioinformatic pipelines for all the pathogens considered in this project (e.g., Mycobacterium tuberculosis, Klebsiella, Plasmodium species), which process raw sequences into variants that are used in the machine learning models. To assist the application of the machine learning models, we have developed Docker containers that are functional software modules that cover data inputs, processing and outputs. These allow for the comparison of different machine learning methods and models across datasets. We propose to share this framework, linked to a scientific publication in preparation.
Type Of Material	Improvements to research infrastructure
Year Produced	2024
Provided To Others?	No
Impact	The use of dockers means that we have a framework for sharing computing code and outputs from the implementation of different machine learning methods.


Title	Infection genomics datasets
Description	We have been automatically downloading sequence and meta data linked to the pathogens of interest in our project (e.g., Mycobacterium, Plasmodium, Klebsiella), and passing them through our bioinformatic pipelines. This is resulting in large datasets for each pathogen (e.g., M. tuberculosis n>100K), which we then apply in our machine learning approaches.
Type Of Material	Improvements to research infrastructure
Year Produced	2023
Provided To Others?	No
Impact	This approach means that we have growing datasets to inform and validate our machine learning models, which in turn provide insights into mutations linked to drug resistance, strain-types and geographical source. The raw data are mostly in the public domain, but through combining them and developing machine learning models, these resources will be useful to those without computational expertise, but can use them to drive their research.


Description	Thailand Ministry of Public Health - Sequence data and informatics
Organisation	Ministry of Public Health
Country	Thailand
Sector	Public
PI Contribution	We have developed the bioinformatic pipelines and adapted our informatic tools (e.g., TB-Profiler) for use by the MOPH.
Collaborator Contribution	The MOPH are sharing TB sequence and AMR phenotypic data that is being used to update our machine learning models. They are also assessing the mutations being found by our machine learning models, for their biological and potential clinical relevance.
Impact	Outputs include: (1) >1,200 M. tuberculosis with whole genome sequencing data to date; (2) TB-Profiler installed at the MOPH, and generating outputs in the Thai language.
Start Year	2023


Description	Thailand Ministry of Public Health - Sequence data and informatics
Organisation	Ministry of Public Health
Country	Thailand
Sector	Public
PI Contribution	We have developed the bioinformatic pipelines and adapted our informatic tools (e.g., TB-Profiler) for use by the MOPH.
Collaborator Contribution	The MOPH are sharing TB sequence and AMR phenotypic data that is being used to update our machine learning models. They are also assessing the mutations being found by our machine learning models, for their biological and potential clinical relevance.
Impact	Outputs include: (1) >1,200 M. tuberculosis with whole genome sequencing data to date; (2) TB-Profiler installed at the MOPH, and generating outputs in the Thai language.
Start Year	2023


Description	UK Health Security Agency
Organisation	Public Health England
Country	United Kingdom
Sector	Public
PI Contribution	We are working with the UKHSA Malaria reference laboratory (UKHSA-MRL) to sequence isolate DNA sourced from clinical cases, to infer parasite species and drug resistance. These data are being used in our machine learning models.
Collaborator Contribution	The UKHSA-MRL are contributing Plasmodium DNA and linked anonymised clinical and parasitology data.
Impact	To date, we have accrued sequence data and drug resistance phenotypes from 300 Plasmodium parasites sourced from the UKHSA. When used in our machine learning models, we are detecting mutations that are linked to geographical source and drug resistance. Follow-up experimental validation of drug resistance mutations by UKHS-MRL is ongoing.
Start Year	2023


Description	Workshop on Genomics in Bangkok
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Postgraduate students
Results and Impact	60 researchers attended training on genomic and AI data analysis, which strengthens capacity in genomics-based investigations.
Year(s) Of Engagement Activity	2025

Abstract

Organisations

People

ORCID iD

Publications