Infection-AID: AI assisted genomic profiling to inform the Diagnosis, personalised treatment and control of infections

Lead Research Organisation: London School of Hygiene & Tropical Medicine
Department Name: Infectious and Tropical Diseases

Abstract

Characterising the genetic code ("genome") of an organism can inform on its ability to survive, tolerate drugs and treatments, and its likely geographical source. Researchers can investigate the genome of an organism, and its important mutations (genome "spelling mistakes"), through applying sequencing technologies to its DNA. Cost-effective and rapid sequencing technologies are now being rolled-out in hospitals and clinics to identify important mutations, and thereby prevent disease, diagnose, and personalise treatment of patients. Genome sequencing has become an important diagnostic tool in infectious disease settings, including to identify microorganisms causing infections ("pathogens") and their resistance to drugs, and to track outbreaks. Such knowledge is revolutionizing clinical decision making, public health surveillance and infection control; as demonstrated during the COVID-19 pandemic, where rapid sequencing of the causal SARS-CoV-2 viral genomes has assisted the detection of clinically important mutations (e.g., omicron variants) and informed on their geographical spread ("transmission patterns"). To assist the analysis of the large datasets arising from the sequencing of pathogens, it is important to identify key mutations linked to (severe) patient outcomes, drug resistance, likely geographical source, and other important "barcoding" information that can provide a "profile" of the pathogen underlying any infection. Computer software tools have been developed (e.g., our TB-Profiler and Malaria-Profiler software) that can rapidly analyse sequence data to provide such pathogen profiles, for easy interpretation by medical doctors and infection control specialists.

With the increasing use of sequencing technologies in hospitals and clinics, there is a need for Artificial Intelligence (AI) computational methods to analyse the resulting "big data" in real time, including to update the lists of barcoding genetic mutations and to identify if the pathogen genome
is related to those previously sequenced i.e., it is being transmitted. We have previously applied AI methods to identify known and novel genetic mutations linked to drug resistance and transmission, as well as created computing repositories (e.g., TB-ML) where the underlying software can be stored, allowing comparisons between statistical models and AI approaches. Our proposed project will integrate these AI-based tools into our profiling software to reveal drug resistance mutation and transmission patterns, and generate informative reports for clinical and infection control decision making. Working within established collaborations involving The UK Health Security Agency and Health ministries in Asia (Bangladesh, Philippines, Thailand, Vietnam), which are routinely using sequencing technologies to inform clinical diagnosis, we will attempt to implement the resulting AI systems software in the UK and overseas settings endemic for infectious diseases. We will initially focus on three main infectious diseases of high global burden, tuberculosis, malaria and Klebsiella infections, with the potential to extend the work to other infections. All sequence data and software developed will be made publicly accessible, leading to their use by other biomedical researchers and healthcare stakeholders. Ultimately, the implementation of such AI-based tools will reduce the burden of infectious diseases, leading to healthier populations and associated economic benefits.
 
Title Bioinformatic and AI tools 
Description We have established bioinformatic pipelines for all the pathogens considered in this project (e.g., Mycobacterium tuberculosis, Klebsiella, Plasmodium species), which process raw sequences into variants that are used in the machine learning models. To assist the application of the machine learning models, we have developed Docker containers that are functional software modules that cover data inputs, processing and outputs. These allow for the comparison of different machine learning methods and models across datasets. We propose to share this framework, linked to a scientific publication in preparation. 
Type Of Material Improvements to research infrastructure 
Year Produced 2024 
Provided To Others? No  
Impact The use of dockers means that we have a framework for sharing computing code and outputs from the implementation of different machine learning methods. 
 
Title Infection genomics datasets 
Description We have been automatically downloading sequence and meta data linked to the pathogens of interest in our project (e.g., Mycobacterium, Plasmodium, Klebsiella), and passing them through our bioinformatic pipelines. This is resulting in large datasets for each pathogen (e.g., M. tuberculosis n>100K), which we then apply in our machine learning approaches. 
Type Of Material Improvements to research infrastructure 
Year Produced 2023 
Provided To Others? No  
Impact This approach means that we have growing datasets to inform and validate our machine learning models, which in turn provide insights into mutations linked to drug resistance, strain-types and geographical source. The raw data are mostly in the public domain, but through combining them and developing machine learning models, these resources will be useful to those without computational expertise, but can use them to drive their research. 
 
Description Thailand Ministry of Public Health - Sequence data and informatics 
Organisation Ministry of Public Health
Country Thailand 
Sector Public 
PI Contribution We have developed the bioinformatic pipelines and adapted our informatic tools (e.g., TB-Profiler) for use by the MOPH.
Collaborator Contribution The MOPH are sharing TB sequence and AMR phenotypic data that is being used to update our machine learning models. They are also assessing the mutations being found by our machine learning models, for their biological and potential clinical relevance.
Impact Outputs include: (1) >1,200 M. tuberculosis with whole genome sequencing data to date; (2) TB-Profiler installed at the MOPH, and generating outputs in the Thai language.
Start Year 2023
 
Description Thailand Ministry of Public Health - Sequence data and informatics 
Organisation Ministry of Public Health
Country Thailand 
Sector Public 
PI Contribution We have developed the bioinformatic pipelines and adapted our informatic tools (e.g., TB-Profiler) for use by the MOPH.
Collaborator Contribution The MOPH are sharing TB sequence and AMR phenotypic data that is being used to update our machine learning models. They are also assessing the mutations being found by our machine learning models, for their biological and potential clinical relevance.
Impact Outputs include: (1) >1,200 M. tuberculosis with whole genome sequencing data to date; (2) TB-Profiler installed at the MOPH, and generating outputs in the Thai language.
Start Year 2023
 
Description UK Health Security Agency 
Organisation Public Health England
Country United Kingdom 
Sector Public 
PI Contribution We are working with the UKHSA Malaria reference laboratory (UKHSA-MRL) to sequence isolate DNA sourced from clinical cases, to infer parasite species and drug resistance. These data are being used in our machine learning models.
Collaborator Contribution The UKHSA-MRL are contributing Plasmodium DNA and linked anonymised clinical and parasitology data.
Impact To date, we have accrued sequence data and drug resistance phenotypes from 300 Plasmodium parasites sourced from the UKHSA. When used in our machine learning models, we are detecting mutations that are linked to geographical source and drug resistance. Follow-up experimental validation of drug resistance mutations by UKHS-MRL is ongoing.
Start Year 2023