Artificial intelligence methods applied to Genomic Data for improved health (AGENDA)

Lead Research Organisation: University of Southampton
Department Name: Human Development and Health

Abstract

Variation between genomes is the driving force behind inter-individual differences in health outcomes. Some patterns of genetic variation that cause disease or alter susceptibility to poor health have been very difficult to detect using limited genetic information on modest numbers of patients. But sequencing data captures the wealth of genetic variation. By 2024, at great cost to UK taxpayers, 500k patients and 100k newborn babies will have had their genomes sequenced through NHS Genomic Strategy initiatives.

Genomic data generation has outpaced the development of new methods to best realise their value. Old approaches such as genome-wide association studies, detect punctate points of common variation across the genome. These methods are statistically underpowered for sequencing data whose hallmark is rare, very rare, and unique genetic changes. As more and more people have their DNA sequenced, the complete set of observed genetic variants is expanding, but because most variants occur in few people, the data are increasingly sparse. New methods are essential to collapse the vastness of genomic data into more intuitive and useful data.

Lack of new methods means that currently, interpretation of genomic data is lagging behind data generation. Many of the new mutations found in a patient's genome are of uncertain clinical significance. This is causing huge delays in reviewing and reporting genomics test results. There is a national shortage of clinical bioinformaticians with expertise in genomic data interpretation and reporting - yet much of their valuable time is being spent on labour-intensive manual curation. Scalable, digital, knowledge-inference tools are essential to improve turnaround times so that patients can benefit from accurate diagnoses and targeted therapies.

Applied to massive cohorts, AI has the power to reveal cryptic, non-linear patterns between patient subgroups. Methods developed within this project will help dissolve the discipline specific barriers between genomicists and computer scientists. This project develops algorithms to assimilate and reduce dimensionality of immense yet sparse genomic data, into intuitive gene-level 'GenePy' matrices. For each individual variant, information on its population frequency, its conservation across species, its impact on protein function and interaction is retained. For each patient, these data are then collapsed for the variant set observed across their sequence of an entire gene, providing a pathogenic burden score - for each person, for each gene. We have demonstrated these scores accurately detect the majority of established diagnoses for thousands of Genomics England patients with recessive diseases. In addition, our hypothesis-free methods detect hundreds of causal variants missed by manual curation. These methods can be implemented by limited manpower, in a fraction of the time, for thousands of samples. This project will develop these tools to incorporate more complex genetic variants and to harness the value of long read sequencing data.

As GenePy scores scale variants to gene-level, they are intuitive input data for various modelling approaches. By mapping GenePy scores onto gene-interaction networks, topology analyses can reveal biological pathway mechanisms, therapeutic targets and identify novel biomarkers for the development of future clinical tests.

Using the existing wealth of experimentally-derived functional evidence of impact for thousands of point mutations in the human genome, AI can help us learn to interpret the most likely clinical impact of the billions of new variants we are discovering. This project uses AI to train protein modelling software to categorise genetic variants as benign, or likely to impair protein function or indeterminate and requiring additional modelling. We will define the steps required to have an end-to-end automated pipeline that can provide functional support to interpret data for personalised medicine.

Publications

10 25 50
 
Description Our first publication (partly funded by the EPSRC agenda award) has been published in Genetics in Medicine. This paper demonstrated the value of the GenePy algorithm in improving processing efficiency and diagnostic detection rates of genetic basis of disease across a subset of genes in patients within the Genomics England research environment. The paper represents a baseline from which the AGENDA project will build algorithmic performance and speed.

We have had projects within Genomics England and UK BioBank research clouds respectively, approved for the testing of algorithmic improvements using large scale genomic data within these environments.

We have demonstrated first successes in mapping GenePy matrices within topological data analyses frameworks and are continuing to optimise mapping.

We have generated a basic pipeline for inputting specific variants into a chemical modelling pipeline. Outputs are being testes and key data points interrogated to better understand how we might digitally infer model performance.
Exploitation Route Other groups already communicating interest in application of model to alternative clinical presentations.
Sectors Digital/Communication/Information Technologies (including Software)

Healthcare

Pharmaceuticals and Medical Biotechnology

 
Description Genomics Artificial Intelligence Network 
Organisation NHS England
Country United Kingdom 
Sector Public 
PI Contribution As part of this collaboration, we will take the GenePy algorithm that has benefited from development during the AGENDA project and implement it it at scale to prove the concept diagnostic benefit in real-world data. We will do this at scale to demonstrate alleviation of manual curation burdent currently expereinced in the handling of large scale genomics data in the NHS.
Collaborator Contribution The Genomics AI Network of Excellence will act at the intersection of NHS genomic medicine and Artificial Intelligence (AI) development and translational research to support the implementation of AI technology solutions in healthcare. The Network will build a national community in genomics and AI, create frameworks to support AI deployment, and deliver exemplar accelerator programmes to develop the evidence required to adopt AI for the benefit of NHS patients, including improved and accelerated diagnosis and personalised medicine. The Network will work in close partnership with existing major contributors in the AI and genomic medicine environment to collectively create the conditions necessary for the adoption of innovation. Key partners will include the NHS Genomic Medicine Service and its component entities (including GMSAs, GLHs and the new Genomic Networks of Excellence), Genomics England, the NHS AI Laboratory within the NHS England Transformation Directorate, the NIHR Biomedical Research Centres, a range of academic partners, industry partners, regulators, bioethics and AI ethics groups, and patient voices.
Impact Network formally initiates 1 April 2024
Start Year 2023
 
Title VDR 
Description Software to efficiently reweight GaMD molecular dynamics simulations. 
Type Of Technology Software 
Year Produced 2024 
Open Source License? Yes  
Impact None as of yet. The software is only now about to be released. 
URL https://github.com/sct1g15/GaMD_Variable_Density_Reweighting