Artificial intelligence methods applied to Genomic Data for improved health (AGENDA)

Lead Research Organisation: University of Southampton

Department Name: Human Development and Health

Abstract

Variation between genomes is the driving force behind inter-individual differences in health outcomes. Some patterns of genetic variation that cause disease or alter susceptibility to poor health have been very difficult to detect using limited genetic information on modest numbers of patients. But sequencing data captures the wealth of genetic variation. By 2024, at great cost to UK taxpayers, 500k patients and 100k newborn babies will have had their genomes sequenced through NHS Genomic Strategy initiatives.

Genomic data generation has outpaced the development of new methods to best realise their value. Old approaches such as genome-wide association studies, detect punctate points of common variation across the genome. These methods are statistically underpowered for sequencing data whose hallmark is rare, very rare, and unique genetic changes. As more and more people have their DNA sequenced, the complete set of observed genetic variants is expanding, but because most variants occur in few people, the data are increasingly sparse. New methods are essential to collapse the vastness of genomic data into more intuitive and useful data.

Lack of new methods means that currently, interpretation of genomic data is lagging behind data generation. Many of the new mutations found in a patient's genome are of uncertain clinical significance. This is causing huge delays in reviewing and reporting genomics test results. There is a national shortage of clinical bioinformaticians with expertise in genomic data interpretation and reporting - yet much of their valuable time is being spent on labour-intensive manual curation. Scalable, digital, knowledge-inference tools are essential to improve turnaround times so that patients can benefit from accurate diagnoses and targeted therapies.

Applied to massive cohorts, AI has the power to reveal cryptic, non-linear patterns between patient subgroups. Methods developed within this project will help dissolve the discipline specific barriers between genomicists and computer scientists. This project develops algorithms to assimilate and reduce dimensionality of immense yet sparse genomic data, into intuitive gene-level 'GenePy' matrices. For each individual variant, information on its population frequency, its conservation across species, its impact on protein function and interaction is retained. For each patient, these data are then collapsed for the variant set observed across their sequence of an entire gene, providing a pathogenic burden score - for each person, for each gene. We have demonstrated these scores accurately detect the majority of established diagnoses for thousands of Genomics England patients with recessive diseases. In addition, our hypothesis-free methods detect hundreds of causal variants missed by manual curation. These methods can be implemented by limited manpower, in a fraction of the time, for thousands of samples. This project will develop these tools to incorporate more complex genetic variants and to harness the value of long read sequencing data.

As GenePy scores scale variants to gene-level, they are intuitive input data for various modelling approaches. By mapping GenePy scores onto gene-interaction networks, topology analyses can reveal biological pathway mechanisms, therapeutic targets and identify novel biomarkers for the development of future clinical tests.

Using the existing wealth of experimentally-derived functional evidence of impact for thousands of point mutations in the human genome, AI can help us learn to interpret the most likely clinical impact of the billions of new variants we are discovering. This project uses AI to train protein modelling software to categorise genetic variants as benign, or likely to impair protein function or indeterminate and requiring additional modelling. We will define the steps required to have an end-to-end automated pipeline that can provide functional support to interpret data for personalised medicine.

Funded Value:

£624,228

Funded Period:

Oct 23 - Mar 25

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/Y01720X/1

Principal Investigator:

Sarah Ennis

Research Subject:

Info. & commun. Technol. (100%)

Research Topic:

Artificial Intelligence (100%)

Organisations

People	ORCID iD
Sarah Ennis (Principal Investigator)	http://orcid.org/0000-0003-2648-0869
Paul Skipp (Co-Investigator)
Jonathan Essex (Co-Investigator)
Jagmohan Chauhan (Co-Investigator)	http://orcid.org/0000-0003-2080-3276
James Ashton (Co-Investigator)
Andrew Shapanis (Researcher)
Eleanor Seaby (Researcher Co-Investigator)

Publications

Author Name

Title Publication Date Published

10 25 50

Ball AT (2024) Identification and Development of Cyclic Peptide Inhibitors of Hypoxia Inducible Factors 1 and 2 That Disrupt Hypoxia-Response Signaling in Cancer Cells. in Journal of the American Chemical Society

Ennis S (2024) Tackling the role of rare functional variation in inflammatory bowel disease through application of GenePy2 as a potential DNA biomarker

Polese B (2024) Circulating inflammatory cytokines predict severity disease in hospitalized COVID-19 patients: A prospective multicenter study of the European DRAGON consortium in Journal of Infection and Public Health

Seaby E (2024) A gene pathogenicity tool "GenePy" identifies missed biallelic diagnoses in the 100,000 Genomes Project in Genetics in Medicine

Shapanis A (2023) Topological data analysis identifies molecular phenotypes of idiopathic pulmonary fibrosis. in Thorax

Willsey H (2024) Modelling human genetic disorders in Xenopus tropicalis in Disease Models & Mechanisms

Key Findings
Further Funding
Research Databases and Models
Research Tools and Methods
Collaboration
Software and Technical Products
Engagement Activities


Description	Our first publication (partly funded by the EPSRC agenda award) has been published in Genetics in Medicine. This paper demonstrated the value of the GenePy algorithm in improving processing efficiency and diagnostic detection rates of genetic basis of disease across a subset of genes in patients within the Genomics England research environment. The paper represents a baseline from which the AGENDA project will build algorithmic performance and speed. We have had projects within Genomics England and UK BioBank research clouds respectively, approved for the testing of algorithmic improvements using large scale genomic data within these environments. We have demonstrated first successes in mapping GenePy matrices within topological data analyses frameworks and are continuing to optimise mapping. We have generated a basic pipeline for inputting specific variants into a chemical modelling pipeline. Outputs are being testes and key data points interrogated to better understand how we might digitally infer model performance.
Exploitation Route	Other groups already communicating interest in application of model to alternative clinical presentations.
Sectors	Digital/Communication/Information Technologies (including Software) Healthcare Pharmaceuticals and Medical Biotechnology


Description	NHS Genomic Artificial Intelligence Network of Excellence - Work Package 3D - Application of GenePy
Amount	£1,570,000 (GBP)
Organisation	NHS England
Sector	Public
Country	United Kingdom
Start	03/2024
End	03/2026


Title	Bioinformatic workflow for processing genomic sequencing data into GenePy matrix.
Description	Modular pipeline that has been packaged up with dependencies into a Nextflow workflow for deployment of pipeline. This can be aligned to genomic sequencing data that has undergone alignment and calling to vcf files. VCF files are input and a GenePy matrix is output.
Type Of Material	Technology assay or reagent
Year Produced	2024
Provided To Others?	Yes
Impact	This is an updated version of the GenePy workflow. It uses a Nextflow wrapper for the modularised pipeline . This workflow accommodates genomic vcf files that have been already split into several chunks per chromosome (E.g. UKBiobank data or Genomics England 100K data). The input file are gvcfs from the data source. There are four modules: module 1 annotates all variants with their CADD score; module 2 annotates (allele frequency) using VEP; module three conducts depth and quality based filtering, conducts CADD-based stratification. This module also rejoins the variant data for any gene where its CCDS region was split by the chunking process; module 4 generates the GenePy scores for all remaining variant data for all genes for all individuals.
URL	https://github.com/UoS-HGIG/GenePy-2/tree/main/GenePy2_UKBiobank/Nextflow_Genepy2_UKBB_V3


Title	Bioinformatic pipeline for processing long-read sequencing variant data into phased GenePy matrices based on the haplotypes of genes.
Description	GenePy-LRSEQ is a specialized pipeline designed to process phased variant data from long-read sequencing. Unlike the original GenePy pipeline developed for short-read sequencing, GenePy-LRSEQ specifically handles genomic VCF files containing phased variant information, transforming them into separate GenePy matrices for each haplotype of a gene.
Type Of Material	Computer model/algorithm
Year Produced	2024
Provided To Others?	Yes
Impact	The pipeline consists of four sequential modules: First, Quality and Region Filtration implements quality-based filtration, retains only autosomal variants within the consensus coding sequence (CCDS) regions (±25bp), and selectively preserves variants with phase-set information. Second, Variant Annotation annotates filtered variants with CADD scores to assess pathogenicity and incorporates allele frequency data from general population databases. Third, Haplotype Reconstruction converts genotypes to haplotypes using phase-set information, effectively splitting the genotypes of variants within each gene into two distinct haplotypes. Finally, Score Generation calculates GenePy scores for each gene haplotype using an additive model, producing two comprehensive GenePy matrices that represent the pathogenic burden of each haplotype. This approach enables more precise characterization of allele-specific effects and provides enhanced resolution for evaluating the pathogenic contribution of each haplotype.
URL	https://github.com/UoS-HGIG/GenePy_LRSEQ


Title	Bioinformatic workflow for processing genomic sequencing data into GenePy matrix.
Description	Modular pipeline that has been packaged up with dependencies into a Nextflow workflow for deployment of pipeline. This can be aligned to genomic sequencing data that has undergone alignment and calling to vcf files. VCF files are input and a GenePy matrix is output.
Type Of Material	Computer model/algorithm
Year Produced	2024
Provided To Others?	Yes
Impact	This is an updated version of the GenePy workflow. It uses a Nextflow wrapper for the modularised pipeline . This workflow accommodates genomic vcf files that have been already split into several chunks per chromosome (E.g. UKBiobank data or Genomics England 100K data). The input file are gvcfs from the data source. There are four modules: module 1 annotates all variants with their CADD score; module 2 annotates (allele frequency) using VEP; module three conducts depth and quality based filtering, conducts CADD-based stratification. This module also rejoins the variant data for any gene where its CCDS region was split by the chunking process; module 4 generates the GenePy scores for all remaining variant data for all genes for all individuals.
URL	https://github.com/UoS-HGIG/GenePy-2/tree/main/GenePy2_UKBiobank/Nextflow_Genepy2_UKBB_V3


Description	Collaboration with large US based IBD Genetics Consortium
Organisation	Icahn School of Medicine at Mount Sinai
Country	United States
Sector	Academic/University
PI Contribution	Collaboration with Judy Cho (PI for the NIDDK IBD Genetics Consortium) and Louis Cohen at Icahn School of Medicine. Collaborating on data (clinical, genomic and transcriptomic) exchange as well as implementation adn validation of IBD refined large language models for creating FAIR data from unstructured clinical records for the purposes of downstream integration with GenePy matrices.
Collaborator Contribution	They are providing validation data for their cohort for the LLM to extract structured FAIR data from redacted unstructured reports. We will generate GenePy scores on their exome data and assess the integration of structured clinical data with GenePy scores for both cohorts.
Impact	None yet
Start Year	2024


Description	Genomics Artificial Intelligence Network
Organisation	NHS England
Country	United Kingdom
Sector	Public
PI Contribution	As part of this collaboration, we will take the GenePy algorithm that has benefited from development during the AGENDA project and implement it it at scale to prove the concept diagnostic benefit in real-world data. We will do this at scale to demonstrate alleviation of manual curation burdent currently expereinced in the handling of large scale genomics data in the NHS.
Collaborator Contribution	The Genomics AI Network of Excellence will act at the intersection of NHS genomic medicine and Artificial Intelligence (AI) development and translational research to support the implementation of AI technology solutions in healthcare. The Network will build a national community in genomics and AI, create frameworks to support AI deployment, and deliver exemplar accelerator programmes to develop the evidence required to adopt AI for the benefit of NHS patients, including improved and accelerated diagnosis and personalised medicine. The Network will work in close partnership with existing major contributors in the AI and genomic medicine environment to collectively create the conditions necessary for the adoption of innovation. Key partners will include the NHS Genomic Medicine Service and its component entities (including GMSAs, GLHs and the new Genomic Networks of Excellence), Genomics England, the NHS AI Laboratory within the NHS England Transformation Directorate, the NIHR Biomedical Research Centres, a range of academic partners, industry partners, regulators, bioethics and AI ethics groups, and patient voices.
Impact	Network formally initiates 1 April 2024
Start Year	2023


Description	IBD Human Phenotype Ontology and Phenopackets
Organisation	Charité - University of Medicine Berlin
Country	Germany
Sector	Academic/University
PI Contribution	Collabortation with the Robinson Lab to extend standardised HPO terms to IBD for the purposes of refining phenotype integration with GenePy scores
Collaborator Contribution	The Robinson Lab - led by Peter Robinson was integral in the generation of HPO terms and Exomiser software in use in Medical Genomics. A major focus of the lab is to create ontologies for representing and analyzing medical data. We are planning a collboation to workshop terms specific for IBD and also to assess the value of Phenopackets for processing longitudinal common disease data for downstream integration with GenePy.
Impact	None yet
Start Year	2025


Description	LongRead Data Collaboration with Dept Human Genetics Amsterdam UMC
Organisation	University Medical Center Utrecht (UMC)
Department	Neurology UMC
Country	Netherlands
Sector	Academic/University
PI Contribution	This group hosts one of the very few large datasets of patients with Long Read PacBio data. Specifically, they host data on a cohort of 500 centenarians with and without Alzheimer's Disease (AD) diagnoses.
Collaborator Contribution	We plan to implement the GenePy algorithm that has been adapted for LR sequencing as part of the AGENDA project data to this cohort
Impact	NA to date. Contracts and agreements still underway.
Start Year	2024


Title	VDR
Description	Software to efficiently reweight GaMD molecular dynamics simulations.
Type Of Technology	Software
Year Produced	2024
Open Source License?	Yes
Impact	None as of yet. The software is only now about to be released.
URL	https://github.com/sct1g15/GaMD_Variable_Density_Reweighting


Description	KeyNote Panel Discussion
Form Of Engagement Activity	A formal working group, expert panel or dialogue
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	The Festival of Genomic & BioData is the largest Genomics event in the UK. 7.000 registrants, 360 speakers and 150 exhibitors. S Ennis (PI for this award) led a keynote panel discussion of the use of Artificial Intelligence in Genomics & Biodata.
Year(s) Of Engagement Activity	2025
URL	https://festivalofgenomics.com/?


Description	Public engagement with use of Genomic data for research
Form Of Engagement Activity	A formal working group, expert panel or dialogue
Part Of Official Scheme?	No
Geographic Reach	Regional
Primary Audience	Public/other audiences
Results and Impact	Wessex Public Panel event - discussion and engagement with public panel regarding the secure use of genomic data for medical research purposes. Held in Bournemouth (13 July 2024) and Southampton (Novotel 20 July 2024).
Year(s) Of Engagement Activity	2024


Description	Talk at the Festival of Genomics & Biodata
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Delivered a talk entitled: The devil's in the data: Tools for modelling genomic data and models for extracting clinical data. Subject matter included outcomes and application of work package 1 in the AGENDA grant.
Year(s) Of Engagement Activity	2025
URL	https://festivalofgenomics.com/?

Abstract

Organisations

People

ORCID iD

Publications