Identifying New Disease Genes & Mechanisms for Musculoskeletal Disorders in 100K Genomes Project using Bioinformatics, Phenotyping & Machine Learning

Lead Research Organisation: University of Oxford
Department Name: Wellcome Trust Centre for Human Genetics

Abstract

Genetic disorders which affect the development of the skeleton or muscles are collectively common, even if individually rare. Providing a genetic diagnosis for the patients and their families is important for ending what is often a lengthy diagnostic odyssey. For their clinicians, it may inform provision of the correct treatment. Understanding the genetic basis of these rare musculoskeletal (MSK) disorders may also provide insights into common MSK disorders, which are a major cause of disability and impairment of quality of life for millions of people in the UK.
In the past, genetic diagnosis of rare MSK diseases has relied on sequencing panels of known genes to identify the causative gene, but the diagnostic yield of such panel-based sequencing is low because many disease genes have not yet been identified.
With technological improvements and cost reductions, sequencing of patients' entire genomes (the full complement of their DNA) has become a possibility. Furthermore, many types of genetic variants can be interrogated from genome sequence data, not just those involving single base pairs, but also more complex duplications, deletions or transpositions of segments of the genome as well as variants in the regions between genes - the introns. These regions have increasingly been recognised to play important roles in regulating gene expression but we have considerably less understanding about their clinical significance.
Interrogation of patients' genomes to identify the disease-causing variants therefore still presents many challenges. Recognising the potential of this genome sequencing approach, the UK launched a national programme (100KGP) to identify pathogenic variants in 100,000 patients, with the aim of improving diagnoses for these patients that might also inform their personalised treatment. Run by Genomics England, sequencing of these patients is now complete and it is estimated that diagnoses have been found for a quarter of the rare disease patients so far. Solving the rest of these cases will require intense effort on behalf of the research community to investigate the different variant types described above.
This proposal aims to contribute to that effort focusing on patients with musculoskeletal and related developmental conditions.
We will use both existing GeL algorithms and our own bioinformatics tools to analyse the genome sequence data to ensure we have investigated all possible variants, and then employ a variety of genetic strategies to assess whether the genes are potentially pathogenic. We invariably need additional clinical or x-ray data to that already collected by the GeL programme. However, this is often available in medical records so we have identified routes to retrieving this which involve clinicians and patients themselves. We have established a clinical multi-disciplinary team to enable discussion of cases, and will employ expertise in clinical radiology assessments to ensure systematic analysis of x-ray data. We will also ask patients to provide us with self-reported data, as we know from other research studies that patients are very good at remembering which bones they have broken and when. Finally, we will see if machine learning or 'artificial intelligence' can help us identify patterns in these vast and complex datasets which could not be identified by our manual inspection.
We anticipate that these efforts will help us provide diagnoses for many more patients in the 100KGP and can then be adopted for other diseases in the 100KGP providing genetic diagnoses for many more patients.

Technical Summary

Whole genome sequencing (WGS) has the potential to revolutionise diagnosis of Rare Diseases. Recognising this, the UK has established a national programme to sequence 100,000 genomes (100KGP). To date, analysis in 100KGP has primarily focused on known disease genes for a given condition and on particular variant types - predominantly single nucleotide variants (SNVs) and a diagnostic yield of ~25% has been achieved. A more research-focused effort is now required to investigate novel disease genes and variant types, such as copy number, other structural variants (CNVs/SVs) and non-coding variants that are largely unexplored in the 100KGP to date. In order to address this requirement, we will focus on patients with musculoskeletal (MSK) disorders in the 100KGP which are a clinically and genetically heterogeneous group of conditions accounting for >1,000 cases in 100KGP. In preliminary studies, we have already identified a non-coding variant that contributes 1% to diagnostic yield of osteogenesis imperfecta and 2 complex SVs in known genes.
We will use novel bioinformatics techniques to comprehensively analyse the WGS data, integrating the various variant types to identify putative novel disease genes. We will combine this with deep phenotyping as core MSK clinical data has not been collected by 100KGP and is required in the assessment of candidate genes. We will also evaluate whether machine learning can be used to identify clusters of genotypes or phenotypes from these complex high dimensionality datasets enabling novel genotype/phenotype correlations to be observed.
Although this proposal focuses on MSK conditions, we anticipate that evaluation of the bioinformatics algorithms, platforms for deep phenotyping at scale and machine learning approaches will be informative for other disease domains in the 100KGP and can be leveraged to increase diagnostic yield across the dataset, as well as helping to maximise the research potential of this unique resource.

Publications

10 25 50
 
Description Providing genetic diagnoses for patients
Geographic Reach National 
Policy Influence Type Contribution to new or improved professional practice
Impact Diagnosis of genetic condition provides impacts for patients and helps their family members.
 
Title Phenotype questionnaire for clinicians 
Description Developed a questionnaire for clinicians to augment the clinical data associated with patients in the 100,000 Genomes Project. This questionnaire has been developed following broad consultation with clinicians at our MDT and in UK more generally. 
Type Of Material Physiological assessment or outcome measure 
Year Produced 2023 
Provided To Others? Yes  
Impact The questionnaires will enable us to collect more detailed clinical characteristics of patients in 100,000 Genomes Project which will aid identification of causative genes for these rare musculoskeletal conditions. 
 
Title Phenotype questionnaire for patients 
Description Questionnaire developed in consultation with patients and the patient group, the Brittle Bone Society, to facilitate collection of self-reported patient data 
Type Of Material Physiological assessment or outcome measure 
Year Produced 2023 
Provided To Others? Yes  
Impact The questionnaires will enable us to collect more detailed phenotypic data on musculoskeletal characteristics of patients in 100,000 Genomes Project, which will aid the identification of causative genes for these rare conditions. 
 
Description Collaboration on 100,000 Genomes Project 
Organisation Genomics England
Country United Kingdom 
Sector Public 
PI Contribution The grant focuses on analysis of whole genome sequencing data for patients enrolled in the musculoskeletal domain of the 100,000 Genomes Project.
Collaborator Contribution Genomics England has provided some bioinformatics support, and is generating resources within the Research Environment, to allow us, and other users, to more readily interrogate splicing variants across the whole project.
Impact The approach has helped us to identify some splicing variants in specific genes of interest
Start Year 2022
 
Title Patient2Genes 
Description Machine learning approach to identify genes associated with specific conditions from large whole genome sequencing datasets 
Type Of Technology Software 
Year Produced 2023 
Impact Prototype developed, still under development 
 
Title Patient2Mutations 
Description Software to identify pathogenic variants associated with specific diseases from whole genome sequencing datasets 
Type Of Technology Software 
Year Produced 2023 
Impact Prototype software written, still in development 
 
Title SVRare: discovering disease-causing structural variants in the 100K Genomes Project 
Description Software/bioinformatics pipeline to interrogate structural variants in whole genome sequencing data 
Type Of Technology Software 
Year Produced 2022 
Impact The software details were published in 2021 but since then we have applied this tool in the 100,000 Genomes Project as part of this MRC grant and identified multiple structural variants responsible for the disease pathogenesis of patients. 
URL https://www.medrxiv.org/content/10.1101/2021.10.15.21265069v1
 
Description Rare Disease video 
Form Of Engagement Activity A broadcast e.g. TV/radio/film/podcast (other than news/press)
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Media (as a channel to the public)
Results and Impact Video highlighting Oxford's programmes in rare disease genomics and the progression to developing advanced therapeutics.
These therapies are the type of research outcomes we anticipate will result from this grant into musculoskeletal disorders.
Year(s) Of Engagement Activity 2022,2023
URL https://www.youtube.com/watch?v=iGHis8MAjdc