Identifying New Disease Genes & Mechanisms for Musculoskeletal Disorders in 100K Genomes Project using Bioinformatics, Phenotyping & Machine Learning
Lead Research Organisation:
University of Oxford
Department Name: Wellcome Trust Centre for Human Genetics
Abstract
Genetic disorders which affect the development of the skeleton or muscles are collectively common, even if individually rare. Providing a genetic diagnosis for the patients and their families is important for ending what is often a lengthy diagnostic odyssey. For their clinicians, it may inform provision of the correct treatment. Understanding the genetic basis of these rare musculoskeletal (MSK) disorders may also provide insights into common MSK disorders, which are a major cause of disability and impairment of quality of life for millions of people in the UK.
In the past, genetic diagnosis of rare MSK diseases has relied on sequencing panels of known genes to identify the causative gene, but the diagnostic yield of such panel-based sequencing is low because many disease genes have not yet been identified.
With technological improvements and cost reductions, sequencing of patients' entire genomes (the full complement of their DNA) has become a possibility. Furthermore, many types of genetic variants can be interrogated from genome sequence data, not just those involving single base pairs, but also more complex duplications, deletions or transpositions of segments of the genome as well as variants in the regions between genes - the introns. These regions have increasingly been recognised to play important roles in regulating gene expression but we have considerably less understanding about their clinical significance.
Interrogation of patients' genomes to identify the disease-causing variants therefore still presents many challenges. Recognising the potential of this genome sequencing approach, the UK launched a national programme (100KGP) to identify pathogenic variants in 100,000 patients, with the aim of improving diagnoses for these patients that might also inform their personalised treatment. Run by Genomics England, sequencing of these patients is now complete and it is estimated that diagnoses have been found for a quarter of the rare disease patients so far. Solving the rest of these cases will require intense effort on behalf of the research community to investigate the different variant types described above.
This proposal aims to contribute to that effort focusing on patients with musculoskeletal and related developmental conditions.
We will use both existing GeL algorithms and our own bioinformatics tools to analyse the genome sequence data to ensure we have investigated all possible variants, and then employ a variety of genetic strategies to assess whether the genes are potentially pathogenic. We invariably need additional clinical or x-ray data to that already collected by the GeL programme. However, this is often available in medical records so we have identified routes to retrieving this which involve clinicians and patients themselves. We have established a clinical multi-disciplinary team to enable discussion of cases, and will employ expertise in clinical radiology assessments to ensure systematic analysis of x-ray data. We will also ask patients to provide us with self-reported data, as we know from other research studies that patients are very good at remembering which bones they have broken and when. Finally, we will see if machine learning or 'artificial intelligence' can help us identify patterns in these vast and complex datasets which could not be identified by our manual inspection.
We anticipate that these efforts will help us provide diagnoses for many more patients in the 100KGP and can then be adopted for other diseases in the 100KGP providing genetic diagnoses for many more patients.
In the past, genetic diagnosis of rare MSK diseases has relied on sequencing panels of known genes to identify the causative gene, but the diagnostic yield of such panel-based sequencing is low because many disease genes have not yet been identified.
With technological improvements and cost reductions, sequencing of patients' entire genomes (the full complement of their DNA) has become a possibility. Furthermore, many types of genetic variants can be interrogated from genome sequence data, not just those involving single base pairs, but also more complex duplications, deletions or transpositions of segments of the genome as well as variants in the regions between genes - the introns. These regions have increasingly been recognised to play important roles in regulating gene expression but we have considerably less understanding about their clinical significance.
Interrogation of patients' genomes to identify the disease-causing variants therefore still presents many challenges. Recognising the potential of this genome sequencing approach, the UK launched a national programme (100KGP) to identify pathogenic variants in 100,000 patients, with the aim of improving diagnoses for these patients that might also inform their personalised treatment. Run by Genomics England, sequencing of these patients is now complete and it is estimated that diagnoses have been found for a quarter of the rare disease patients so far. Solving the rest of these cases will require intense effort on behalf of the research community to investigate the different variant types described above.
This proposal aims to contribute to that effort focusing on patients with musculoskeletal and related developmental conditions.
We will use both existing GeL algorithms and our own bioinformatics tools to analyse the genome sequence data to ensure we have investigated all possible variants, and then employ a variety of genetic strategies to assess whether the genes are potentially pathogenic. We invariably need additional clinical or x-ray data to that already collected by the GeL programme. However, this is often available in medical records so we have identified routes to retrieving this which involve clinicians and patients themselves. We have established a clinical multi-disciplinary team to enable discussion of cases, and will employ expertise in clinical radiology assessments to ensure systematic analysis of x-ray data. We will also ask patients to provide us with self-reported data, as we know from other research studies that patients are very good at remembering which bones they have broken and when. Finally, we will see if machine learning or 'artificial intelligence' can help us identify patterns in these vast and complex datasets which could not be identified by our manual inspection.
We anticipate that these efforts will help us provide diagnoses for many more patients in the 100KGP and can then be adopted for other diseases in the 100KGP providing genetic diagnoses for many more patients.
Technical Summary
Whole genome sequencing (WGS) has the potential to revolutionise diagnosis of Rare Diseases. Recognising this, the UK has established a national programme to sequence 100,000 genomes (100KGP). To date, analysis in 100KGP has primarily focused on known disease genes for a given condition and on particular variant types - predominantly single nucleotide variants (SNVs) and a diagnostic yield of ~25% has been achieved. A more research-focused effort is now required to investigate novel disease genes and variant types, such as copy number, other structural variants (CNVs/SVs) and non-coding variants that are largely unexplored in the 100KGP to date. In order to address this requirement, we will focus on patients with musculoskeletal (MSK) disorders in the 100KGP which are a clinically and genetically heterogeneous group of conditions accounting for >1,000 cases in 100KGP. In preliminary studies, we have already identified a non-coding variant that contributes 1% to diagnostic yield of osteogenesis imperfecta and 2 complex SVs in known genes.
We will use novel bioinformatics techniques to comprehensively analyse the WGS data, integrating the various variant types to identify putative novel disease genes. We will combine this with deep phenotyping as core MSK clinical data has not been collected by 100KGP and is required in the assessment of candidate genes. We will also evaluate whether machine learning can be used to identify clusters of genotypes or phenotypes from these complex high dimensionality datasets enabling novel genotype/phenotype correlations to be observed.
Although this proposal focuses on MSK conditions, we anticipate that evaluation of the bioinformatics algorithms, platforms for deep phenotyping at scale and machine learning approaches will be informative for other disease domains in the 100KGP and can be leveraged to increase diagnostic yield across the dataset, as well as helping to maximise the research potential of this unique resource.
We will use novel bioinformatics techniques to comprehensively analyse the WGS data, integrating the various variant types to identify putative novel disease genes. We will combine this with deep phenotyping as core MSK clinical data has not been collected by 100KGP and is required in the assessment of candidate genes. We will also evaluate whether machine learning can be used to identify clusters of genotypes or phenotypes from these complex high dimensionality datasets enabling novel genotype/phenotype correlations to be observed.
Although this proposal focuses on MSK conditions, we anticipate that evaluation of the bioinformatics algorithms, platforms for deep phenotyping at scale and machine learning approaches will be informative for other disease domains in the 100KGP and can be leveraged to increase diagnostic yield across the dataset, as well as helping to maximise the research potential of this unique resource.
Publications


Chauhan V
(2022)
COPER: Continuous Patient State Perceiver

Chauhan V
(2022)
Continuous Patient State Attention Models

Ferla MP
(2022)
Venus: Elucidating the Impact of Amino Acid Variants on Protein Function Beyond Structure Destabilisation.
in Journal of molecular biology

Irving M
(2023)
European Achondroplasia Forum guiding principles for the detection and management of foramen magnum stenosis.
in Orphanet journal of rare diseases

Moore AR
(2023)
Use of genome sequencing to hunt for cryptic second-hit variants: analysis of 31 cases recruited to the 100 000 Genomes Project.
in Journal of medical genetics

Nagy S
(2024)
Autosomal recessive VWA1-related disorder: comprehensive analysis of phenotypic variability and genetic mutations.
in Brain communications

Pagnamenta AT
(2023)
The prevalence and phenotypic range associated with biallelic PKDCC variants.
in Clinical genetics

Pagnamenta AT
(2023)
Structural and non-coding variants increase the diagnostic yield of clinical whole genome sequencing for rare diseases.
in Genome medicine

Pagnamenta AT
(2022)
Variable skeletal phenotypes associated with biallelic variants in PRKG2.
in Journal of medical genetics
Description | Guiding principles for the detection and management of foramen magnum stenosis |
Geographic Reach | Europe |
Policy Influence Type | Influenced training of practitioners or researchers |
Impact | Improved diagnosis and treatment of an achondroplasia subtype. |
URL | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10375694 |
Description | Providing genetic diagnoses for patients |
Geographic Reach | National |
Policy Influence Type | Contribution to new or improved professional practice |
Impact | Diagnosis of genetic condition provides impacts for patients and helps their family members. To date (updated March 2024) diagnoses have been provided for ~150 patients since the start of this grant and 391 overall. |
Description | NIHR Biomedical Research Centres |
Amount | £4,700,000 (GBP) |
Funding ID | NIHR203311 |
Organisation | Oxford University Hospitals NHS Foundation Trust |
Sector | Academic/University |
Country | United Kingdom |
Start | 12/2022 |
End | 11/2027 |
Description | Research Chair |
Amount | £1,586,000 (GBP) |
Organisation | Royal Academy of Engineering |
Sector | Charity/Non Profit |
Country | United Kingdom |
Start | 04/2023 |
End | 04/2028 |
Description | Research Professorship |
Amount | £1,823,387 (GBP) |
Funding ID | NIHR302440 |
Organisation | National Institute for Health Research |
Sector | Public |
Country | United Kingdom |
Start | 12/2022 |
End | 11/2027 |
Title | Phenotype questionnaire for clinicians |
Description | Developed a questionnaire for clinicians to augment the clinical data associated with patients in the 100,000 Genomes Project. This questionnaire has been developed following broad consultation with clinicians at our MDT and in UK more generally. |
Type Of Material | Physiological assessment or outcome measure |
Year Produced | 2023 |
Provided To Others? | Yes |
Impact | The questionnaires will enable us to collect more detailed clinical characteristics of patients in 100,000 Genomes Project which will aid identification of causative genes for these rare musculoskeletal conditions. |
Title | Phenotype questionnaire for patients |
Description | Questionnaire developed in consultation with patients and the patient group, the Brittle Bone Society, to facilitate collection of self-reported patient data |
Type Of Material | Physiological assessment or outcome measure |
Year Produced | 2023 |
Provided To Others? | Yes |
Impact | The questionnaires will enable us to collect more detailed phenotypic data on musculoskeletal characteristics of patients in 100,000 Genomes Project, which will aid the identification of causative genes for these rare conditions. |
Title | AI for Complex Healthcare Data |
Description | The primary output of this research activity is AI-based methods for training models from multimodal healthcare data, and for using the resulting models for phenotyping, prediction, and decision support. The activity described is one of the UK's largest "AI for Healthcare" teams, supported by this award. |
Type Of Material | Computer model/algorithm |
Year Produced | 2023 |
Provided To Others? | Yes |
Impact | Citations, collaborations, implementations. |
Description | Collaboration on 100,000 Genomes Project |
Organisation | Genomics England |
Country | United Kingdom |
Sector | Public |
PI Contribution | The grant focuses on analysis of whole genome sequencing data for patients enrolled in the musculoskeletal domain of the 100,000 Genomes Project. |
Collaborator Contribution | Genomics England has provided some bioinformatics support, and is generating resources within the Research Environment, to allow us, and other users, to more readily interrogate splicing variants across the whole project. |
Impact | The approach has helped us to identify some splicing variants in specific genes of interest |
Start Year | 2022 |
Description | Collaboration with Origins of Bone and Cartilage Disease Project |
Organisation | Imperial College London |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | We are investigating whether genes identified from the International Mouse Phenotyping Consortium mouse mutagenesis project which have been found to cause defects in murine bone quality or quantity also cause skeletal phenotypes in humans, by analysing the whole genome sequencing data in the 100,000 Genomes Project |
Collaborator Contribution | Our partners, Prof Duncan Bassett and Prof Graham Williams, have generated the detailed skeletal phenotyping data on 1000 mouse single gene deletion lines |
Impact | We have received a list of the genes which cause defects in mouse bone quality or quantity and are checking these to see if there any patients with variants in the equivalent human genes in the Genomics England dataset. |
Start Year | 2023 |
Title | Genetic diagnosis for ENPP1 deficiency enabling patient entry into clinical trial |
Description | By providing a patient with a genetic diagnosis for ENPP1 deficiency, the patient has been shown to be eligible for a clinical trial of a recombinant ENPP1 protein product being developed by Inozyme pharma. The contribution from this MRC grant is to provide the genetic diagnosis from whole genome sequencing data available in the 100,000 Genomes Project. The patient's clinician subsequently referred the patient to the clinical trial once the diagnosis had been confirmed in an accredited genetics lab. Our award does not cover the clinical development of the ENPP1 protein, and we are not involved in the clinical trial, but it is important to record these outcomes too (and there is no other place to record such outcomes on this portal since 'Other Outcomes' category has now been removed). |
Type | Management of Diseases and Conditions |
Current Stage Of Development | Early clinical assessment |
Year Development Stage Completed | 2023 |
Development Status | Under active development/distribution |
Impact | We think it important to record this as the aim of our MRC project is to ensure that genetic diagnoses for patients inform their treatment where possible. Inozyme announced positive interim data from its Phase 1/2 trial of 9 patients in Sept 2023. |
URL | https://www.biospace.com/article/releases/inozyme-pharma-announces-positive-interim-data-from-ongoin... |
Title | Patient2Genes |
Description | Machine learning approach to identify genes associated with specific conditions from large whole genome sequencing datasets |
Type Of Technology | Software |
Year Produced | 2023 |
Impact | Prototype developed, still under development |
Title | Patient2Mutations |
Description | Software to identify pathogenic variants associated with specific diseases from whole genome sequencing datasets |
Type Of Technology | Software |
Year Produced | 2023 |
Impact | Prototype software written, still in development |
Title | SVRare: discovering disease-causing structural variants in the 100K Genomes Project |
Description | Software/bioinformatics pipeline to interrogate structural variants in whole genome sequencing data |
Type Of Technology | Software |
Year Produced | 2022 |
Impact | The software details were published in 2021 but since then we have applied this tool in the 100,000 Genomes Project as part of this MRC grant and identified multiple structural variants responsible for the disease pathogenesis of patients. |
URL | https://www.medrxiv.org/content/10.1101/2021.10.15.21265069v1 |
Description | Medics4RareDiseases interview |
Form Of Engagement Activity | A broadcast e.g. TV/radio/film/podcast (other than news/press) |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Public/other audiences |
Results and Impact | Recorded 5 minute interview at an event for the charity Medics4RareDiseases. Discussed importance of research into rare diseases and discussed the work of this particular Genomics England Clinical Interpretation Partnership. This interview was for the Medics4RareDiseases YouTube channel. |
Year(s) Of Engagement Activity | 2024 |
Description | PPI Activities |
Form Of Engagement Activity | A formal working group, expert panel or dialogue |
Part Of Official Scheme? | No |
Geographic Reach | Local |
Primary Audience | Public/other audiences |
Results and Impact | PPI activities, undertaken at our AI lab in Oxford |
Year(s) Of Engagement Activity | 2023,2024 |
Description | Rare Disease video |
Form Of Engagement Activity | A broadcast e.g. TV/radio/film/podcast (other than news/press) |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Media (as a channel to the public) |
Results and Impact | Video highlighting Oxford's programmes in rare disease genomics and the progression to developing advanced therapeutics. These therapies are the type of research outcomes we anticipate will result from this grant into musculoskeletal disorders. |
Year(s) Of Engagement Activity | 2022,2023 |
URL | https://www.youtube.com/watch?v=iGHis8MAjdc |