📣 Help Shape the Future of UKRI's Gateway to Research (GtR)

We're improving UKRI's Gateway to Research and are seeking your input! If you would be interested in being interviewed about the improvements we're making and to have your say about how we can make GtR more user-friendly, impactful, and effective for the Research and Innovation community, please email gateway@ukri.org.

Novel Methodology for predicting the Functional Effects of Genetic Variation

Lead Research Organisation: University of Bristol
Department Name: Engineering Mathematics and Technology

Abstract

Rapid developments in next-generation sequencing technologies
has lead to a rapid increase in identified genetic variants,
which may be implicated in disease. Laboratory-based experiments
to investigate functional consequences of these genetic variants
is a costly and time-consuming process. The core application focus of the project
is to construct algorithmically-based methods for predicting
the functional consequence of genetic variants, a problem where we have already
achieved some state-of-the-art results in certain contexts. 50% of the project period will be
given over to the design of improved algorithms specific to this application.
The other 50% of the project period will be given over to achieving
state-of-the-art performance with these new methods, and existing methods,
in application to the stated goal of predicting which variants, insertions
and deletion events are associated with human disease. The pathways to impact
is briefly described in the Case and more fully detailed in the Pathways to Impact
statement.

Planned Impact

This is a multidisciplinary project in which algorithm development (to attain optimal predictive performance) is strongly linked to the very important application of predicting which single nucleotide variants and indels are functional in disease. The project will have the following impacts:

1. The algorithms we will develop have a broader range of applications beyond the immediate project.
The stochastic edit code algorithms we propose to develop (Section 3.2) could be used in text processing,
in general. The MKL algorithms (Section 3.5) have potential applications in machine vision, bioinformatics,
supervised network inference, for example.

2. This project has an excellent tie-in with recently announced initiatives, such as Genomics England
(www.genomicsengland.co.uk) a 300 million pound project to gather 100,000 genomes (through
NHS facilities, ). The ability to functional annotate genomes is of obvious paramount importance, to
exploit this .

3. The developed tools would be useful in guiding experimentalists toward areas for investigation:

3.1. There is some prospect for the discovery of novel monogenic diseases, or highlighting the role of
SNVs in multi-factorial disease.

3.2. Interstitial deletion events leading to fusion genes are known disease-drivers in certain cancer contexts
and work in Section 3.1 [4.] and Section 3.2 could highlight novel drug targets.

In our benchmarking study (reported in Section 2.3), we have discovered 90,000 SNVs (at 90% or higher
confidence), indicated as functional in disease.

4. As reported in Pathways to Impact, we are in a good position to pursue any follow-up experimental studies
of identifies targets.

Publications

10 25 50
publication icon
Campbell C (2023) Predicting pathogenicity from non-coding mutations. in Nature biomedical engineering

publication icon
Richardson TG (2016) A pathway-centric approach to rare variant association analysis. in European journal of human genetics : EJHG

publication icon
Rogers MF (2015) Probabilistic inference of biological networks via data integration. in BioMed research international

publication icon
Shihab HA (2017) HIPred: an integrative approach to predicting haploinsufficient genes. in Bioinformatics (Oxford, England)

publication icon
Shihab HA (2017) GTB - an online genome tolerance browser. in BMC bioinformatics

 
Description Our FATHMM family of predictors originally predicted the pathogenic status of amino acid substitutions: •Hashem A. Shihab, Julian Gough, David N. Cooper, Peter D. Stenson, Gary L.A. Barker, Keith J. Edwards, Ian N.M. Day, Tom R. Gaunt. Predicting the Functional, Molecular and Phenotypic Consequences of Amino Acid Substitutions using Hidden Markov Models. Hum. Mutat. (2013), 34:57-65 Variations of this predictor were created for disease-specific contexts, including cancer. For predicting the pathogenic status of single nucleotide variants (SNVs) in the human genome, they focused on the observation that SNVs in regions of the genome which are highly conserved across species are more likely to be deleterious, relative to variants in regions with high variability across species. As an improvement we later devised a predictor for single nucleotide variants, in both the coding and non-coding regions of the human genome. This predictor (FATHMM-MKL) used a wide variety of sources of data for predicting the pathogenic impact of individual SNVs, inclusive of sequence conservation across species, which remained the most informative source of information. This method used multiple kernel kernel (see my webpages on machine learning, or Chapter 3.6 of Learning with Support Vector Machines). The algorithm learns to weight the different types of data according to relative informativeness. This method is available at the FATHMM webserver site and was published here: •Hashem Shihab, Mark Rogers, Julian Gough, Matthew Mort, David Cooper, Ian Day, Tom Gaunt and Colin Campbell. An Integrative Approach to Predicting the Functional Effects of Non-Coding and Coding Sequence Variation. Bioinformatics Vol. 31, No. 10, 2015, pages 1536-1543. We later improved the method a little: •Mark Rogers, Hashem Shihab, Tom Gaunt, Matthew Mort, David Cooper, and Colin Campbell, Sequential Data Selection for Predicting the Pathogenic Effects of Sequence Variation, Proceedings, 2015 IEEE International Conference on Bioinformatics and Biomedicine (IEEE BIBM 2015, B394) FATHMM-MKL has been found to be state-of-the-art in comparative surveys by other groups. We have also devised a Genome Tolerance Browser to better visualise the locations of pathogenic single nucleotide variants in the human genome. Peaks near unity in the depicted plots indicate probable pathogenic SNVs and peaks near zero indicate neutral. Other prediction methods are presented, e.g. CADD, some as optional tracks. Reference: Hashem A. Shihab, Mark F. Rogers, Michael Ferlaino, Colin Campbell and Tom R. Gaunt. GTB - an online genome tolerance browser. BMC Bioinformatics 2017, 18:20. DOI: 10.1186/s12859-016-1436-4. Subsequent to this we developed an indel predictor (for estimating the pathogenic effects of short insertions or deletions of genetic code). This predictor can handle indels in non-coding regions of the human genome: •Michael Ferlaino, Mark F Rogers, Hashem A Shihab, Tom R Gaunt, Matthew Mort, David N Cooper, Colin Campbell. An integrative approach to predicting the functional effects of small indels in non-coding regions of the human genome. Journal submission A further area of interest has been disease-specific predictors, which are generally more accurate than generic predictors. Thus we are devising a suite of predictors in the context of cancer under the generic title of CScape. Our first generic cancer predictor CScape uses a wide variety of data sources to predict if a single nucleotide variant is potentially a disease-driver for cancer: • Mark F. Rogers, Hashem A. Shihab, Tom R. Gaunt and Colin Campbell. CScape: a tool for predicting oncogenic single-point mutations in the cancer genome. Journal submission Our baseline predictor appears more accurate than competitors and was based on data from COSMIC and up to 30 different types of genomic data sources. The method was benchmarked on independent data from the The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC), in addition to other databases. It is able to make predictions in both the coding and non-coding regions of the cancer genome, though it is much more accurate in coding regions. We furthermore introduced a confidence measure for the predicted class label. By restricting prediction to the highest confidence instances, the resultant classifier can perform at approximately 90% test accuracy (in coding regions), though it is only able to achieve this level of accuracy at a minority of nucleotide positions (about 17% of nucleotide positions). These high confidence predicted potential disease-driver variants are typically clustered by location in the cancer genome and the method highlights exons in 191 autosomal genes such that mutational change could act as a disease-driver. Finally, we have developed a state-of-the-art integrative classifier for predicting haploinsufficient genes: Hashem Shihab, Mark Rogers, Colin Campbell and Tom Gaunt. HIPred: an integrative approach to predicting haploinsufficient genes. Bioinformatics, https://doi.org/10.1093/bioinformatics/btx028 (2017). The cell nucleus of many human cells are diploid: they contain two complete sets of chromosomes, one from each parent (in humans, germ cells are haploid). Haploinsufficiency occurs when a diploid organism has only a single functional copy of a gene (with the other copy inactivated by mutation) and this single functional copy does not produce enough of a gene product, leading to a disease trait. ***Added 16th February 2022***: the FATHMM-MKL tool developed in this research is an optional software filter for Ion Torrent Genome Sequencing machines (Google 'FATHMM scores Ion Torrent'). The method is the sole mutation impact predictor at COSMIC in Cambridge, the worlds largest cancer genome archive (Google 'FATHMM-MKL cosmic' and 'cancer cosmic'). There are at least 100+ papers benchmarking/analysing the method (search FATHMM-MKL and FATHMM in Google Scholar).
First Year Of Impact 2015
Sector Healthcare
Impact Types Societal

 
Title Sequence variant predictor (human disease) 
Description Available at http://fathmm.biocompute.org.uk. 
Type Of Material Model of mechanisms or symptoms - human 
Year Produced 2014 
Provided To Others? Yes  
Impact Plugin for highly used software tools such as Ensembl variant effect predictor VEP and COSMIC (Sanger Centre) 
URL http://fathmm.biocompute.org.uk
 
Description FATHMM-MKL adopted as software filter for In Torrent whole genome sequencers 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact FATHMM-MKL adopted as a software filter:
https://ionreporter.thermofisher.com/ionreporter/help/GUID-4E7D0B57-D7E5-4BA2-977F-D84260BF23E5.html
for Ion Torrent Genome Sequencing Machines:
https://www.thermofisher.com/uk/en/home/brands/ion-torrent.html
Year(s) Of Engagement Activity 2017,2018,2019,2020,2021,2022,2023
URL https://ionreporter.thermofisher.com/ionreporter/help/GUID-4E7D0B57-D7E5-4BA2-977F-D84260BF23E5.html
 
Description Research resource 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact From 2015-2022, our method, FATHMM-MKL, was adopted as the only mutation impact predictor at COSMIC in Cambridge (Forbes, D. et al. COSMIC: somatic cancer genetics at high-resolution. Nucleic Acids Research, 45(D1):D777- D783, 11 2016). COSMIC is by far the world's largest cancer genome archive: FATHMM-MKL predicted which single nucleotide variants in the human cancer genome are drivers of unregulated cell proliferation, with an associated probability score (SNVs are the most common driver mutation in cancer).
Year(s) Of Engagement Activity 2015,2016,2017,2018,2019,2020,2021,2022,2023
URL https://cancer.sanger.ac.uk/cosmic/analyses
 
Description Update 12 February 2024 (from C.Campbell): 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact Update (2024): As mentioned previously, the algorithms developed on this project were very successful, highlights being the usage of FATHMM-MKL as the mutation impact predictor at COSMIC in Cambridge in the period 2015 to 2022 and its adoption as a filter for Ion Torrent sequencing machine, for example. A current search via Google Scholar gives 1420 usages for FATHMM-MKL in the literature (648 citations), 5820 for FATHMM, 484 for FATHMM-XF, etc. A number of publications related to the project are available here: https://seis.bristol.ac.uk/~enicgc/software.htm. It is true to say that these FATHMM algorithms have become obsolete due to the efforts of competing research groups. Nevertheless, the applications of machine learning to genomics, and this particular area of variant effect predictors, remains very rich for new and potentially high impact research projects, and proposed new prediction tools. A constraint has been a lack of research funding and some disruption during the pandemic. However, we remain committed to developing higher accuracy and more capable tools beyond FATHMM-MKL and CScape. We have gained a PhD studentship from Cancer Research UK and have devised a new data resource with over 1500 data sources (DrivR-Base: paper under review, but github directory is here: https://github.com/amyfrancis97/DrivR-Base), in place of the 30 types of data used in the construction of the original FATHMM-MKL. We are therefore well positioned to build a replacement for FATHMM-MKL which will be more accurate. We have evaluated a cancer-specific mutation impact predictor, to be called CanDrivR, which is more accurate and capable than FATHMM-MKL or CScape, in application to variants in the human cancer genome. Much further work needs to be progressed to perfect this new tool and an eventual step is to derive predictions at all 3 billion positions in the human genome: this last step requires substantial computing and we have gained, or applied for, accounts on the University of Bristol based Isambard3 and Isambard-AI machines, and Archer2. With two leading research groups, we have also discussed a more immediate application of these variant effect predictors to the identification of drivers within circulating tumour DNA (ctDNA) from liquid biopsy. The above research investigations also integrate very well with multiple government initiatives such as Our Future Health, the Genomics England initiatives, the Generations Study, and other initiatives such as UK Biobank.
Year(s) Of Engagement Activity 2024
URL https://seis.bristol.ac.uk/~enicgc/software.htm