Novel Methodology for predicting the Functional Effects of Genetic Variation

Lead Research Organisation: University of Bristol
Department Name: Engineering Mathematics and Technology

Abstract

Rapid developments in next-generation sequencing technologies
has lead to a rapid increase in identified genetic variants,
which may be implicated in disease. Laboratory-based experiments
to investigate functional consequences of these genetic variants
is a costly and time-consuming process. The core application focus of the project
is to construct algorithmically-based methods for predicting
the functional consequence of genetic variants, a problem where we have already
achieved some state-of-the-art results in certain contexts. 50% of the project period will be
given over to the design of improved algorithms specific to this application.
The other 50% of the project period will be given over to achieving
state-of-the-art performance with these new methods, and existing methods,
in application to the stated goal of predicting which variants, insertions
and deletion events are associated with human disease. The pathways to impact
is briefly described in the Case and more fully detailed in the Pathways to Impact
statement.

Planned Impact

This is a multidisciplinary project in which algorithm development (to attain optimal predictive performance) is strongly linked to the very important application of predicting which single nucleotide variants and indels are functional in disease. The project will have the following impacts:

1. The algorithms we will develop have a broader range of applications beyond the immediate project.
The stochastic edit code algorithms we propose to develop (Section 3.2) could be used in text processing,
in general. The MKL algorithms (Section 3.5) have potential applications in machine vision, bioinformatics,
supervised network inference, for example.

2. This project has an excellent tie-in with recently announced initiatives, such as Genomics England
(www.genomicsengland.co.uk) a 300 million pound project to gather 100,000 genomes (through
NHS facilities, ). The ability to functional annotate genomes is of obvious paramount importance, to
exploit this .

3. The developed tools would be useful in guiding experimentalists toward areas for investigation:

3.1. There is some prospect for the discovery of novel monogenic diseases, or highlighting the role of
SNVs in multi-factorial disease.

3.2. Interstitial deletion events leading to fusion genes are known disease-drivers in certain cancer contexts
and work in Section 3.1 [4.] and Section 3.2 could highlight novel drug targets.

In our benchmarking study (reported in Section 2.3), we have discovered 90,000 SNVs (at 90% or higher
confidence), indicated as functional in disease.

4. As reported in Pathways to Impact, we are in a good position to pursue any follow-up experimental studies
of identifies targets.

Publications

10 25 50
 
Description Our FATHMM family of predictors originally predicted the pathogenic status of amino acid substitutions: •Hashem A. Shihab, Julian Gough, David N. Cooper, Peter D. Stenson, Gary L.A. Barker, Keith J. Edwards, Ian N.M. Day, Tom R. Gaunt. Predicting the Functional, Molecular and Phenotypic Consequences of Amino Acid Substitutions using Hidden Markov Models. Hum. Mutat. (2013), 34:57-65 Variations of this predictor were created for disease-specific contexts, including cancer. For predicting the pathogenic status of single nucleotide variants (SNVs) in the human genome, they focused on the observation that SNVs in regions of the genome which are highly conserved across species are more likely to be deleterious, relative to variants in regions with high variability across species. As an improvement we later devised a predictor for single nucleotide variants, in both the coding and non-coding regions of the human genome. This predictor (FATHMM-MKL) used a wide variety of sources of data for predicting the pathogenic impact of individual SNVs, inclusive of sequence conservation across species, which remained the most informative source of information. This method used multiple kernel kernel (see my webpages on machine learning, or Chapter 3.6 of Learning with Support Vector Machines). The algorithm learns to weight the different types of data according to relative informativeness. This method is available at the FATHMM webserver site and was published here: •Hashem Shihab, Mark Rogers, Julian Gough, Matthew Mort, David Cooper, Ian Day, Tom Gaunt and Colin Campbell. An Integrative Approach to Predicting the Functional Effects of Non-Coding and Coding Sequence Variation. Bioinformatics Vol. 31, No. 10, 2015, pages 1536-1543. We later improved the method a little: •Mark Rogers, Hashem Shihab, Tom Gaunt, Matthew Mort, David Cooper, and Colin Campbell, Sequential Data Selection for Predicting the Pathogenic Effects of Sequence Variation, Proceedings, 2015 IEEE International Conference on Bioinformatics and Biomedicine (IEEE BIBM 2015, B394) FATHMM-MKL has been found to be state-of-the-art in comparative surveys by other groups. We have also devised a Genome Tolerance Browser to better visualise the locations of pathogenic single nucleotide variants in the human genome. Peaks near unity in the depicted plots indicate probable pathogenic SNVs and peaks near zero indicate neutral. Other prediction methods are presented, e.g. CADD, some as optional tracks. Reference: Hashem A. Shihab, Mark F. Rogers, Michael Ferlaino, Colin Campbell and Tom R. Gaunt. GTB - an online genome tolerance browser. BMC Bioinformatics 2017, 18:20. DOI: 10.1186/s12859-016-1436-4. Subsequent to this we developed an indel predictor (for estimating the pathogenic effects of short insertions or deletions of genetic code). This predictor can handle indels in non-coding regions of the human genome: •Michael Ferlaino, Mark F Rogers, Hashem A Shihab, Tom R Gaunt, Matthew Mort, David N Cooper, Colin Campbell. An integrative approach to predicting the functional effects of small indels in non-coding regions of the human genome. Journal submission A further area of interest has been disease-specific predictors, which are generally more accurate than generic predictors. Thus we are devising a suite of predictors in the context of cancer under the generic title of CScape. Our first generic cancer predictor CScape uses a wide variety of data sources to predict if a single nucleotide variant is potentially a disease-driver for cancer: • Mark F. Rogers, Hashem A. Shihab, Tom R. Gaunt and Colin Campbell. CScape: a tool for predicting oncogenic single-point mutations in the cancer genome. Journal submission Our baseline predictor appears more accurate than competitors and was based on data from COSMIC and up to 30 different types of genomic data sources. The method was benchmarked on independent data from the The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC), in addition to other databases. It is able to make predictions in both the coding and non-coding regions of the cancer genome, though it is much more accurate in coding regions. We furthermore introduced a confidence measure for the predicted class label. By restricting prediction to the highest confidence instances, the resultant classifier can perform at approximately 90% test accuracy (in coding regions), though it is only able to achieve this level of accuracy at a minority of nucleotide positions (about 17% of nucleotide positions). These high confidence predicted potential disease-driver variants are typically clustered by location in the cancer genome and the method highlights exons in 191 autosomal genes such that mutational change could act as a disease-driver. Finally, we have developed a state-of-the-art integrative classifier for predicting haploinsufficient genes: Hashem Shihab, Mark Rogers, Colin Campbell and Tom Gaunt. HIPred: an integrative approach to predicting haploinsufficient genes. Bioinformatics, https://doi.org/10.1093/bioinformatics/btx028 (2017). The cell nucleus of many human cells are diploid: they contain two complete sets of chromosomes, one from each parent (in humans, germ cells are haploid). Haploinsufficiency occurs when a diploid organism has only a single functional copy of a gene (with the other copy inactivated by mutation) and this single functional copy does not produce enough of a gene product, leading to a disease trait. ***Added 16th February 2022***: the FATHMM-MKL tool developed in this research is an optional software filter for Ion Torrent Genome Sequencing machines (Google 'FATHMM scores Ion Torrent'). The method is the sole mutation impact predictor at COSMIC in Cambridge, the worlds largest cancer genome archive (Google 'FATHMM-MKL cosmic' and 'cancer cosmic'). There are at least 100+ papers benchmarking/analysing the method (search FATHMM-MKL and FATHMM in Google Scholar).
First Year Of Impact 2015
Sector Healthcare
Impact Types Societal

 
Title Sequence variant predictor (human disease) 
Description Available at http://fathmm.biocompute.org.uk. 
Type Of Material Model of mechanisms or symptoms - human 
Year Produced 2014 
Provided To Others? Yes  
Impact Plugin for highly used software tools such as Ensembl variant effect predictor VEP and COSMIC (Sanger Centre) 
URL http://fathmm.biocompute.org.uk
 
Description FATHMM-MKL adopted as software filter for In Torrent whole genome sequencers 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact FATHMM-MKL adopted as a software filter:
https://ionreporter.thermofisher.com/ionreporter/help/GUID-4E7D0B57-D7E5-4BA2-977F-D84260BF23E5.html
for Ion Torrent Genome Sequencing Machines:
https://www.thermofisher.com/uk/en/home/brands/ion-torrent.html
Year(s) Of Engagement Activity 2017,2018,2019,2020,2021,2022,2023
URL https://ionreporter.thermofisher.com/ionreporter/help/GUID-4E7D0B57-D7E5-4BA2-977F-D84260BF23E5.html
 
Description Research resource 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact From 2015-2022, our method, FATHMM-MKL, was adopted as the only mutation impact predictor at COSMIC in Cambridge (Forbes, D. et al. COSMIC: somatic cancer genetics at high-resolution. Nucleic Acids Research, 45(D1):D777- D783, 11 2016). COSMIC is by far the world's largest cancer genome archive: FATHMM-MKL predicted which single nucleotide variants in the human cancer genome are drivers of unregulated cell proliferation, with an associated probability score (SNVs are the most common driver mutation in cancer).
Year(s) Of Engagement Activity 2015,2016,2017,2018,2019,2020,2021,2022,2023
URL https://cancer.sanger.ac.uk/cosmic/analyses