Novel Methodology for predicting the Functional Effects of Genetic Variation

Lead Research Organisation: University of Bristol

Department Name: Engineering Mathematics and Technology

Abstract

Rapid developments in next-generation sequencing technologies
has lead to a rapid increase in identified genetic variants,
which may be implicated in disease. Laboratory-based experiments
to investigate functional consequences of these genetic variants
is a costly and time-consuming process. The core application focus of the project
is to construct algorithmically-based methods for predicting
the functional consequence of genetic variants, a problem where we have already
achieved some state-of-the-art results in certain contexts. 50% of the project period will be
given over to the design of improved algorithms specific to this application.
The other 50% of the project period will be given over to achieving
state-of-the-art performance with these new methods, and existing methods,
in application to the stated goal of predicting which variants, insertions
and deletion events are associated with human disease. The pathways to impact
is briefly described in the Case and more fully detailed in the Pathways to Impact
statement.

Planned Impact

This is a multidisciplinary project in which algorithm development (to attain optimal predictive performance) is strongly linked to the very important application of predicting which single nucleotide variants and indels are functional in disease. The project will have the following impacts:

1. The algorithms we will develop have a broader range of applications beyond the immediate project.
The stochastic edit code algorithms we propose to develop (Section 3.2) could be used in text processing,
in general. The MKL algorithms (Section 3.5) have potential applications in machine vision, bioinformatics,
supervised network inference, for example.

2. This project has an excellent tie-in with recently announced initiatives, such as Genomics England
(www.genomicsengland.co.uk) a 300 million pound project to gather 100,000 genomes (through
NHS facilities, ). The ability to functional annotate genomes is of obvious paramount importance, to
exploit this .

3. The developed tools would be useful in guiding experimentalists toward areas for investigation:

3.1. There is some prospect for the discovery of novel monogenic diseases, or highlighting the role of
SNVs in multi-factorial disease.

3.2. Interstitial deletion events leading to fusion genes are known disease-drivers in certain cancer contexts
and work in Section 3.1 [4.] and Section 3.2 could highlight novel drug targets.

In our benchmarking study (reported in Section 2.3), we have discovered 90,000 SNVs (at 90% or higher
confidence), indicated as functional in disease.

4. As reported in Pathways to Impact, we are in a good position to pursue any follow-up experimental studies
of identifies targets.

Funded Value:

£271,071

Funded Period:

Jun 15 - May 18

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/M01715X/1

Principal Investigator:

Colin Campbell

Research Subject:

Info. & commun. Technol. (50%)

Tools, technologies & methods (50%)

Research Topic:

Fundamentals of Computing (50%)

See subject area (50%)

Organisations

University of Bristol (Lead Research Organisation)

People	ORCID iD
Colin Campbell (Principal Investigator)	http://orcid.org/0000-0002-4775-9802
Tom Gaunt (Co-Investigator)	http://orcid.org/0000-0003-0924-3247

Publications

Author Name

Title Publication Date Published

|< < 1 2 3 > >|

10 25 50

Campbell C (2022) Predicting pathogenicity from non-coding mutations in Nature Biomedical Engineering

Darbyshire M (2019) Estimating the Frequency of Single Point Driver Mutations across Common Solid Tumours. in Scientific reports

Ferlaino M (2017) An integrative approach to predicting the functional effects of small indels in non-coding regions of the human genome in BMC Bioinformatics

Fernandez-Lozano C (2016) Texture analysis in gel electrophoresis images using an integrative kernel-based approach. in Scientific reports

Flynn B (2021) Corticosterone pattern-dependent glucocorticoid receptor binding and transcriptional regulation within the liver in PLOS Genetics

Jiang L (2016) RNA sequencing analysis of human podocytes reveals glucocorticoid regulated gene networks targeting non-immune pathways. in Scientific reports

Loh SY (2017) Unsupervised Network Analysis of the Plastic Supraoptic Nucleus Transcriptome Predicts Caprin2 Regulatory Interactions. in eNeuro

Luca BA (2020) A novel stratification framework for predicting outcome in patients with prostate cancer. in British journal of cancer

Richardson T (2016) A pathway-centric approach to rare variant association analysis in European Journal of Human Genetics

Richardson TG (2016) A Protein Domain and Family Based Approach to Rare Variant Association Analysis. in PloS one

Impact Summary
Research Tools and Methods
Engagement Activities


Description	Our FATHMM family of predictors originally predicted the pathogenic status of amino acid substitutions: •Hashem A. Shihab, Julian Gough, David N. Cooper, Peter D. Stenson, Gary L.A. Barker, Keith J. Edwards, Ian N.M. Day, Tom R. Gaunt. Predicting the Functional, Molecular and Phenotypic Consequences of Amino Acid Substitutions using Hidden Markov Models. Hum. Mutat. (2013), 34:57-65 Variations of this predictor were created for disease-specific contexts, including cancer. For predicting the pathogenic status of single nucleotide variants (SNVs) in the human genome, they focused on the observation that SNVs in regions of the genome which are highly conserved across species are more likely to be deleterious, relative to variants in regions with high variability across species. As an improvement we later devised a predictor for single nucleotide variants, in both the coding and non-coding regions of the human genome. This predictor (FATHMM-MKL) used a wide variety of sources of data for predicting the pathogenic impact of individual SNVs, inclusive of sequence conservation across species, which remained the most informative source of information. This method used multiple kernel kernel (see my webpages on machine learning, or Chapter 3.6 of Learning with Support Vector Machines). The algorithm learns to weight the different types of data according to relative informativeness. This method is available at the FATHMM webserver site and was published here: •Hashem Shihab, Mark Rogers, Julian Gough, Matthew Mort, David Cooper, Ian Day, Tom Gaunt and Colin Campbell. An Integrative Approach to Predicting the Functional Effects of Non-Coding and Coding Sequence Variation. Bioinformatics Vol. 31, No. 10, 2015, pages 1536-1543. We later improved the method a little: •Mark Rogers, Hashem Shihab, Tom Gaunt, Matthew Mort, David Cooper, and Colin Campbell, Sequential Data Selection for Predicting the Pathogenic Effects of Sequence Variation, Proceedings, 2015 IEEE International Conference on Bioinformatics and Biomedicine (IEEE BIBM 2015, B394) FATHMM-MKL has been found to be state-of-the-art in comparative surveys by other groups. We have also devised a Genome Tolerance Browser to better visualise the locations of pathogenic single nucleotide variants in the human genome. Peaks near unity in the depicted plots indicate probable pathogenic SNVs and peaks near zero indicate neutral. Other prediction methods are presented, e.g. CADD, some as optional tracks. Reference: Hashem A. Shihab, Mark F. Rogers, Michael Ferlaino, Colin Campbell and Tom R. Gaunt. GTB - an online genome tolerance browser. BMC Bioinformatics 2017, 18:20. DOI: 10.1186/s12859-016-1436-4. Subsequent to this we developed an indel predictor (for estimating the pathogenic effects of short insertions or deletions of genetic code). This predictor can handle indels in non-coding regions of the human genome: •Michael Ferlaino, Mark F Rogers, Hashem A Shihab, Tom R Gaunt, Matthew Mort, David N Cooper, Colin Campbell. An integrative approach to predicting the functional effects of small indels in non-coding regions of the human genome. Journal submission A further area of interest has been disease-specific predictors, which are generally more accurate than generic predictors. Thus we are devising a suite of predictors in the context of cancer under the generic title of CScape. Our first generic cancer predictor CScape uses a wide variety of data sources to predict if a single nucleotide variant is potentially a disease-driver for cancer: • Mark F. Rogers, Hashem A. Shihab, Tom R. Gaunt and Colin Campbell. CScape: a tool for predicting oncogenic single-point mutations in the cancer genome. Journal submission Our baseline predictor appears more accurate than competitors and was based on data from COSMIC and up to 30 different types of genomic data sources. The method was benchmarked on independent data from the The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC), in addition to other databases. It is able to make predictions in both the coding and non-coding regions of the cancer genome, though it is much more accurate in coding regions. We furthermore introduced a confidence measure for the predicted class label. By restricting prediction to the highest confidence instances, the resultant classifier can perform at approximately 90% test accuracy (in coding regions), though it is only able to achieve this level of accuracy at a minority of nucleotide positions (about 17% of nucleotide positions). These high confidence predicted potential disease-driver variants are typically clustered by location in the cancer genome and the method highlights exons in 191 autosomal genes such that mutational change could act as a disease-driver. Finally, we have developed a state-of-the-art integrative classifier for predicting haploinsufficient genes: Hashem Shihab, Mark Rogers, Colin Campbell and Tom Gaunt. HIPred: an integrative approach to predicting haploinsufficient genes. Bioinformatics, https://doi.org/10.1093/bioinformatics/btx028 (2017). The cell nucleus of many human cells are diploid: they contain two complete sets of chromosomes, one from each parent (in humans, germ cells are haploid). Haploinsufficiency occurs when a diploid organism has only a single functional copy of a gene (with the other copy inactivated by mutation) and this single functional copy does not produce enough of a gene product, leading to a disease trait. *Added 16th February 2022*: the FATHMM-MKL tool developed in this research is an optional software filter for Ion Torrent Genome Sequencing machines (Google 'FATHMM scores Ion Torrent'). The method is the sole mutation impact predictor at COSMIC in Cambridge, the worlds largest cancer genome archive (Google 'FATHMM-MKL cosmic' and 'cancer cosmic'). There are at least 100+ papers benchmarking/analysing the method (search FATHMM-MKL and FATHMM in Google Scholar).
First Year Of Impact	2015
Sector	Healthcare
Impact Types	Societal


Title	Sequence variant predictor (human disease)
Description	Available at http://fathmm.biocompute.org.uk.
Type Of Material	Model of mechanisms or symptoms - human
Year Produced	2014
Provided To Others?	Yes
Impact	Plugin for highly used software tools such as Ensembl variant effect predictor VEP and COSMIC (Sanger Centre)
URL	http://fathmm.biocompute.org.uk


Description	FATHMM-MKL adopted as software filter for In Torrent whole genome sequencers
Form Of Engagement Activity	Engagement focused website, blog or social media channel
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other audiences
Results and Impact	FATHMM-MKL adopted as a software filter: https://ionreporter.thermofisher.com/ionreporter/help/GUID-4E7D0B57-D7E5-4BA2-977F-D84260BF23E5.html for Ion Torrent Genome Sequencing Machines: https://www.thermofisher.com/uk/en/home/brands/ion-torrent.html
Year(s) Of Engagement Activity	2017,2018,2019,2020,2021,2022,2023
URL	https://ionreporter.thermofisher.com/ionreporter/help/GUID-4E7D0B57-D7E5-4BA2-977F-D84260BF23E5.html


Description	Research resource
Form Of Engagement Activity	Engagement focused website, blog or social media channel
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other audiences
Results and Impact	From 2015-2022, our method, FATHMM-MKL, was adopted as the only mutation impact predictor at COSMIC in Cambridge (Forbes, D. et al. COSMIC: somatic cancer genetics at high-resolution. Nucleic Acids Research, 45(D1):D777- D783, 11 2016). COSMIC is by far the world's largest cancer genome archive: FATHMM-MKL predicted which single nucleotide variants in the human cancer genome are drivers of unregulated cell proliferation, with an associated probability score (SNVs are the most common driver mutation in cancer).
Year(s) Of Engagement Activity	2015,2016,2017,2018,2019,2020,2021,2022,2023
URL	https://cancer.sanger.ac.uk/cosmic/analyses

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications