Data mining and bioinformatics cross-cutting theme

Lead Research Organisation: University of Bristol

Abstract

The investigation of how molecules influence health and disease (molecular epidemiology) has benefited from rapid development of methods to collect molecular data. This has created many challenges in how to handle, analyse and interpret the resulting “big data”. For example, we hold data on up to 2.5 million genetic variants and 450,000 epigenetic measures collected from thousands of samples. There is also a wealth of publicly available data that has the potential to give us further insights into disease processes when combined with the new data we collect.
The data mining and bioinformatics theme aims to: (a) develop and apply methods to combine data from a wide range of sources; (b) use publicly available data to filter and prioritise our results; (c) develop and apply methods to summarise data to reduce the computing power required; (d) “Mine” the resulting datasets to identify relationships between molecular, lifestyle and disease measures; (e) Train new and existing researchers in the methods required to work with “big data”.
This theme will build on core computational expertise in the key areas of molecular epidemiology, maximising the potential discoveries from the wealth of data generated by the programmes within the Unit.

Technical Summary

A core feature of the Unit will be the generation, storage and analysis of large volumes of epidemiological and “omics” (genomic, epigenomic, metabolomic and metagenomic) data from a range of cohort studies. The Data Mining and Bioinformatics theme will be responsible for the development of novel tools and strategies to access, integrate, merge, visualise and analyse the multi-dimensional datasets within and across the various studies involved in the Unit Programmes. In particular the application of data mining and machine learning algorithms will allow hypothesis-free approaches to data interrogation (Programme 1), which can be followed by hypothesis testing in the other Programmes. Bioinformatics will also underpin the genetic, epigenetic and omics association studies that will be undertaken in all of the Programmes.
Objectives and plans:
Integration of diverse data types, including large-scale “omics” datasets: integration of very high-dimensional omics data to facilitate data mining and specific hypothesis-based analyses.
Filtering and annotation of omics association results: integration of internally generated data with public data from projects such as ENCODE, the NIH roadmap epigenomics project and others to prioritise and filter results.
Developing approaches for combining variables in omics data: collapsing and dimensionality-reduction approaches will be developed and applied to reduce the computational burden of association analysis between high-dimensional omics datasets.
Development of methods for hypothesis-free mining of complex data: using allele-score based approaches to de novo identification of potential causal relationships across a wide range of phenotypes/outcomes.
Training in bioinformatics and data-mining: specific training will be developed to build in-house skills in bioinformatics and data mining methodology in the context of high-dimensional data.
This theme will further develop the existing core expertise in bioinformatics and data mining to enable Unit researchers to work effectively with the data generated by the Unit and collaborators. The new methods developed will be of broad applicability in the field of molecular epidemiology, and the Unit will train a new cadre of researchers with essential skills in medical bioinformatics.

Publications

10 25 50

 
Description Research Grant
Amount £349,099 (GBP)
Organisation GlaxoSmithKline (GSK) 
Sector Private
Country Global
Start 02/2017 
End 02/2020
 
Description Research Grant
Amount £117,059 (GBP)
Organisation CHDI Foundation 
Sector Charity/Non Profit
Country United States
Start 04/2017 
End 03/2019
 
Description Research Grant
Amount £349,099 (GBP)
Organisation Biogen 
Sector Private
Country United Kingdom
Start 08/2017 
End 08/2020
 
Description Stratified Medicine Initiative
Amount £2,500,000 (GBP)
Organisation Medical Research Council (MRC) 
Sector Public
Country United Kingdom
Start 04/2018 
End 03/2022
 
Description Vice chancellor fellowship
Amount £163,000 (GBP)
Organisation University of Bristol 
Sector Academic/University
Country United Kingdom
Start 10/2017 
End 09/2020
 
Title ARIES-Explorer 
Description ARIES-Explorer provides an openly accessible web interface to explore epigenome-wide methylation data from the Accessible Resource for Integrated Epigenomics Studies 
Type Of Material Database/Collection of data 
Year Produced 2013 
Provided To Others? Yes  
Impact Cited in a key review in the field (Mill et al, Nature Reviews Genetics, VOLUME 14 | AUGUST 2013 | 585). 419 unique visitors to the site in its first month of operation. 
URL http://ariesepigenomics.org.uk/ariesexplorer
 
Title GoDMC mQTL database 
Description Database of methylation quantitative trait loci (mQTL) due to be openly released on publication of the GoDMC consortium paper. 
Type Of Material Database/Collection of data 
Year Produced 2019 
Provided To Others? No  
Impact This is the largest mQTL analysis to date, providing genetic instruments for use in Mendelian randomization analyses of DNA methylation. 
URL http://mqtldb.godmc.org.uk/
 
Title IEU GWAS database 
Description Database of GWAS results underpinning the MR-Base and LD Hub web applications 
Type Of Material Database/Collection of data 
Year Produced 2016 
Provided To Others? Yes  
Impact The database forms the basis of the MR-Base and LD Hub web applications and is widely used by researchers within and outside the IEU 
URL http://www.mrbase.org/
 
Title Supporting data for "PhenoSpD: an integrated toolkit for phenotypic correlation es-timation and multiple testing correction using GWAS summary statistics" 
Description Identifying phenotypic correlations between complex traits and diseases can provide useful etiological insights. Restrict-ed access to much individual-level phenotype data makes it difficult to estimate large-scale phenotypic correlation across the human phenome. Two state-of-the-art methods, metaCCA and LD score regression, provide an alternative approach to estimate phenotypic correlation using only genome-wide association study (GWAS) summary results.
Here, we present an integrated R toolkit, PhenoSpD, to 1) use LD score regression to estimate phenotypic correlations using GWAS summary statistics; and 2) utilize the estimated phenotypic correlations to inform correction of multiple testing for complex human traits using the spectral decomposition of matrices (SpD). The simulations suggest 1) it is pos-sible to identify non-independence of phenotypes using samples with partial overlap, as overlap decreases the estimated phenotypic correlations will attenuate towards zero and multiple testing correction will be more stringent than in perfectly overlapping samples; 2) in contrast to LD score regression, metaCCA will provide approximate genetic correlations rather than phenotypic correlation, which limits its application for multiple testing correction. In a case study, PhenoSpD using UK Biobank GWAS results suggested 399.6 independent tests among 487 human traits, which is close to the 352.4 inde-pendent tests estimated using true phenotypic correlation. We further applied PhenoSpD to an estimated 5618 pair-wise phenotypic correlations among 107 metabolites using GWAS summary statistics from Kettunen et al. and PhenoSpD suggested the equivalent of 33.5 independent tests for theses metabolites.
PhenoSpD extends the use of summary level results, providing a simple and conservative way to reduce dimensionality for complex human traits using GWAS summary statistics. This is particularly valuable in the age of large-scale biobank and consortia studies, where GWAS results are much more accessible than individual-level data.
R code and documentation for PhenoSpD V1.0.0 is available online https://github.com/MRCIEU/PhenoSpD. 
Type Of Material Database/Collection of data 
Year Produced 2018 
Provided To Others? Yes  
 
Title mQTLdb 
Description mQTLdb is a database of methylation QTL, initially set up for methylation QTL from the ALSPAC-ARIES project. 
Type Of Material Database/Collection of data 
Year Produced 2016 
Provided To Others? Yes  
Impact The database is being used by researchers to identify methylation QTL for Mendelian randomization analyses 
URL http://www.mqtldb.org/
 
Description Genetics of DNA Methylation Consortium 
Organisation CeMM Research Center for Molecular Medicine
Country Austria 
Sector Academic/University 
PI Contribution Contributing to a consortial analysis of methylation quantitative trait loci using data from the Accessible Resource for Integrated Epigenomics Studies (ARIES)
Collaborator Contribution Contributing to a consortial analysis of methylation quantitative trait loci using data from other studies.
Impact Multi-disciplinary collaboration involving molecular epidemiology, statistics and bioinformatics. Outputs: Database of methylation QTL: http://mqtldb.godmc.org.uk/ Publication pending
Start Year 2013
 
Description Genetics of DNA Methylation Consortium 
Organisation King's College London
Department Brain Bank
Country United Kingdom 
Sector Academic/University 
PI Contribution Contributing to a consortial analysis of methylation quantitative trait loci using data from the Accessible Resource for Integrated Epigenomics Studies (ARIES)
Collaborator Contribution Contributing to a consortial analysis of methylation quantitative trait loci using data from other studies.
Impact Multi-disciplinary collaboration involving molecular epidemiology, statistics and bioinformatics. Outputs: Database of methylation QTL: http://mqtldb.godmc.org.uk/ Publication pending
Start Year 2013
 
Description Genetics of DNA Methylation Consortium 
Organisation Leiden University Medical Center
Country Netherlands 
Sector Academic/University 
PI Contribution Contributing to a consortial analysis of methylation quantitative trait loci using data from the Accessible Resource for Integrated Epigenomics Studies (ARIES)
Collaborator Contribution Contributing to a consortial analysis of methylation quantitative trait loci using data from other studies.
Impact Multi-disciplinary collaboration involving molecular epidemiology, statistics and bioinformatics. Outputs: Database of methylation QTL: http://mqtldb.godmc.org.uk/ Publication pending
Start Year 2013
 
Description Genetics of DNA Methylation Consortium 
Organisation Newcastle University
Country United Kingdom 
Sector Academic/University 
PI Contribution Contributing to a consortial analysis of methylation quantitative trait loci using data from the Accessible Resource for Integrated Epigenomics Studies (ARIES)
Collaborator Contribution Contributing to a consortial analysis of methylation quantitative trait loci using data from other studies.
Impact Multi-disciplinary collaboration involving molecular epidemiology, statistics and bioinformatics. Outputs: Database of methylation QTL: http://mqtldb.godmc.org.uk/ Publication pending
Start Year 2013
 
Description Genetics of DNA Methylation Consortium 
Organisation University of Bristol
Country United Kingdom 
Sector Academic/University 
PI Contribution Contributing to a consortial analysis of methylation quantitative trait loci using data from the Accessible Resource for Integrated Epigenomics Studies (ARIES)
Collaborator Contribution Contributing to a consortial analysis of methylation quantitative trait loci using data from other studies.
Impact Multi-disciplinary collaboration involving molecular epidemiology, statistics and bioinformatics. Outputs: Database of methylation QTL: http://mqtldb.godmc.org.uk/ Publication pending
Start Year 2013
 
Description Genetics of DNA Methylation Consortium 
Organisation University of Exeter
Country United Kingdom 
Sector Academic/University 
PI Contribution Contributing to a consortial analysis of methylation quantitative trait loci using data from the Accessible Resource for Integrated Epigenomics Studies (ARIES)
Collaborator Contribution Contributing to a consortial analysis of methylation quantitative trait loci using data from other studies.
Impact Multi-disciplinary collaboration involving molecular epidemiology, statistics and bioinformatics. Outputs: Database of methylation QTL: http://mqtldb.godmc.org.uk/ Publication pending
Start Year 2013
 
Description MR-Base collaboration 
Organisation Biogen
Country United Kingdom 
Sector Private 
PI Contribution We are collaborating with GlaxoSmithKline and Biogen on the further development and enhancement of the MR-Base platform, with a particular focus on the evaluation of potential drug targets.
Collaborator Contribution The industry partners are providing scientific input on the project and advising on who to maximise translational value of the MR-Base platform.
Impact Outputs/outcomes: * expansion of the database underlying MR-Base. Papers: * Baird DA, Liu JZ, Zheng J, Sieberts SK, Perumal T, Elsworth B, Richardson TG... AMP-AD eQTL working group . (2021). Identifying drug targets for neurological and psychiatric disease via genetics and the brain transcriptome.. PLoS genetics, 17 (1), pp. e1009224 * Zheng J, Haberland V, Baird D, Walker V, Haycock PC, Hurle MR, Gutteridge A... Gaunt TR. (2020). Phenome-wide Mendelian randomization mapping the influence of the plasma proteome on complex diseases.. Nature genetics, 52 (10), pp. 1122-1131
Start Year 2017
 
Description MR-Base collaboration 
Organisation GlaxoSmithKline (GSK)
Country Global 
Sector Private 
PI Contribution We are collaborating with GlaxoSmithKline and Biogen on the further development and enhancement of the MR-Base platform, with a particular focus on the evaluation of potential drug targets.
Collaborator Contribution The industry partners are providing scientific input on the project and advising on who to maximise translational value of the MR-Base platform.
Impact Outputs/outcomes: * expansion of the database underlying MR-Base. Papers: * Baird DA, Liu JZ, Zheng J, Sieberts SK, Perumal T, Elsworth B, Richardson TG... AMP-AD eQTL working group . (2021). Identifying drug targets for neurological and psychiatric disease via genetics and the brain transcriptome.. PLoS genetics, 17 (1), pp. e1009224 * Zheng J, Haberland V, Baird D, Walker V, Haycock PC, Hurle MR, Gutteridge A... Gaunt TR. (2020). Phenome-wide Mendelian randomization mapping the influence of the plasma proteome on complex diseases.. Nature genetics, 52 (10), pp. 1122-1131
Start Year 2017
 
Title ARIES-Explorer 
Description Novel web interface to enable browsing of epigenomic data. 
Type Of Technology Webtool/Application 
Year Produced 2013 
Impact Implementation with data from the Accessible Resource for Integrated Epidemiology Studies 
URL http://ariesepigenomics.org.uk/ariesexplorer
 
Title CScape 
Description CScape predicts the oncogenic status (disease-driver or neutral) of somatic point mutations in the coding and non-coding regions of the cancer genome. 
Type Of Technology Webtool/Application 
Year Produced 2017 
Open Source License? Yes  
Impact None yet 
URL http://cscape.biocompute.org.uk/
 
Title FATHMM 
Description Our software and server is capable of predicting the functional effects of protein missense mutations by combining sequence conservation within hidden Markov models (HMMs), representing the alignment of homologous sequences and conserved protein domains, with "pathogenicity weights", representing the overall tolerance of the protein/domain to mutations. 
Type Of Technology Software 
Year Produced 2013 
Open Source License? Yes  
Impact The software has been implemented by COSMIC (Catalogue of somatic mutations in cancer) and as an add-in for the widely used ANNOVAR tool. Three publications with different variants of the algorithm: Shihab HA, Gough J, Cooper DN, Stenson PD, Barker GLA, Edwards KJ, Day INM, Gaunt, TR. (2013). Predicting the Functional, Molecular and Phenotypic Consequences of Amino Acid Substitutions using Hidden Markov Models. Hum. Mutat., 34:57-65 Shihab HA, Gough J, Cooper DN, Day INM, Gaunt, TR. (2013). Predicting the Functional Consequences of Cancer-Associated Amino Acid Substitutions. Bioinformatics 29:1504-1510. Shihab HA, Gough J, Mort M, Cooper DN, Day INM, Gaunt, TR. (2014). Ranking Non-Synonymous Single Nucleotide Polymorphisms based on Disease Concepts. Human Genomics, 8:11 
URL http://fathmm.biocompute.org.uk/
 
Title FATHMM-MKL 
Description FATHMM-MKL provides a high-throughput web-server capable of predicting the functional consequences of non-coding variants. Our MKL algorithm integrates functional annotations from ENCODE with nucleotide-based HMMs. 
Type Of Technology Webtool/Application 
Year Produced 2015 
Impact The FATHMM-MKL algorithm has been adopted by dbNSFP and as a plugin for the European Bioinformatics Institutes Variant Effect Predictor, and is thus widely available to researchers predicting the functional effects of genetic variants. 
URL http://fathmm.biocompute.org.uk/fathmmMKL.htm
 
Title FSMKL 
Description The software provides multiple-kernel learning (MKL) with feature selection, and has been applied by us in the context of predicting cancer outcomes using combinations of molecular, pathway and clinical information. 
Type Of Technology Software 
Year Produced 2013 
Open Source License? Yes  
Impact Published in Bioinformatics (Bioinformatics. 2014 Mar 15;30(6):838-45. doi: 10.1093/bioinformatics/btt610) 
URL https://github.com/jseoane/FSMKL
 
Title GTB 
Description The Genome Tolerance Browser (GTB) provides a browsable and searchable view of the predicted effect of genetic variation across the genome using predictions from a range of published algorithms 
Type Of Technology Webtool/Application 
Year Produced 2016 
Impact The Genome Tolerance Browser has been published 
URL http://gtb.biocompute.org.uk/
 
Title HIPred 
Description The HIPred haploinsufficiency predictor aims to predict whether a gene will function effectively in single copy (ie in the presence of a heterozygous deletion or heterozygous nonsense mutation. 
Type Of Technology Software 
Year Produced 2017 
Open Source License? Yes  
Impact The software has been published 
URL https://github.com/HAShihab/HIPred
 
Title LD Hub 
Description LD Hub is a web application that enables estimation of heritability of one trait and analysis of genetic correlation between pairs of traits using LD Score regression 
Type Of Technology Webtool/Application 
Year Produced 2016 
Impact The web application is being widely used by researchers around the world 
URL http://ldsc.broadinstitute.org/
 
Title MELODI 
Description MELDOI is a web application built on a Neo4J graph database that identifies mechanistic pathways between risk factors and disease outcomes using text from the scientific literature 
Type Of Technology Webtool/Application 
Year Produced 2017 
Impact MELODI is being used by a number of researchers to identify mechanisms underpinning causal relationships between risk factors and disease 
URL http://melodi.biocompute.org.uk/
 
Title MR-Base 
Description MR-base is a web application and R package providing a range of different methods for two-sample Mendelian randomization, and designed to be used with the IEU GWAS database 
Type Of Technology Webtool/Application 
Year Produced 2016 
Impact MR-base is being widely used by researchers to perform two-sample MR 
URL http://www.mrbase.org/
 
Title MRC IEU UK Biobank GWAS pipeline 
Description Genome wide association study (GWAS) pipeline developed by the MRC-IEU for the full UK Biobank (July 2017) genetic data. 
Type Of Technology Software 
Year Produced 2017 
Impact The idea was to create an GWAS method that was accessible to all researchers, directly. The pipeline has been used directly by 14 researchers within the MRC Integrative Epidemiology Unit (IEU) at the University of Bristol and has performed over 2000 GWAS with all output available to the MRC IEU. 
 
Title TeMMPo 
Description TeMMPo (Text Mining for Mechanism Prioritisation) is a web-based tool to enable researchers to identify the quantity of published evidence for specific mechanisms between an exposure and outcome. The tool identifies co-occurrence of MeSH headings in scientific publications to indicate papers that link an intermediate mechanism to either an exposure or an outcome. TeMMPo is particularly useful when a specific lifestyle or dietary exposure is known to associate with a disease outcome, but little is known about the underlying mechanisms. Understanding these mechanisms may help develop interventions, sub-classify disease or establish evidence for causality. TeMMPo quantifies the body of published literature to establish which mechanisms have been researched the most, enabling these mechanisms to be subjected to systematic review. 
Type Of Technology Webtool/Application 
Year Produced 2015 
Impact The TeMMPo tool forms part of a protocol for systematic review of mechanistic studies developed by the University of Bristol with funding from the World Cancer Research Fund (WCRF), and will be used by the WCRF for their "continuous update project" which provides key summary data on risk factors for cancer. 
URL https://www.temmpo.org.uk/
 
Description GW4 Data Science Talk - T Gaunt 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Professional Practitioners
Results and Impact Tom Gaunt was invited to present at the GW4 Data Science Meeting in Bristol. Dr Gaunt gave a talk on Informatics in Epidemiology, raising the profile of his research.
Year(s) Of Engagement Activity 2017
 
Description Invited Seminar at University of Southampton (T Gaunt) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Postgraduate students
Results and Impact Tom Gaunt was invited to give a bioinformatics seminar at the University of Southampton to a higher education audience. His talk entitled Automating causal inference in epidemiology raised the profile of Dr Gaunt and his research.
Year(s) Of Engagement Activity 2017
 
Description MR-Base user workshop at Mendelian Randomization Conference 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact 1/2 day workshop to disseminate the MR-Base platform to researchers and industry. Approx 60 participants.
Year(s) Of Engagement Activity 2017
URL https://www.mendelianrandomization.org.uk/mr-base-user-workshop/