Data mining and bioinformatics cross-cutting theme
Lead Research Organisation:
University of Bristol
Abstract
The investigation of how molecules influence health and disease (molecular epidemiology) has benefited from rapid development of methods to collect molecular data. This has created many challenges in how to handle, analyse and interpret the resulting “big data”. For example, we hold data on up to 2.5 million genetic variants and 450,000 epigenetic measures collected from thousands of samples. There is also a wealth of publicly available data that has the potential to give us further insights into disease processes when combined with the new data we collect.
The data mining and bioinformatics theme aims to: (a) develop and apply methods to combine data from a wide range of sources; (b) use publicly available data to filter and prioritise our results; (c) develop and apply methods to summarise data to reduce the computing power required; (d) “Mine” the resulting datasets to identify relationships between molecular, lifestyle and disease measures; (e) Train new and existing researchers in the methods required to work with “big data”.
This theme will build on core computational expertise in the key areas of molecular epidemiology, maximising the potential discoveries from the wealth of data generated by the programmes within the Unit.
The data mining and bioinformatics theme aims to: (a) develop and apply methods to combine data from a wide range of sources; (b) use publicly available data to filter and prioritise our results; (c) develop and apply methods to summarise data to reduce the computing power required; (d) “Mine” the resulting datasets to identify relationships between molecular, lifestyle and disease measures; (e) Train new and existing researchers in the methods required to work with “big data”.
This theme will build on core computational expertise in the key areas of molecular epidemiology, maximising the potential discoveries from the wealth of data generated by the programmes within the Unit.
Technical Summary
A core feature of the Unit will be the generation, storage and analysis of large volumes of epidemiological and “omics” (genomic, epigenomic, metabolomic and metagenomic) data from a range of cohort studies. The Data Mining and Bioinformatics theme will be responsible for the development of novel tools and strategies to access, integrate, merge, visualise and analyse the multi-dimensional datasets within and across the various studies involved in the Unit Programmes. In particular the application of data mining and machine learning algorithms will allow hypothesis-free approaches to data interrogation (Programme 1), which can be followed by hypothesis testing in the other Programmes. Bioinformatics will also underpin the genetic, epigenetic and omics association studies that will be undertaken in all of the Programmes.
Objectives and plans:
Integration of diverse data types, including large-scale “omics” datasets: integration of very high-dimensional omics data to facilitate data mining and specific hypothesis-based analyses.
Filtering and annotation of omics association results: integration of internally generated data with public data from projects such as ENCODE, the NIH roadmap epigenomics project and others to prioritise and filter results.
Developing approaches for combining variables in omics data: collapsing and dimensionality-reduction approaches will be developed and applied to reduce the computational burden of association analysis between high-dimensional omics datasets.
Development of methods for hypothesis-free mining of complex data: using allele-score based approaches to de novo identification of potential causal relationships across a wide range of phenotypes/outcomes.
Training in bioinformatics and data-mining: specific training will be developed to build in-house skills in bioinformatics and data mining methodology in the context of high-dimensional data.
This theme will further develop the existing core expertise in bioinformatics and data mining to enable Unit researchers to work effectively with the data generated by the Unit and collaborators. The new methods developed will be of broad applicability in the field of molecular epidemiology, and the Unit will train a new cadre of researchers with essential skills in medical bioinformatics.
Objectives and plans:
Integration of diverse data types, including large-scale “omics” datasets: integration of very high-dimensional omics data to facilitate data mining and specific hypothesis-based analyses.
Filtering and annotation of omics association results: integration of internally generated data with public data from projects such as ENCODE, the NIH roadmap epigenomics project and others to prioritise and filter results.
Developing approaches for combining variables in omics data: collapsing and dimensionality-reduction approaches will be developed and applied to reduce the computational burden of association analysis between high-dimensional omics datasets.
Development of methods for hypothesis-free mining of complex data: using allele-score based approaches to de novo identification of potential causal relationships across a wide range of phenotypes/outcomes.
Training in bioinformatics and data-mining: specific training will be developed to build in-house skills in bioinformatics and data mining methodology in the context of high-dimensional data.
This theme will further develop the existing core expertise in bioinformatics and data mining to enable Unit researchers to work effectively with the data generated by the Unit and collaborators. The new methods developed will be of broad applicability in the field of molecular epidemiology, and the Unit will train a new cadre of researchers with essential skills in medical bioinformatics.
Organisations
- University of Bristol, United Kingdom (Collaboration, Lead Research Organisation)
- Biogen (Collaboration)
- CeMM Research Center for Molecular Medicine (Collaboration)
- GlaxoSmithKline (GSK) (Collaboration)
- Newcastle University, United Kingdom (Collaboration)
- Leiden University Medical Center (Collaboration)
- University of Exeter, United Kingdom (Collaboration)
- King's College London, United Kingdom (Collaboration)
People |
ORCID iD |
Peter Flach (Co-Investigator) |
![]() |
Publications

Alharbi KK
(2014)
Influence of adiposity-related genetic markers in a population of saudi arabians where other variables influencing obesity may be reduced.
in Disease markers

Alsaadi MM
(2014)
Nonsense mutation in coiled-coil domain containing 151 gene (CCDC151) causes primary ciliary dyskinesia.
in Human mutation

Arathimos R
(2017)
Epigenome-wide association study of asthma and wheeze in childhood and adolescence.
in Clinical epigenetics


Barker ED
(2018)
Inflammation-related epigenetic risk and child and adolescent mental health: A prospective study from pregnancy to middle adolescence.
in Development and psychopathology

Beaney KE
(2017)
Functional Analysis of the Coronary Heart Disease Risk Locus on Chromosome 21q22.
in Disease markers

Bonilla C
(2021)
Investigating DNA methylation as a potential mediator between pigmentation genes, pigmentary traits and skin cancer.
in Pigment cell & melanoma research

Bonilla C
(2016)
Assessing the role of insulin-like growth factors and binding proteins in prostate cancer using Mendelian randomization: Genetic variants as instruments for circulating levels.
in International journal of cancer

Borges M
(2020)
Circulating Fatty Acids and Risk of Coronary Heart Disease and Stroke: Individual Participant Data Meta-Analysis in Up to 16 126 Participants
in Journal of the American Heart Association

Bright HD
(2019)
Epigenetic gestational age and trajectories of weight and height during childhood: a prospective cohort study.
in Clinical epigenetics
Description | Research Grant |
Amount | £349,099 (GBP) |
Organisation | GlaxoSmithKline (GSK) |
Sector | Private |
Country | Global |
Start | 02/2017 |
End | 02/2020 |
Description | Research Grant |
Amount | £349,099 (GBP) |
Organisation | Biogen |
Sector | Private |
Country | United Kingdom |
Start | 07/2017 |
End | 08/2020 |
Description | Research Grant |
Amount | £117,059 (GBP) |
Organisation | CHDI Foundation |
Sector | Charity/Non Profit |
Country | United States |
Start | 03/2017 |
End | 03/2019 |
Description | Stratified Medicine Initiative |
Amount | £2,500,000 (GBP) |
Organisation | Medical Research Council (MRC) |
Sector | Public |
Country | United Kingdom |
Start | 03/2018 |
End | 03/2022 |
Description | Vice chancellor fellowship |
Amount | £163,000 (GBP) |
Organisation | University of Bristol |
Sector | Academic/University |
Country | United Kingdom |
Start | 09/2017 |
End | 09/2020 |
Title | ARIES-Explorer |
Description | ARIES-Explorer provides an openly accessible web interface to explore epigenome-wide methylation data from the Accessible Resource for Integrated Epigenomics Studies |
Type Of Material | Database/Collection of data |
Year Produced | 2013 |
Provided To Others? | Yes |
Impact | Cited in a key review in the field (Mill et al, Nature Reviews Genetics, VOLUME 14 | AUGUST 2013 | 585). 419 unique visitors to the site in its first month of operation. |
URL | http://ariesepigenomics.org.uk/ariesexplorer |
Title | GoDMC mQTL database |
Description | Database of methylation quantitative trait loci (mQTL) due to be openly released on publication of the GoDMC consortium paper. |
Type Of Material | Database/Collection of data |
Year Produced | 2019 |
Provided To Others? | No |
Impact | This is the largest mQTL analysis to date, providing genetic instruments for use in Mendelian randomization analyses of DNA methylation. |
URL | http://mqtldb.godmc.org.uk/ |
Title | IEU OpenGWAS database |
Description | Database of GWAS results underpinning the MR-Base and LD Hub web applications |
Type Of Material | Database/Collection of data |
Year Produced | 2016 |
Provided To Others? | Yes |
Impact | The database forms the basis of the MR-Base and LD Hub web applications and is widely used by researchers within and outside the IEU |
URL | http://gwas.mrcieu.ac.uk |
Title | Supporting data for "PhenoSpD: an integrated toolkit for phenotypic correlation es-timation and multiple testing correction using GWAS summary statistics" |
Description | Identifying phenotypic correlations between complex traits and diseases can provide useful etiological insights. Restrict-ed access to much individual-level phenotype data makes it difficult to estimate large-scale phenotypic correlation across the human phenome. Two state-of-the-art methods, metaCCA and LD score regression, provide an alternative approach to estimate phenotypic correlation using only genome-wide association study (GWAS) summary results.
Here, we present an integrated R toolkit, PhenoSpD, to 1) use LD score regression to estimate phenotypic correlations using GWAS summary statistics; and 2) utilize the estimated phenotypic correlations to inform correction of multiple testing for complex human traits using the spectral decomposition of matrices (SpD). The simulations suggest 1) it is pos-sible to identify non-independence of phenotypes using samples with partial overlap, as overlap decreases the estimated phenotypic correlations will attenuate towards zero and multiple testing correction will be more stringent than in perfectly overlapping samples; 2) in contrast to LD score regression, metaCCA will provide approximate genetic correlations rather than phenotypic correlation, which limits its application for multiple testing correction. In a case study, PhenoSpD using UK Biobank GWAS results suggested 399.6 independent tests among 487 human traits, which is close to the 352.4 inde-pendent tests estimated using true phenotypic correlation. We further applied PhenoSpD to an estimated 5618 pair-wise phenotypic correlations among 107 metabolites using GWAS summary statistics from Kettunen et al. and PhenoSpD suggested the equivalent of 33.5 independent tests for theses metabolites. PhenoSpD extends the use of summary level results, providing a simple and conservative way to reduce dimensionality for complex human traits using GWAS summary statistics. This is particularly valuable in the age of large-scale biobank and consortia studies, where GWAS results are much more accessible than individual-level data. R code and documentation for PhenoSpD V1.0.0 is available online https://github.com/MRCIEU/PhenoSpD. |
Type Of Material | Database/Collection of data |
Year Produced | 2018 |
Provided To Others? | Yes |
Title | mQTLdb |
Description | mQTLdb is a database of methylation QTL, initially set up for methylation QTL from the ALSPAC-ARIES project. |
Type Of Material | Database/Collection of data |
Year Produced | 2016 |
Provided To Others? | Yes |
Impact | The database is being used by researchers to identify methylation QTL for Mendelian randomization analyses |
URL | http://www.mqtldb.org/ |
Description | Genetics of DNA Methylation Consortium |
Organisation | CeMM Research Center for Molecular Medicine |
Country | Austria |
Sector | Academic/University |
PI Contribution | Contributing to a consortial analysis of methylation quantitative trait loci using data from the Accessible Resource for Integrated Epigenomics Studies (ARIES) |
Collaborator Contribution | Contributing to a consortial analysis of methylation quantitative trait loci using data from other studies. |
Impact | Multi-disciplinary collaboration involving molecular epidemiology, statistics and bioinformatics. Outputs: Database of methylation QTL: http://mqtldb.godmc.org.uk/ Publication pending |
Start Year | 2013 |
Description | Genetics of DNA Methylation Consortium |
Organisation | King's College London |
Department | Brain Bank |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | Contributing to a consortial analysis of methylation quantitative trait loci using data from the Accessible Resource for Integrated Epigenomics Studies (ARIES) |
Collaborator Contribution | Contributing to a consortial analysis of methylation quantitative trait loci using data from other studies. |
Impact | Multi-disciplinary collaboration involving molecular epidemiology, statistics and bioinformatics. Outputs: Database of methylation QTL: http://mqtldb.godmc.org.uk/ Publication pending |
Start Year | 2013 |
Description | Genetics of DNA Methylation Consortium |
Organisation | Leiden University Medical Center |
Country | Netherlands |
Sector | Academic/University |
PI Contribution | Contributing to a consortial analysis of methylation quantitative trait loci using data from the Accessible Resource for Integrated Epigenomics Studies (ARIES) |
Collaborator Contribution | Contributing to a consortial analysis of methylation quantitative trait loci using data from other studies. |
Impact | Multi-disciplinary collaboration involving molecular epidemiology, statistics and bioinformatics. Outputs: Database of methylation QTL: http://mqtldb.godmc.org.uk/ Publication pending |
Start Year | 2013 |
Description | Genetics of DNA Methylation Consortium |
Organisation | Newcastle University |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | Contributing to a consortial analysis of methylation quantitative trait loci using data from the Accessible Resource for Integrated Epigenomics Studies (ARIES) |
Collaborator Contribution | Contributing to a consortial analysis of methylation quantitative trait loci using data from other studies. |
Impact | Multi-disciplinary collaboration involving molecular epidemiology, statistics and bioinformatics. Outputs: Database of methylation QTL: http://mqtldb.godmc.org.uk/ Publication pending |
Start Year | 2013 |
Description | Genetics of DNA Methylation Consortium |
Organisation | University of Bristol |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | Contributing to a consortial analysis of methylation quantitative trait loci using data from the Accessible Resource for Integrated Epigenomics Studies (ARIES) |
Collaborator Contribution | Contributing to a consortial analysis of methylation quantitative trait loci using data from other studies. |
Impact | Multi-disciplinary collaboration involving molecular epidemiology, statistics and bioinformatics. Outputs: Database of methylation QTL: http://mqtldb.godmc.org.uk/ Publication pending |
Start Year | 2013 |
Description | Genetics of DNA Methylation Consortium |
Organisation | University of Exeter |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | Contributing to a consortial analysis of methylation quantitative trait loci using data from the Accessible Resource for Integrated Epigenomics Studies (ARIES) |
Collaborator Contribution | Contributing to a consortial analysis of methylation quantitative trait loci using data from other studies. |
Impact | Multi-disciplinary collaboration involving molecular epidemiology, statistics and bioinformatics. Outputs: Database of methylation QTL: http://mqtldb.godmc.org.uk/ Publication pending |
Start Year | 2013 |
Description | MR-Base collaboration |
Organisation | Biogen |
Country | United Kingdom |
Sector | Private |
PI Contribution | We are collaborating with GlaxoSmithKline and Biogen on the further development and enhancement of the MR-Base platform, with a particular focus on the evaluation of potential drug targets. |
Collaborator Contribution | The industry partners are providing scientific input on the project and advising on who to maximise translational value of the MR-Base platform. |
Impact | Outputs/outcomes: * expansion of the database underlying MR-Base. Papers: * Baird DA, Liu JZ, Zheng J, Sieberts SK, Perumal T, Elsworth B, Richardson TG... AMP-AD eQTL working group . (2021). Identifying drug targets for neurological and psychiatric disease via genetics and the brain transcriptome.. PLoS genetics, 17 (1), pp. e1009224 * Zheng J, Haberland V, Baird D, Walker V, Haycock PC, Hurle MR, Gutteridge A... Gaunt TR. (2020). Phenome-wide Mendelian randomization mapping the influence of the plasma proteome on complex diseases.. Nature genetics, 52 (10), pp. 1122-1131 |
Start Year | 2017 |
Description | MR-Base collaboration |
Organisation | GlaxoSmithKline (GSK) |
Country | Global |
Sector | Private |
PI Contribution | We are collaborating with GlaxoSmithKline and Biogen on the further development and enhancement of the MR-Base platform, with a particular focus on the evaluation of potential drug targets. |
Collaborator Contribution | The industry partners are providing scientific input on the project and advising on who to maximise translational value of the MR-Base platform. |
Impact | Outputs/outcomes: * expansion of the database underlying MR-Base. Papers: * Baird DA, Liu JZ, Zheng J, Sieberts SK, Perumal T, Elsworth B, Richardson TG... AMP-AD eQTL working group . (2021). Identifying drug targets for neurological and psychiatric disease via genetics and the brain transcriptome.. PLoS genetics, 17 (1), pp. e1009224 * Zheng J, Haberland V, Baird D, Walker V, Haycock PC, Hurle MR, Gutteridge A... Gaunt TR. (2020). Phenome-wide Mendelian randomization mapping the influence of the plasma proteome on complex diseases.. Nature genetics, 52 (10), pp. 1122-1131 |
Start Year | 2017 |
Title | ARIES-Explorer |
Description | Novel web interface to enable browsing of epigenomic data. |
Type Of Technology | Webtool/Application |
Year Produced | 2013 |
Impact | Implementation with data from the Accessible Resource for Integrated Epidemiology Studies |
URL | http://ariesepigenomics.org.uk/ariesexplorer |
Title | CScape |
Description | CScape predicts the oncogenic status (disease-driver or neutral) of somatic point mutations in the coding and non-coding regions of the cancer genome. |
Type Of Technology | Webtool/Application |
Year Produced | 2017 |
Open Source License? | Yes |
Impact | None yet |
URL | http://cscape.biocompute.org.uk/ |
Title | FATHMM |
Description | Our software and server is capable of predicting the functional effects of protein missense mutations by combining sequence conservation within hidden Markov models (HMMs), representing the alignment of homologous sequences and conserved protein domains, with "pathogenicity weights", representing the overall tolerance of the protein/domain to mutations. |
Type Of Technology | Software |
Year Produced | 2013 |
Open Source License? | Yes |
Impact | The software has been implemented by COSMIC (Catalogue of somatic mutations in cancer) and as an add-in for the widely used ANNOVAR tool. Three publications with different variants of the algorithm: Shihab HA, Gough J, Cooper DN, Stenson PD, Barker GLA, Edwards KJ, Day INM, Gaunt, TR. (2013). Predicting the Functional, Molecular and Phenotypic Consequences of Amino Acid Substitutions using Hidden Markov Models. Hum. Mutat., 34:57-65 Shihab HA, Gough J, Cooper DN, Day INM, Gaunt, TR. (2013). Predicting the Functional Consequences of Cancer-Associated Amino Acid Substitutions. Bioinformatics 29:1504-1510. Shihab HA, Gough J, Mort M, Cooper DN, Day INM, Gaunt, TR. (2014). Ranking Non-Synonymous Single Nucleotide Polymorphisms based on Disease Concepts. Human Genomics, 8:11 |
URL | http://fathmm.biocompute.org.uk/ |
Title | FATHMM-MKL |
Description | FATHMM-MKL provides a high-throughput web-server capable of predicting the functional consequences of non-coding variants. Our MKL algorithm integrates functional annotations from ENCODE with nucleotide-based HMMs. |
Type Of Technology | Webtool/Application |
Year Produced | 2015 |
Impact | The FATHMM-MKL algorithm has been adopted by dbNSFP and as a plugin for the European Bioinformatics Institutes Variant Effect Predictor, and is thus widely available to researchers predicting the functional effects of genetic variants. |
URL | http://fathmm.biocompute.org.uk/fathmmMKL.htm |
Title | FSMKL |
Description | The software provides multiple-kernel learning (MKL) with feature selection, and has been applied by us in the context of predicting cancer outcomes using combinations of molecular, pathway and clinical information. |
Type Of Technology | Software |
Year Produced | 2013 |
Open Source License? | Yes |
Impact | Published in Bioinformatics (Bioinformatics. 2014 Mar 15;30(6):838-45. doi: 10.1093/bioinformatics/btt610) |
URL | https://github.com/jseoane/FSMKL |
Title | GTB |
Description | The Genome Tolerance Browser (GTB) provides a browsable and searchable view of the predicted effect of genetic variation across the genome using predictions from a range of published algorithms |
Type Of Technology | Webtool/Application |
Year Produced | 2016 |
Impact | The Genome Tolerance Browser has been published |
URL | http://gtb.biocompute.org.uk/ |
Title | HIPred |
Description | The HIPred haploinsufficiency predictor aims to predict whether a gene will function effectively in single copy (ie in the presence of a heterozygous deletion or heterozygous nonsense mutation. |
Type Of Technology | Software |
Year Produced | 2017 |
Open Source License? | Yes |
Impact | The software has been published |
URL | https://github.com/HAShihab/HIPred |
Title | LD Hub |
Description | LD Hub is a web application that enables estimation of heritability of one trait and analysis of genetic correlation between pairs of traits using LD Score regression |
Type Of Technology | Webtool/Application |
Year Produced | 2016 |
Impact | The web application is being widely used by researchers around the world |
URL | http://ldsc.broadinstitute.org/ |
Title | MELODI |
Description | MELDOI is a web application built on a Neo4J graph database that identifies mechanistic pathways between risk factors and disease outcomes using text from the scientific literature |
Type Of Technology | Webtool/Application |
Year Produced | 2017 |
Impact | MELODI is being used by a number of researchers to identify mechanisms underpinning causal relationships between risk factors and disease |
URL | http://melodi.biocompute.org.uk/ |
Title | MR-Base |
Description | MR-base is a web application and R package providing a range of different methods for two-sample Mendelian randomization, and designed to be used with the IEU GWAS database |
Type Of Technology | Webtool/Application |
Year Produced | 2016 |
Impact | MR-base is being widely used by researchers to perform two-sample MR |
URL | http://www.mrbase.org/ |
Title | MRC IEU UK Biobank GWAS pipeline |
Description | Genome wide association study (GWAS) pipeline developed by the MRC-IEU for the full UK Biobank (July 2017) genetic data. |
Type Of Technology | Software |
Year Produced | 2017 |
Impact | The idea was to create an GWAS method that was accessible to all researchers, directly. The pipeline has been used directly by 14 researchers within the MRC Integrative Epidemiology Unit (IEU) at the University of Bristol and has performed over 2000 GWAS with all output available to the MRC IEU. |
Title | TeMMPo |
Description | TeMMPo (Text Mining for Mechanism Prioritisation) is a web-based tool to enable researchers to identify the quantity of published evidence for specific mechanisms between an exposure and outcome. The tool identifies co-occurrence of MeSH headings in scientific publications to indicate papers that link an intermediate mechanism to either an exposure or an outcome. TeMMPo is particularly useful when a specific lifestyle or dietary exposure is known to associate with a disease outcome, but little is known about the underlying mechanisms. Understanding these mechanisms may help develop interventions, sub-classify disease or establish evidence for causality. TeMMPo quantifies the body of published literature to establish which mechanisms have been researched the most, enabling these mechanisms to be subjected to systematic review. |
Type Of Technology | Webtool/Application |
Year Produced | 2015 |
Impact | The TeMMPo tool forms part of a protocol for systematic review of mechanistic studies developed by the University of Bristol with funding from the World Cancer Research Fund (WCRF), and will be used by the WCRF for their "continuous update project" which provides key summary data on risk factors for cancer. |
URL | https://www.temmpo.org.uk/ |
Description | GW4 Data Science Talk - T Gaunt |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Local |
Primary Audience | Professional Practitioners |
Results and Impact | Tom Gaunt was invited to present at the GW4 Data Science Meeting in Bristol. Dr Gaunt gave a talk on Informatics in Epidemiology, raising the profile of his research. |
Year(s) Of Engagement Activity | 2017 |
Description | Invited Seminar at University of Southampton (T Gaunt) |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Regional |
Primary Audience | Postgraduate students |
Results and Impact | Tom Gaunt was invited to give a bioinformatics seminar at the University of Southampton to a higher education audience. His talk entitled Automating causal inference in epidemiology raised the profile of Dr Gaunt and his research. |
Year(s) Of Engagement Activity | 2017 |
Description | MR-Base user workshop at Mendelian Randomization Conference |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Other audiences |
Results and Impact | 1/2 day workshop to disseminate the MR-Base platform to researchers and industry. Approx 60 participants. |
Year(s) Of Engagement Activity | 2017 |
URL | https://www.mendelianrandomization.org.uk/mr-base-user-workshop/ |