Data mining epidemiological relationships

Lead Research Organisation: University of Bristol

Department Name: UNLISTED

Abstract

We aim to develop and use cutting edge data mining tools to identify risk factors that cause common diseases and potential drug targets that could prevent or treat these diseases. Methods developed within the MRC Integrative Epidemiology Unit use genetic data to help identify lifestyle risk factors that could be modified to reduce the risk or impact of disease, and can also identify potential drug targets. This programme is developing tools and databases to automate this type of analysis and apply it to large-scale population datasets to help us discover new ways to prevent and treat disease. We are also combining the evidence from these analyses with other types of biomedical information in a “knowledge graph” to enable us to investigate the mechanisms underlying disease, identify new targets for treatment or prevention, predict side effects of drugs and identify opportunities to repurpose existing drugs for other diseases. The methods, software and knowledge graph we are developing are made openly available to the research community to maximise their potential to improve population health.

Technical Summary

Background: Mendelian randomization (MR) is typically used to address specific causal hypotheses. Our MR-Base platform and OpenGWAS database now enable more systematic MR analyses of causal relationships between many traits and diseases, whilst our EpiGraphDB knowledge graph integrates these results with other biomedical evidence. Despite successes in identifying intervention targets and repurposing opportunities, such systematic MR analyses still face unsolved challenges in their interpretation and integration with other knowledge.
Aims: We aim to further advance approaches for systematically generating and integrating evidence to identify and prioritize intervention targets for disease prevention and treatment, and make these approaches and data resources widely accessible.
Objectives: (1) Developing and applying knowledge graphs (KGs) to generate hypotheses: we will use EpiGraphDB (and other KGs) for systematic analysis of specific disease outcomes, explore the use of graph embedding/link prediction methods to improve KGs and identify novel hypotheses, and develop natural language KG query interfaces to broaden their applicability. (2) Automating triangulation and evidence synthesis: we will develop new approaches to extracting evidence from the literature, websites and clinical trials databases. We will then systematically integrate this with evidence from MR and observational studies (including target trial emulation) and explore approaches to automating triangulation and synthesis of evidence for intervention targets. (3) Identifying and prioritizing intervention targets: we will use transcriptomic signatures to identify off-target side effects, strengthen the evidence for drug targets by integrating molecular QTL (molQTL) across traits and tissues with literature, coding mutations (including autozygous loss of function mutations) and animal knockouts, and implement approaches for identifying interactions. We will further develop trans-ancestry MR for prediction of cross-population generalisability of both pharmaceutical and non-pharmaceutical interventions. (4) New software and data resources: we will develop new open data and software resources based on IEU methodological innovations. We will enhance OpenGWAS by integrating non-European GWAS datasets to support multi-ancestry MR, implementing variance GWAS to identify potential interactions and improve automated phenotype curation and clustering. We will also implement a new curated molQTL catalogue to support drug-target MR.
Importance: This programme will develop and apply systematic approaches to prioritise and validate causal hypotheses, linking methodology developed in the unit with applied epidemiological research. Implementing these approaches in open software/data resources and applying them to emerging datasets will yield new discoveries to improve population health.

Total Expenditure April 2006 - March 2025:

£1,559,000

Funded Period:

Mar 23 - Mar 28

Funder:

MRC

Project Status:

Active

Project Category:

Intramural

Project Reference:

MC_UU_00032/3

Principal Investigator:

Tom Gaunt

Health Category:

Unclassified

Organisations

People	ORCID iD
Tom Gaunt (Principal Investigator)	http://orcid.org/0000-0003-0924-3247

Publications

Author Name

Title Publication Date Published

|< < 1 2 3 > >|

10 25 50

Barry C (2023) How to estimate heritability: a guide for genetic epidemiologists in International Journal of Epidemiology

Barry CS (2024) Genetic Insights Into Perinatal Outcomes of Maternal Antihypertensive Therapy During Pregnancy. in JAMA network open

Bull CJ (2024) Impact of weight loss on cancer-related proteins in serum: results from a cluster randomised controlled trial of individuals with type 2 diabetes. in EBioMedicine

Elmore AR (2024) Protein Identification for Stroke Progression via Mendelian Randomization in Million Veteran Program and UK Biobank. in Stroke

Francis A (2024) DrivR-Base: a feature extraction toolkit for variant effect prediction model construction. in Bioinformatics (Oxford, England)

Hazelwood E (2024) Plasma Ghrelin and Risks of Sex-Specific, Site-Specific, and Early-Onset Colorectal Cancer: A Mendelian Randomization Analysis. in Cancer epidemiology, biomarkers & prevention : a publication of the American Association for Cancer Research, cosponsored by the American Society of Preventive Oncology

Lee MA (2025) Exploring the role of circulating proteins in multiple myeloma risk: a Mendelian randomization study. in Scientific reports

Leyden GM (2024) Characterizing the Causal Pathway From Childhood Adiposity to Right Heart Physiology and Pulmonary Circulation Using Lifecourse Mendelian Randomization. in Journal of the American Heart Association

Liu Y (2024) Triangulating evidence in health sciences with Annotated Semantic Queries. in Bioinformatics (Oxford, England)

Liu Y (2023) Using language models and ontology topology to perform semantic mapping of traits between biomedical datasets. in Bioinformatics (Oxford, England)

Further Funding
Research Databases and Models
Collaboration
Software and Technical Products
Engagement Activities


Description	CHECKPOINT
Amount	£3,499,252 (GBP)
Organisation	Medical Research Council (MRC)
Sector	Public
Country	United Kingdom
Start	03/2024
End	02/2029


Description	Causal inference methods to integrate genetics and multi-omics data for target discovery and validation
Amount	$493,127 (USD)
Organisation	Biogen Idec
Sector	Private
Country	United States
Start	11/2023
End	11/2027


Description	Skin Genetics Consortium grant
Amount	4,046,238 kr. (DKK)
Organisation	LEO Foundation
Sector	Charity/Non Profit
Country	Denmark
Start	03/2025
End	09/2027


Title	DrivR-Base
Description	DrivR-Base is a pipeline for extracting feature information from different databases for single nucleotide variants (SNVs). These features are designed to be inputs for machine learning models, aiding in the prediction of functional impacts of genetic variants in human genome sequencing.
Type Of Material	Computer model/algorithm
Year Produced	2023
Provided To Others?	Yes
Impact	This is forming the basis of ongoing work for variant effect prediction (in preparation for publication)
URL	https://github.com/amyfrancis97/DrivR-Base


Description	CUP-Global
Organisation	Imperial College London
Country	United Kingdom
Sector	Academic/University
PI Contribution	We are collaborating with the Global Cancer Update Programme (CUP-Global) team on processes to automate the processes of systematic review used in the CUP project.
Collaborator Contribution	The CUP-Global team are providing information on the challenges of information extraction from the literature, and human-curated training datasets.
Impact	None yet
Start Year	2023


Description	CVD-COVID-UK
Organisation	Health Data Research UK
Country	United Kingdom
Sector	Charity/Non Profit
PI Contribution	Analyses on the potential role of drug targets in COVID-19
Collaborator Contribution	This is a HDR-UK consortium with wide contributions from partners in terms of data, expertise, analyses and technologies.
Impact	N/A
Start Year	2020


Description	Genetics of DNA Methylation Consortium
Organisation	CeMM Research Center for Molecular Medicine
Country	Austria
Sector	Academic/University
PI Contribution	Contributing to a consortial analysis of methylation quantitative trait loci using data from the Accessible Resource for Integrated Epigenomics Studies (ARIES)
Collaborator Contribution	Contributing to a consortial analysis of methylation quantitative trait loci using data from other studies.
Impact	Multi-disciplinary collaboration involving molecular epidemiology, statistics and bioinformatics. Outputs: Database of methylation QTL: http://mqtldb.godmc.org.uk/ Publication pending
Start Year	2013


Description	Genetics of DNA Methylation Consortium
Organisation	King's College London
Department	Brain Bank
Country	United Kingdom
Sector	Academic/University
PI Contribution	Contributing to a consortial analysis of methylation quantitative trait loci using data from the Accessible Resource for Integrated Epigenomics Studies (ARIES)
Collaborator Contribution	Contributing to a consortial analysis of methylation quantitative trait loci using data from other studies.
Impact	Multi-disciplinary collaboration involving molecular epidemiology, statistics and bioinformatics. Outputs: Database of methylation QTL: http://mqtldb.godmc.org.uk/ Publication pending
Start Year	2013


Description	Genetics of DNA Methylation Consortium
Organisation	Leiden University Medical Center
Country	Netherlands
Sector	Academic/University
PI Contribution	Contributing to a consortial analysis of methylation quantitative trait loci using data from the Accessible Resource for Integrated Epigenomics Studies (ARIES)
Collaborator Contribution	Contributing to a consortial analysis of methylation quantitative trait loci using data from other studies.
Impact	Multi-disciplinary collaboration involving molecular epidemiology, statistics and bioinformatics. Outputs: Database of methylation QTL: http://mqtldb.godmc.org.uk/ Publication pending
Start Year	2013


Description	Genetics of DNA Methylation Consortium
Organisation	Newcastle University
Country	United Kingdom
PI Contribution	Contributing to a consortial analysis of methylation quantitative trait loci using data from the Accessible Resource for Integrated Epigenomics Studies (ARIES)
Collaborator Contribution	Contributing to a consortial analysis of methylation quantitative trait loci using data from other studies.
Impact	Multi-disciplinary collaboration involving molecular epidemiology, statistics and bioinformatics. Outputs: Database of methylation QTL: http://mqtldb.godmc.org.uk/ Publication pending
Start Year	2013


Description	Genetics of DNA Methylation Consortium
Organisation	University of Bristol
Country	United Kingdom
Sector	Academic/University
PI Contribution	Contributing to a consortial analysis of methylation quantitative trait loci using data from the Accessible Resource for Integrated Epigenomics Studies (ARIES)
Collaborator Contribution	Contributing to a consortial analysis of methylation quantitative trait loci using data from other studies.
Impact	Multi-disciplinary collaboration involving molecular epidemiology, statistics and bioinformatics. Outputs: Database of methylation QTL: http://mqtldb.godmc.org.uk/ Publication pending
Start Year	2013


Description	Genetics of DNA Methylation Consortium
Organisation	University of Exeter
Country	United Kingdom
Sector	Academic/University
PI Contribution	Contributing to a consortial analysis of methylation quantitative trait loci using data from the Accessible Resource for Integrated Epigenomics Studies (ARIES)
Collaborator Contribution	Contributing to a consortial analysis of methylation quantitative trait loci using data from other studies.
Impact	Multi-disciplinary collaboration involving molecular epidemiology, statistics and bioinformatics. Outputs: Database of methylation QTL: http://mqtldb.godmc.org.uk/ Publication pending
Start Year	2013


Description	IEU/UPenn collaboration
Organisation	University of Pennsylvania
Country	United States
Sector	Academic/University
PI Contribution	Mendelian randomization projects: conception, design, analysis and interpretation
Collaborator Contribution	Mendelian randomization projects: conception, design, data and compute resources and interpretation
Impact	Multi-disciplinary, integrating clinical, epidemiological and informatics expertise. Outputs: doi: 10.1007/s00125-022-05653-1
Start Year	2019


Title	ASQ
Description	The EpiGraphDB-ASQ (ASQ; /??sk/ i.e. "ask") interface is a natural language interface to query the integrated epidemiological evidence of the EpiGraphDB data and ecosystem. The starting point of the query is either a short paragraph of text from which ASQ will derive and extract claim triples, or users can supply those claim triples directly. ASQ will retrieve data from EpiGraphDB, both biomedical entities and evidence from various sources, to faciliate the triangulation of the evidence regarding a specific claim.
Type Of Technology	Webtool/Application
Year Produced	2022
Open Source License?	Yes
Impact	Publication pre-printed and in submission
URL	https://asq.epigraphdb.org/


Title	CanDrivR-CS
Description	CanDrivR-CS is a cancer-specific machine learning framework for distinguishing recurrent and rare variants
Type Of Technology	Software
Year Produced	2024
Open Source License?	Yes
Impact	Pre-print published, and in submission for journal publication. They key finding is that cancer-specific predictors of somatic driver mutations perform better than pan-cancer predictors. This is likely to be important to drug discovery research.
URL	https://github.com/amyfrancis97/CanDrivR-CS


Description	Organising Uganda Hub for Mendelian Randomization Conference 2024
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Postgraduate students
Results and Impact	We organised an international hub at the MRC/UVRI and LSHTM Uganda Research Unit for participants from Africa to remotely join the international Mendelian Randomization conference hosted in Bristol 19-21 June 2024. This Hub aimed to promote both inclusivity in the global research community and environmental sustainability. In addition, we hope that the success of this Hub will provide the infrastructure support to include more international hubs for future events, allowing for greater opportunities to connect on a global platform.
Year(s) Of Engagement Activity	2024
URL	https://www.mendelianrandomization.org.uk/uganda-conference-hub/


Description	Patient and Public Involvement Workshops
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Patients, carers and/or patient groups
Results and Impact	Patient and Public Involvement and Engagement (PPIE) Workshop to inform the development of a Cancer Research UK grant application
Year(s) Of Engagement Activity	2024

Abstract

Technical Summary

Organisations

People

ORCID iD

Publications