Data mining epidemiological relationships
Lead Research Organisation:
University of Bristol
Department Name: UNLISTED
Abstract
We aim to develop and use cutting edge data mining tools to identify risk factors that cause common diseases and potential drug targets that could prevent or treat these diseases. Methods developed within the MRC Integrative Epidemiology Unit use genetic data to help identify lifestyle risk factors that could be modified to reduce the risk or impact of disease, and can also identify potential drug targets. This programme is developing tools and databases to automate this type of analysis and apply it to large-scale population datasets to help us discover new ways to prevent and treat disease. We are also combining the evidence from these analyses with other types of biomedical information in a “knowledge graph” to enable us to investigate the mechanisms underlying disease, identify new targets for treatment or prevention, predict side effects of drugs and identify opportunities to repurpose existing drugs for other diseases. The methods, software and knowledge graph we are developing are made openly available to the research community to maximise their potential to improve population health.
Technical Summary
Background: Mendelian randomization (MR) is typically used to address specific causal hypotheses. Our MR-Base platform and OpenGWAS database now enable more systematic MR analyses of causal relationships between many traits and diseases, whilst our EpiGraphDB knowledge graph integrates these results with other biomedical evidence. Despite successes in identifying intervention targets and repurposing opportunities, such systematic MR analyses still face unsolved challenges in their interpretation and integration with other knowledge.
Aims: We aim to further advance approaches for systematically generating and integrating evidence to identify and prioritize intervention targets for disease prevention and treatment, and make these approaches and data resources widely accessible.
Objectives: (1) Developing and applying knowledge graphs (KGs) to generate hypotheses: we will use EpiGraphDB (and other KGs) for systematic analysis of specific disease outcomes, explore the use of graph embedding/link prediction methods to improve KGs and identify novel hypotheses, and develop natural language KG query interfaces to broaden their applicability. (2) Automating triangulation and evidence synthesis: we will develop new approaches to extracting evidence from the literature, websites and clinical trials databases. We will then systematically integrate this with evidence from MR and observational studies (including target trial emulation) and explore approaches to automating triangulation and synthesis of evidence for intervention targets. (3) Identifying and prioritizing intervention targets: we will use transcriptomic signatures to identify off-target side effects, strengthen the evidence for drug targets by integrating molecular QTL (molQTL) across traits and tissues with literature, coding mutations (including autozygous loss of function mutations) and animal knockouts, and implement approaches for identifying interactions. We will further develop trans-ancestry MR for prediction of cross-population generalisability of both pharmaceutical and non-pharmaceutical interventions. (4) New software and data resources: we will develop new open data and software resources based on IEU methodological innovations. We will enhance OpenGWAS by integrating non-European GWAS datasets to support multi-ancestry MR, implementing variance GWAS to identify potential interactions and improve automated phenotype curation and clustering. We will also implement a new curated molQTL catalogue to support drug-target MR.
Importance: This programme will develop and apply systematic approaches to prioritise and validate causal hypotheses, linking methodology developed in the unit with applied epidemiological research. Implementing these approaches in open software/data resources and applying them to emerging datasets will yield new discoveries to improve population health.
Aims: We aim to further advance approaches for systematically generating and integrating evidence to identify and prioritize intervention targets for disease prevention and treatment, and make these approaches and data resources widely accessible.
Objectives: (1) Developing and applying knowledge graphs (KGs) to generate hypotheses: we will use EpiGraphDB (and other KGs) for systematic analysis of specific disease outcomes, explore the use of graph embedding/link prediction methods to improve KGs and identify novel hypotheses, and develop natural language KG query interfaces to broaden their applicability. (2) Automating triangulation and evidence synthesis: we will develop new approaches to extracting evidence from the literature, websites and clinical trials databases. We will then systematically integrate this with evidence from MR and observational studies (including target trial emulation) and explore approaches to automating triangulation and synthesis of evidence for intervention targets. (3) Identifying and prioritizing intervention targets: we will use transcriptomic signatures to identify off-target side effects, strengthen the evidence for drug targets by integrating molecular QTL (molQTL) across traits and tissues with literature, coding mutations (including autozygous loss of function mutations) and animal knockouts, and implement approaches for identifying interactions. We will further develop trans-ancestry MR for prediction of cross-population generalisability of both pharmaceutical and non-pharmaceutical interventions. (4) New software and data resources: we will develop new open data and software resources based on IEU methodological innovations. We will enhance OpenGWAS by integrating non-European GWAS datasets to support multi-ancestry MR, implementing variance GWAS to identify potential interactions and improve automated phenotype curation and clustering. We will also implement a new curated molQTL catalogue to support drug-target MR.
Importance: This programme will develop and apply systematic approaches to prioritise and validate causal hypotheses, linking methodology developed in the unit with applied epidemiological research. Implementing these approaches in open software/data resources and applying them to emerging datasets will yield new discoveries to improve population health.
Organisations
- University of Bristol (Lead Research Organisation)
- Leiden University Medical Center (Collaboration)
- UNIVERSITY OF BRISTOL (Collaboration)
- Health Data Research UK (Collaboration)
- KING'S COLLEGE LONDON (Collaboration)
- University of Pennsylvania (Collaboration)
- Newcastle University (Collaboration)
- CeMM Research Center for Molecular Medicine (Collaboration)
- IMPERIAL COLLEGE LONDON (Collaboration)
- UNIVERSITY OF EXETER (Collaboration)
Publications
Barry C
(2023)
How to estimate heritability: a guide for genetic epidemiologists
in International Journal of Epidemiology
Barry CS
(2024)
Genetic Insights Into Perinatal Outcomes of Maternal Antihypertensive Therapy During Pregnancy.
in JAMA network open
Francis A
(2024)
DrivR-Base: a feature extraction toolkit for variant effect prediction model construction.
in Bioinformatics (Oxford, England)
Hazelwood E
(2024)
Plasma Ghrelin and Risks of Sex-Specific, Site-Specific, and Early-Onset Colorectal Cancer: A Mendelian Randomization Analysis.
in Cancer epidemiology, biomarkers & prevention : a publication of the American Association for Cancer Research, cosponsored by the American Society of Preventive Oncology
Lee MA
(2025)
Exploring the role of circulating proteins in multiple myeloma risk: a Mendelian randomization study.
in Scientific reports
Leyden GM
(2024)
Characterizing the Causal Pathway From Childhood Adiposity to Right Heart Physiology and Pulmonary Circulation Using Lifecourse Mendelian Randomization.
in Journal of the American Heart Association
Liu Y
(2024)
Triangulating evidence in health sciences with Annotated Semantic Queries.
in Bioinformatics (Oxford, England)
Liu Y
(2023)
Using language models and ontology topology to perform semantic mapping of traits between biomedical datasets.
in Bioinformatics (Oxford, England)
| Description | CHECKPOINT |
| Amount | £3,499,252 (GBP) |
| Organisation | Medical Research Council (MRC) |
| Sector | Public |
| Country | United Kingdom |
| Start | 03/2024 |
| End | 02/2029 |
| Description | Causal inference methods to integrate genetics and multi-omics data for target discovery and validation |
| Amount | $493,127 (USD) |
| Organisation | Biogen Idec |
| Sector | Private |
| Country | United States |
| Start | 11/2023 |
| End | 11/2027 |
| Description | Skin Genetics Consortium grant |
| Amount | 4,046,238 kr. (DKK) |
| Organisation | LEO Foundation |
| Sector | Charity/Non Profit |
| Country | Denmark |
| Start | 03/2025 |
| End | 09/2027 |
| Title | DrivR-Base |
| Description | DrivR-Base is a pipeline for extracting feature information from different databases for single nucleotide variants (SNVs). These features are designed to be inputs for machine learning models, aiding in the prediction of functional impacts of genetic variants in human genome sequencing. |
| Type Of Material | Computer model/algorithm |
| Year Produced | 2023 |
| Provided To Others? | Yes |
| Impact | This is forming the basis of ongoing work for variant effect prediction (in preparation for publication) |
| URL | https://github.com/amyfrancis97/DrivR-Base |
| Description | CUP-Global |
| Organisation | Imperial College London |
| Country | United Kingdom |
| Sector | Academic/University |
| PI Contribution | We are collaborating with the Global Cancer Update Programme (CUP-Global) team on processes to automate the processes of systematic review used in the CUP project. |
| Collaborator Contribution | The CUP-Global team are providing information on the challenges of information extraction from the literature, and human-curated training datasets. |
| Impact | None yet |
| Start Year | 2023 |
| Description | CVD-COVID-UK |
| Organisation | Health Data Research UK |
| Country | United Kingdom |
| Sector | Charity/Non Profit |
| PI Contribution | Analyses on the potential role of drug targets in COVID-19 |
| Collaborator Contribution | This is a HDR-UK consortium with wide contributions from partners in terms of data, expertise, analyses and technologies. |
| Impact | N/A |
| Start Year | 2020 |
| Description | Genetics of DNA Methylation Consortium |
| Organisation | CeMM Research Center for Molecular Medicine |
| Country | Austria |
| Sector | Academic/University |
| PI Contribution | Contributing to a consortial analysis of methylation quantitative trait loci using data from the Accessible Resource for Integrated Epigenomics Studies (ARIES) |
| Collaborator Contribution | Contributing to a consortial analysis of methylation quantitative trait loci using data from other studies. |
| Impact | Multi-disciplinary collaboration involving molecular epidemiology, statistics and bioinformatics. Outputs: Database of methylation QTL: http://mqtldb.godmc.org.uk/ Publication pending |
| Start Year | 2013 |
| Description | Genetics of DNA Methylation Consortium |
| Organisation | King's College London |
| Department | Brain Bank |
| Country | United Kingdom |
| Sector | Academic/University |
| PI Contribution | Contributing to a consortial analysis of methylation quantitative trait loci using data from the Accessible Resource for Integrated Epigenomics Studies (ARIES) |
| Collaborator Contribution | Contributing to a consortial analysis of methylation quantitative trait loci using data from other studies. |
| Impact | Multi-disciplinary collaboration involving molecular epidemiology, statistics and bioinformatics. Outputs: Database of methylation QTL: http://mqtldb.godmc.org.uk/ Publication pending |
| Start Year | 2013 |
| Description | Genetics of DNA Methylation Consortium |
| Organisation | Leiden University Medical Center |
| Country | Netherlands |
| Sector | Academic/University |
| PI Contribution | Contributing to a consortial analysis of methylation quantitative trait loci using data from the Accessible Resource for Integrated Epigenomics Studies (ARIES) |
| Collaborator Contribution | Contributing to a consortial analysis of methylation quantitative trait loci using data from other studies. |
| Impact | Multi-disciplinary collaboration involving molecular epidemiology, statistics and bioinformatics. Outputs: Database of methylation QTL: http://mqtldb.godmc.org.uk/ Publication pending |
| Start Year | 2013 |
| Description | Genetics of DNA Methylation Consortium |
| Organisation | Newcastle University |
| Country | United Kingdom |
| PI Contribution | Contributing to a consortial analysis of methylation quantitative trait loci using data from the Accessible Resource for Integrated Epigenomics Studies (ARIES) |
| Collaborator Contribution | Contributing to a consortial analysis of methylation quantitative trait loci using data from other studies. |
| Impact | Multi-disciplinary collaboration involving molecular epidemiology, statistics and bioinformatics. Outputs: Database of methylation QTL: http://mqtldb.godmc.org.uk/ Publication pending |
| Start Year | 2013 |
| Description | Genetics of DNA Methylation Consortium |
| Organisation | University of Bristol |
| Country | United Kingdom |
| Sector | Academic/University |
| PI Contribution | Contributing to a consortial analysis of methylation quantitative trait loci using data from the Accessible Resource for Integrated Epigenomics Studies (ARIES) |
| Collaborator Contribution | Contributing to a consortial analysis of methylation quantitative trait loci using data from other studies. |
| Impact | Multi-disciplinary collaboration involving molecular epidemiology, statistics and bioinformatics. Outputs: Database of methylation QTL: http://mqtldb.godmc.org.uk/ Publication pending |
| Start Year | 2013 |
| Description | Genetics of DNA Methylation Consortium |
| Organisation | University of Exeter |
| Country | United Kingdom |
| Sector | Academic/University |
| PI Contribution | Contributing to a consortial analysis of methylation quantitative trait loci using data from the Accessible Resource for Integrated Epigenomics Studies (ARIES) |
| Collaborator Contribution | Contributing to a consortial analysis of methylation quantitative trait loci using data from other studies. |
| Impact | Multi-disciplinary collaboration involving molecular epidemiology, statistics and bioinformatics. Outputs: Database of methylation QTL: http://mqtldb.godmc.org.uk/ Publication pending |
| Start Year | 2013 |
| Description | IEU/UPenn collaboration |
| Organisation | University of Pennsylvania |
| Country | United States |
| Sector | Academic/University |
| PI Contribution | Mendelian randomization projects: conception, design, analysis and interpretation |
| Collaborator Contribution | Mendelian randomization projects: conception, design, data and compute resources and interpretation |
| Impact | Multi-disciplinary, integrating clinical, epidemiological and informatics expertise. Outputs: doi: 10.1007/s00125-022-05653-1 |
| Start Year | 2019 |
| Title | ASQ |
| Description | The EpiGraphDB-ASQ (ASQ; /??sk/ i.e. "ask") interface is a natural language interface to query the integrated epidemiological evidence of the EpiGraphDB data and ecosystem. The starting point of the query is either a short paragraph of text from which ASQ will derive and extract claim triples, or users can supply those claim triples directly. ASQ will retrieve data from EpiGraphDB, both biomedical entities and evidence from various sources, to faciliate the triangulation of the evidence regarding a specific claim. |
| Type Of Technology | Webtool/Application |
| Year Produced | 2022 |
| Open Source License? | Yes |
| Impact | Publication pre-printed and in submission |
| URL | https://asq.epigraphdb.org/ |
| Title | CanDrivR-CS |
| Description | CanDrivR-CS is a cancer-specific machine learning framework for distinguishing recurrent and rare variants |
| Type Of Technology | Software |
| Year Produced | 2024 |
| Open Source License? | Yes |
| Impact | Pre-print published, and in submission for journal publication. They key finding is that cancer-specific predictors of somatic driver mutations perform better than pan-cancer predictors. This is likely to be important to drug discovery research. |
| URL | https://github.com/amyfrancis97/CanDrivR-CS |
| Description | Organising Uganda Hub for Mendelian Randomization Conference 2024 |
| Form Of Engagement Activity | Participation in an activity, workshop or similar |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Postgraduate students |
| Results and Impact | We organised an international hub at the MRC/UVRI and LSHTM Uganda Research Unit for participants from Africa to remotely join the international Mendelian Randomization conference hosted in Bristol 19-21 June 2024. This Hub aimed to promote both inclusivity in the global research community and environmental sustainability. In addition, we hope that the success of this Hub will provide the infrastructure support to include more international hubs for future events, allowing for greater opportunities to connect on a global platform. |
| Year(s) Of Engagement Activity | 2024 |
| URL | https://www.mendelianrandomization.org.uk/uganda-conference-hub/ |
| Description | Patient and Public Involvement Workshops |
| Form Of Engagement Activity | Participation in an activity, workshop or similar |
| Part Of Official Scheme? | No |
| Geographic Reach | National |
| Primary Audience | Patients, carers and/or patient groups |
| Results and Impact | Patient and Public Involvement and Engagement (PPIE) Workshop to inform the development of a Cancer Research UK grant application |
| Year(s) Of Engagement Activity | 2024 |
