Using semantics to leverage health and research Big Data

Lead Research Organisation: University of Leicester
Department Name: Genetics

Abstract

Connecting health-related research big data enables them to be compared, and increased sample/participant sizes to be discovered for analysis. Big data pose integration challenges with regards to their complexity. Phenotype is a data type that shows particular variety and variability across clinical and non-clinical (e.g. 'omics) health research big data. The term "phenotype" is used to define an aggregated set of medically and semantically distinct concepts such as a trait (e.g. blood glucose level), medical signs and symptoms (e.g. hyperglycemia), and disease (e.g. type 2 diabetes). To be able to compare values across big data we need to know if the values have the same meaning between datasets and the semantic rigour required to do this is provided by the use of "ontologies". There are several ontologies that describe overlapping phenotype domains but have been developed for different purposes, for example SNOMED CT is used by the NHS and the Human Phenotype Ontology (HPO) is used by research databases. If datasets are coded to different ontologies, or no ontology at all, then they can be linked to a common ontology via a process of harmonisation. This involves mapping terms from different ontologies, and text mining (TM) of free-text to associate the original values with ontology codes. Unfortunately, there are several barriers to this such as gaps in the ontologies and the publically available ontology mappings, the need to adapt current TM approaches which are optimised to perform well in clearly defined areas, and the need to scale current mapping and TM methods to work with big data.

During this fellowship I will create enhanced capabilities for connecting clinical and non-clinical research big data by using ontologies to harmonise phenotype data. This will involve bridging the gaps in current ontologies, and adapting current state of the art TM and ontology mapping approaches so they are optimised for this context and can be applied to big data. The approaches I develop will be disease agnostic, however in the first instance they will be applied to disease areas of local interest. The Leicester Biomedical Research Centre focuses on cardiovascular, respiratory and lifestyle diseases and encompasses datasets from primary and secondary care, and clinical research studies which include participant questionnaires and biological sample data. I will connect clinical research data within and between disease areas and with local and publically available 'omics data, for example genome- and epigenome-wide association studies.

The study-specific and tiered opt-in consent completed by study participants can be incompatible or ambiguous when connecting data across multiple studies. This blocks a harmonised clinical research dataset being used to answer a new research question. Some projects have worked on developing consent ontologies, but there is not currently a suitable consent ontology that fits with NHS guidance on collecting consent. Leicester already co-leads the Global Alliance for Genomics and Health efforts in this area, which I will help extend towards an ontology-based approach for representing NHS consents and data use conditions, to allow consent harmonisation in line with the requirements of new UK data protection laws.

Harmonised datasets can be connected to public sources of standardised data to bridge the gap to translational research. An example of this are cross-disciplinary collaborations that have mapped between human and mouse phenotype ontologies, to allow the discovery of mouse disease models for a collection of human phenotypic abnormalities. Where a disease does not have a known genetic cause, the ability to perform a cross-species phenotype comparison allows potential mouse gene-knockout models for the disease to be discovered. These cross-species mappings have been applied to public standardised databases and I will investigate their utility with real-world health related big data.

Technical Summary

The key aim of this research is to develop new methods for connecting clinical and non-clinical research big data. I will focus on harmonisation of two domains: "phenotype" due to its complexity, and "consent" due to its importance when making existing clinical data available for new purposes. The principles from this work will be applied to other domains, such as demographic and environmental data, to connect heterogeneous data of many dimensions.

To harmonise phenotype data I will bring together clinical (Read Codes/SNOMED CT and ICD 10) and research (HPO and MeSH) focussed ontologies, with recent developments in ontology mapping and text mining to code clinical research data with both clinical and research ontologies. This will provide the interface for connecting with NHS and non-clinical 'omics data. A range of state of the art text mining tools (e.g. MetaMap, Bio-LarK, NCBO Annotator, cTAKES) will be assessed. Suitable tools will be optimised for concept selection and computationally intensive processing of big data that will exploit scalable storage (e.g. NoSQL) and infrastructure provided by the Leicester HPC facility. The finding, gathering and harmonisation of disease-centric public sources of 'omics data from databases, websites and scientific literature, will extend techniques we have developed for harmonising GWAS data and apply them to other types of 'omics, such as EWAS.

I will investigate the utility of cross-species harmonisation of human phenotype terms to closely associated mouse phenotypes to answer translational research questions. This will make use of publically available HPO to MP mappings. I will collaborate with the relevant mapping projects to leverage existing semantic matching methods with real-world big data.
 
Description 2021-hCNVexchange ELIXIR Community-led Implementation Study
Amount € 232,782 (EUR)
Organisation ELIXIR 
Sector Charity/Non Profit
Country United Kingdom
Start 06/2021 
End 05/2023
 
Description 2022-Humanphengen ELIXIR Data Implementation Study
Amount € 166,750 (EUR)
Organisation ELIXIR 
Sector Charity/Non Profit
Country United Kingdom
Start 01/2022 
End 12/2023
 
Description 2023-MLstandards ELIXIR Commissioned Services Implementation Study
Amount € 243,658 (EUR)
Organisation ELIXIR 
Sector Charity/Non Profit
Country United Kingdom
Start 01/2023 
End 12/2024
 
Description Combatting diet related non-communicable disease through enhanced surveillance
Amount € 11,717,708 (EUR)
Funding ID 101084642 
Organisation European Commission 
Sector Public
Country European Union (EU)
Start 01/2023 
End 12/2026
 
Description ELIXIR Strategic Implementation Studies - Federated Human Data (2019-21)
Amount € 3,750 (EUR)
Organisation ELIXIR 
Sector Charity/Non Profit
Country United Kingdom
Start 04/2020 
End 12/2021
 
Description University of Leicester PhD Studentship
Amount £57,971 (GBP)
Organisation University of Leicester 
Sector Academic/University
Country United Kingdom
Start 09/2019 
End 09/2022
 
Title GWAS Central 
Description GWAS Central is a widely-used comprehensive collection of summary-level genetic association data. Within the last year we performed a significant update to the database, adding over 1.4K new studies and annotating the additional phenotype content with standardised vocabularies. A more complete description of the resource is included in a 2020 Nucleic Acids Research Database Issue manuscript (DOI: 10.1093/nar/gkz895). 
Type Of Material Database/Collection of data 
Year Produced 2018 
Provided To Others? Yes  
Impact Immediately following the update, the resource experienced an increase in user activity/requests. Locally, the expanded GWAS dataset has been used in our collaborative projects (see collaborations section). More generally, we have experienced an increase in study authors submitting their findings to GWAS Central as a means to maximise the impact of their work. In January 2020 GWAS Central was awarded ELIXIR-UK Node Service status. 
URL https://www.gwascentral.org/
 
Description Imperial College London 
Organisation Imperial College London
Department Department of Metabolism, Digestion and Reproduction
Country United Kingdom 
Sector Academic/University 
PI Contribution Development of methods for generating programmatically accessible outputs from free-text scientific literature. Provide expertise on the Information Artifact Ontology (IAO) and the genome-wide association study (GWAS) natural language processing (NLP) use case. Co-lead on the project direction.
Collaborator Contribution Development of methods for predicting publication sections and aligning to the IAO. Provide expertise on the metabolome-wide association study (MWAS) NLP use case. Co-lead on the project direction.
Impact This work is being undertaken across several co-supervised postgraduate student projects (informatics, bioinformatics, biosciences). A preprint manuscript describes the status of this ongoing work at the start of 2021 (DOI: 10.1101/2021.01.08.425887).
Start Year 2020
 
Description NIHR Leicester BRC 
Organisation National Institute for Health Research
Department NIHR Leicester Biomedical Research Centre
Country United Kingdom 
Sector Public 
PI Contribution Development of strategies for the semantic integration of clinical research data across the BRC disease themes.
Collaborator Contribution Provides access to N3 (latterly HSCN) and clinical research data. Defines the requirements and parameters of data harmonisation and integration.
Impact Current outputs have included the development of application ontologies for collecting primary care data, local informatics strategy contributions to enable the cataloguing of cohort data, and establishment of local data discovery software to enable researchers to identify cohorts across disease themes. This work has involved a multi-disciplinary team of clinicians, bioinformaticians and NHS clinical informaticians.
Start Year 2018
 
Description Pompeu Fabra University 
Organisation Pompeu Fabra University
Country Spain 
Sector Academic/University 
PI Contribution NLP, text mining and ontology expertise related to GWAS publication text.
Collaborator Contribution Text mining and database expertise related to extracting gene-disease and variant-disease associations from publication text.
Impact The collaboration is ongoing though a funded ELIXIR implementation study (2022 - 2023).
Start Year 2022
 
Description University of Wisconsin-Madison 
Organisation University of Wisconsin-Madison
Department Department of Biochemistry
Country United States 
Sector Academic/University 
PI Contribution Ontology-driven mapping of mouse traits to human phenotypes, and derivation of human GWAS SNP data associated with mapped mouse traits.
Collaborator Contribution Identification of mouse QTL associated with lifestyle disease-related traits, and statistical analysis to determine if human loci that are syntenic to mouse QTL are enriched with lifestyle disease-related SNPs.
Impact This work was undertaken by a multi-disciplinary team of mouse and human genetics researchers, bioinformaticians and biostatisticians. A manuscript describing this work is published in The Journal of Clinical Investigation (DOI: 10.1172/JCI129143).
Start Year 2018
 
Title Auto-CORPus: A Natural Language Processing Tool for Standardizing and Reusing Biomedical Literature 
Description Auto-CORPus (Automated pipeline for Consistent Outputs from Research Publications) is a novel NLP tool for the standardization and conversion of publication HTML and table image files to three convenient machine-interpretable outputs to support biomedical text analytics. Full-text is converted to BioC-JSON format, publication tables are converted to tables-JSON format, and abbreviations declared within the publication are output in abbreviations-JSON format that relates abbreviations with full definitions. 
Type Of Technology Software 
Year Produced 2021 
Open Source License? Yes  
Impact This software is being used and developed within the text mining "Humangenphen" implementation study funded by ELIXIR (2022 - 2023). The software reduces the barrier to enabling text mining by standardising and optimising the text to be processed. 
 
Description ELIXIR-UK Health Data Workshop 2021 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact I organised this workshop to bring together UK and international health data research expertise to explore the formation of a UK health data community which identifies with the goals of ELIXIR. This event was aimed at those working in health data within the UK, including those working in academic institutions, healthcare and life science companies.

The event led to follow-up discussions from participants around engagement with the ELIXIR Health Data Communities (HDCs) and strategic health data collaborations.
Year(s) Of Engagement Activity 2021
URL https://elixiruknode.org/elixir-uk-health-data-workshop-2021/
 
Description Established and co-lead the ELIXIR Health Data Focus Group 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The ELIXIR Health Data Focus Group meets regularly and is open to all ELIXIR members. It aims to be the incubator of ideas relating to health data and is ELIXIR's formal channel to organise this. A range of topics around improving health data are discussed, including the use of semantics to harmonise clinical and research health datasets.
Year(s) Of Engagement Activity 2020,2021
URL https://elixir-europe.org/focus-groups/health-data
 
Description European i2b2 Conference Scientific Committee 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Member of the Scientific Committee for the European i2b2 tranSMART Conference. After a presentation, questions and subsequent discussions, interest was expressed in our research activities and us partnering a major EU-funded project.
Year(s) Of Engagement Activity 2018