Using semantics to leverage health and research Big Data

Lead Research Organisation: University of Leicester

Department Name: Genetics

Abstract

Connecting health-related research big data enables them to be compared, and increased sample/participant sizes to be discovered for analysis. Big data pose integration challenges with regards to their complexity. Phenotype is a data type that shows particular variety and variability across clinical and non-clinical (e.g. 'omics) health research big data. The term "phenotype" is used to define an aggregated set of medically and semantically distinct concepts such as a trait (e.g. blood glucose level), medical signs and symptoms (e.g. hyperglycemia), and disease (e.g. type 2 diabetes). To be able to compare values across big data we need to know if the values have the same meaning between datasets and the semantic rigour required to do this is provided by the use of "ontologies". There are several ontologies that describe overlapping phenotype domains but have been developed for different purposes, for example SNOMED CT is used by the NHS and the Human Phenotype Ontology (HPO) is used by research databases. If datasets are coded to different ontologies, or no ontology at all, then they can be linked to a common ontology via a process of harmonisation. This involves mapping terms from different ontologies, and text mining (TM) of free-text to associate the original values with ontology codes. Unfortunately, there are several barriers to this such as gaps in the ontologies and the publically available ontology mappings, the need to adapt current TM approaches which are optimised to perform well in clearly defined areas, and the need to scale current mapping and TM methods to work with big data.

During this fellowship I will create enhanced capabilities for connecting clinical and non-clinical research big data by using ontologies to harmonise phenotype data. This will involve bridging the gaps in current ontologies, and adapting current state of the art TM and ontology mapping approaches so they are optimised for this context and can be applied to big data. The approaches I develop will be disease agnostic, however in the first instance they will be applied to disease areas of local interest. The Leicester Biomedical Research Centre focuses on cardiovascular, respiratory and lifestyle diseases and encompasses datasets from primary and secondary care, and clinical research studies which include participant questionnaires and biological sample data. I will connect clinical research data within and between disease areas and with local and publically available 'omics data, for example genome- and epigenome-wide association studies.

The study-specific and tiered opt-in consent completed by study participants can be incompatible or ambiguous when connecting data across multiple studies. This blocks a harmonised clinical research dataset being used to answer a new research question. Some projects have worked on developing consent ontologies, but there is not currently a suitable consent ontology that fits with NHS guidance on collecting consent. Leicester already co-leads the Global Alliance for Genomics and Health efforts in this area, which I will help extend towards an ontology-based approach for representing NHS consents and data use conditions, to allow consent harmonisation in line with the requirements of new UK data protection laws.

Harmonised datasets can be connected to public sources of standardised data to bridge the gap to translational research. An example of this are cross-disciplinary collaborations that have mapped between human and mouse phenotype ontologies, to allow the discovery of mouse disease models for a collection of human phenotypic abnormalities. Where a disease does not have a known genetic cause, the ability to perform a cross-species phenotype comparison allows potential mouse gene-knockout models for the disease to be discovered. These cross-species mappings have been applied to public standardised databases and I will investigate their utility with real-world health related big data.

Technical Summary

The key aim of this research is to develop new methods for connecting clinical and non-clinical research big data. I will focus on harmonisation of two domains: "phenotype" due to its complexity, and "consent" due to its importance when making existing clinical data available for new purposes. The principles from this work will be applied to other domains, such as demographic and environmental data, to connect heterogeneous data of many dimensions.

To harmonise phenotype data I will bring together clinical (Read Codes/SNOMED CT and ICD 10) and research (HPO and MeSH) focussed ontologies, with recent developments in ontology mapping and text mining to code clinical research data with both clinical and research ontologies. This will provide the interface for connecting with NHS and non-clinical 'omics data. A range of state of the art text mining tools (e.g. MetaMap, Bio-LarK, NCBO Annotator, cTAKES) will be assessed. Suitable tools will be optimised for concept selection and computationally intensive processing of big data that will exploit scalable storage (e.g. NoSQL) and infrastructure provided by the Leicester HPC facility. The finding, gathering and harmonisation of disease-centric public sources of 'omics data from databases, websites and scientific literature, will extend techniques we have developed for harmonising GWAS data and apply them to other types of 'omics, such as EWAS.

I will investigate the utility of cross-species harmonisation of human phenotype terms to closely associated mouse phenotypes to answer translational research questions. This will make use of publically available HPO to MP mappings. I will collaborate with the relevant mapping projects to leverage existing semantic matching methods with real-world big data.

Funded Value:

£293,153

Funded Period:

Feb 18 - Nov 21

Funder:

MRC

Project Status:

Closed

Project Category:

Fellowship

Project Reference:

MR/S003703/1

Principal Investigator:

Tim Beck

Health Category:

Unclassified

Organisations

People	ORCID iD
Tim Beck (Principal Investigator / Fellow)

Publications

Author Name

Title Publication Date Published

|< < 1 2 > >|

10 25 50

Beck T (2020) GWAS Central: a comprehensive resource for the discovery and comparison of genotype and phenotype data from genome-wide association studies. in Nucleic acids research

Beck T (2021) Auto-CORPus: A Natural Language Processing Tool for Standardising and Reusing Biomedical Literature

Beck T (2023) GWAS Central: an expanding resource for finding and visualising genotype and phenotype data from genome-wide association studies. in Nucleic acids research

Beck T (2022) Auto-CORPus: A Natural Language Processing Tool for Standardizing and Reusing Biomedical Literature. in Frontiers in digital health

Keller MP (2019) Gene loci associated with insulin secretion in islets from non-diabetic mice. in The Journal of clinical investigation

Price T (2023) Identification of genetic drivers of plasma lipoprotein size in the Diversity Outbred mouse population in Journal of Lipid Research

Rambla J (2022) Beacon v2 and Beacon networks: A "lingua franca" for federated data discovery in biomedical genomics, and beyond. in Human mutation

Rehm HL (2021) GA4GH: International policies and standards for data sharing across genomic research and healthcare. in Cell genomics

Wang M (2024) Vocabulary Matters: An Annotation Pipeline and Four Deep Learning Algorithms for Enzyme Named Entity Recognition. in Journal of proteome research

Yeung C (2022) MetaboListem and TABoLiSTM: Two Deep Learning Algorithms for Metabolite Named Entity Recognition

Further Funding
Research Databases and Models
Collaboration
Software and Technical Products
Engagement Activities


Description	2021-hCNVexchange ELIXIR Community-led Implementation Study
Amount	€ 232,782 (EUR)
Organisation	ELIXIR
Sector	Charity/Non Profit
Country	United Kingdom
Start	05/2021
End	05/2023


Description	2022-Humanphengen ELIXIR Data Implementation Study
Amount	€ 166,750 (EUR)
Organisation	ELIXIR
Sector	Charity/Non Profit
Country	United Kingdom
Start	01/2022
End	12/2023


Description	2023-MLstandards ELIXIR Commissioned Services Implementation Study
Amount	€ 243,658 (EUR)
Organisation	ELIXIR
Sector	Charity/Non Profit
Country	United Kingdom
Start	01/2023
End	12/2024


Description	Combatting diet related non-communicable disease through enhanced surveillance
Amount	€ 11,717,708 (EUR)
Funding ID	101084642
Organisation	European Commission
Sector	Public
Country	Belgium
Start	01/2023
End	12/2026


Description	ELIXIR Strategic Implementation Studies - Federated Human Data (2019-21)
Amount	€ 3,750 (EUR)
Organisation	ELIXIR
Sector	Charity/Non Profit
Country	United Kingdom
Start	03/2020
End	12/2021


Description	University of Leicester PhD Studentship
Amount	£57,971 (GBP)
Organisation	University of Leicester
Sector	Academic/University
Country	United Kingdom
Start	08/2019
End	09/2022


Title	GWAS Central
Description	GWAS Central is a widely-used comprehensive collection of summary-level genetic association data. Within the last year we performed a significant update to the database, adding over 1.4K new studies and annotating the additional phenotype content with standardised vocabularies. A more complete description of the resource is included in a 2020 Nucleic Acids Research Database Issue manuscript (DOI: 10.1093/nar/gkz895).
Type Of Material	Database/Collection of data
Year Produced	2018
Provided To Others?	Yes
Impact	Immediately following the update, the resource experienced an increase in user activity/requests. Locally, the expanded GWAS dataset has been used in our collaborative projects (see collaborations section). More generally, we have experienced an increase in study authors submitting their findings to GWAS Central as a means to maximise the impact of their work. In January 2020 GWAS Central was awarded ELIXIR-UK Node Service status.
URL	https://www.gwascentral.org/


Description	Imperial College London
Organisation	Imperial College London
Department	Department of Metabolism, Digestion and Reproduction
Country	United Kingdom
Sector	Academic/University
PI Contribution	Development of methods for generating programmatically accessible outputs from free-text scientific literature. Provide expertise on the Information Artifact Ontology (IAO) and the genome-wide association study (GWAS) natural language processing (NLP) use case. Co-lead on the project direction.
Collaborator Contribution	Development of methods for predicting publication sections and aligning to the IAO. Provide expertise on the metabolome-wide association study (MWAS) NLP use case. Co-lead on the project direction.
Impact	This work is being undertaken across several co-supervised postgraduate student projects (informatics, bioinformatics, biosciences). A preprint manuscript describes the status of this ongoing work at the start of 2021 (DOI: 10.1101/2021.01.08.425887).
Start Year	2020


Description	NIHR Leicester BRC
Organisation	National Institute for Health and Care Research
Department	NIHR Leicester Biomedical Research Centre
Country	United Kingdom
Sector	Public
PI Contribution	Development of strategies for the semantic integration of clinical research data across the BRC disease themes.
Collaborator Contribution	Provides access to N3 (latterly HSCN) and clinical research data. Defines the requirements and parameters of data harmonisation and integration.
Impact	Current outputs have included the development of application ontologies for collecting primary care data, local informatics strategy contributions to enable the cataloguing of cohort data, and establishment of local data discovery software to enable researchers to identify cohorts across disease themes. This work has involved a multi-disciplinary team of clinicians, bioinformaticians and NHS clinical informaticians.
Start Year	2018


Description	Pompeu Fabra University
Organisation	Pompeu Fabra University
Country	Spain
Sector	Academic/University
PI Contribution	NLP, text mining and ontology expertise related to GWAS publication text.
Collaborator Contribution	Text mining and database expertise related to extracting gene-disease and variant-disease associations from publication text.
Impact	The collaboration is ongoing though a funded ELIXIR implementation study (2022 - 2024).
Start Year	2022


Description	University of Wisconsin-Madison
Organisation	University of Wisconsin-Madison
Department	Department of Biochemistry
Country	United States
Sector	Academic/University
PI Contribution	Ontology-driven mapping of mouse traits to human phenotypes, and derivation of human GWAS SNP data associated with mapped mouse traits.
Collaborator Contribution	Identification of mouse QTL associated with lifestyle disease-related traits, and statistical analysis to determine if human loci that are syntenic to mouse QTL are enriched with lifestyle disease-related SNPs.
Impact	This work was undertaken by a multi-disciplinary team of mouse and human genetics researchers, bioinformaticians and biostatisticians. A manuscript describing this work is published in The Journal of Clinical Investigation (DOI: 10.1172/JCI129143).
Start Year	2018


Title	Auto-CORPus: A Natural Language Processing Tool for Standardizing and Reusing Biomedical Literature
Description	Auto-CORPus (Automated pipeline for Consistent Outputs from Research Publications) is a novel NLP tool for the standardization and conversion of publication HTML and table image files to three convenient machine-interpretable outputs to support biomedical text analytics. Full-text is converted to BioC-JSON format, publication tables are converted to tables-JSON format, and abbreviations declared within the publication are output in abbreviations-JSON format that relates abbreviations with full definitions.
Type Of Technology	Software
Year Produced	2021
Open Source License?	Yes
Impact	This software is being used and developed within the text mining "Humangenphen" implementation study funded by ELIXIR (2022 - 2023). The software reduces the barrier to enabling text mining by standardising and optimising the text to be processed.


Title	TABoLiSTM (BERT-embedding) model
Description	TABoLiSTM weights and model files to run the code at https://github.com/omicsNLP/MetaboliteNER
Type Of Technology	Software
Year Produced	2022
Impact	The software has achieved state-of-the-art performance on metabolite named entity recognition (NER) and has been used by EU research projects for automatic dataset annotation.
URL	https://zenodo.org/record/6340002


Description	ELIXIR-UK Health Data Workshop 2021
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	I organised this workshop to bring together UK and international health data research expertise to explore the formation of a UK health data community which identifies with the goals of ELIXIR. This event was aimed at those working in health data within the UK, including those working in academic institutions, healthcare and life science companies. The event led to follow-up discussions from participants around engagement with the ELIXIR Health Data Communities (HDCs) and strategic health data collaborations.
Year(s) Of Engagement Activity	2021
URL	https://elixiruknode.org/elixir-uk-health-data-workshop-2021/


Description	Established and co-lead the ELIXIR Health Data Focus Group
Form Of Engagement Activity	A formal working group, expert panel or dialogue
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	The ELIXIR Health Data Focus Group meets regularly and is open to all ELIXIR members. It aims to be the incubator of ideas relating to health data and is ELIXIR's formal channel to organise this. A range of topics around improving health data are discussed, including the use of semantics to harmonise clinical and research health datasets.
Year(s) Of Engagement Activity	2020,2021
URL	https://elixir-europe.org/focus-groups/health-data


Description	European i2b2 Conference Scientific Committee
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Member of the Scientific Committee for the European i2b2 tranSMART Conference. After a presentation, questions and subsequent discussions, interest was expressed in our research activities and us partnering a major EU-funded project.
Year(s) Of Engagement Activity	2018

Abstract

Technical Summary

Organisations

People

ORCID iD

Publications