Using semantics to leverage health and research Big Data
Lead Research Organisation:
University of Leicester
Department Name: Genetics
Abstract
Connecting health-related research big data enables them to be compared, and increased sample/participant sizes to be discovered for analysis. Big data pose integration challenges with regards to their complexity. Phenotype is a data type that shows particular variety and variability across clinical and non-clinical (e.g. 'omics) health research big data. The term "phenotype" is used to define an aggregated set of medically and semantically distinct concepts such as a trait (e.g. blood glucose level), medical signs and symptoms (e.g. hyperglycemia), and disease (e.g. type 2 diabetes). To be able to compare values across big data we need to know if the values have the same meaning between datasets and the semantic rigour required to do this is provided by the use of "ontologies". There are several ontologies that describe overlapping phenotype domains but have been developed for different purposes, for example SNOMED CT is used by the NHS and the Human Phenotype Ontology (HPO) is used by research databases. If datasets are coded to different ontologies, or no ontology at all, then they can be linked to a common ontology via a process of harmonisation. This involves mapping terms from different ontologies, and text mining (TM) of free-text to associate the original values with ontology codes. Unfortunately, there are several barriers to this such as gaps in the ontologies and the publically available ontology mappings, the need to adapt current TM approaches which are optimised to perform well in clearly defined areas, and the need to scale current mapping and TM methods to work with big data.
During this fellowship I will create enhanced capabilities for connecting clinical and non-clinical research big data by using ontologies to harmonise phenotype data. This will involve bridging the gaps in current ontologies, and adapting current state of the art TM and ontology mapping approaches so they are optimised for this context and can be applied to big data. The approaches I develop will be disease agnostic, however in the first instance they will be applied to disease areas of local interest. The Leicester Biomedical Research Centre focuses on cardiovascular, respiratory and lifestyle diseases and encompasses datasets from primary and secondary care, and clinical research studies which include participant questionnaires and biological sample data. I will connect clinical research data within and between disease areas and with local and publically available 'omics data, for example genome- and epigenome-wide association studies.
The study-specific and tiered opt-in consent completed by study participants can be incompatible or ambiguous when connecting data across multiple studies. This blocks a harmonised clinical research dataset being used to answer a new research question. Some projects have worked on developing consent ontologies, but there is not currently a suitable consent ontology that fits with NHS guidance on collecting consent. Leicester already co-leads the Global Alliance for Genomics and Health efforts in this area, which I will help extend towards an ontology-based approach for representing NHS consents and data use conditions, to allow consent harmonisation in line with the requirements of new UK data protection laws.
Harmonised datasets can be connected to public sources of standardised data to bridge the gap to translational research. An example of this are cross-disciplinary collaborations that have mapped between human and mouse phenotype ontologies, to allow the discovery of mouse disease models for a collection of human phenotypic abnormalities. Where a disease does not have a known genetic cause, the ability to perform a cross-species phenotype comparison allows potential mouse gene-knockout models for the disease to be discovered. These cross-species mappings have been applied to public standardised databases and I will investigate their utility with real-world health related big data.
During this fellowship I will create enhanced capabilities for connecting clinical and non-clinical research big data by using ontologies to harmonise phenotype data. This will involve bridging the gaps in current ontologies, and adapting current state of the art TM and ontology mapping approaches so they are optimised for this context and can be applied to big data. The approaches I develop will be disease agnostic, however in the first instance they will be applied to disease areas of local interest. The Leicester Biomedical Research Centre focuses on cardiovascular, respiratory and lifestyle diseases and encompasses datasets from primary and secondary care, and clinical research studies which include participant questionnaires and biological sample data. I will connect clinical research data within and between disease areas and with local and publically available 'omics data, for example genome- and epigenome-wide association studies.
The study-specific and tiered opt-in consent completed by study participants can be incompatible or ambiguous when connecting data across multiple studies. This blocks a harmonised clinical research dataset being used to answer a new research question. Some projects have worked on developing consent ontologies, but there is not currently a suitable consent ontology that fits with NHS guidance on collecting consent. Leicester already co-leads the Global Alliance for Genomics and Health efforts in this area, which I will help extend towards an ontology-based approach for representing NHS consents and data use conditions, to allow consent harmonisation in line with the requirements of new UK data protection laws.
Harmonised datasets can be connected to public sources of standardised data to bridge the gap to translational research. An example of this are cross-disciplinary collaborations that have mapped between human and mouse phenotype ontologies, to allow the discovery of mouse disease models for a collection of human phenotypic abnormalities. Where a disease does not have a known genetic cause, the ability to perform a cross-species phenotype comparison allows potential mouse gene-knockout models for the disease to be discovered. These cross-species mappings have been applied to public standardised databases and I will investigate their utility with real-world health related big data.
Technical Summary
The key aim of this research is to develop new methods for connecting clinical and non-clinical research big data. I will focus on harmonisation of two domains: "phenotype" due to its complexity, and "consent" due to its importance when making existing clinical data available for new purposes. The principles from this work will be applied to other domains, such as demographic and environmental data, to connect heterogeneous data of many dimensions.
To harmonise phenotype data I will bring together clinical (Read Codes/SNOMED CT and ICD 10) and research (HPO and MeSH) focussed ontologies, with recent developments in ontology mapping and text mining to code clinical research data with both clinical and research ontologies. This will provide the interface for connecting with NHS and non-clinical 'omics data. A range of state of the art text mining tools (e.g. MetaMap, Bio-LarK, NCBO Annotator, cTAKES) will be assessed. Suitable tools will be optimised for concept selection and computationally intensive processing of big data that will exploit scalable storage (e.g. NoSQL) and infrastructure provided by the Leicester HPC facility. The finding, gathering and harmonisation of disease-centric public sources of 'omics data from databases, websites and scientific literature, will extend techniques we have developed for harmonising GWAS data and apply them to other types of 'omics, such as EWAS.
I will investigate the utility of cross-species harmonisation of human phenotype terms to closely associated mouse phenotypes to answer translational research questions. This will make use of publically available HPO to MP mappings. I will collaborate with the relevant mapping projects to leverage existing semantic matching methods with real-world big data.
To harmonise phenotype data I will bring together clinical (Read Codes/SNOMED CT and ICD 10) and research (HPO and MeSH) focussed ontologies, with recent developments in ontology mapping and text mining to code clinical research data with both clinical and research ontologies. This will provide the interface for connecting with NHS and non-clinical 'omics data. A range of state of the art text mining tools (e.g. MetaMap, Bio-LarK, NCBO Annotator, cTAKES) will be assessed. Suitable tools will be optimised for concept selection and computationally intensive processing of big data that will exploit scalable storage (e.g. NoSQL) and infrastructure provided by the Leicester HPC facility. The finding, gathering and harmonisation of disease-centric public sources of 'omics data from databases, websites and scientific literature, will extend techniques we have developed for harmonising GWAS data and apply them to other types of 'omics, such as EWAS.
I will investigate the utility of cross-species harmonisation of human phenotype terms to closely associated mouse phenotypes to answer translational research questions. This will make use of publically available HPO to MP mappings. I will collaborate with the relevant mapping projects to leverage existing semantic matching methods with real-world big data.
People |
ORCID iD |
Tim Beck (Principal Investigator / Fellow) |
Publications
Beck T
(2023)
GWAS Central: an expanding resource for finding and visualising genotype and phenotype data from genome-wide association studies.
in Nucleic acids research
Beck T
(2022)
Auto-CORPus: A Natural Language Processing Tool for Standardizing and Reusing Biomedical Literature.
in Frontiers in digital health
Beck T
(2020)
GWAS Central: a comprehensive resource for the discovery and comparison of genotype and phenotype data from genome-wide association studies.
in Nucleic acids research
Keller MP
(2019)
Gene loci associated with insulin secretion in islets from non-diabetic mice.
in The Journal of clinical investigation
Price TR
(2023)
Identification of genetic drivers of plasma lipoprotein size in the Diversity Outbred mouse population.
in Journal of lipid research
Rambla J
(2022)
Beacon v2 and Beacon networks: A "lingua franca" for federated data discovery in biomedical genomics, and beyond.
in Human mutation
Rehm HL
(2021)
GA4GH: International policies and standards for data sharing across genomic research and healthcare.
in Cell genomics
Wang M
(2024)
Vocabulary Matters: An Annotation Pipeline and Four Deep Learning Algorithms for Enzyme Named Entity Recognition.
in Journal of proteome research
Description | 2021-hCNVexchange ELIXIR Community-led Implementation Study |
Amount | € 232,782 (EUR) |
Organisation | ELIXIR |
Sector | Charity/Non Profit |
Country | United Kingdom |
Start | 05/2021 |
End | 05/2023 |
Description | 2022-Humanphengen ELIXIR Data Implementation Study |
Amount | € 166,750 (EUR) |
Organisation | ELIXIR |
Sector | Charity/Non Profit |
Country | United Kingdom |
Start | 01/2022 |
End | 12/2023 |
Description | 2023-MLstandards ELIXIR Commissioned Services Implementation Study |
Amount | € 243,658 (EUR) |
Organisation | ELIXIR |
Sector | Charity/Non Profit |
Country | United Kingdom |
Start | 01/2023 |
End | 12/2024 |
Description | Combatting diet related non-communicable disease through enhanced surveillance |
Amount | € 11,717,708 (EUR) |
Funding ID | 101084642 |
Organisation | European Commission |
Sector | Public |
Country | European Union (EU) |
Start | 01/2023 |
End | 12/2026 |
Description | ELIXIR Strategic Implementation Studies - Federated Human Data (2019-21) |
Amount | € 3,750 (EUR) |
Organisation | ELIXIR |
Sector | Charity/Non Profit |
Country | United Kingdom |
Start | 03/2020 |
End | 12/2021 |
Description | University of Leicester PhD Studentship |
Amount | £57,971 (GBP) |
Organisation | University of Leicester |
Sector | Academic/University |
Country | United Kingdom |
Start | 08/2019 |
End | 09/2022 |
Title | GWAS Central |
Description | GWAS Central is a widely-used comprehensive collection of summary-level genetic association data. Within the last year we performed a significant update to the database, adding over 1.4K new studies and annotating the additional phenotype content with standardised vocabularies. A more complete description of the resource is included in a 2020 Nucleic Acids Research Database Issue manuscript (DOI: 10.1093/nar/gkz895). |
Type Of Material | Database/Collection of data |
Year Produced | 2018 |
Provided To Others? | Yes |
Impact | Immediately following the update, the resource experienced an increase in user activity/requests. Locally, the expanded GWAS dataset has been used in our collaborative projects (see collaborations section). More generally, we have experienced an increase in study authors submitting their findings to GWAS Central as a means to maximise the impact of their work. In January 2020 GWAS Central was awarded ELIXIR-UK Node Service status. |
URL | https://www.gwascentral.org/ |
Description | Imperial College London |
Organisation | Imperial College London |
Department | Department of Metabolism, Digestion and Reproduction |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | Development of methods for generating programmatically accessible outputs from free-text scientific literature. Provide expertise on the Information Artifact Ontology (IAO) and the genome-wide association study (GWAS) natural language processing (NLP) use case. Co-lead on the project direction. |
Collaborator Contribution | Development of methods for predicting publication sections and aligning to the IAO. Provide expertise on the metabolome-wide association study (MWAS) NLP use case. Co-lead on the project direction. |
Impact | This work is being undertaken across several co-supervised postgraduate student projects (informatics, bioinformatics, biosciences). A preprint manuscript describes the status of this ongoing work at the start of 2021 (DOI: 10.1101/2021.01.08.425887). |
Start Year | 2020 |
Description | NIHR Leicester BRC |
Organisation | National Institute for Health Research |
Department | NIHR Leicester Biomedical Research Centre |
Country | United Kingdom |
Sector | Public |
PI Contribution | Development of strategies for the semantic integration of clinical research data across the BRC disease themes. |
Collaborator Contribution | Provides access to N3 (latterly HSCN) and clinical research data. Defines the requirements and parameters of data harmonisation and integration. |
Impact | Current outputs have included the development of application ontologies for collecting primary care data, local informatics strategy contributions to enable the cataloguing of cohort data, and establishment of local data discovery software to enable researchers to identify cohorts across disease themes. This work has involved a multi-disciplinary team of clinicians, bioinformaticians and NHS clinical informaticians. |
Start Year | 2018 |
Description | Pompeu Fabra University |
Organisation | Pompeu Fabra University |
Country | Spain |
Sector | Academic/University |
PI Contribution | NLP, text mining and ontology expertise related to GWAS publication text. |
Collaborator Contribution | Text mining and database expertise related to extracting gene-disease and variant-disease associations from publication text. |
Impact | The collaboration is ongoing though a funded ELIXIR implementation study (2022 - 2023). |
Start Year | 2022 |
Description | University of Wisconsin-Madison |
Organisation | University of Wisconsin-Madison |
Department | Department of Biochemistry |
Country | United States |
Sector | Academic/University |
PI Contribution | Ontology-driven mapping of mouse traits to human phenotypes, and derivation of human GWAS SNP data associated with mapped mouse traits. |
Collaborator Contribution | Identification of mouse QTL associated with lifestyle disease-related traits, and statistical analysis to determine if human loci that are syntenic to mouse QTL are enriched with lifestyle disease-related SNPs. |
Impact | This work was undertaken by a multi-disciplinary team of mouse and human genetics researchers, bioinformaticians and biostatisticians. A manuscript describing this work is published in The Journal of Clinical Investigation (DOI: 10.1172/JCI129143). |
Start Year | 2018 |
Title | Auto-CORPus: A Natural Language Processing Tool for Standardizing and Reusing Biomedical Literature |
Description | Auto-CORPus (Automated pipeline for Consistent Outputs from Research Publications) is a novel NLP tool for the standardization and conversion of publication HTML and table image files to three convenient machine-interpretable outputs to support biomedical text analytics. Full-text is converted to BioC-JSON format, publication tables are converted to tables-JSON format, and abbreviations declared within the publication are output in abbreviations-JSON format that relates abbreviations with full definitions. |
Type Of Technology | Software |
Year Produced | 2021 |
Open Source License? | Yes |
Impact | This software is being used and developed within the text mining "Humangenphen" implementation study funded by ELIXIR (2022 - 2023). The software reduces the barrier to enabling text mining by standardising and optimising the text to be processed. |
Title | TABoLiSTM (BERT-embedding) model |
Description | TABoLiSTM weights and model files to run the code at https://github.com/omicsNLP/MetaboliteNER |
Type Of Technology | Software |
Year Produced | 2022 |
Open Source License? | Yes |
URL | https://zenodo.org/record/6340001 |
Description | ELIXIR-UK Health Data Workshop 2021 |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | I organised this workshop to bring together UK and international health data research expertise to explore the formation of a UK health data community which identifies with the goals of ELIXIR. This event was aimed at those working in health data within the UK, including those working in academic institutions, healthcare and life science companies. The event led to follow-up discussions from participants around engagement with the ELIXIR Health Data Communities (HDCs) and strategic health data collaborations. |
Year(s) Of Engagement Activity | 2021 |
URL | https://elixiruknode.org/elixir-uk-health-data-workshop-2021/ |
Description | Established and co-lead the ELIXIR Health Data Focus Group |
Form Of Engagement Activity | A formal working group, expert panel or dialogue |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | The ELIXIR Health Data Focus Group meets regularly and is open to all ELIXIR members. It aims to be the incubator of ideas relating to health data and is ELIXIR's formal channel to organise this. A range of topics around improving health data are discussed, including the use of semantics to harmonise clinical and research health datasets. |
Year(s) Of Engagement Activity | 2020,2021 |
URL | https://elixir-europe.org/focus-groups/health-data |
Description | European i2b2 Conference Scientific Committee |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Member of the Scientific Committee for the European i2b2 tranSMART Conference. After a presentation, questions and subsequent discussions, interest was expressed in our research activities and us partnering a major EU-funded project. |
Year(s) Of Engagement Activity | 2018 |