Using semantics to leverage health and research Big Data

Lead Research Organisation: University of Leicester
Department Name: Genetics


Connecting health-related research big data enables them to be compared, and increased sample/participant sizes to be discovered for analysis. Big data pose integration challenges with regards to their complexity. Phenotype is a data type that shows particular variety and variability across clinical and non-clinical (e.g. 'omics) health research big data. The term "phenotype" is used to define an aggregated set of medically and semantically distinct concepts such as a trait (e.g. blood glucose level), medical signs and symptoms (e.g. hyperglycemia), and disease (e.g. type 2 diabetes). To be able to compare values across big data we need to know if the values have the same meaning between datasets and the semantic rigour required to do this is provided by the use of "ontologies". There are several ontologies that describe overlapping phenotype domains but have been developed for different purposes, for example SNOMED CT is used by the NHS and the Human Phenotype Ontology (HPO) is used by research databases. If datasets are coded to different ontologies, or no ontology at all, then they can be linked to a common ontology via a process of harmonisation. This involves mapping terms from different ontologies, and text mining (TM) of free-text to associate the original values with ontology codes. Unfortunately, there are several barriers to this such as gaps in the ontologies and the publically available ontology mappings, the need to adapt current TM approaches which are optimised to perform well in clearly defined areas, and the need to scale current mapping and TM methods to work with big data.

During this fellowship I will create enhanced capabilities for connecting clinical and non-clinical research big data by using ontologies to harmonise phenotype data. This will involve bridging the gaps in current ontologies, and adapting current state of the art TM and ontology mapping approaches so they are optimised for this context and can be applied to big data. The approaches I develop will be disease agnostic, however in the first instance they will be applied to disease areas of local interest. The Leicester Biomedical Research Centre focuses on cardiovascular, respiratory and lifestyle diseases and encompasses datasets from primary and secondary care, and clinical research studies which include participant questionnaires and biological sample data. I will connect clinical research data within and between disease areas and with local and publically available 'omics data, for example genome- and epigenome-wide association studies.

The study-specific and tiered opt-in consent completed by study participants can be incompatible or ambiguous when connecting data across multiple studies. This blocks a harmonised clinical research dataset being used to answer a new research question. Some projects have worked on developing consent ontologies, but there is not currently a suitable consent ontology that fits with NHS guidance on collecting consent. Leicester already co-leads the Global Alliance for Genomics and Health efforts in this area, which I will help extend towards an ontology-based approach for representing NHS consents and data use conditions, to allow consent harmonisation in line with the requirements of new UK data protection laws.

Harmonised datasets can be connected to public sources of standardised data to bridge the gap to translational research. An example of this are cross-disciplinary collaborations that have mapped between human and mouse phenotype ontologies, to allow the discovery of mouse disease models for a collection of human phenotypic abnormalities. Where a disease does not have a known genetic cause, the ability to perform a cross-species phenotype comparison allows potential mouse gene-knockout models for the disease to be discovered. These cross-species mappings have been applied to public standardised databases and I will investigate their utility with real-world health related big data.

Technical Summary

The key aim of this research is to develop new methods for connecting clinical and non-clinical research big data. I will focus on harmonisation of two domains: "phenotype" due to its complexity, and "consent" due to its importance when making existing clinical data available for new purposes. The principles from this work will be applied to other domains, such as demographic and environmental data, to connect heterogeneous data of many dimensions.

To harmonise phenotype data I will bring together clinical (Read Codes/SNOMED CT and ICD 10) and research (HPO and MeSH) focussed ontologies, with recent developments in ontology mapping and text mining to code clinical research data with both clinical and research ontologies. This will provide the interface for connecting with NHS and non-clinical 'omics data. A range of state of the art text mining tools (e.g. MetaMap, Bio-LarK, NCBO Annotator, cTAKES) will be assessed. Suitable tools will be optimised for concept selection and computationally intensive processing of big data that will exploit scalable storage (e.g. NoSQL) and infrastructure provided by the Leicester HPC facility. The finding, gathering and harmonisation of disease-centric public sources of 'omics data from databases, websites and scientific literature, will extend techniques we have developed for harmonising GWAS data and apply them to other types of 'omics, such as EWAS.

I will investigate the utility of cross-species harmonisation of human phenotype terms to closely associated mouse phenotypes to answer translational research questions. This will make use of publically available HPO to MP mappings. I will collaborate with the relevant mapping projects to leverage existing semantic matching methods with real-world big data.


10 25 50
Title GWAS-Central 
Description GWAS Central is a widely-used comprehensive collection of summary-level genetic association data. Within the last year we performed a significant update to the database, adding over 1.4K new studies and annotating the additional phenotype content with standardised vocabularies. 
Type Of Material Database/Collection of data 
Year Produced 2018 
Provided To Others? Yes  
Impact Immediately following the update, the resource experienced an increase in user activity/requests. Locally, the expanded GWAS dataset has been used in our collaborative projects (see collaborations section). More generally, we have experienced an increase in study authors submitting their findings to GWAS Central as a means to maximise the impact of their work. 
Description Leicester-BRC 
Organisation National Institute for Health Research
Department NIHR Leicester Biomedical Research Centre
Country United Kingdom 
Sector Public 
PI Contribution Development of strategies for the semantic integration of clinical research data across the BRC disease themes.
Collaborator Contribution Provides access to N3 and clinical research data. Defines the requirements and parameters of data harmonisation and integration.
Impact Within the first year of collaboration, outputs have included the development of application ontologies for collecting primary care data, local informatics strategy contributions to enable the cataloguing of cohort data, and establishment of local data discovery software to enable researchers to identify cohorts across disease themes. This work has involved a multi-disciplinary team of clinicians, bioinformaticians and NHS clinical informaticians.
Start Year 2018
Description MRC-Tox 
Organisation Medical Research Council (MRC)
Department MRC Toxicology Unit
Country United Kingdom 
Sector Public 
PI Contribution Development of new methods for semantically integrating, displaying and interrogating translational profile data. Co-lead on the project direction.
Collaborator Contribution Provides translational status data of human mRNAs under differing conditions, and defines the community requirements for a system to share translational profiles. Co-lead on the project direction.
Impact This is the first year of collaboration in which we are testing new approaches to integrating and publicly sharing translational profile data.
Start Year 2018
Description Wisconsin-Keller 
Organisation University of Wisconsin-Madison
Department Department of Biochemistry
Country United States 
Sector Academic/University 
PI Contribution Ontology-driven mapping of mouse traits to human phenotypes, and derivation of human GWAS SNP data associated with mapped mouse traits.
Collaborator Contribution Identification of mouse QTL associated with lifestyle disease-related traits, and statistical analysis to determine if human loci that are syntenic to mouse QTL are enriched with lifestyle disease-related SNPs.
Impact A manuscript describing this work is under review. This work was undertaken by a multi-disciplinary team of mouse and human genetics researchers, bioinformaticians and biostatisticians.
Start Year 2018
Description i2b2-EU 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Member of the Scientific Committee for the European i2b2 tranSMART Conference. After a presentation, questions and subsequent discussions, interest was expressed in our research activities and us partnering a major EU-funded project.
Year(s) Of Engagement Activity 2018