Probabilistic knowledge representation of big data for efficient integration and useful inference in bio-medicine

Lead Research Organisation: University of Manchester

Department Name: School of Health Sciences

Abstract

Background: In order to realise the opportunities offered by big data in the field of biomedicine,
these must be combined, integrated, and made available for cross-cutting research. To achieve this,
many issues stemming from the inherent heterogeneity of datasets, such as the use of different data
models, naming conventions and levels of abstraction, must be overcome. Traditional approaches
aimed at improving compatibility of discrete datasets have focused on establishing universal
standards for data representation at the source; for example, by using standard vocabularies or
ontologies to consistently and unambiguously represent concepts of interest. However, while such
approaches are ideal in theory, they have had few successful examples. Universal standards are
poorly adopted in the biomedical community. Alternatively, conventional integration methods are
laborious and usually lead to reduction in dimensionality and loss of information in the integrated
dataset. We propose the development of a Bayesian approach to harmonisation and data
integration, allowing for more inclusive and flexible knowledge representation.

Hypothesis: Novel methods for data integration, adopting a probabilistic approach, can improve the
inference from heterogeneous datasets for biomedical research.

Objectives:
1. Review existing literature of probabilistic learning/mining methodologies to identify potential
starting points for this project.
2. Extend and/or develop methods for generic data description and visualisation, to assist in rapid
exploration of new datasets.
3. Develop a Bayesian approach for knowledge representation and integration.
4. Evaluate performance, accuracy and limitations, and compare to existing integration methods.
5. Apply newly developed methods to real-world datasets in biomedicine and health such as
cancer clinical trials and studies in Lupus.

Methods: This project requires the development, application, and evaluation of computational
methods in machine learning, data processing, and knowledge representation. The student will also
be exposed to methods for data analysis and data mining in the biomedical domain. The Farr@HeRC
developed eLab platform, adopted by multiple projects to integrate data gathered across
international studies, will be used as a testbed for the developed methods.

Outcomes/Impact:
1. A new framework for efficient, probabilistic data integration, with an evaluation of its
performance, accuracy and limitations.
2. New approaches for knowledge representation in the biomedical domain, with potential for wider adoption in other fields.
3. One or more applied use cases that demonstrate the utility of the developed methodologies.

Training:
Dr Niels Peek - Informatics and machine learning input.
Dr Nophar Geifman - Informatics, knowledge representation and biomedical input.
Dr Philip Couch - Information sciences input.

The student will sit in the MRC-funded Farr@HeRC, with exposure to a range of informatics,
statistics, clinical, and epidemiological expertise. He/she will belong to the HeRC Doctoral Training
Network and will have the opportunity to receive training as part of this, as well as other in house
training such as CPD programmes and MSc Health Data Science modules as required

Sep 17 - Sep 21

Funder:

EPSRC

Project Status:

Closed

Project Category:

Studentship

Project Reference:

1949035

Research Topic:

Unclassified

Organisations

University of Manchester (Lead Research Organisation)

People

ORCID iD

Publications

Author Name

Title Publication Date Published

10 25 50

Studentship Projects

Project Reference	Relationship	Related To	Start	End	Student Name
EP/N509565/1			30/09/2016	29/09/2021
1949035	Studentship	EP/N509565/1	30/09/2017	29/09/2021

Key Findings


Description	Existing methods for dataset integration rely on mapping to common data models, often resulting in a substantial loss of information that is present in the source datasets. One promising alternative relies on probabilistic methodologies. Our research findings have illustrated this approach using a real-world example from Lupus cohort studies. Rather than relying on perfectly harmonised data items, our method propagates the uncertainty that results from imperfect harmonisation into the statistical analysis, thus obviating the need for data integration through a common data model. Soon it will be published in Medical Informatics Europe 2020 proceedings
Exploitation Route	New approaches for knowledge representation in the biomedical domain, with potential for wider adoption in other fields.
Sectors	Healthcare Other