Probabilistic knowledge representation of big data for efficient integration and useful inference in bio-medicine

Lead Research Organisation: University of Manchester
Department Name: School of Health Sciences


Background: In order to realise the opportunities offered by big data in the field of biomedicine,
these must be combined, integrated, and made available for cross-cutting research. To achieve this,
many issues stemming from the inherent heterogeneity of datasets, such as the use of different data
models, naming conventions and levels of abstraction, must be overcome. Traditional approaches
aimed at improving compatibility of discrete datasets have focused on establishing universal
standards for data representation at the source; for example, by using standard vocabularies or
ontologies to consistently and unambiguously represent concepts of interest. However, while such
approaches are ideal in theory, they have had few successful examples. Universal standards are
poorly adopted in the biomedical community. Alternatively, conventional integration methods are
laborious and usually lead to reduction in dimensionality and loss of information in the integrated
dataset. We propose the development of a Bayesian approach to harmonisation and data
integration, allowing for more inclusive and flexible knowledge representation.

Hypothesis: Novel methods for data integration, adopting a probabilistic approach, can improve the
inference from heterogeneous datasets for biomedical research.

1. Review existing literature of probabilistic learning/mining methodologies to identify potential
starting points for this project.
2. Extend and/or develop methods for generic data description and visualisation, to assist in rapid
exploration of new datasets.
3. Develop a Bayesian approach for knowledge representation and integration.
4. Evaluate performance, accuracy and limitations, and compare to existing integration methods.
5. Apply newly developed methods to real-world datasets in biomedicine and health such as
cancer clinical trials and studies in Lupus.

Methods: This project requires the development, application, and evaluation of computational
methods in machine learning, data processing, and knowledge representation. The student will also
be exposed to methods for data analysis and data mining in the biomedical domain. The Farr@HeRC
developed eLab platform, adopted by multiple projects to integrate data gathered across
international studies, will be used as a testbed for the developed methods.

1. A new framework for efficient, probabilistic data integration, with an evaluation of its
performance, accuracy and limitations.
2. New approaches for knowledge representation in the biomedical domain, with potential for wider adoption in other fields.
3. One or more applied use cases that demonstrate the utility of the developed methodologies.

Dr Niels Peek - Informatics and machine learning input.
Dr Nophar Geifman - Informatics, knowledge representation and biomedical input.
Dr Philip Couch - Information sciences input.

The student will sit in the MRC-funded Farr@HeRC, with exposure to a range of informatics,
statistics, clinical, and epidemiological expertise. He/she will belong to the HeRC Doctoral Training
Network and will have the opportunity to receive training as part of this, as well as other in house
training such as CPD programmes and MSc Health Data Science modules as required


10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/N509565/1 01/10/2016 30/09/2021
1949035 Studentship EP/N509565/1 01/10/2017 30/09/2021 Alexia Sampri
Description Existing methods for dataset integration rely on mapping to common data models, often resulting in a substantial loss of information that is present in the source datasets. One promising alternative relies on probabilistic methodologies. Our research findings have illustrated this approach using a real-world example from Lupus cohort studies. Rather than relying on perfectly harmonised data items, our method propagates the uncertainty that results from imperfect harmonisation into the statistical analysis, thus obviating the need for data integration through a common data model.

Soon it will be published in Medical Informatics Europe 2020 proceedings
Exploitation Route New approaches for knowledge representation in the biomedical domain, with potential for wider adoption in other fields.
Sectors Healthcare,Other