Latent feature models for multi-omic data analysis

Lead Research Organisation: University of Oxford

Abstract

Prostate cancer is a heterogeneous disease, displaying a multitude of genetic alterations, histological patterns and clinical outcomes. This heterogeneity has confounded our ability to identify subtypes or signatures of aggressive disease, and as such clinical decision making in prostate cancer is still informed only through histopathological grading and biochemical markers. Analysis of data obtained through DNA sequencing and other 'omics technologies has revealed patterns that could be used to inform diagnosis, prognosis and treatment, but these are difficult to define in a clear and consistent way. Recently, several countries have joined together to form the Pan Prostate Cancer Group (PPCG), pooling resources and harmonising bioinformatics pipelines so their data can be analysed together. This data set consists of samples from approximately 2000 men and includes data from whole genome sequencing, RNA sequencing, methylation arrays, histopathological features, and clinical variables.
The data is now of such a scale where machine learning methods could potentially provide deep insights by identifying underlying patterns in the data, but there are a number of requirements that are difficult to fulfil with conventional methods. In particular:
1) Integration of data from multiple sources across several countries, which may have different biases and scales
2) Ability to deal with large amounts of missing data as not all data sources are available for all patients

3) Retaining interpretable links to the underlying biology so clinicians and patients can understand the rationale behind any computational output

4) Identification of patterns that are linked to aggressive disease or reflect disrupted biological processes that could be targeted therapeutically

We therefore propose the development of novel machine learning approaches that can fulfil these requirements and these will be applied to the PPCG data set. Our methodology will be focused on the extraction of latent features that encapsulate relationships both within and between data from different sources and present them in an interpretable form. We will then evaluate the applicability of feature scoring methods in our approaches and develop methods to circumvent any shortcomings. The patient data can be represented in the form of these latent features and this will be used as the basis for further analysis.

As we experience diminishing returns from data sets on the scale that can be generated by individual groups, we will inevitably experience a rise in pooled data sets like that from the PPGC. These provide greater sample numbers but also increase the breadth of available data types, which could contain information leading to deep insights. However, the processing and analysis of this data also presents unique challenges. In this project we will design methods that address these challenges and open up this data for analysis. As we focus on interpretability, any findings should be amenable for translation to clinical use. We will create our methodology with the expectation that it should be will applicable to data from other cancer types. This project falls within the EPSRC Artificial Intelligence and Robotics research area and will run in conjunction with projects funded by Prostate Cancer Research, Cancer Research UK and Prostate Cancer UK.

Planned Impact

In the same way that bioinformatics has transformed genomic research and clinical practice, health data science will have a dramatic and lasting impact upon the broader fields of medical research, population health, and healthcare delivery. The beneficiaries of the proposed training programme, and of the research that it delivers and enables, will include academia, industry, healthcare, and the broader UK economy.

Academia: Graduates of the training programme will be well placed to start their post-doctoral careers in leading academic institutions, engaging in high-impact multi-disciplinary research, helping to build training and research capacity, sharing their experience within the wider academic community.

Industry: Partner organisations will benefit from close collaboration with leading researchers, from the joint exploration of research priorities, and from the commercialisation of arising intellectual property. Other organisations will benefit from the availability of highly-qualified graduates with skills in big health data analytics.

Healthcare: Healthcare organisations and patients will benefit from the results of enabled and accelerated health research, leading to new treatments and technologies, and an improved ability to identify and evaluate potential improvements in practice through the analysis of real-world health data.

Economy: The life sciences sector is a key component of the UK economy. The programme will provide partner companies with direct access to leading-edge research. Graduates of the programme will be well-qualified to contribute to economic growth - supporting health research and the development of new products and services - and will be able to inform policy and decision making at organisational, regional, and national levels.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/S02428X/1 01/04/2019 30/09/2027
2431819 Studentship EP/S02428X/1 01/10/2020 30/09/2024 Aleksandra Krepa