Interpretable statistical machine learning approaches for the molecular investigation of cancer

Lead Research Organisation: University of Oxford

Abstract

Ovarian cancer is the 6th most common cancer for women in the UK. High-grade serous ovarian cancer (HGSOC) accounts for most cases, with a low 30% 5-year survival. The two main factors that contribute to this poor prognosis are: 1) late diagnosis of the disease, and 2) a high proportion of relapse despite initial response to treatment. The latter suggests that small populations of treatment resistant cancer cells may exist that can repopulate the disease. It is therefore of interest to identify such cancer cells and understanding how they different from other cancer cell types that might also affect why patients respond differently to treatment and differ in how long they survive.
The molecular basis of ovarian cancer can be unravelled using a plethora of modern technologies such as sequencing and imaging at both bulk tissue and single-cell level. This is creating an unprecedented opportunity to use a data-driven approach to enable the precise characterisation of ovarian cancer and the possibility of developing targeted treatment options. However, to make effective use of the molecular data, robust analytical approaches are required to characterise cell populations of interest (particularly rare ones) and to integrate heterogeneous data modalities.
This research aims to:
1) To develop a robust and interpretable approach to identify rare cell populations from high-dimensional molecular data,
2) To develop a statistical framework for the integration of multimodal data for survival prediction.

Novelty of the research methodology
We will develop statistical techniques that are specifically designed to identify rare cell types from molecular data. Classical statistical discovery methods tend to be biased toward the most common cell populations as there is more information about them. There is often a penalty associated with suggesting a rare cell type as these may not be real so a balance must be struck between proposing new cell types and the chance that these proposals may turn out to be false after further examination. We will develop techniques that allow us to control the balance between these competing needs allowing experimental scientists to adjust expectations based on the level of acceptable risk available to them.
We will also develop techniques to combine different sources of data such as clinical record, magnetic resonance imaging and whole genome sequencing. These techniques will examine a specific limitation of existing approaches which typically do not account for the information imbalance between different types of data. For example, a clinical record might contain 30-40 data entries describing a patient's condition, but a whole genome sequence might reveal 10,000s of cancer mutations. If naively combined, the sheer number of mutations can overwhelm the importance of the clinical information, which can cause biases in analysis and interpretation, for instance, by failing to consider important socioeconomic or ethnicity information. We will develop approaches that equalise the important placed on different sources of data such that they can be combined in a fair and equitable way.
This project falls within the EPSRC Healthcare Technologies research area' where "Optimising disease prediction, diagnosis and intervention" is one of the themes or research areas listed on this website.It will create new methods for analysing large data sets, underpin patient-specific predictive models, and support the identification of opportunities for prevention of disease or its recurrence.
This project will involve a collaboration with the Oxford-based cancer immunology company, Singula Bio.

Planned Impact

In the same way that bioinformatics has transformed genomic research and clinical practice, health data science will have a dramatic and lasting impact upon the broader fields of medical research, population health, and healthcare delivery. The beneficiaries of the proposed training programme, and of the research that it delivers and enables, will include academia, industry, healthcare, and the broader UK economy.

Academia: Graduates of the training programme will be well placed to start their post-doctoral careers in leading academic institutions, engaging in high-impact multi-disciplinary research, helping to build training and research capacity, sharing their experience within the wider academic community.

Industry: Partner organisations will benefit from close collaboration with leading researchers, from the joint exploration of research priorities, and from the commercialisation of arising intellectual property. Other organisations will benefit from the availability of highly-qualified graduates with skills in big health data analytics.

Healthcare: Healthcare organisations and patients will benefit from the results of enabled and accelerated health research, leading to new treatments and technologies, and an improved ability to identify and evaluate potential improvements in practice through the analysis of real-world health data.

Economy: The life sciences sector is a key component of the UK economy. The programme will provide partner companies with direct access to leading-edge research. Graduates of the programme will be well-qualified to contribute to economic growth - supporting health research and the development of new products and services - and will be able to inform policy and decision making at organisational, regional, and national levels.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/S02428X/1 01/04/2019 30/09/2027
2728935 Studentship EP/S02428X/1 01/10/2021 30/09/2025 Ellen Visscher