Scalable Bayesian Statistical Machine Learning for Multi-modal Data with Applications to Multiple Sclerosis

Lead Research Organisation: University of Oxford

Abstract

Within this project, we undertake the challenge of effectively managing extensive and complex datasets, as exemplified by the NO.MS clinical dataset. The primary objective is to streamline the inherent complexity of these datasets by uncovering underlying latent variables. This simplification is particularly vital when contending with high-dimensional data, where the challenge lies in distilling an abundance of dimensions into a more concise set of broader covariates that remain accurate representations of the data. This concept of distilling latent variables extends its relevance to diverse fields beyond our immediate study.

The NO.MS dataset, provided by Novartis, serves as a focal point of interest. It encompasses a wealth of data on individuals affected by multiple sclerosis, distinguishing itself as one of the largest datasets of its kind. This distinction arises from the inclusion of numerous MRI brain scans, contributing to its high dimensionality due to the myriad of pixels within each scan. Consequently, this dataset bears significant potential for unravelling insights into the disease. However, from an analytical standpoint, it presents considerable challenges. This complexity stems from the dataset's amalgamation of discrete data, such as disability scores, and continuous data, exemplified by the pixel values within the MRI scans. Furthermore, the dataset draws from multiple studies, each capturing distinct facets of patient visits, resulting in a substantial amount of missing data.

The project's core objectives comprise of two pivotal contributions:
First and foremost, we aim to construct a model that can effectively unveil an interpretable representation of the lower-dimensional latent space. Our approach relies heavily on Bayesian statistics, a statistical framework that integrates prior beliefs into modelling, subsequently updating them based on the incoming data. This model must possess the versatility to accommodate both continuous and discrete data, offering a solution for datasets like NO.MS. Furthermore, we prioritize scalability, recognizing the impracticality of conventional methods for managing large, high-dimensional datasets, such as the Novartis multiple sclerosis data. In addition, our model should autonomously determine the optimal number of latent variables required to represent the data accurately. While existing models may address individual aspects of these challenges, the unique aspect of our approach is the integration of solutions into a cohesive whole.
Secondly, we intend to apply this comprehensive model to the NO.MS dataset to deepen our understanding of multiple sclerosis. This can be achieved by analysing the latent factors unveiled, in collaboration with medical experts. Additionally. These latent factors furnish a simplified representation of the data, which can, in turn, be employed in conjunction with more computationally intensive models. This streamlined representation enhances the efficiency of our analyses compared to conventional approaches.

This project falls within the EPSRC Statistics and Applied Probability research area and is carried out in collaboration with Novartis, it is supervised by Dr Habib Ganjgahi, Prof Tom Nichols and Prof Chris Holmes.

Planned Impact

The primary CDT impact will be training 75 PhD graduates as the next generation of leaders in statistics and statistical machine learning. These graduates will lead in industry, government, health care, and academic research. They will bridge the gap between academia and industry, resulting in significant knowledge transfer to both established and start-up companies. Because this cohort will also learn to mentor other researchers, the CDT will ultimately address a UK-wide skills gap. The students will also be crucial in keeping the UK at the forefront of methodological research in statistics and machine learning.
After graduating, students will act as multipliers, educating others in advanced methodology throughout their career. There are a range of further impacts:
- The CDT has a large number of high calibre external partners in government, health care, industry and science. These partnerships will catalyse immediate knowledge transfer, bringing cutting edge methodology to a large number of areas. Knowledge transfer will also be achieved through internships/placements of our students with users of statistics and machine learning.
- Our Women in Mathematics and Statistics summer programme is aimed at students who could go on to apply for a PhD. This programme will inspire the next generation of statisticians and also provide excellent leadership training for the CDT students.
- The students will develop new methodology and theory in the domains of statistics and statistical machine learning. It will be relevant research, addressing the key questions behind real world problems. The research will be published in the best possible statistics journals and machine learning conferences and will be made available online. To maximize reproducibility and replicability, source code and replication files will be made available as open source software or, when relevant to an industrial collaboration, held as a patent or software copyright.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/S023151/1 01/04/2019 30/09/2027
2740724 Studentship EP/S023151/1 01/10/2022 30/09/2026 George Hutchings