End-to-End Probabilistic Modelling of Longitudinal Electronic Health Records

Lead Research Organisation: University of Oxford

Abstract

The recent availability of large data sets of electronic health records (EHR) led to a surge in the machine learning for health care literature. This thesis aims to build a bridge between the newly proposed methods and the classical literature on probabilistic modelling. It thereby pays special attention to typical properties of medical data: (1) extensive missingness, (2) irregularly spaced observations, (2) model interpretation and (3) semi-supervision.

In a first step, we explore how deep generative models (non-linear models that generate new data) can be used to impute missing data. We provide an intuition why some of these deep generative imputation methods work and make it possible to encode prior information on the missing data into the model.
In a second step, we deal with irregularly spaced observations by modelling latent categorical variables using in-time-continuous neural networks. We hereby allow to model time conditional probabilities in the latent space. Such a model can then be used to cluster the disease states of a patient in time.
In a third step, we develop interpretable methodology that reveals the latent dynamics of a disease based on observed lab values and survey answers. These latent dynamics can then be used as efficacy measures in clinical trials to assess whether a drug is working or not. Common latent factor models that can be applied to this problem are often restricted to capture linear interactions. Time-continuous nonlinear models however fail in giving interpretable results. Our goal is to find a sparse linear approximation to the non-linear dynamics found by continuous deep latent models.
In a fourth step, we want to explore how such a deep latent model could be controlled by including semi-supervision. In clinical trials, existing efficacy measures are available for some of the clinical visits. The goal is to find novel efficacy measures of a disease that do not suffer from the naivety of the existing measures but are to some degree positively correlated with the existing measures. An additional model is aimed to explain possible deviations. It should then ensure that the deviations are reasonable.

To demonstrate the effectiveness of our models, we evaluate the proposed methodology on various real-world medical datasets such as the medical benchmark data set MIMIC III and clinical trial data sets provided by Novartis.
My research is partly funded by Novartis. My supervisor is Chris Holmes. This project falls within the EPSRC Healthcare technologies research area.

Planned Impact

The primary CDT impact will be training 75 PhD graduates as the next generation of leaders in statistics and statistical machine learning. These graduates will lead in industry, government, health care, and academic research. They will bridge the gap between academia and industry, resulting in significant knowledge transfer to both established and start-up companies. Because this cohort will also learn to mentor other researchers, the CDT will ultimately address a UK-wide skills gap. The students will also be crucial in keeping the UK at the forefront of methodological research in statistics and machine learning.
After graduating, students will act as multipliers, educating others in advanced methodology throughout their career. There are a range of further impacts:
- The CDT has a large number of high calibre external partners in government, health care, industry and science. These partnerships will catalyse immediate knowledge transfer, bringing cutting edge methodology to a large number of areas. Knowledge transfer will also be achieved through internships/placements of our students with users of statistics and machine learning.
- Our Women in Mathematics and Statistics summer programme is aimed at students who could go on to apply for a PhD. This programme will inspire the next generation of statisticians and also provide excellent leadership training for the CDT students.
- The students will develop new methodology and theory in the domains of statistics and statistical machine learning. It will be relevant research, addressing the key questions behind real world problems. The research will be published in the best possible statistics journals and machine learning conferences and will be made available online. To maximize reproducibility and replicability, source code and replication files will be made available as open source software or, when relevant to an industrial collaboration, held as a patent or software copyright.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/S023151/1 01/04/2019 30/09/2027
2247906 Studentship EP/S023151/1 01/10/2019 30/09/2023 Sahra Ghalebikesabi