Interpretable representation learning

Lead Research Organisation: University of Oxford
Department Name: Computer Science


Deep learning approaches have had tremendous successes across a wide range of domains in recent years, from image segmentation and classification to speech recognition and language translation. They have also started to demonstrate promising results in healthcare applications, supported by the increasingly growing size and diversity of available patient data.

However, despite clear performance improvements, their adoption by the healthcare community is hindered both by the fact that many perceive these deep learning models as indecipherable black boxes, and that current state-of-the-art approaches in the medical domain do not offer a good handle on the uncertainty of model predictions.

The objective of my research will be to fill these gaps by developing novel representation learning approaches that are more interpretable and robust. Of particular interest will be the extension of these methods to the case of heterogenous and/or non-stationary data inputs, building for example on some of the early work developed around Bayesian Recurrent Neural Networks applied to language modeling and image captioning.
The research will aim to demonstrate how these can be leveraged for healthcare applications, where patient data may come from a wide range of different sources (e.g., medical imaging, high-throughput sequencing, electronic medical records) and vary over time under the influence of disease progression and treatment effects.
The goal will be to show very concretely how they can help provide a better understanding of the elements that underpin the predictions of a machine learning model, as well as lead to new insights related to disease understanding (e.g., identification of patient subgroups for a given pathology).

This cross-disciplinary project, combining theoretical machine learning developments as well as their applications to healthcare, is at the intersection of two of the core research areas from EPSRC, namely "Artificial Intelligence Technologies" and "Healthcare Technologies".
The project will be supervised by Professor Yarin Gal (Oxford Applied & Theoretical Machine Learning Group, Department of Computer Science, University of Oxford) and Dr. Lindsay Edwards (Vice President, AI/ML Engineering, GlaxoSmithKline).


10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/S513866/1 01/10/2018 31/03/2024
2287801 Studentship EP/S513866/1 01/10/2019 30/09/2022 Pascal Notin
Description Disease variant prediction with deep generative models of evolutionary data:
Quantifying the pathogenicity of protein variants in human disease-related genes would have a marked effect on clinical decisions, yet the overwhelming majority (over 98%) of these variants still have unknown consequences. In principle, computational methods could support the large-scale interpretation of genetic variants. However, state-of-the-art methods have relied on training machine learning models on known disease labels. As these labels are sparse, biased and of variable quality, the resulting models have been considered insufficiently reliable. Here we propose an approach that leverages deep generative models to predict variant pathogenicity without relying on labels. By modelling the distribution of sequence variation across organisms, we implicitly capture constraints on the protein sequences that maintain fitness. Our model EVE (evolutionary model of variant effect) not only outperforms computational approaches that rely on labelled data but also performs on par with, if not better than, predictions from high-throughput experiments, which are increasingly used as evidence for variant classification. We predict the pathogenicity of more than 36 million variants across 3,219 disease genes and provide evidence for the classification of more than 256,000 variants of unknown significance. Our work suggests that models of evolutionary information can provide valuable independent evidence for variant interpretation that will be widely useful in research and clinical settings.
Exploitation Route Our data and results, available at, provide information on a gene-by-gene basis where researchers and physicians can look at individual variants in detail, including model predictions for each variant for 3k proteins. We are working to extend our predictions to the full proteome and are closely collaborating with several research teams and private institutions to integrate our models and predictions in their workflows and analyses. Our objective is to thereby support the early diagnosis of genetic diseases by clinical geneticists, as well as solidifying our understanding of the mechanisms underlying genetic disorders.
Sectors Healthcare

Description Models developed as part of this grant (in particular EVE models, discussed in the paper "Disease variant prediction with deep generative models of evolutionary data") have started being used in hospitals to identify potential genes responsible for genetic pathologies.
First Year Of Impact 2022
Sector Healthcare
Impact Types Societal