Generative Models: Theory and Applications

Lead Research Organisation: University of Oxford

Abstract

Recently, various methods, known collectively as the field of generative modelling, have been developed for understanding and approximating complex probability distributions. These techniques work by learning to generate synthetic datapoint which look like they come from the original distribution. These methods have grown significantly in prominence in the last few years, with two particular classes of models known as diffusion models and variational autoencoders becoming particularly prominent. Such models have been very successful in a variety of applications, including generating synthetic images, improving quality of images, generating synthetic protein structures, approximating geospatial data and more. However, our theoretical understanding of such models is patchy at best, with application often running significantly ahead of the theoretical underpinning.

The aim of the proposed research project is to develop a stronger theoretical basis for these generative modelling techniques than currently exists. The key questions we aim to answer include:
- Under what conditions do techniques currently in use work well or fail? Can we derive new theoretical guarantees of success or failure, or empirical evidence about when such models are suitable?
- Can we find alterations to existing methods that make them more robust or applicable in a wider range of scenarios? Alternatively, can we develop new methods that expand the range of distributions we can approximate?
- Can we find new underlying connections between different versions of these methods, or a new theoretical framework underpinning these methods?
The overarching aim is that answers to these questions will lead to the ability to produce generative models which are able to approximate a wider number of distributions more accurately, and for us to have a better understanding of when and why such methods are suitable, and when they may not be.

We will approach these questions through a combination of theoretical work and empirical work. The theoretical work will aim to develop suitable mathematical frameworks for generative models and theoretical guarantees on their effectiveness. The empirical work will involve building examples of these models and testing them on toy and real-world data in order to gain insights about their behaviour and empirical evidence on their effectiveness or otherwise. It is our hope that the theory will be used to explain the empirical behaviour and the empirical results will provide substance to the theoretical frameworks.

This project falls within the EPSRC statistics and applied probability research area.

Planned Impact

The primary CDT impact will be training 75 PhD graduates as the next generation of leaders in statistics and statistical machine learning. These graduates will lead in industry, government, health care, and academic research. They will bridge the gap between academia and industry, resulting in significant knowledge transfer to both established and start-up companies. Because this cohort will also learn to mentor other researchers, the CDT will ultimately address a UK-wide skills gap. The students will also be crucial in keeping the UK at the forefront of methodological research in statistics and machine learning.
After graduating, students will act as multipliers, educating others in advanced methodology throughout their career. There are a range of further impacts:
- The CDT has a large number of high calibre external partners in government, health care, industry and science. These partnerships will catalyse immediate knowledge transfer, bringing cutting edge methodology to a large number of areas. Knowledge transfer will also be achieved through internships/placements of our students with users of statistics and machine learning.
- Our Women in Mathematics and Statistics summer programme is aimed at students who could go on to apply for a PhD. This programme will inspire the next generation of statisticians and also provide excellent leadership training for the CDT students.
- The students will develop new methodology and theory in the domains of statistics and statistical machine learning. It will be relevant research, addressing the key questions behind real world problems. The research will be published in the best possible statistics journals and machine learning conferences and will be made available online. To maximize reproducibility and replicability, source code and replication files will be made available as open source software or, when relevant to an industrial collaboration, held as a patent or software copyright.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/S023151/1 01/04/2019 30/09/2027
2564794 Studentship EP/S023151/1 01/10/2021 30/09/2025 Joseph Benton