Generative Modelling and Representation Learning

Lead Research Organisation: University of Oxford

Abstract

Raw data is abundant in the modern world. Creating models that can make sense of this large flow of information would be very helpful for many tasks. Unfortunately, most data is not neatly packaged in a format that traditional machine learning methods can use. It is desirable to have methods that can extract useful information from data in any format. A learning paradigm that meets this criteria is generative modelling. Generative models look at example data from a certain source and learn to synthesize new fake data that looks like the original real data. Synthesizing fake examples may not be useful for many applications in and of itself. However, inherent in the ability to create realistic examples is a deep understand of the structure and form of the data. Therefore, within these generative models there must exist 'representations' of the data that summarize pertinent aspects of the data such as its structure and form. These representations are useful for both humans and further models. The generative models can be coaxed into producing representations that are interpretable to humans providing us with automatic and extensive summaries of large amounts of data that go beyond simple metrics such as the mean and variance. Parallel to this, further models can be trained directly on the representations of the data instead of on the raw data itself. Since the representations contain condensed information about the data learnt by the original generative model, it is usually the case that models trained on representations require much less overall data to achieve the same level of performance as a model trained on the raw data directly.

This project aims to improve upon existing generative modelling techniques as well as formulate new ways to extract representations from learnt generative models. New methods for generative modelling are regularly proposed in the research community but our understanding of exactly what is being learnt from the data and how to extract this information often lags behind. The project will deal with very novel methods for generative modelling and aims to bring our understanding of their inner workings up to speed. This will be achieved through a combination of empirical investigation into state of the art models as well as theoretical work to characterise their behaviours.

This project falls within the EPSRC Artifical Intelligence and Robotics research area.

Collaboration is currently planned within the Oxford University Department of Statistics internally.

Planned Impact

The primary CDT impact will be training 75 PhD graduates as the next generation of leaders in statistics and statistical machine learning. These graduates will lead in industry, government, health care, and academic research. They will bridge the gap between academia and industry, resulting in significant knowledge transfer to both established and start-up companies. Because this cohort will also learn to mentor other researchers, the CDT will ultimately address a UK-wide skills gap. The students will also be crucial in keeping the UK at the forefront of methodological research in statistics and machine learning.
After graduating, students will act as multipliers, educating others in advanced methodology throughout their career. There are a range of further impacts:
- The CDT has a large number of high calibre external partners in government, health care, industry and science. These partnerships will catalyse immediate knowledge transfer, bringing cutting edge methodology to a large number of areas. Knowledge transfer will also be achieved through internships/placements of our students with users of statistics and machine learning.
- Our Women in Mathematics and Statistics summer programme is aimed at students who could go on to apply for a PhD. This programme will inspire the next generation of statisticians and also provide excellent leadership training for the CDT students.
- The students will develop new methodology and theory in the domains of statistics and statistical machine learning. It will be relevant research, addressing the key questions behind real world problems. The research will be published in the best possible statistics journals and machine learning conferences and will be made available online. To maximize reproducibility and replicability, source code and replication files will be made available as open source software or, when relevant to an industrial collaboration, held as a patent or software copyright.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/S023151/1 01/04/2019 30/09/2027
2420772 Studentship EP/S023151/1 01/10/2020 30/09/2024 Andrew Campbell