Data Reduction and Large-Scale Inference - Bayesian Coresets

Lead Research Organisation: University of Bristol
Department Name: Mathematics

Abstract

The use of Bayesian methods in large-scale data settings is attractive due to the coherent uncertainty quantification, and prior specification they provide. Unfortunately, Bayesian inference algorithms are not generally computationally scalable, making their application to large datasets difficult or infeasible. As modern data sets continue to grow ever larger, it is essential for inference procedures to be scalable whilst retaining theoretical guarantees on the quality of their results. The question then naturally arises of how to reduce data in a principled manner, somehow extracting the meaningful structure in massive, high-dimensional data sets and condensing it into a smaller, lower-dimensional data sets which are less costly to analyse. Previous work on scaling Bayesian inference has focused on augmenting algorithms to, for example, use only a random data subsample at each iteration. However, by leveraging the insight that data is often redundant, recent work on Bayesian coresets has provided numerous approaches to finding a weighted subset of the data (called a coreset) that is much smaller than the original dataset. This coreset can then be exploited in many existing posterior inference algorithms without alteration, providing computational speedup and guarantees on posterior approximation error.

Significant computational gains can be achieved by ensuring that the combined cost of coreset construction plus follow-on regression-parameter estimation from the coreset is less than that of estimating the inference parameters from the full dataset. These ideas can extend to other applications too, for example to Bayesian inference where, rather than using point estimates, parameters are sampled from a posterior distribution using MCMC or SMC techniques. Such sampling processes involve repeatedly evaluating the likelihood function which is less costly using a small coreset than it is for the full dataset.

Work on this topic could also be taken in the direction hybridizing coreset methods with nonlinear dimensionality reduction techniques. Such techniques are designed not to reduce the number of data points, but rather the dimension of each data point, by recognizing and exploiting the fact that data may be concentrated around a manifold of low intrinsic dimension, embedded in a high-dimensional space. There are several other interesting research directions in which the work might be taken; current coreset reduction methods rely on full or conditional independence of data points. To what extent can the methods be extended beyond this regime? Can dimensionality reduction methods be placed within a well-founded and unified probabilistic framework?

The University project supervisors will be Nick Whiteley and Robert Allison. "Industrial" co-supervisor(s) will be from the machine learning research group within the NCSC which is fully engaged on research into largescale Bayesian inference techniques, including data-reduction methods, and will join with our regular detailed technical discussions. This group is well connected across the UK university research community in the areas of data-science/computational-statistics/machine-learning as well as with the Alan Turing Institute and with NCSC research activities

Planned Impact

The COMPASS Centre for Doctoral Training will have the following impact.

Doctoral Students Impact.

I1. Recruit and train over 55 students and provide them with a broad and comprehensive education in contemporary Computational Statistics & Data Science, leading to the award of a PhD. The training environment will be built around a set of multilevel cohorts: a variety of group sizes, within and across year cohort activities, within and across disciplinary boundaries with internal and external partners, where statistics and computation are the common focus, but remaining sensitive to disciplinary needs. Our novel doctoral training environment will powerfully impact on students, opening their eyes to not only a range of modern technical benefits and opportunities, but on the power of team-working with people from a range of backgrounds to solve the most important problems of the day. They will learn to apply their skills to achieve impact by collaborative working with internal and external partners, such as via our Rapid Response Teams, Policy Workshops & Statistical Clinics.

I2. As well as advanced training in computational statistics and data science, our students will be impacted by exposure to, and training in, important cognate topics such as ethics, responsible innovation, equality, diversity and inclusion, policy, effective communication and dissemination, enterprise, impact and consultancy skills. It is vital for our students to understand that their training will enable them to have a powerful impact on the wider world, so, e.g., AI algorithms they develop should not be discriminatory, and statistical methodologies should be reproducible, and statistical results accurately and comprehensibly communicated to the general public and policymakers.

I3. The students will gain experience via collaborations with academic partners within the University in cognate disciplines, and a wide range of external industrial & government partners. The students will be impacted by the structured training programmes of the UK Academy of Postgraduate Training in Statistics, the Bristol Doctoral College, the Jean Golding Institute, the Alan Turing Institute and the Heilbronn Institute for Mathematical Sciences, which will be integrated into our programme.

I4. Having received an excellent training, the students will then impact powerfully on the world in their future fruitful careers, spreading excellence.

Impact on our Partners & ourselves.

I5. Direct impacts will be achieved by students engaging with, and working on projects with, our academic partners, with discipline-specific problems arising in engineering, education, medicine, economics, earth sciences, life sciences and geographical sciences, and our external partners Adarga, the Atomic Weapons Establishment, CheckRisk, EDF, GCHQ, GSK, the Office for National Statistics, Sciex, Shell UK, Trainline and the UK Space Agency. The students will demonstrate a wide range of innovation with these partners, will attract engagement from new partners, and often provide attractive future employment matches for students and partners alike.

Wider Societal Impact

I6. COMPASS will greatly benefit the UK by providing over 55 highly trained PhD graduates in an area that is known to be suffering from extreme, well-known, shortages in the people pipeline nationally. COMPASS CDT graduates will be equipped for jobs in sectors of high economic value and national priority, including data science, analytics, pharmaceuticals, security, energy, communications, government, and indeed all research labs that deal with data. Through their training, they will enable these organisations to make well-informed and statistically principled decisions that will allow them to maximise their international competitiveness and contribution to societal well-being. COMPASS will also impact positively on the wider student community, both now and sustainably into the future.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/S023569/1 01/04/2019 30/09/2027
2592814 Studentship EP/S023569/1 01/10/2021 22/03/2026 Dominic Broadbent