Methods for Dealing with Misspecification in Bayesian Experimental Design

Lead Research Organisation: University of Oxford

Abstract

Much of the research in statistics and machine learning has focussed on methods of analysing data once it has already been acquired, but the question of how to best collect data in the first place has been under-explored. Gathering data can be expensive, therefore practitioners are limited by the amount of data they can collect. When this is done without care, it can produce poor quality data - potentially leading to inaccurate results and incorrect conclusions, regardless of how advanced one's analytical toolkit is. It is therefore vital to endeavour to gather good quality data before analysis. Research in experimental design aims to address this issue, providing practitioners with methods of collecting informative data that will lead to reliable results and strong conclusions.


To outline Bayesian experimental design (BED), we consider the following setting: there are several beacons within an area, each emitting a signal, and a practitioner wishes to locate the beacons. The data-gathering process involves the practitioner choosing locations in which to probe the signal, then recording the strength of that signal at these locations. With infinite resources, the practitioner would be able to perfectly locate the beacons, but in practice they are constrained to performing only a finite number of experiments.


BED then works as follows: before collecting any data, the practitioner will first form a statistical model of the strength of the signal at a given location in terms of the unknown locations of the beacons, and they will specify their prior beliefs about the locations of the beacons with a prior distribution on these locations. BED procedures can then provide the practitioner with the best locations to probe the signal - where "best" is defined as the locations that will lead to the largest increase in information about the beacon locations. The above can be easily generalised to other settings.


BED is both theoretically sound and performs well practically, however, it can break down if our statistical model of the data is misspecified, i.e., if the true data-generating process is different from the model that we specified. Bayesian statistical methodology is always vulnerable to model misspecification, but unfortunately this fact is particularly problematic for experimental design, where we are not just using our model to analyse the data, but also to collect it in the first place. In the worst case, there are models in which the optimal course of action is to pick all your designs in exactly the same place, regardless of the outcomes you observe. However, unless your model is correct, this will produce an extremely poor quality dataset.


In collaboration with my supervisor, Dr Tom Rainforth, we will first aim to deepen understanding of this problem: categorising the ways in which misspecification causes BED to fail; forming metrics to diagnose this failure and best practices to avoid its occurrence; and providing theoretical guarantees for when failure will occur. Following this, we will develop methods to counteract misspecification, ideally expanding the theoretical elegance and empirical performance of BED to cases when our model is misspecified. BED has enormous potential application, including quantum information experiments, psychology trials, constructing lifelike police sketches, and guiding drug discovery. As these applications become more complex, model misspecification becomes more prevalent; it is therefore pertinent to further investigate misspecification in BED.


This project falls within the EPSRC 'statistics and applied probability' research area.

Planned Impact

The primary CDT impact will be training 75 PhD graduates as the next generation of leaders in statistics and statistical machine learning. These graduates will lead in industry, government, health care, and academic research. They will bridge the gap between academia and industry, resulting in significant knowledge transfer to both established and start-up companies. Because this cohort will also learn to mentor other researchers, the CDT will ultimately address a UK-wide skills gap. The students will also be crucial in keeping the UK at the forefront of methodological research in statistics and machine learning.
After graduating, students will act as multipliers, educating others in advanced methodology throughout their career. There are a range of further impacts:
- The CDT has a large number of high calibre external partners in government, health care, industry and science. These partnerships will catalyse immediate knowledge transfer, bringing cutting edge methodology to a large number of areas. Knowledge transfer will also be achieved through internships/placements of our students with users of statistics and machine learning.
- Our Women in Mathematics and Statistics summer programme is aimed at students who could go on to apply for a PhD. This programme will inspire the next generation of statisticians and also provide excellent leadership training for the CDT students.
- The students will develop new methodology and theory in the domains of statistics and statistical machine learning. It will be relevant research, addressing the key questions behind real world problems. The research will be published in the best possible statistics journals and machine learning conferences and will be made available online. To maximize reproducibility and replicability, source code and replication files will be made available as open source software or, when relevant to an industrial collaboration, held as a patent or software copyright.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/S023151/1 01/04/2019 30/09/2027
2740638 Studentship EP/S023151/1 01/10/2022 30/09/2026 Alexander Forster