Differential Model Inference with Imperfect Information

Lead Research Organisation: University of Bristol
Department Name: Mathematics

Abstract

Density Ratio estimation (DRE) is the practice of estimating the ratio between two probability density functions (PDFs). DRE's ability to characterise the relationship between two PDFs naturally lends itself to many applications such as outlier detection, Generative Adversarial Networks (GANs) , and general binary classification. Furthermore, DRE is a technique which can be applied to the EPSRC research area of Natural Language Processing. In our research we aim to adapt various DRE methods, and downstream applications of these methods, to be robust when working with imperfect data. Imperfect data itself can take many forms such as missing data, corrupted data, and even adversarial data each of which need to be taken into careful consideration when trying to adapt DRE approaches. Imperfect data is a key issue within DRE as "few key points" can have a large impact on estimates making DRE very sensitive to any irregularities within the data. While there is a vast number of DRE approaches, very few of them explicitly account for the any case of imperfect data. Some work has been done regarding the impact of missing data on DRE, however this work exclusively focuses on the case of uniform missing patterns. There are many applications in which such an assumption is unrealistic and the probability of an observation being missing depends in some way in the value of the observation itself. For example, many measuring instruments are more likely to err when when attempting to measure more extreme values, while in questionnaires, participants are less likely to answer a question if they deem their answer to be embarrassing or unfavourable. Both of these examples lead to non-uniform missing patterns within the data. In such a case, naive implementation of a complete case approach with any DRE procedure can lead to estimating a different density ratio to our true density ratio and thus give inconsistent estimations. Our initial aim is therefore to adapt DRE procedures to this scenario of non-uniform missing data. When doing so there are multiple considerations An additional aim of the PhD will be to adapt downstream applications of DRE to the case of imperfect data. One of these applications we are looking to adapt is Neyman-Pearson (NP) classification. Neyman-Pearson (NP) classification is an application of DRE in which one wants to create a classification procedure which strictly controls miss-classification for one class. A potential application of NP classification is for use in disease diagnosis. Within this setting, falsely classifying a diseased individual as healthy could be far more damaging than classifying an individual who is healthy as diseased. As such we would like to construct a procedure for classifying individuals which has a strict control on the probability of miss-classifying a healthy individual as diseased. While major NP classification procedures leverage DRE, there is still opportunity for imperfect data to impact the procedure outside of the DRE. Therefore, we aim to address this and make the entirety of the NP classification procedure robust to non-uniform missing data. Again we will look to expand this to multi-dimensional settings. Another way we intend to extend this work is to look into cases where the missingness structure is in some way unknown. In this case we will explore how this missingness structure can be first learned before we perform our adapted DRE procedure or any downstream applications.

Planned Impact

The COMPASS Centre for Doctoral Training will have the following impact.

Doctoral Students Impact.

I1. Recruit and train over 55 students and provide them with a broad and comprehensive education in contemporary Computational Statistics & Data Science, leading to the award of a PhD. The training environment will be built around a set of multilevel cohorts: a variety of group sizes, within and across year cohort activities, within and across disciplinary boundaries with internal and external partners, where statistics and computation are the common focus, but remaining sensitive to disciplinary needs. Our novel doctoral training environment will powerfully impact on students, opening their eyes to not only a range of modern technical benefits and opportunities, but on the power of team-working with people from a range of backgrounds to solve the most important problems of the day. They will learn to apply their skills to achieve impact by collaborative working with internal and external partners, such as via our Rapid Response Teams, Policy Workshops & Statistical Clinics.

I2. As well as advanced training in computational statistics and data science, our students will be impacted by exposure to, and training in, important cognate topics such as ethics, responsible innovation, equality, diversity and inclusion, policy, effective communication and dissemination, enterprise, impact and consultancy skills. It is vital for our students to understand that their training will enable them to have a powerful impact on the wider world, so, e.g., AI algorithms they develop should not be discriminatory, and statistical methodologies should be reproducible, and statistical results accurately and comprehensibly communicated to the general public and policymakers.

I3. The students will gain experience via collaborations with academic partners within the University in cognate disciplines, and a wide range of external industrial & government partners. The students will be impacted by the structured training programmes of the UK Academy of Postgraduate Training in Statistics, the Bristol Doctoral College, the Jean Golding Institute, the Alan Turing Institute and the Heilbronn Institute for Mathematical Sciences, which will be integrated into our programme.

I4. Having received an excellent training, the students will then impact powerfully on the world in their future fruitful careers, spreading excellence.

Impact on our Partners & ourselves.

I5. Direct impacts will be achieved by students engaging with, and working on projects with, our academic partners, with discipline-specific problems arising in engineering, education, medicine, economics, earth sciences, life sciences and geographical sciences, and our external partners Adarga, the Atomic Weapons Establishment, CheckRisk, EDF, GCHQ, GSK, the Office for National Statistics, Sciex, Shell UK, Trainline and the UK Space Agency. The students will demonstrate a wide range of innovation with these partners, will attract engagement from new partners, and often provide attractive future employment matches for students and partners alike.

Wider Societal Impact

I6. COMPASS will greatly benefit the UK by providing over 55 highly trained PhD graduates in an area that is known to be suffering from extreme, well-known, shortages in the people pipeline nationally. COMPASS CDT graduates will be equipped for jobs in sectors of high economic value and national priority, including data science, analytics, pharmaceuticals, security, energy, communications, government, and indeed all research labs that deal with data. Through their training, they will enable these organisations to make well-informed and statistically principled decisions that will allow them to maximise their international competitiveness and contribution to societal well-being. COMPASS will also impact positively on the wider student community, both now and sustainably into the future.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/S023569/1 01/04/2019 30/09/2027
2592959 Studentship EP/S023569/1 01/10/2021 19/09/2025 Josh Givens