A suite of new nonparametric methods for missing data and data from heterogeneous sources based on the theory of Frechet classes
Lead Research Organisation:
University of Warwick
Department Name: Statistics
Abstract
From the traditional settings of clinical trials to the technologically-driven mass collection of data in many modern application areas, the statistician's raw material is often plagued with missing data. Whether this be down to nonresponse, or the increasing heterogeneity of data sources, incompleteness is typically unavoidable in practice. The vast majority of statistical procedures are designed for use with complete information, and without it may become inapplicable, uninterpretable or unreliable. Restricting attention to complete cases, i.e. data points without missing variables, however, will often drastically reduce the utility of a data set, both by throwing away useful information in the non-complete cases, and by introducing the possibility of bias due to the complete cases not providing a representative sample of the population.
When a practitioner encounters missing data, the first questions they must ask themselves concern the mechanism by which the data came to be missing, and whether the missingness will cause serious problems in the analysis of their data set and the interpretation of their results. If the absence of information on certain variables can be modelled as independent of the value of the data, then the data is said to be Missing Completely at Random (MCAR), and subsequent analysis is significantly simpler than it would otherwise be. However, the consequences of making this assumption without proper basis can be severe.
We will begin with a rigorous study of the consequences of the MCAR assumption, presenting new characterisations of this property and providing novel connections to concepts studied in other fields, including copula theory and convex and computational geometry. Leveraging knowledge developed in these disciplines, we will design new tools for statisticians, bring new perspectives to the analysis of incomplete data, and open up new frontiers in the study of missingness. Specifically, we will link the property of MCAR to Fréchet classes and compatibility.
With the necessary framework in place, we will introduce hypothesis tests for the assumption of MCAR. In the first instance these will be applicable to contingency tables, but they will be extended to continuous data through binning. Certain alternatives are indistinguishable from the null, but we will show that these tests have power against all fixed alternative hypotheses that are distinguishable, and give situations in which they have optimal power.
Although a crucial first step, the assumption of MCAR is often too restrictive to be useful in practice. However, it may be that the missingness can be explained by certain fully-observed variables (CDM). Using additional insights from the problem of conditional independence testing we may extend our earlier work to test this more flexible assumption that is similar to, though stronger than, the usual MAR assumption.
In high-dimensional settings, the use of such flexible tests is likely to result in low power and we are limited to simple tests. To circumvent this issue, our next goal will be to define and analyse new tests in a relaxed version of the problem, which only attempt to find departures from the null that manifest in incompatibility of means and covariance matrices. We will show that all such departures can be detected, even when dimension grows polynomially in the sample size.
Once hypothesis tests have been carried out and reasonable assumptions developed, a practitioner will typically want to perform inference such as estimating an unknown quantity with confidence. In the framework we provide, the construction of confidence intervals for linear estimands is dual to the testing problems we consider. We combine our new technology with empirical process theory to provide minimal width confidence intervals, even in settings where consistent estimation is not possible.
When a practitioner encounters missing data, the first questions they must ask themselves concern the mechanism by which the data came to be missing, and whether the missingness will cause serious problems in the analysis of their data set and the interpretation of their results. If the absence of information on certain variables can be modelled as independent of the value of the data, then the data is said to be Missing Completely at Random (MCAR), and subsequent analysis is significantly simpler than it would otherwise be. However, the consequences of making this assumption without proper basis can be severe.
We will begin with a rigorous study of the consequences of the MCAR assumption, presenting new characterisations of this property and providing novel connections to concepts studied in other fields, including copula theory and convex and computational geometry. Leveraging knowledge developed in these disciplines, we will design new tools for statisticians, bring new perspectives to the analysis of incomplete data, and open up new frontiers in the study of missingness. Specifically, we will link the property of MCAR to Fréchet classes and compatibility.
With the necessary framework in place, we will introduce hypothesis tests for the assumption of MCAR. In the first instance these will be applicable to contingency tables, but they will be extended to continuous data through binning. Certain alternatives are indistinguishable from the null, but we will show that these tests have power against all fixed alternative hypotheses that are distinguishable, and give situations in which they have optimal power.
Although a crucial first step, the assumption of MCAR is often too restrictive to be useful in practice. However, it may be that the missingness can be explained by certain fully-observed variables (CDM). Using additional insights from the problem of conditional independence testing we may extend our earlier work to test this more flexible assumption that is similar to, though stronger than, the usual MAR assumption.
In high-dimensional settings, the use of such flexible tests is likely to result in low power and we are limited to simple tests. To circumvent this issue, our next goal will be to define and analyse new tests in a relaxed version of the problem, which only attempt to find departures from the null that manifest in incompatibility of means and covariance matrices. We will show that all such departures can be detected, even when dimension grows polynomially in the sample size.
Once hypothesis tests have been carried out and reasonable assumptions developed, a practitioner will typically want to perform inference such as estimating an unknown quantity with confidence. In the framework we provide, the construction of confidence intervals for linear estimands is dual to the testing problems we consider. We combine our new technology with empirical process theory to provide minimal width confidence intervals, even in settings where consistent estimation is not possible.
Organisations
Publications
Berrett T
(2023)
Efficient functional estimation and the super-oracle phenomenon
in The Annals of Statistics
Berrett T
(2023)
Optimal nonparametric testing of Missing Completely At Random and its connections to compatibility
in The Annals of Statistics
Li M
(2023)
On robustness and local differential privacy
in The Annals of Statistics
Li M.
(2022)
Network change point localisation under local differential privacy
in Advances in Neural Information Processing Systems
Sell T
(2024)
Nonparametric classification with missing data
in The Annals of Statistics
| Description | In the first major piece of work associated to this award, we introduced a rigorous new framework for hypothesis tests of the common assumption of Missing Completely At Random. We were able to prove that our new methodology was optimal in important special cases. This work nicely meets the objectives that I set out for Work Package 1 in the research proposal and has now been accepted at the leading journal the Annals of Statistics. I am currently working with collaborators to take these findings further. In particular, the nonparametric framework introduced reveals interesting connections between the study of missing data and robust statistics, and this is something we are currently working on. Work on Work Package 2 of the proposal is now available in preprint form. This was completed with a PhD student at Warwick and has been submitted to the Annals of Statistics. The work introduces a high-dimensional version of the framework discussed in Work Package 1. Here the approach is related to matrix completion problems, which are highly interesting to statisticians and mathematicians alike. Work on Work Package 3 is more diverse and is currently in progress. There is one piece of work currently available in preprint form, that has also been submitted to the Annals of Statistics. This is joint work with collaborators of mine at the University of Edinburgh, and tackles the problem of nonparametric classification with missing data. We provide the first rigorous theoretical framework for this problem, introduce novel methdology and prove its optimality. Another piece of work, for which I am the sole author, is currently ongoing and is likely to be available as a preprint before the end of the grant. This proves new links between the estimation of statistical functionals with missing data and well-studied theoretical objects in statistics called generalised ANOVA decompositions. Another piece of work on the estimation of statstical functionals is also underway with a collaborator in Rome. Here we bring techniques from the modern theory of optimal transport to missing data. As well as the proposed work on missing data. I have concurrently made significant contributions to the field of differential privacy. |
| Exploitation Route | As described above, the work on the first package forges links between missing data and robust statistics and we are currently working on new work developing these links. I have also spoken to another researcher who has applied our methods to medical data. Work on package 2 provides new theory for statistical problems related to positive definite matrix completion. These problems have previously been studied in the mathematics literature but allowing for noisy observations has not been previously done in a rigorous statistical framework. |
| Sectors | Digital/Communication/Information Technologies (including Software) Security and Diplomacy |
