📣 Help Shape the Future of UKRI's Gateway to Research (GtR)

We're improving UKRI's Gateway to Research and are seeking your input! If you would be interested in being interviewed about the improvements we're making and to have your say about how we can make GtR more user-friendly, impactful, and effective for the Research and Innovation community, please email gateway@ukri.org.

Two-sample testing with arbitrarily missing data

Lead Research Organisation: Imperial College London
Department Name: Mathematics

Abstract

Two-sample testing is a fundamental approach in statistics used to decide if two samples of data are significantly different in some way. Well-known two-sample testing methods include Student's t test [6], the Wilcoxon-Mann-Whitney U test [2], and the Maximum Mean Discrepancy (MMD) test [1]. Nearly all two-sample testing methods are designed solely for data that are fully observed. However, in many real-world datasets, a subset of univariate values may be missing, or multivariate values may only be partially observed.
When data are missing, common practices are either to ignore all missing values or impute these values using an imputation scheme, after which the data are treated as complete for testing. However, except in special cases, such as when the missing data are missing completely at random [3], these practices are often invalid as they risk increasing the probability of a Type I error occurring. Under certain assumptions, such as when the data are missing not at random but the missingness mechanisms can be explicitly specified [5], certain sophisticated missing data methods exist, such as the expectation-maximization algorithm and the multiple imputation method [4]. However, without these assumptions, relying on these methods is fraught with risk.
Our research focuses on two-sample hypothesis testing in the presence of missing data when no assumption of the missingness mechanisms can be made. In one of our studies [8], we proposed an approach to detect location shifts for univariate data in the presence of missing data without making assumptions of the missing data. This study is a theoretical extension of the Wilcoxon-Mann-Whitney test obtained by deriving exact bounds for the test statistic and rejecting the null hypothesis after accounting for all possible values of the missing data. This approach avoids ignoring or imputing the missing data and is shown to control the Type I error without making assumptions about the missingness mechanisms, while also having good statistical power when the proportion of missing data is around 10%.
In related work [7], we extend this framework to multivariate data to detect any distributional shift. This work builds upon the MMD test and follows a similar approach to [8] by first deriving the bounds of the unbiased MMD test statistic using the Laplacian kernel [1] in the presence of missing data, and rejecting the null hypothesis when all possible test statistics are significant. We proved theoretically that this approach controls the Type I error, regardless of the values of the missing data. Numerical simulations show that this method has good testing power when the proportion of missing data is around 5% to 10%.
Our future work aims to explore two-sample testing in the presence of missing data for alternative scenarios, such as when the detection of location-scale shifts is of interest. We are also interested in developing methods for dealing with missing data in other problems, such as two-sample independence testing and changepoint detection.
[1] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Scholkopf, and Alexander Smola. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723-773, 2012.
[2] Henry B Mann and Donald R Whitney. On a test of whether one of two random variables is stochastically larger than the other. The annals of mathematical statistics, pages 50-60,1947.
[3] Donald B Rubin. Inference and missing data. Biometrika, 63(3):581-592, 1976.
[4] Joseph L Schafer. Multiple imputation: a primer. Statistical methods in medical research, 8(1):3-15, 1999.
[5] Joseph L Schafer and John W Graham. Missing data: our view of the state of the art. Psychological methods, 7(2):147, 2002.
[7] Yijin Zeng, Niall M Adams, and Dean A Bodenham. Mmd two-sample testing in the presence of arbitrarily missing data. arXiv preprint arXiv:2405.15531, 2024.
[8] Yijin Zeng, Niall M Adams, and Dean A Bodenham. On two-sample testing for data with ar

Planned Impact

The primary CDT impact will be training 75 PhD graduates as the next generation of leaders in statistics and statistical machine learning. These graduates will lead in industry, government, health care, and academic research. They will bridge the gap between academia and industry, resulting in significant knowledge transfer to both established and start-up companies. Because this cohort will also learn to mentor other researchers, the CDT will ultimately address a UK-wide skills gap. The students will also be crucial in keeping the UK at the forefront of methodological research in statistics and machine learning.
After graduating, students will act as multipliers, educating others in advanced methodology throughout their career. There are a range of further impacts:
- The CDT has a large number of high calibre external partners in government, health care, industry and science. These partnerships will catalyse immediate knowledge transfer, bringing cutting edge methodology to a large number of areas. Knowledge transfer will also be achieved through internships/placements of our students with users of statistics and machine learning.
- Our Women in Mathematics and Statistics summer programme is aimed at students who could go on to apply for a PhD. This programme will inspire the next generation of statisticians and also provide excellent leadership training for the CDT students.
- The students will develop new methodology and theory in the domains of statistics and statistical machine learning. It will be relevant research, addressing the key questions behind real world problems. The research will be published in the best possible statistics journals and machine learning conferences and will be made available online. To maximize reproducibility and replicability, source code and replication files will be made available as open source software or, when relevant to an industrial collaboration, held as a patent or software copyright.

People

ORCID iD

Yijin Zeng (Student)

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/S023151/1 31/03/2019 29/09/2027
2602749 Studentship EP/S023151/1 01/10/2021 29/09/2025 Yijin Zeng