Partial recovery of missing responses - a toolbox for efficient design and analysis when data may be missing not at random

Lead Research Organisation: UNIVERSITY COLLEGE LONDON
Department Name: Statistical Science

Abstract

Missing data are a common problem in many application areas. The presence of missing values complicates analyses, and if not dealt with properly can result in incorrect conclusions being drawn from the data. It is often helpful to assume there is a process that produces the missing values, typically called a missing data mechanism. A particularly problematic scenario is when this mechanism is in part determined by some other unknown variables, such as the missing values themselves. This is known as a missing not at random (MNAR) mechanism.

If missing values arise due to a MNAR mechanism then conclusions drawn from the data will typically be biased. Also, importantly, it is not possible to know whether this problem occurs or not in the data. This is the challenging problem area that this proposal seeks to address, namely developing procedures that can best test whether or MNAR occurs in the data.

The proposal will consider scenarios where it is possible to estimate some of the missing values through a follow up sample. The main purpose of this is to learn about the missing data mechanism and specifically test whether the MNAR assumption is valid or not. Further, the recovered data will also help to correct for the effect the missing data have on conclusions. The proposal makes use of optimal design techniques to decide which missing values to follow up. Essentially certain missing values might yield more information about the type of missing data mechanism than others; in addition some values might be more likely than others to be recovered. In this way we would ensure maximum information from the recovered data is obtained. This will allow data analysts to determine whether the presence of MNAR is likely and take appropriate action.

We will collaborate with our project partners, the Office for National Statistics and NHS Blood and Transplant in the development of these methods. Our project partners will provide relevant data for us to consider realistic scenarios and we will discuss interim results with them to ensure our methods are most useful for practitioners. We will also present the work as part of a missing data course at the African Institute of Mathematical Sciences (AIMS) to maximise the global benefit of the work.

The methods developed in this proposal will be disseminated through papers and presentations. In addition, we will create a free to use R package that will implement the methods to allow easy uptake by users. We will provide training in using this R package as part of a two-day workshop where we will describe our methods to users. A dedicated website will be updated throughout the project to describe developments and facilitate engagement with interested parties.

Planned Impact

The proposed research will have wide ranging benefits with the potential for improving how data is collected, analysed and subsequently used to make important decisions. We can broadly speaking categorise the proposed beneficiaries into three groups: 1) Academic Researchers, 2) Non Academic Researchers e.g. those in government and industry, and 3) Wider public. The benefits of the research are listed below in point form, with each indicating which of the above groups would benefit.

1. New methodology in handling the problem of Missing Not at Random (MNAR)

Academics working in the field would clearly benefit from learning of these developments (as detailed in the academic beneficiaries section). In addition, non academic researchers seeking to analyse data that might contain MNAR would also similarly benefit. Our project partners at the Office for National Statistics and NHS Blood and Transplant clearly show a wide ranging interest in handling this problem outside of academia. Groups to benefit from this component of the proposal are thus 1) and 2). The publications, presentations and the workshop will help to facilitate the transfer of knowledge and the free R package will also allow fast uptake of the methods.

2. A more efficient and appropriate analysis of data with the potential for MNAR

By being able to analyse data with an increased certainty of whether MNAR is present or not, and if necessary then correcting for this, will allow researchers to make conclusions that are more appropriate than would otherwise be the case. This would clearly be beneficial for researchers (both academic and non academic) but in addition the wider public that would be affected by the results of the analysis would benefit from a greater certainty in the validity of these. Thus group 3) will benefit here in addition to groups 1) and 2).

3. A greater appreciation of the importance of handling missing data appropriately

Our dissemination strategy is designed to be as wide reaching as possible. In particular, the course we plan to teach at the African Institute of Mathematical Sciences (AIMS) will aim to educate students in Africa of the importance of careful data collection procedures and how to deal with the problem of missing data appropriately. In many parts of the world, poor quality data and analysis practices obstructs effective decision making. Being able to disseminate these methods through a course will allow participants to benefit from learning of these methods, building capability in regions where this is greatly needed. In addition, as important policy decisions are often made based on data analysis, wider society in these areas will also benefit as a result of implementation of appropriate methods to handle MNAR and missing values in general. Thus group 3) will benefit from this component of the proposal.

Publications

10 25 50

Related Projects

Project Reference Relationship Related To Start End Award Value
EP/V00641X/1 20/06/2021 30/08/2022 £281,269
EP/V00641X/2 Transfer EP/V00641X/1 31/08/2022 19/10/2023 £104,252
 
Description We have found that significant improvements to detecting a particularly problematic type of missing data are possible through a carefully designed follow up sample that seeks to recover a proportion of the missing values. Importantly, we also now have an approach that is free from any model assumptions and is thus robust to model mis-specification.
Exploitation Route We have developed a comprehensive framework to test, and subsequently potentially account for, the effects of missing not at random (MNAR). This can translate to many benefits for practitioners who are affected by missing data problems and are concerned about the presence of MNAR. The research may transform how practitioners view follow up sampling and also how they implement the follow up. In particular, our model free framework gives extra confidence to practitioners when applying this.
Sectors Healthcare

Government

Democracy and Justice

Manufacturing

including Industrial Biotechology

Pharmaceuticals and Medical Biotechnology

 
Description Conference presentation at MODA, UK 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The conference provides a high-level international forum for researchers, professionals and practitioners to present and discuss recent advances, new techniques and applications in the field of optimum experimental design. Following the presentation the subsequent discussions have the potential to lead to new avenues of research being explored.
Year(s) Of Engagement Activity 2023
URL https://statsdavew.github.io/mODa13/
 
Description Conference presentation in Memphis, USA 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The conference sought to highlight some of the newest areas for research in experimental design, as well as novel methodology in traditional areas. The interactions following the presentation have the potential to develop new lines of research in the area of design of experiments for missing values.
Year(s) Of Engagement Activity 2023
URL https://www.memphis.edu/msci/icodoe22/
 
Description Internal seminar at the Open University 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Professional Practitioners
Results and Impact This was a seminar given to the Open University Statistics research group to disseminate findings from this research. The discussions during the seminar have the potential to initiate new collaborations and directions of research.
Year(s) Of Engagement Activity 2024
 
Description Presentation at workshop at Brunel University 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Public/other audiences
Results and Impact This was a workshop that sought to bring together UK leaders in the field of design and analysis of experiments to provide an avenue for dissemination of the state-of-the-art in methodologies that underpin modern techniques in data collection and analysis, and hence to ensure that the results that statisticians and mathematicians provide to practitioners is robust. There was a good exchange of ideas during the workshop and following the presentation, with the potential to initiate interesting new avenues of research in design of experiments.
Year(s) Of Engagement Activity 2023
URL https://sites.google.com/view/robustexperimentation/home