Partial recovery of missing responses - a toolbox for efficient design and analysis when data may be missing not at random

Lead Research Organisation: University College London
Department Name: Statistical Science

Abstract

Missing data are a common problem in many application areas. The presence of missing values complicates analyses, and if not dealt with properly can result in incorrect conclusions being drawn from the data. It is often helpful to assume there is a process that produces the missing values, typically called a missing data mechanism. A particularly problematic scenario is when this mechanism is in part determined by some other unknown variables, such as the missing values themselves. This is known as a missing not at random (MNAR) mechanism.

If missing values arise due to a MNAR mechanism then conclusions drawn from the data will typically be biased. Also, importantly, it is not possible to know whether this problem occurs or not in the data. This is the challenging problem area that this proposal seeks to address, namely developing procedures that can best test whether or MNAR occurs in the data.

The proposal will consider scenarios where it is possible to estimate some of the missing values through a follow up sample. The main purpose of this is to learn about the missing data mechanism and specifically test whether the MNAR assumption is valid or not. Further, the recovered data will also help to correct for the effect the missing data have on conclusions. The proposal makes use of optimal design techniques to decide which missing values to follow up. Essentially certain missing values might yield more information about the type of missing data mechanism than others; in addition some values might be more likely than others to be recovered. In this way we would ensure maximum information from the recovered data is obtained. This will allow data analysts to determine whether the presence of MNAR is likely and take appropriate action.

We will collaborate with our project partners, the Office for National Statistics and NHS Blood and Transplant in the development of these methods. Our project partners will provide relevant data for us to consider realistic scenarios and we will discuss interim results with them to ensure our methods are most useful for practitioners. We will also present the work as part of a missing data course at the African Institute of Mathematical Sciences (AIMS) to maximise the global benefit of the work.

The methods developed in this proposal will be disseminated through papers and presentations. In addition, we will create a free to use R package that will implement the methods to allow easy uptake by users. We will provide training in using this R package as part of a two-day workshop where we will describe our methods to users. A dedicated website will be updated throughout the project to describe developments and facilitate engagement with interested parties.

Planned Impact

The proposed research will have wide ranging benefits with the potential for improving how data is collected, analysed and subsequently used to make important decisions. We can broadly speaking categorise the proposed beneficiaries into three groups: 1) Academic Researchers, 2) Non Academic Researchers e.g. those in government and industry, and 3) Wider public. The benefits of the research are listed below in point form, with each indicating which of the above groups would benefit.

1. New methodology in handling the problem of Missing Not at Random (MNAR)

Academics working in the field would clearly benefit from learning of these developments (as detailed in the academic beneficiaries section). In addition, non academic researchers seeking to analyse data that might contain MNAR would also similarly benefit. Our project partners at the Office for National Statistics and NHS Blood and Transplant clearly show a wide ranging interest in handling this problem outside of academia. Groups to benefit from this component of the proposal are thus 1) and 2). The publications, presentations and the workshop will help to facilitate the transfer of knowledge and the free R package will also allow fast uptake of the methods.

2. A more efficient and appropriate analysis of data with the potential for MNAR

By being able to analyse data with an increased certainty of whether MNAR is present or not, and if necessary then correcting for this, will allow researchers to make conclusions that are more appropriate than would otherwise be the case. This would clearly be beneficial for researchers (both academic and non academic) but in addition the wider public that would be affected by the results of the analysis would benefit from a greater certainty in the validity of these. Thus group 3) will benefit here in addition to groups 1) and 2).

3. A greater appreciation of the importance of handling missing data appropriately

Our dissemination strategy is designed to be as wide reaching as possible. In particular, the course we plan to teach at the African Institute of Mathematical Sciences (AIMS) will aim to educate students in Africa of the importance of careful data collection procedures and how to deal with the problem of missing data appropriately. In many parts of the world, poor quality data and analysis practices obstructs effective decision making. Being able to disseminate these methods through a course will allow participants to benefit from learning of these methods, building capability in regions where this is greatly needed. In addition, as important policy decisions are often made based on data analysis, wider society in these areas will also benefit as a result of implementation of appropriate methods to handle MNAR and missing values in general. Thus group 3) will benefit from this component of the proposal.

Publications

10 25 50