Multiple imputation by chained equations for data that are missing not at random: methods development for randomised trials and observational studies

Lead Research Organisation: University College London

Abstract

Medical researchers often find that some data which they intended to collect could not be collected: for example, because
participants could not be contacted or were unwilling to provide data. These missing data present problems in the analysis
of the study, because including only participants who provided data may lead to incorrect results. The commonest way to
handle missing data assumes that missing values are similar to observed values within subgroups: for example, for
participants whose weight was observed at times 1 and 2 but missing at time 3, the missing weights at time 3 are assumed
to have the same average as observed weights at time 3 in participants whose weights were similar at times 1 and 2 and
observed at time 3. This approach is called "Missing at Random" and provides a good starting point for analysis but is
unlikely to be entirely correct: for example, participants whose weight was unobserved at time 3 may have had a larger
weight gain. It is therefore important for researchers to do sensitivity analyses in which different assumptions are made
about the missing data.
Our research proposes to adapt a popular method for handling missing data called Multiple Imputation by Chained
Equations (MICE) to allow for a range of assumptions about the missing data. The idea of this approach is that missing
values are filled in iteratively using the relationships between all the variables, and this is then done multiple times in order
to express uncertainty about the missing data. However, at present the MICE method is done assuming Missing at
Random. We have developed a new way to implement the MICE method which does not assume Missing at Random:
instead, the researcher has to specify how big the departures from Missing at Random are, by specifying the likely average
differences between missing values and observed values within subgroups. However, we have only explored the new
method in idealised settings, and in particular we have not explored its use in randomised trials or in studies where
outcomes are measured over time.
The work will first extend the statistical theory to handle outcomes that are measured over time and see how well the
method performs in randomised trials. It will then extend the methods to tackle a wide range of problems met in practice:
for example different types of variables, complex analysis questions, and very large data sets. This work will be supported
by writing user-friendly software to implement the new method in two widely used statistics packages.
We will implement the method in practice in several data sets, including the Avon Longitudinal Study of Parents and
Children where we will explore predictors of self-harm, and randomised trials in smoking cessation and weight loss. Missing
self-harm, smoking cessation and weight loss data are all very unlikely to be Missing at Random: we will use our subject
matter expertise to specify a range of likely average differences between missing values and observed values within
subgroups and hence reach more defensible conclusions. This work is likely to raise unexpected theoretical issues which
we will address.
Finally, we believe that this method will be widely applicable, so we will disseminate it to researchers via tutorial articles and by running courses.

Technical Summary

Awareness of the problem of missing data has increased in recent years, and multiple imputation is increasingly used to
handle it. Standard implementations of multiple imputation make a missing at random (MAR) assumption, which cannot be
tested from the data and can rarely be confidently justified. Hence analysis based on the MAR assumption should usually
be supplemented by sensitivity analyses exploring departures from MAR. For example, the US National Research Council's
report on The Prevention and Treatment of Missing Data in Clinical Trials (2010) highlighted "methods for sensitivity
analysis and principled decision making based on the results from sensitivity analyses" as an area of statistical research
where progress is particularly needed.
Multiple Imputation by Chained Equations (MICE) is a popular way to implement multiple imputation, but efforts to do
missing not at random analyses have lacked a principled foundation. Finbarr Leacy's PhD research has identified the
problem, and shown that the solution is to include missingness indicators in all imputation models.
The proposed research aims to extend knowledge of this procedure. We aim to understand how the method works in
longitudinal data, comparing its simple implementation with simple implementations of pattern-mixture models and
selection models. We then aim to extend the method to realistic complex data sets, including different variable types,
complex analysis models and large numbers of variables.
Alongside the methods development, we will apply the method in case studies, including exploring predictors of self-harm
in the ALSPAC study and re-analysing longitudinal randomised trials in smoking cessation and weight loss. To do this we
will develop and implement methods to elicit the magnitude of departures from MAR.
Finally, the project will provide user-friendly software in R and Stata, and disseminate the methods and software in tutorial
articles and short courses.

Planned Impact

The aims of this research are primarily to facilitate sensitivity analyses which allow for the possibility of data being missing
not at random, and hence to make it easier for investigators in randomised trials and observational studies to make realistic
allowance for the impact of missing data in their trials.
More broadly, we hope the research will benefit:
- the pharmaceutical industry, which (like academic researchers) will be more able to allow for missing data;
- by thinking harder about how to handle missing data, we hope that all investigators will become more aware of the
importance of missing data and be motivated to take more steps to reduce the amount of missing data at the data
collection stage;
- regulatory authorities will benefit if these methods are used in the analysis of RCTs, because they will be more able to
assess whether missing data is an important source of bias;
- researchers beyond the health field will benefit from articles in the general literature: missing data presents problems
more widely than just in medical research;
- policy makers (e.g. NICE) who need reliable evidence from evaluations of interventions subject to missing data.
Indirectly, we believe the research can ultimately benefit clinicians and patients by avoiding biased studies being used to
support new interventions, but also by making it easier for unbiased studies to demonstrate their lack of bias and hence
have greater influence on practice.

Publications

10 25 50
 
Description NARMICE-Melbourne collaboration 
Organisation University of Melbourne
Department Murdoch Children's Research Centre
Country Australia 
Sector Academic/University 
PI Contribution Contributed to discussions on work led by the MCRI group on developing analysis of incomplete data through use of directed acyclic graphs.
Collaborator Contribution Contributed to discussions on work led by the MRC BSU group on developing analysis of incomplete data through not-at-random multiple imputatoin by chained equations.
Impact Two linked presentations at the "Missing Data Analysis and Imputation" group, London (and Skype), 9/2/2017. Two draft papers.
Start Year 2016
 
Title NARFCS extension to MICE for R 
Description This is ongoing work to move beyond the usual "missing at random" assumption in the handling of missing data. It extends the widely used "MICE" package for R. Work is ongoing. 
Type Of Technology Software 
Year Produced 2017 
Impact None at present 
URL https://github.com/moreno-betancur/mice