Multiple imputation by chained equations for data that are missing not at random: methods development for randomised trials and observational studies

Lead Research Organisation: MRC Centre Cambridge
Department Name: MRC Biostatistics Unit

Abstract

Medical researchers often find that some data which they intended to collect could not be collected: for example, because participants could not be contacted or were unwilling to provide data. These missing data present problems in the analysis of the study, because including only participants who provided data may lead to incorrect results. The commonest way to handle missing data assumes that missing values are similar to observed values within subgroups: for example, for participants whose weight was observed at times 1 and 2 but missing at time 3, the missing weights at time 3 are assumed to have the same average as observed weights at time 3 in participants whose weights were similar at times 1 and 2 and observed at time 3. This approach is called "Missing at Random" and provides a good starting point for analysis but is unlikely to be entirely correct: for example, participants whose weight was unobserved at time 3 may have had a larger weight gain. It is therefore important for researchers to do sensitivity analyses in which different assumptions are made about the missing data.

Our research proposes to adapt a popular method for handling missing data called Multiple Imputation by Chained Equations (MICE) to allow for a range of assumptions about the missing data. The idea of this approach is that missing values are filled in iteratively using the relationships between all the variables, and this is then done multiple times in order to express uncertainty about the missing data. However, at present the MICE method is done assuming Missing at Random. We have developed a new way to implement the MICE method which does not assume Missing at Random: instead, the researcher has to specify how big the departures from Missing at Random are, by specifying the likely average differences between missing values and observed values within subgroups. However, we have only explored the new method in idealised settings, and in particular we have not explored its use in randomised trials or in studies where outcomes are measured over time.

The work will first extend the statistical theory to handle outcomes that are measured over time and see how well the method performs in randomised trials. It will then extend the methods to tackle a wide range of problems met in practice: for example different types of variables, complex analysis questions, and very large data sets. This work will be supported by writing user-friendly software to implement the new method in two widely used statistics packages.

We will implement the method in practice in several data sets, including the Avon Longitudinal Study of Parents and Children where we will explore predictors of self-harm, and randomised trials in smoking cessation and weight loss. Missing self-harm, smoking cessation and weight loss data are all very unlikely to be Missing at Random: we will use our subject matter expertise to specify a range of likely average differences between missing values and observed values within subgroups and hence reach more defensible conclusions. This work is likely to raise unexpected theoretical issues which we will address.

Finally, we believe that this method will be widely applicable, so we will disseminate it to researchers via tutorial articles and by running courses.

Technical Summary

Awareness of the problem of missing data has increased in recent years, and multiple imputation is increasingly used to handle it. Standard implementations of multiple imputation make a missing at random (MAR) assumption, which cannot be tested from the data and can rarely be confidently justified. Hence analysis based on the MAR assumption should usually be supplemented by sensitivity analyses exploring departures from MAR. For example, the US National Research Council's report on The Prevention and Treatment of Missing Data in Clinical Trials (2010) highlighted "methods for sensitivity analysis and principled decision making based on the results from sensitivity analyses" as an area of statistical research where progress is particularly needed.

Multiple Imputation by Chained Equations (MICE) is a popular way to implement multiple imputation, but efforts to do missing not at random analyses have lacked a principled foundation. Finbarr Leacy's PhD research has identified the problem, and shown that the solution is to include missingness indicators in all imputation models.

The proposed research aims to extend knowledge of this procedure. We aim to understand how the method works in longitudinal data, comparing its simple implementation with simple implementations of pattern-mixture models and selection models. We then aim to extend the method to realistic complex data sets, including different variable types, complex analysis models and large numbers of variables.

Alongside the methods development, we will apply the method in case studies, including exploring predictors of self-harm in the ALSPAC study and re-analysing longitudinal randomised trials in smoking cessation and weight loss. To do this we will develop and implement methods to elicit the magnitude of departures from MAR.

Finally, the project will provide user-friendly software in R and Stata, and disseminate the methods and software in tutorial articles and short courses.

Planned Impact

The aims of this research are primarily to facilitate sensitivity analyses which allow for the possibility of data being missing not at random, and hence to make it easier for investigators in randomised trials and observational studies to make realistic allowance for the impact of missing data in their trials.

More broadly, we hope the research will benefit:

- the pharmaceutical industry, which (like academic researchers) will be more able to allow for missing data;

- by thinking harder about how to handle missing data, we hope that all investigators will become more aware of the importance of missing data and be motivated to take more steps to reduce the amount of missing data at the data collection stage;

- regulatory authorities will benefit if these methods are used in the analysis of RCTs, because they will be more able to assess whether missing data is an important source of bias;

- researchers beyond the health field will benefit from articles in the general literature: missing data presents problems more widely than just in medical research;

- policy makers (e.g. NICE) who need reliable evidence from evaluations of interventions subject to missing data.

Indirectly, we believe the research can ultimately benefit clinicians and patients by avoiding biased studies being used to support new interventions, but also by making it easier for unbiased studies to demonstrate their lack of bias and hence have greater influence on practice.

Publications

10 25 50
 
Description Multiple imputation by chained equations for data that are missing not at random: methods development for randomised trials and observational studies
Amount £166,749 (GBP)
Funding ID MC_EX_MR/M025012 
Organisation Medical Research Council (MRC) 
Sector Public
Country United Kingdom
Start 04/2016 
End 04/2019
 
Description NARMICE-Melbourne collaboration 
Organisation Murdoch Children's Research Institute
Country Australia 
Sector Academic/University 
PI Contribution Contributed to discussions on work led by the MCRI group on developing analysis of incomplete data through use of directed acyclic graphs.
Collaborator Contribution Contributed to discussions on work led by the MRC BSU group on developing analysis of incomplete data through not-at-random multiple imputatoin by chained equations.
Impact Two linked presentations at the "Missing Data Analysis and Imputation" group, London (and Skype), 9/2/2017. Two draft papers.
Start Year 2016
 
Title NARFCS extension to MICE for R 
Description This is ongoing work to move beyond the usual "missing at random" assumption in the handling of missing data. It extends the widely used "MICE" package for R. Work is ongoing. 
Type Of Technology Software 
Year Produced 2017 
Impact None at present 
URL https://github.com/moreno-betancur/mice
 
Description MNAR in MI course 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Methods developed for handling missing data that may be missing not at random were taught in an annual course on multiple imputation
Year(s) Of Engagement Activity 2018,2019,2020,2021