Multiple imputation by chained equations for data that are missing not at random: methods development for randomised trials and observational studies
Lead Research Organisation:
MRC Centre Cambridge
Abstract
Medical researchers often find that some data which they intended to collect could not be collected: for example, because participants could not be contacted or were unwilling to provide data. These missing data present problems in the analysis of the study, because including only participants who provided data may lead to incorrect results. The commonest way to handle missing data assumes that missing values are similar to observed values within subgroups: for example, for participants whose weight was observed at times 1 and 2 but missing at time 3, the missing weights at time 3 are assumed to have the same average as observed weights at time 3 in participants whose weights were similar at times 1 and 2 and observed at time 3. This approach is called "Missing at Random" and provides a good starting point for analysis but is unlikely to be entirely correct: for example, participants whose weight was unobserved at time 3 may have had a larger weight gain. It is therefore important for researchers to do sensitivity analyses in which different assumptions are made about the missing data.
Our research proposes to adapt a popular method for handling missing data called Multiple Imputation by Chained Equations (MICE) to allow for a range of assumptions about the missing data. The idea of this approach is that missing values are filled in iteratively using the relationships between all the variables, and this is then done multiple times in order to express uncertainty about the missing data. However, at present the MICE method is done assuming Missing at Random. We have developed a new way to implement the MICE method which does not assume Missing at Random: instead, the researcher has to specify how big the departures from Missing at Random are, by specifying the likely average differences between missing values and observed values within subgroups. However, we have only explored the new method in idealised settings, and in particular we have not explored its use in randomised trials or in studies where outcomes are measured over time.
The work will first extend the statistical theory to handle outcomes that are measured over time and see how well the method performs in randomised trials. It will then extend the methods to tackle a wide range of problems met in practice: for example different types of variables, complex analysis questions, and very large data sets. This work will be supported by writing user-friendly software to implement the new method in two widely used statistics packages.
We will implement the method in practice in several data sets, including the Avon Longitudinal Study of Parents and Children where we will explore predictors of self-harm, and randomised trials in smoking cessation and weight loss. Missing self-harm, smoking cessation and weight loss data are all very unlikely to be Missing at Random: we will use our subject matter expertise to specify a range of likely average differences between missing values and observed values within subgroups and hence reach more defensible conclusions. This work is likely to raise unexpected theoretical issues which we will address.
Finally, we believe that this method will be widely applicable, so we will disseminate it to researchers via tutorial articles and by running courses.
Our research proposes to adapt a popular method for handling missing data called Multiple Imputation by Chained Equations (MICE) to allow for a range of assumptions about the missing data. The idea of this approach is that missing values are filled in iteratively using the relationships between all the variables, and this is then done multiple times in order to express uncertainty about the missing data. However, at present the MICE method is done assuming Missing at Random. We have developed a new way to implement the MICE method which does not assume Missing at Random: instead, the researcher has to specify how big the departures from Missing at Random are, by specifying the likely average differences between missing values and observed values within subgroups. However, we have only explored the new method in idealised settings, and in particular we have not explored its use in randomised trials or in studies where outcomes are measured over time.
The work will first extend the statistical theory to handle outcomes that are measured over time and see how well the method performs in randomised trials. It will then extend the methods to tackle a wide range of problems met in practice: for example different types of variables, complex analysis questions, and very large data sets. This work will be supported by writing user-friendly software to implement the new method in two widely used statistics packages.
We will implement the method in practice in several data sets, including the Avon Longitudinal Study of Parents and Children where we will explore predictors of self-harm, and randomised trials in smoking cessation and weight loss. Missing self-harm, smoking cessation and weight loss data are all very unlikely to be Missing at Random: we will use our subject matter expertise to specify a range of likely average differences between missing values and observed values within subgroups and hence reach more defensible conclusions. This work is likely to raise unexpected theoretical issues which we will address.
Finally, we believe that this method will be widely applicable, so we will disseminate it to researchers via tutorial articles and by running courses.
Technical Summary
Awareness of the problem of missing data has increased in recent years, and multiple imputation is increasingly used to handle it. Standard implementations of multiple imputation make a missing at random (MAR) assumption, which cannot be tested from the data and can rarely be confidently justified. Hence analysis based on the MAR assumption should usually be supplemented by sensitivity analyses exploring departures from MAR. For example, the US National Research Council's report on The Prevention and Treatment of Missing Data in Clinical Trials (2010) highlighted "methods for sensitivity analysis and principled decision making based on the results from sensitivity analyses" as an area of statistical research where progress is particularly needed.
Multiple Imputation by Chained Equations (MICE) is a popular way to implement multiple imputation, but efforts to do missing not at random analyses have lacked a principled foundation. Finbarr Leacy's PhD research has identified the problem, and shown that the solution is to include missingness indicators in all imputation models.
The proposed research aims to extend knowledge of this procedure. We aim to understand how the method works in longitudinal data, comparing its simple implementation with simple implementations of pattern-mixture models and selection models. We then aim to extend the method to realistic complex data sets, including different variable types, complex analysis models and large numbers of variables.
Alongside the methods development, we will apply the method in case studies, including exploring predictors of self-harm in the ALSPAC study and re-analysing longitudinal randomised trials in smoking cessation and weight loss. To do this we will develop and implement methods to elicit the magnitude of departures from MAR.
Finally, the project will provide user-friendly software in R and Stata, and disseminate the methods and software in tutorial articles and short courses.
Multiple Imputation by Chained Equations (MICE) is a popular way to implement multiple imputation, but efforts to do missing not at random analyses have lacked a principled foundation. Finbarr Leacy's PhD research has identified the problem, and shown that the solution is to include missingness indicators in all imputation models.
The proposed research aims to extend knowledge of this procedure. We aim to understand how the method works in longitudinal data, comparing its simple implementation with simple implementations of pattern-mixture models and selection models. We then aim to extend the method to realistic complex data sets, including different variable types, complex analysis models and large numbers of variables.
Alongside the methods development, we will apply the method in case studies, including exploring predictors of self-harm in the ALSPAC study and re-analysing longitudinal randomised trials in smoking cessation and weight loss. To do this we will develop and implement methods to elicit the magnitude of departures from MAR.
Finally, the project will provide user-friendly software in R and Stata, and disseminate the methods and software in tutorial articles and short courses.
Planned Impact
The aims of this research are primarily to facilitate sensitivity analyses which allow for the possibility of data being missing not at random, and hence to make it easier for investigators in randomised trials and observational studies to make realistic allowance for the impact of missing data in their trials.
More broadly, we hope the research will benefit:
- the pharmaceutical industry, which (like academic researchers) will be more able to allow for missing data;
- by thinking harder about how to handle missing data, we hope that all investigators will become more aware of the importance of missing data and be motivated to take more steps to reduce the amount of missing data at the data collection stage;
- regulatory authorities will benefit if these methods are used in the analysis of RCTs, because they will be more able to assess whether missing data is an important source of bias;
- researchers beyond the health field will benefit from articles in the general literature: missing data presents problems more widely than just in medical research;
- policy makers (e.g. NICE) who need reliable evidence from evaluations of interventions subject to missing data.
Indirectly, we believe the research can ultimately benefit clinicians and patients by avoiding biased studies being used to support new interventions, but also by making it easier for unbiased studies to demonstrate their lack of bias and hence have greater influence on practice.
More broadly, we hope the research will benefit:
- the pharmaceutical industry, which (like academic researchers) will be more able to allow for missing data;
- by thinking harder about how to handle missing data, we hope that all investigators will become more aware of the importance of missing data and be motivated to take more steps to reduce the amount of missing data at the data collection stage;
- regulatory authorities will benefit if these methods are used in the analysis of RCTs, because they will be more able to assess whether missing data is an important source of bias;
- researchers beyond the health field will benefit from articles in the general literature: missing data presents problems more widely than just in medical research;
- policy makers (e.g. NICE) who need reliable evidence from evaluations of interventions subject to missing data.
Indirectly, we believe the research can ultimately benefit clinicians and patients by avoiding biased studies being used to support new interventions, but also by making it easier for unbiased studies to demonstrate their lack of bias and hence have greater influence on practice.
People |
ORCID iD |
Publications
Baker R
(2016)
New models for describing outliers in meta-analysis.
in Research synthesis methods
Chen Y
(2016)
Inference for correlated effect sizes using multiple univariate meta-analyses.
in Statistics in medicine
Jackson D
(2017)
Borrowing of strength and study weights in multivariate and network meta-analysis.
in Statistical methods in medical research
Jackson D
(2016)
The design-by-treatment interaction model: a unifying framework for modelling loop inconsistency in network meta-analysis.
in Research synthesis methods
Moreno-Betancur M
(2018)
Canonical Causal Diagrams to Guide the Treatment of Missing Data in Epidemiologic Studies.
in American journal of epidemiology
Tompsett D
(2020)
A general method for elicitation, imputation, and sensitivity analysis for incomplete repeated binary data.
in Statistics in medicine
Tompsett DM
(2018)
On the use of the not-at-random fully conditional specification (NARFCS) procedure in practice.
in Statistics in medicine
Description | Multiple imputation by chained equations for data that are missing not at random: methods development for randomised trials and observational studies |
Amount | £166,749 (GBP) |
Funding ID | MC_EX_MR/M025012 |
Organisation | Medical Research Council (MRC) |
Sector | Public |
Country | United Kingdom |
Start | 03/2016 |
End | 04/2019 |
Description | NARMICE-Melbourne collaboration |
Organisation | Murdoch Children's Research Institute |
Country | Australia |
Sector | Academic/University |
PI Contribution | Contributed to discussions on work led by the MCRI group on developing analysis of incomplete data through use of directed acyclic graphs. |
Collaborator Contribution | Contributed to discussions on work led by the MRC BSU group on developing analysis of incomplete data through not-at-random multiple imputatoin by chained equations. |
Impact | Two linked presentations at the "Missing Data Analysis and Imputation" group, London (and Skype), 9/2/2017. Two draft papers. |
Start Year | 2016 |
Title | NARFCS extension to MICE for R |
Description | This is ongoing work to move beyond the usual "missing at random" assumption in the handling of missing data. It extends the widely used "MICE" package for R. Work is ongoing. |
Type Of Technology | Software |
Year Produced | 2017 |
Impact | None at present |
URL | https://github.com/moreno-betancur/mice |
Description | MNAR in MI course |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Methods developed for handling missing data that may be missing not at random were taught in an annual course on multiple imputation |
Year(s) Of Engagement Activity | 2018,2019,2020,2021 |