Development of miDOC: an expert system and methodology for multiple imputation

Lead Research Organisation: University of Bristol
Department Name: Bristol Medical School

Abstract

Much health and social research is done using studies of people - e.g. randomised trials (comparing those who do have a treatment to those who don't), cohort studies (examining how the health of a group of people changes over time, and what causes these changes) or case-control studies (examining the risk factors for getting a relatively rare disease). All these studies can suffer from missing data - either when people drop out completely, or don't answer some questions, or forget to give some information. This missing data can mean that the results of the study are wrong ("biased"), or that they are less precise than they should be, or both.

Much research has been done into how to deal with missing data, and one commonly-used method is multiple imputation (MI). In MI, other information (e.g. details of someone's previous health, and medications they are currently using) is used to predict ("impute") the missing information. The success of this technique depends crucially on why the information is missing in the first place, and how well the missing information can be predicted. There are guidelines for researchers carrying out MI, but some of the guidelines are not correct, and some are complex and hard to follow. Different researchers use MI in different ways, and do not usually document what they did - so it is hard to replicate analyses, or to see if analysts have followed best practice.

We aim to develop methods to address some remaining issues about how to carry out MI. We will assess what problems are caused when people use the wrong analysis model; what problems may arise from including some variables in the imputation model that do not predict the missing information very well; how to choose which other variables to include in the imputation model. We will also investigate how researchers can best check whether their MI is working well. We will then pull these new methods together with existing knowledge into a new automated expert system, the 'multiple imputation Doctor' (miDOC). miDOC will guide researchers through their analyses, examining the structure of the dataset to advise on whether multiple imputation is needed, and if so how to perform it. The expert system, miDOC, will be useful for all researchers using incomplete data, but will be particularly aimed at those who may have relatively little formal training in statistical analysis of missing data. Not only will miDOC give users access to expert advice on their analysis, but by providing documented decisions and code it will increase reproducibility and transparency of analyses.

We will run focus groups with researchers to help us develop miDOC, and refine it on the basis of feedback. We will make miDOC freely available, and also include the methods and information about miDOC in courses we already run on how to deal with missing data, on www.missing data.org.uk and in the second edition of a textbook authored by one of the co-applicants. We will run two free workshops (which will be permanently made available online), in order to help as many people as possible benefit from these methods and miDOC. The methods and miDOC will be useful for all types of study - randomised trials, cohort studies, case-control studies - and thus have the potential to improve much research in both health and medicine, and beyond.

We will use our links with other cohorts, academic and non-academic agencies to ensure that our methods are widely used, and thus improve the level of evidence informing policy and practice in the UK and worldwide.

Technical Summary

Missing data are common in health research, increasingly addressed by multiple imputation (MI). There are unresolved methodological questions around how to choose the best imputation model for each incomplete variable. Application of MI can be complex and involve multiple decisions which are rarely justified (e.g. which variables to include in each imputation model, how to specify the functional form of each imputation model, diagnostics for the MI procedure, etc).

We will tackle these outstanding issues, and combine our insights with current knowledge into a new automated expert system, the 'multiple imputation Doctor' (miDOC). MiDOC will guide researchers through their analyses, and by providing documented decisions and code, will increase reproducibility and transparency of analyses.
Objectives:
1-3: Resolve outstanding questions around bias due to incorrect specification of the imputation model (even when compatible with the analysis model), and bias due to including variables that are strongly predictive of missingness, or due to over-fitting of the imputation model
4: Develop methodology and an associated algorithm to identify the optimum choice of variables to include in an imputation model, for imputation of different types and roles of variables
5: Extend diagnostics to address issues in (1), including diagnostics for over-fitting of imputation models.
6: Incorporate the results of (1) - (5), together with current knowledge, into an expert system, miDOC, developed in R. miDOC will take the scientific model and data, then (i) identify whether MI is likely to be biased, (ii) implement a sensible MI strategy and (iii) provide diagnostics , including a summary of the MI assumptions.
7: Apply miDOC to exemplar analyses
8: Disseminate the results through conference presentations, research articles, workshops, courses and guidance for researchers, our established website, www.missingdata.org, and the next edition of Carpenter & Kenward.

Publications

10 25 50