Developing and disseminating robust methods for handling missing data in epidemiological studies

Lead Research Organisation: University of Bristol
Department Name: Social Medicine


Longitudinal studies ? studies in which individuals are followed over periods of many months or years ? are of great importance in understanding how aspects of people?s lifestyle or environment influence their health and wellbeing. When many individuals are followed over extended periods it is inevitable that measurements on particular variables are sometimes missing, for example because a measuring device broke down, or a subject did not answer certain questions or did not attend an examination. Individuals may also drop out of the study altogether. Missing values raise difficult issues in the analysis of data from longitudinal studies, and failing to address these appropriately can lead to results that are both biased (they differ from the results that would be observed if the missing values could have been included) and inefficient (there is more uncertainty about the results than there would be if the missing values could have been included). New statistical methods that do address these issues have been proposed, and have the potential to decrease bias and increase efficiency in analyses of longitudinal studies. However, these methods can be highly complex and difficult to apply, and their incorrect use may actually increase bias in certain circumstances. We will develop solutions to the remaining problems with applying one of these methods (multiple imputation), including developing strategies for deciding whether missing values are likely to cause bias in analyses, and checks for whether the multiple imputation models are appropriate. We will also develop new methods which still work even when aspects of the chosen statistical models are incorrect. We will incorporate our new methods into existing software, to maximise their future use, as well as publishing the results in scientific journals.

Technical Summary

Missing data is a problem common to almost every clinical and epidemiological study, especially when large cohorts are followed over long periods of time. Traditionally, missing data have been dealt with by complete-case analysis - including in the analysis only those participants with complete data. Medical and epidemiological researchers are increasingly aware that such analyses fail to allow appropriately for missing data, and can lead to both bias and inefficiency. Practical tools for analysing datasets with missing data are now available, and those based on multiple imputation (MI) are increasingly recommended. However their typical use is too uncritical, does not make assumptions explicit, and may replace the potential bias associated with complete-case analyses with different biases arising from inappropriate assumptions or mis-specification of imputation models. The proposed research will focus on tackling the practical barriers to the most effective use of MI, by developing preliminary analyses and diagnostic tools. We will adapt existing software to improve model diagnostics that may alert the user to problems in imputation procedures. We will use simulated and real data to investigate the size and directions of bias caused by ignoring the structure of the data (e.g. longitudinal, clustered) in the imputation model, and develop and compare different ways in which the structure(s) can be incorporated in the imputation model. We will also develop more robust methods for handling missing data. These new methods will include both doubly robust weighted analyses and a second generation of MI based on the doubly robust principle. We will develop methodological approaches to sensitivity analyses in MNAR situations, including parameters to be used in doubly robust weighted analyses. In particular, we will focus on the situation where it is suspected that different MNAR mechanisms operate in different parts of a complex dataset. The proposed research will focus on the application of these methods in simulated data and in longitudinal studies (using data from the ALSPAC birth cohort study, the National Child Development Study (NCDS) and the Millennium Cohort Study (MCS)).


10 25 50