Methods for handling missing data and covariate measurement error in individual participant data meta-analysis
Lead Research Organisation:
London School of Hygiene & Tropical Medicine
Department Name: Epidemiology and Population Health
Abstract
In recent decades there has been a concerted drive towards ensuring medicine is evidence based, meaning that decisions about patient care and public health are made in light of the current best available evidence. Central to establishing what constitutes the best available evidence in regards to a particular clinical or public health question is the process of evidence synthesis. For clinical questions which can be numerically quantified, the primary tool for synthesizing evidence is meta-analysis, which involves taking the results from previous studies and combining them to give a single summary estimate of the quantity of interest.
The gold standard approach to meta-analysis involves collating the individual participant data (IPD) from all of the previously conducted relevant studies and analysing the resulting combined dataset. Pooling the individual level data confers a number of advantages compared to the traditional meta-analysis approach which involves combining the overall results of studies (as opposed to analysing their original, individual level data). These advantages include the ability to make statistical adjustments for a consistent set of variables, exploration of whether treatment effects vary between different groups of patients, and the ability to investigate the shape of relationships between variables.
However, there are a number of issues which threaten the potential of IPD meta-analysis. Principal among these are issues caused by missing data and measurement error. Missing data occur for two reasons in IPD meta-analyses. The first is when some studies did not collect data on one or more variables which are of interest, such that the values of these variables are missing for all participants in these studies. The second occurs when, for a variety of reasons, some participants have missing values despite the fact the study intended to collect the variable. Missing data cause results to be less precise and possible biased. Measurement error occurs when variables of interest can only be measured imprecisely. If ignored, measurement error also causes biases in results.
The proposed research seeks to develop new statistical methods to deal with these two issues. By doing so, they will enable researchers to obtain more precise and less biased estimates from IPD meta-analyses, thereby giving more accurate answers to important clinical and public health questions. New methods will be published in scientific journals, and methods implemented into statistical software packages to enable them to be used by researchers. This will help enable medical practitioners and public health experts to base their decisions and policies on the best available evidence, thus improving health outcomes for patients and the population more generally.
The work will be conducted by the Fellowship applicant.
The gold standard approach to meta-analysis involves collating the individual participant data (IPD) from all of the previously conducted relevant studies and analysing the resulting combined dataset. Pooling the individual level data confers a number of advantages compared to the traditional meta-analysis approach which involves combining the overall results of studies (as opposed to analysing their original, individual level data). These advantages include the ability to make statistical adjustments for a consistent set of variables, exploration of whether treatment effects vary between different groups of patients, and the ability to investigate the shape of relationships between variables.
However, there are a number of issues which threaten the potential of IPD meta-analysis. Principal among these are issues caused by missing data and measurement error. Missing data occur for two reasons in IPD meta-analyses. The first is when some studies did not collect data on one or more variables which are of interest, such that the values of these variables are missing for all participants in these studies. The second occurs when, for a variety of reasons, some participants have missing values despite the fact the study intended to collect the variable. Missing data cause results to be less precise and possible biased. Measurement error occurs when variables of interest can only be measured imprecisely. If ignored, measurement error also causes biases in results.
The proposed research seeks to develop new statistical methods to deal with these two issues. By doing so, they will enable researchers to obtain more precise and less biased estimates from IPD meta-analyses, thereby giving more accurate answers to important clinical and public health questions. New methods will be published in scientific journals, and methods implemented into statistical software packages to enable them to be used by researchers. This will help enable medical practitioners and public health experts to base their decisions and policies on the best available evidence, thus improving health outcomes for patients and the population more generally.
The work will be conducted by the Fellowship applicant.
Technical Summary
The overall aim of the proposed research is to develop and apply methods based on multiple imputation (MI) to tackle the issues of missing data and covariate measurement error in IPD meta-analysis. This will be achieved through pursuing the following objectives:
1) I will critically evaluate existing multiple imputation (MI) approaches which can be used for imputing systematically missing data. Parameter identifiability problems will be tackled through use of ridge priors. I will explore the extension of a full conditional specification approach to imputation of systematically missing data to accommodate missingness in categorical variables.
2) I will develop an MI approach for (sporadic and systematically) missing data which accommodates non-linear covariate effects and interactions in the substantive model, by extending the approach I have recently proposed for the setting of single studies.
3) I will develop an MI approach for covariate measurement error which accommodates non-linear covariate effects and interactions in the substantive model, by extending the approach I have recently proposed for imputing partially observed covariates in the context of single studies.
4) I will develop an MI approach for covariate measurement error which handles studies without repeat measurements. This approach will explicitly model between-study heterogeneity in the measurement error distribution and distribution of underlying covariates.
5) Existing theory on congeniality between imputation and substantive models will be used to establish the order in which meta-analysis and pooling across imputations should be performed when a multi-level imputation model is used.
6) I will investigate the feasibility of doubly-robust estimators for missing data in IPD meta-analysis, and their use for investigating sensitivity to mis-specification of the imputation model.
7) I will actively disseminate the methods developed through their implementation in statistical packages.
1) I will critically evaluate existing multiple imputation (MI) approaches which can be used for imputing systematically missing data. Parameter identifiability problems will be tackled through use of ridge priors. I will explore the extension of a full conditional specification approach to imputation of systematically missing data to accommodate missingness in categorical variables.
2) I will develop an MI approach for (sporadic and systematically) missing data which accommodates non-linear covariate effects and interactions in the substantive model, by extending the approach I have recently proposed for the setting of single studies.
3) I will develop an MI approach for covariate measurement error which accommodates non-linear covariate effects and interactions in the substantive model, by extending the approach I have recently proposed for imputing partially observed covariates in the context of single studies.
4) I will develop an MI approach for covariate measurement error which handles studies without repeat measurements. This approach will explicitly model between-study heterogeneity in the measurement error distribution and distribution of underlying covariates.
5) Existing theory on congeniality between imputation and substantive models will be used to establish the order in which meta-analysis and pooling across imputations should be performed when a multi-level imputation model is used.
6) I will investigate the feasibility of doubly-robust estimators for missing data in IPD meta-analysis, and their use for investigating sensitivity to mis-specification of the imputation model.
7) I will actively disseminate the methods developed through their implementation in statistical packages.
Planned Impact
The proposed research will enable more accurate estimates to obtained in IPD meta-analyses in a number of contexts, and this has the potential to improve clinical and public health outcomes. For example, improved estimation of prognostic models would enable patients at high risk of disease to be better identified, potentially leading to earlier intervention and improved health outcomes. A better understanding of how treatment effects vary between individuals would enable treatments to be better targeted, enabling the most appropriate interventions to be used for individuals and reducing prescription of drugs to patients who would derive no benefit. Lastly, the methods will be suitable for analyses which further our understanding of disease aetiology, which in turn should translate into improved health outcomes for patients and healthy populations generally.
Publications
Bartlett JW
(2015)
Asymptotically Unbiased Estimation of Exposure Odds Ratios in Complete Records Logistic Regression.
in American journal of epidemiology
Shah AD
(2014)
Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study.
in American journal of epidemiology
Bartlett JW
(2016)
Missing covariates in competing risks analysis.
in Biostatistics (Oxford, England)
Bartlett JW
(2014)
Improving upon the efficiency of complete case analysis when covariates are MNAR.
in Biostatistics (Oxford, England)
Bartlett JW
(2014)
Systematically missing data in individual participant data meta-analysis: a semiparametric inverse probability weighting approach
in International Biometric Conference 2014
Bartlett JW
(2014)
Methodology for multiple imputation for missing data in electronic health record data
in International Biometric Conference 2014
Bartlett JW
(2015)
Multiple imputation of covariates by fully conditional specification: Accommodating the substantive model.
in Statistical methods in medical research
Bartlett J
(2014)
Multiple imputation of covariates by fully conditional specification: Accommodating the substantive model
in Statistical Methods in Medical Research
Hossain A
(2017)
Missing continuous outcomes under covariate dependent missingness in cluster randomised trials.
in Statistical methods in medical research
Bartlett JW
(2018)
Bayesian correction for covariate measurement error: A frequentist evaluation and comparison with regression calibration.
in Statistical methods in medical research
Hossain A
(2017)
Missing binary outcomes under covariate-dependent missingness in cluster randomised trials.
in Statistics in medicine
Resche-Rigon M
(2013)
Multiple imputation for handling systematically missing confounders in meta-analysis of individual participant data.
in Statistics in medicine
Beesley LJ
(2016)
Multiple imputation of missing covariates for the Cox proportional hazards cure model.
in Statistics in medicine
Welch CA
(2014)
Evaluation of two-fold fully conditional specification multiple imputation for longitudinal electronic health record data.
in Statistics in medicine
Welch C
(2014)
Application of multiple imputation using the two-fold fully conditional specification algorithm in longitudinal clinical data.
in The Stata journal
Bartlett J
(2015)
Multiple Imputation of Covariates by Substantive-model Compatible Fully Conditional Specification
in The Stata Journal: Promoting communications on statistics and Stata
Hossain A.
(2017)
Missing data in cluster randomised trials
Title | Stata program for predictive value weighting to allow for covariate misclassification |
Description | pvw is a Stata program which implements the predictive value weighting approach for adjustment for misclassification in a binary covariate in a logistic regression model, as proposed by Lyles and Lin (2010). |
Type Of Technology | Software |
Year Produced | 2014 |
Open Source License? | Yes |
Impact | The program is being used in teaching materials for Masters students at the London School of Hygiene & Tropical Medicine, giving an easy to use method for allowing for the effects of misclassification in a binary covariate in logistic regression models. |
URL | https://ideas.repec.org/c/boc/bocode/s457825.html |
Title | smcfcs package for R |
Description | The software is a package for R which implements the Substantive Model Compatible Fully Conditional Specification approach to multiple imputation of covariates. |
Type Of Technology | Software |
Year Produced | 2015 |
Open Source License? | Yes |
Impact | The package has been used in teaching courses on methods for handling missing data. |
URL | https://cran.r-project.org/web/packages/smcfcs/index.html |
Title | smcfcs package for Stata |
Description | The software implements the Substantive Model Compatible Fully Conditional Specification (SMC-FCS) multiple imputation for missing covariates in Stata. The software can be installed freely into Stata, and used to impute missing covariates using the SMC-FCS approach. An accompanying publication in the Stata Journal has been published which describes the software package and its use. |
Type Of Technology | Software |
Year Produced | 2015 |
Open Source License? | Yes |
Impact | The package is used as part of the LSHTM short course "Statistical Analysis of Missing Data with Multiple Imputation and Inverse Probability Weighting". I believe it is also planned to be used in a similar course on missing data run by the MRC Biostatistics Unit in Cambridge. |
URL | https://ideas.repec.org/c/boc/bocode/s457968.html |