HMD: Missing data in propensity score analyses of Electronic Health Records Data
Lead Research Organisation:
London School of Hygiene & Tropical Medicine
Department Name: Epidemiology and Population Health
Abstract
Electronic storage and linkage of routinely-collected health data has opened up substantial opportunities to address important questions, not least those relating to the possible harms and benefits of long term medication use. Such information is important to patients and health care professionals alike. Indeed, the expectation that EHR and related data will be used to measure medication effects is now written into EU legislation. Thus we expect the use of electronic health records for research will increase dramatically.
Using data taken from electronic health records to investigate medication effects raises substantial challenges. In particular, patients who are prescribed a particular medication will tend to be very different from those who are not. Disentangling these patient differences from effects of the medication is a key aim of observational epidemiology. This process of disentangling, which is challenging even when information concerning patient characteristics (such as their cholesterol level or age) is available, is greatly complicated when some information is unavailable.
A propensity score analysis is a statistical approach that is very useful in accounting for differing patient characteristics between patients prescribed a medication and those who are not, in order to measure effects of the medication. By modelling the process of medication prescription, propensity score methods attempt to identify patients prescribed the medication and others who are not prescribed the medication who are otherwise comparable, and measures effects of the medication by comparing health outcomes between these patients. Methods for accounting for missing patient information within a propensity score analysis, however, remain poorly understood.
Failure to adequately handle missing information in an investigation of medication effects could lead to incorrect conclusions regarding the benefits, or harms, of the medication. In order to avoid this, it is vital to develop appropriate ways of dealing with missing information within propensity score analyses. There is an established literature concerning how to handle missing information within other types of analyses, particularly those focused around modelling the outcome as a function of the patient characteristics. However, the way in which the patient characteristics are used in this outcome modelling approach and propensity score analyses differs in practically important ways. Thus the way in which missing data should be handled cannot be directly learnt from our experiences within the outcome regression modelling context.
Our proposal aims to develop guidelines for researchers undertaking these analyses to help them select an appropriate method for handling their missing data, and to understand the assumptions under which their conclusions regarding the effects of the medication are valid. As part of this, we will take sophisticated statistical methods for handling missing data that have proved themselves outside the propensity score setting, such as multiple imputation, and develop and apply them in a way that is consistent with the goals and structure of propensity score analyses.
Because medication use of a particular patient will often change over time, as will many of the patient's characteristics, it is often desirable to take this into account in the statistical analysis. This can be done through the application of an extension of the propensity score approach, called marginal structural models. A final aspect of our proposal, therefore, seeks to understand how to extend our proposed missing data methods to this setting.
Through our broad based dissemination strategy (described elsewhere) our work will be relevant to a broad range of quantitative researchers in medical and social science, in academic, pharmaceutical, regulatory and policy settings.
Using data taken from electronic health records to investigate medication effects raises substantial challenges. In particular, patients who are prescribed a particular medication will tend to be very different from those who are not. Disentangling these patient differences from effects of the medication is a key aim of observational epidemiology. This process of disentangling, which is challenging even when information concerning patient characteristics (such as their cholesterol level or age) is available, is greatly complicated when some information is unavailable.
A propensity score analysis is a statistical approach that is very useful in accounting for differing patient characteristics between patients prescribed a medication and those who are not, in order to measure effects of the medication. By modelling the process of medication prescription, propensity score methods attempt to identify patients prescribed the medication and others who are not prescribed the medication who are otherwise comparable, and measures effects of the medication by comparing health outcomes between these patients. Methods for accounting for missing patient information within a propensity score analysis, however, remain poorly understood.
Failure to adequately handle missing information in an investigation of medication effects could lead to incorrect conclusions regarding the benefits, or harms, of the medication. In order to avoid this, it is vital to develop appropriate ways of dealing with missing information within propensity score analyses. There is an established literature concerning how to handle missing information within other types of analyses, particularly those focused around modelling the outcome as a function of the patient characteristics. However, the way in which the patient characteristics are used in this outcome modelling approach and propensity score analyses differs in practically important ways. Thus the way in which missing data should be handled cannot be directly learnt from our experiences within the outcome regression modelling context.
Our proposal aims to develop guidelines for researchers undertaking these analyses to help them select an appropriate method for handling their missing data, and to understand the assumptions under which their conclusions regarding the effects of the medication are valid. As part of this, we will take sophisticated statistical methods for handling missing data that have proved themselves outside the propensity score setting, such as multiple imputation, and develop and apply them in a way that is consistent with the goals and structure of propensity score analyses.
Because medication use of a particular patient will often change over time, as will many of the patient's characteristics, it is often desirable to take this into account in the statistical analysis. This can be done through the application of an extension of the propensity score approach, called marginal structural models. A final aspect of our proposal, therefore, seeks to understand how to extend our proposed missing data methods to this setting.
Through our broad based dissemination strategy (described elsewhere) our work will be relevant to a broad range of quantitative researchers in medical and social science, in academic, pharmaceutical, regulatory and policy settings.
Technical Summary
Propensity score methods are often used to assess treatment effects in observational data, particularly where a large number of confounding variables need to be accounted for. However, the confounding variables often have a non-trivial proportion of missing values. Methods for handling missing data within propensity score analyses are relatively under-investigated; while missing data methodology is now well established for a range of standard substantive scientific models, this does not necessarily directly translate to the propensity score context, due to the differing ways in which the confounders are used.
Fundamental to analyses of partially observed data is the accessible framing of the additional assumptions entailed, and statistical methods for valid inference under these assumptions. This project brings this approach to this setting through (i) evaluation of missing value indicator methods; (ii) development of multiple imputation strategies consistent with propensity score methodologies; (iii) using these to lift the restrictions on our doubly robust estimators (robust to misspecification of confounding or substantive models) via doubly robust multiple imputation, and (iv) applying the methodologies to the practically important area of time varying confounding.
Through the dissemination strategy outlined elsewhere (including development of software) our research will enable practitioners to:
(1) Frame appropriate assumptions regarding missing data in the context of their data and research questions;
(2) Understand the impact of these on the validity of complete records and missing indicator type analyses;
(3) Choose appropriate multiple imputation models consistent with the substantive propensity score model;
(4) Apply methods with a degree of robustness to misspecification of key components of the propensity score analysis;
(5) Understand, and clearly report, the assumptions, strengths and limitations of the analyses performed.
Fundamental to analyses of partially observed data is the accessible framing of the additional assumptions entailed, and statistical methods for valid inference under these assumptions. This project brings this approach to this setting through (i) evaluation of missing value indicator methods; (ii) development of multiple imputation strategies consistent with propensity score methodologies; (iii) using these to lift the restrictions on our doubly robust estimators (robust to misspecification of confounding or substantive models) via doubly robust multiple imputation, and (iv) applying the methodologies to the practically important area of time varying confounding.
Through the dissemination strategy outlined elsewhere (including development of software) our research will enable practitioners to:
(1) Frame appropriate assumptions regarding missing data in the context of their data and research questions;
(2) Understand the impact of these on the validity of complete records and missing indicator type analyses;
(3) Choose appropriate multiple imputation models consistent with the substantive propensity score model;
(4) Apply methods with a degree of robustness to misspecification of key components of the propensity score analysis;
(5) Understand, and clearly report, the assumptions, strengths and limitations of the analyses performed.
Planned Impact
Our proposed research tackles outstanding missing data issues hindering the widespread use of propensity scores for robust inference from electronic health record data. Although motivated by assessing long-term medication effects, our findings will be applicable across the range of medical and social sciences, where routinely collected data is increasingly being used to understand effects of treatment and/or policy interventions.
It is therefore of interest to, and stands to benefit, a broad range of stakeholders in this area, including pharmaceutical companies, academic researchers in health and social science, policy makers and bodies which are responsible for pharmacovigilance, such as the Medicines and Healthcare Products Regulatory Agency (MHRA).
Through the analyses conducted by researchers working for these bodies, drawing on the insights, methods and software arising from this project, we expect the research to in turn benefit clinicians and their patients.
It is therefore of interest to, and stands to benefit, a broad range of stakeholders in this area, including pharmaceutical companies, academic researchers in health and social science, policy makers and bodies which are responsible for pharmacovigilance, such as the Medicines and Healthcare Products Regulatory Agency (MHRA).
Through the analyses conducted by researchers working for these bodies, drawing on the insights, methods and software arising from this project, we expect the research to in turn benefit clinicians and their patients.
Organisations
Publications
Ali M
(2019)
Propensity Score Methods in Health Technology Assessment: Principles, Extended Applications, and Recent Advances
in Frontiers in Pharmacology
Blake HA
(2020)
Propensity scores using missingness pattern information: a practical guide.
in Statistics in medicine
Blake HA
(2020)
Estimating treatment effects with partially observed covariates using outcome regression with missing indicators.
in Biometrical journal. Biometrische Zeitschrift
Chatton A
(2022)
G-computation and doubly robust standardisation for continuous-time data: A comparison with inverse probability weighting.
in Statistical methods in medical research
Crellin E
(2018)
Trimethoprim use for urinary tract infection and risk of adverse outcomes in older patients: cohort study.
in BMJ (Clinical research ed.)
Elze MC
(2019)
Evaluation in four cardiovascular studies.
in JACC
Honeyford K
(2020)
Evaluating a digital sepsis alert in a London multisite hospital network: a natural experiment using electronic health record data.
in Journal of the American Medical Informatics Association : JAMIA
Leyrat C
(2021)
Common Methods for Handling Missing Data in Marginal Structural Models: What Works and Why.
in American journal of epidemiology
Leyrat C
(2019)
Propensity score analysis with partially observed covariates: How should multiple imputation be used?
in Statistical methods in medical research
Description | Australian propensity score work |
Organisation | University of Melbourne |
Department | Centre for Epidemiology & Biostatistics |
Country | Australia |
Sector | Academic/University |
PI Contribution | We are advising our Australian collaborators on how to deal with the missing data issues in their data, which they are analysing using propensity score methods. |
Collaborator Contribution | Our collaborators have a very interesting dataset, posing particular methodological challenges. This is helping to guide our methodological thoughts about how to handle missing data in this contexrt. |
Impact | We are still analysing the data. |
Start Year | 2015 |
Description | Brigham and Women's Hospital |
Organisation | Harvard University |
Department | Harvard T.H. Chan School of Public Health |
Country | United States |
Sector | Academic/University |
PI Contribution | Sebastian Schneeweiss's group, Brigham and Women's Hospital, Division of Pharmacoepidemiology and Pharmacoeconomics at the Harvard School of Public Health, developed the High Dimensional Propensity Score. Growing out of an initial MRC MRP project grant, further developed in a second, we have established a collaboration with them to further explore the HDPS in UK electronic health record data. |
Collaborator Contribution | A PhD student and early career fellow both visited our collaborators in Boston. We have regular teleconferences and email exchanges regarding our collaborative projects. |
Impact | Publications in process |
Start Year | 2018 |
Description | Farr Institute |
Organisation | Farr Institute of Health Informatics Research |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | Members of our team have advised a number of researchers at the Farr institute about handling missing data within propensity score analyses. |
Collaborator Contribution | Our collaborators at the Farr have a range of interesting real-life examples, which are throwing up unexpected methodological challenges and enabling us to broaden our focus in our methodological work. |
Impact | Analyses underway |
Start Year | 2015 |
Description | GSK |
Organisation | GlaxoSmithKline (GSK) |
Country | Global |
Sector | Private |
PI Contribution | We are working with researchers at GSK to investigate the potential of a relatively novel design (the prevalent new user design), for analyses of UK electronic health record data. |
Collaborator Contribution | They have hosted a doctoral student as an intern for a number of weeks. |
Impact | NA |
Start Year | 2017 |
Description | Missing data case studies |
Organisation | Maastricht University (UM) |
Country | Netherlands |
Sector | Academic/University |
PI Contribution | Our collaborators in Maastricht University were applying propensity score methods in electronic health data and wished to understand how robust their results were to the missing data methods they were using. They have provided the case study, and analysis, for an investigation of various missing data methods. |
Collaborator Contribution | We have provided the methodological underpinnings of the case study and advice on the analyses. |
Impact | Abstract submitted to the 2016 Conference for the International Society for Pharmacoepidemiology. |
Start Year | 2015 |
Title | R package MatchThem |
Description | This R package aims to facilitate the use of multiple imputation in propensity score matched analyses. |
Type Of Technology | Webtool/Application |
Year Produced | 2020 |
Open Source License? | Yes |
Impact | Not known yet. |
URL | https://cran.r-project.org/web/packages/MatchThem/index.html |
Description | PPI forum |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Regional |
Primary Audience | Public/other audiences |
Results and Impact | We ran a public debate and discussion about missing data; how it arises, how it impacts on medical research and how we deal with it in analyses. It was called "Listening to the silence: What does unrecorded information in the electronic health record tell us?" There was a lively debate about how missing data arises, and how researchers should go about investigating and thinking about missingness mechanisms. We have written a report about the utility of patient and public involvement in such methodological questions and submitted an abstract to the 2016 International Population Data Linkage Conference. |
Year(s) Of Engagement Activity | 2016 |