HMD: Missing data in propensity score analyses of Electronic Health Records Data

Lead Research Organisation: London Sch of Hygiene and Trop Medicine
Department Name: Epidemiology and Population Health


Electronic storage and linkage of routinely-collected health data has opened up substantial opportunities to address important questions, not least those relating to the possible harms and benefits of long term medication use. Such information is important to patients and health care professionals alike. Indeed, the expectation that EHR and related data will be used to measure medication effects is now written into EU legislation. Thus we expect the use of electronic health records for research will increase dramatically.

Using data taken from electronic health records to investigate medication effects raises substantial challenges. In particular, patients who are prescribed a particular medication will tend to be very different from those who are not. Disentangling these patient differences from effects of the medication is a key aim of observational epidemiology. This process of disentangling, which is challenging even when information concerning patient characteristics (such as their cholesterol level or age) is available, is greatly complicated when some information is unavailable.

A propensity score analysis is a statistical approach that is very useful in accounting for differing patient characteristics between patients prescribed a medication and those who are not, in order to measure effects of the medication. By modelling the process of medication prescription, propensity score methods attempt to identify patients prescribed the medication and others who are not prescribed the medication who are otherwise comparable, and measures effects of the medication by comparing health outcomes between these patients. Methods for accounting for missing patient information within a propensity score analysis, however, remain poorly understood.

Failure to adequately handle missing information in an investigation of medication effects could lead to incorrect conclusions regarding the benefits, or harms, of the medication. In order to avoid this, it is vital to develop appropriate ways of dealing with missing information within propensity score analyses. There is an established literature concerning how to handle missing information within other types of analyses, particularly those focused around modelling the outcome as a function of the patient characteristics. However, the way in which the patient characteristics are used in this outcome modelling approach and propensity score analyses differs in practically important ways. Thus the way in which missing data should be handled cannot be directly learnt from our experiences within the outcome regression modelling context.
Our proposal aims to develop guidelines for researchers undertaking these analyses to help them select an appropriate method for handling their missing data, and to understand the assumptions under which their conclusions regarding the effects of the medication are valid. As part of this, we will take sophisticated statistical methods for handling missing data that have proved themselves outside the propensity score setting, such as multiple imputation, and develop and apply them in a way that is consistent with the goals and structure of propensity score analyses.

Because medication use of a particular patient will often change over time, as will many of the patient's characteristics, it is often desirable to take this into account in the statistical analysis. This can be done through the application of an extension of the propensity score approach, called marginal structural models. A final aspect of our proposal, therefore, seeks to understand how to extend our proposed missing data methods to this setting.

Through our broad based dissemination strategy (described elsewhere) our work will be relevant to a broad range of quantitative researchers in medical and social science, in academic, pharmaceutical, regulatory and policy settings.

Technical Summary

Propensity score methods are often used to assess treatment effects in observational data, particularly where a large number of confounding variables need to be accounted for. However, the confounding variables often have a non-trivial proportion of missing values. Methods for handling missing data within propensity score analyses are relatively under-investigated; while missing data methodology is now well established for a range of standard substantive scientific models, this does not necessarily directly translate to the propensity score context, due to the differing ways in which the confounders are used.

Fundamental to analyses of partially observed data is the accessible framing of the additional assumptions entailed, and statistical methods for valid inference under these assumptions. This project brings this approach to this setting through (i) evaluation of missing value indicator methods; (ii) development of multiple imputation strategies consistent with propensity score methodologies; (iii) using these to lift the restrictions on our doubly robust estimators (robust to misspecification of confounding or substantive models) via doubly robust multiple imputation, and (iv) applying the methodologies to the practically important area of time varying confounding.

Through the dissemination strategy outlined elsewhere (including development of software) our research will enable practitioners to:
(1) Frame appropriate assumptions regarding missing data in the context of their data and research questions;
(2) Understand the impact of these on the validity of complete records and missing indicator type analyses;
(3) Choose appropriate multiple imputation models consistent with the substantive propensity score model;
(4) Apply methods with a degree of robustness to misspecification of key components of the propensity score analysis;
(5) Understand, and clearly report, the assumptions, strengths and limitations of the analyses performed.

Planned Impact

Our proposed research tackles outstanding missing data issues hindering the widespread use of propensity scores for robust inference from electronic health record data. Although motivated by assessing long-term medication effects, our findings will be applicable across the range of medical and social sciences, where routinely collected data is increasingly being used to understand effects of treatment and/or policy interventions.

It is therefore of interest to, and stands to benefit, a broad range of stakeholders in this area, including pharmaceutical companies, academic researchers in health and social science, policy makers and bodies which are responsible for pharmacovigilance, such as the Medicines and Healthcare Products Regulatory Agency (MHRA).

Through the analyses conducted by researchers working for these bodies, drawing on the insights, methods and software arising from this project, we expect the research to in turn benefit clinicians and their patients.


10 25 50

publication icon
Honeyford K (2020) Evaluating a digital sepsis alert in a London multisite hospital network: a natural experiment using electronic health record data. in Journal of the American Medical Informatics Association : JAMIA

Description Australian propensity score work 
Organisation University of Melbourne
Department Centre for Epidemiology & Biostatistics
Country Australia 
Sector Academic/University 
PI Contribution We are advising our Australian collaborators on how to deal with the missing data issues in their data, which they are analysing using propensity score methods.
Collaborator Contribution Our collaborators have a very interesting dataset, posing particular methodological challenges. This is helping to guide our methodological thoughts about how to handle missing data in this contexrt.
Impact We are still analysing the data.
Start Year 2015
Description Brigham and Women's Hospital 
Organisation Harvard University
Department Harvard T.H. Chan School of Public Health
Country United States 
Sector Academic/University 
PI Contribution Sebastian Schneeweiss's group, Brigham and Women's Hospital, Division of Pharmacoepidemiology and Pharmacoeconomics at the Harvard School of Public Health, developed the High Dimensional Propensity Score. Growing out of an initial MRC MRP project grant, further developed in a second, we have established a collaboration with them to further explore the HDPS in UK electronic health record data.
Collaborator Contribution A PhD student and early career fellow both visited our collaborators in Boston. We have regular teleconferences and email exchanges regarding our collaborative projects.
Impact Publications in process
Start Year 2018
Description Farr Institute 
Organisation Farr Institute of Health Informatics Research
Country United Kingdom 
Sector Academic/University 
PI Contribution Members of our team have advised a number of researchers at the Farr institute about handling missing data within propensity score analyses.
Collaborator Contribution Our collaborators at the Farr have a range of interesting real-life examples, which are throwing up unexpected methodological challenges and enabling us to broaden our focus in our methodological work.
Impact Analyses underway
Start Year 2015
Description GSK 
Organisation GlaxoSmithKline (GSK)
Country Global 
Sector Private 
PI Contribution We are working with researchers at GSK to investigate the potential of a relatively novel design (the prevalent new user design), for analyses of UK electronic health record data.
Collaborator Contribution They have hosted a doctoral student as an intern for a number of weeks.
Impact NA
Start Year 2017
Description Missing data case studies 
Organisation Maastricht University (UM)
Country Netherlands 
Sector Academic/University 
PI Contribution Our collaborators in Maastricht University were applying propensity score methods in electronic health data and wished to understand how robust their results were to the missing data methods they were using. They have provided the case study, and analysis, for an investigation of various missing data methods.
Collaborator Contribution We have provided the methodological underpinnings of the case study and advice on the analyses.
Impact Abstract submitted to the 2016 Conference for the International Society for Pharmacoepidemiology.
Start Year 2015
Description PPI forum 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Public/other audiences
Results and Impact We ran a public debate and discussion about missing data; how it arises, how it impacts on medical research and how we deal with it in analyses. It was called "Listening to the silence: What does unrecorded information in the electronic health record tell us?"

There was a lively debate about how missing data arises, and how researchers should go about investigating and thinking about missingness mechanisms. We have written a report about the utility of patient and public involvement in such methodological questions and submitted an abstract to the 2016 International Population Data Linkage Conference.
Year(s) Of Engagement Activity 2016