HOD2: Data driven semi-automated approaches to comparative effectiveness research using electronic health record data

Lead Research Organisation: London Sch of Hygiene and Trop Medicine
Department Name: Epidemiology and Population Health


Electronic storage and linkage of routinely-collected health data has opened up substantial opportunities to assess the effectiveness of medications, potentially paving the way for more personalised treatment advice to be given to patients. The expectation that electronic health records and related data will be used to measure medication effects is now written into EU legislation. However, prominent examples in which analysis of these types of data have provided misleading results have led to questions about whether these data should be used to address such questions.

More recently, data-driven approaches which attempt to harness the wealth of data available in electronic health records, to recapture information which is overlooked in traditional analyses, have been used to re-analyse studies in which misleading results were obtained. In these early examples, data-driven approaches have been able to retrieve valid conclusions regarding medication effects. These methods, therefore, offer great potential to overcome the limitations of previous methods, and pave the way for a semi-automated process of obtaining estimated medication effects from routinely-collected data.

Our proposal focuses on data-driven methods based on the propensity score. A propensity score analysis is a statistical approach that is very useful in accounting for differing patient characteristics between patients prescribed a medication and those who are not, in order to allow a fair comparison between those two groups to determine the effects of the medication. By modelling the process of medication prescription, propensity score methods attempt to identify patients prescribed the medication and others who are not prescribed the medication who are otherwise comparable, and measures effects of the medication by comparing health outcomes between these patients. Typically, investigators performing the analysis will select which patient characteristics are relevant to the process of medication prescription. However, much of the information that investigators would like to include is not directly available in routinely-collected data; instead there is a large amount of information about the patients' previous medical history that might collectively be relevant to the medication prescription. The task of manually selecting relevant information, from the thousands of measurements available, is not an easy one. Thus data-driven approaches which select relevant information to include in the analysis, are necessary.

Our proposal aims to explore the use of these data-driven approaches, with the aim of increasing the transparency of what might appear to be a "black box" approach, while focusing on automation of the process insofar as is possible. As part of this, we will develop a suite of visual plots that can be automatically generated in the analysis, which will provide investigators with an understanding of the key inner workings of the analysis. We will also identify the optimal data-driven methods to deploy in these analyses.

Unlike traditional research data, routinely-collected data is not collected according to a regular schedule, so for the purposes of analysis information that investigators would like to include can often be missing from the patient's record. We will therefore explore how best to handle missing data within the data-driven analytic approaches discussed above.

The final element of our project will be to disseminate the results of our work through a broad range of channels. Our work will be relevant to a broad spectrum of quantitative researchers in medical and social science, in academic, pharmaceutical, regulatory and policy settings.

Technical Summary

Propensity score methods are often used to assess treatment effects in electronic health record (EHR) data, but enthusiasm has waned amid prominent examples providing misleading results. Traditional propensity score analysis relies on investigators selecting confounding variables to be included in the model. EHR contain a wealth of data, in the form of codes (prescriptions, diagnoses, etc.); these data may collectively contain important confounding information, which could help remove bias apparently due to unmeasured confounding. How best to incorporate information from thousands of codes remains unclear.

Data-driven approaches such as the high dimensional propensity score (HDPS) offer a semi-automated algorithm for selecting relevant codes to include in the propensity score model. The HDPS relies on a relatively crude algorithm, based on marginal associations of codes with the exposure and outcome. Other, more sophisticated, approaches are likely to provide superior inference. A key aspect of our proposed work seeks to identify the optimal way of selecting which codes to include in the analysis, exploring various data-driven approaches that have been proposed in this context. This work will also explore optimal methods for handling missing data.

Data-driven approaches can be perceived as a "black box", reducing the investigator's ability to understand the data and therefore to fully consider potential biases, leading to a lack of confidence in the results. Therefore, we propose to develop a suite of visual diagnostics to provide intuition about the key drivers of the results in a clinically interpretable way.

Through the dissemination strategy outlined elsewhere, our research will enable practitioners to obtain robust estimates of comparative effectiveness, in a timely and cost-effective manner, while also having the tools to fully understand the results of these analyses.

Planned Impact

Our proposed research tackles outstanding issues hindering the use of electronic health record data (EHR) for comparative effectiveness research. Although our research will focus on this setting, results will also be more broadly applicable to wider areas where routinely collected data is being used to draw causal conclusions.

Our results will be of interest to, and stand to benefit, a broad range of stakeholders in this area, including pharmaceutical companies, academic researchers in health and social science, data holders (e.g. Clinical Practice Research Datalink) providing electronic health record data to researchers interested in questions of comparative effectiveness; policy makers and bodies which are responsible for pharmacovigilance, such as the Medicines and Healthcare Products Regulatory Agency (MHRA).

Ultimately, through enabling robust, timely and cost-efficient ways of addressing questions of comparative effectiveness, we expect the research to benefit clinicians and their patients.


10 25 50