HOD2: Data driven semi-automated approaches to comparative effectiveness research using electronic health record data

Lead Research Organisation: London School of Hygiene & Tropical Medicine

Department Name: Epidemiology and Population Health

Abstract

Electronic storage and linkage of routinely-collected health data has opened up substantial opportunities to assess the effectiveness of medications, potentially paving the way for more personalised treatment advice to be given to patients. The expectation that electronic health records and related data will be used to measure medication effects is now written into EU legislation. However, prominent examples in which analysis of these types of data have provided misleading results have led to questions about whether these data should be used to address such questions.

More recently, data-driven approaches which attempt to harness the wealth of data available in electronic health records, to recapture information which is overlooked in traditional analyses, have been used to re-analyse studies in which misleading results were obtained. In these early examples, data-driven approaches have been able to retrieve valid conclusions regarding medication effects. These methods, therefore, offer great potential to overcome the limitations of previous methods, and pave the way for a semi-automated process of obtaining estimated medication effects from routinely-collected data.

Our proposal focuses on data-driven methods based on the propensity score. A propensity score analysis is a statistical approach that is very useful in accounting for differing patient characteristics between patients prescribed a medication and those who are not, in order to allow a fair comparison between those two groups to determine the effects of the medication. By modelling the process of medication prescription, propensity score methods attempt to identify patients prescribed the medication and others who are not prescribed the medication who are otherwise comparable, and measures effects of the medication by comparing health outcomes between these patients. Typically, investigators performing the analysis will select which patient characteristics are relevant to the process of medication prescription. However, much of the information that investigators would like to include is not directly available in routinely-collected data; instead there is a large amount of information about the patients' previous medical history that might collectively be relevant to the medication prescription. The task of manually selecting relevant information, from the thousands of measurements available, is not an easy one. Thus data-driven approaches which select relevant information to include in the analysis, are necessary.

Our proposal aims to explore the use of these data-driven approaches, with the aim of increasing the transparency of what might appear to be a "black box" approach, while focusing on automation of the process insofar as is possible. As part of this, we will develop a suite of visual plots that can be automatically generated in the analysis, which will provide investigators with an understanding of the key inner workings of the analysis. We will also identify the optimal data-driven methods to deploy in these analyses.

Unlike traditional research data, routinely-collected data is not collected according to a regular schedule, so for the purposes of analysis information that investigators would like to include can often be missing from the patient's record. We will therefore explore how best to handle missing data within the data-driven analytic approaches discussed above.

The final element of our project will be to disseminate the results of our work through a broad range of channels. Our work will be relevant to a broad spectrum of quantitative researchers in medical and social science, in academic, pharmaceutical, regulatory and policy settings.

Technical Summary

Propensity score methods are often used to assess treatment effects in electronic health record (EHR) data, but enthusiasm has waned amid prominent examples providing misleading results. Traditional propensity score analysis relies on investigators selecting confounding variables to be included in the model. EHR contain a wealth of data, in the form of codes (prescriptions, diagnoses, etc.); these data may collectively contain important confounding information, which could help remove bias apparently due to unmeasured confounding. How best to incorporate information from thousands of codes remains unclear.

Data-driven approaches such as the high dimensional propensity score (HDPS) offer a semi-automated algorithm for selecting relevant codes to include in the propensity score model. The HDPS relies on a relatively crude algorithm, based on marginal associations of codes with the exposure and outcome. Other, more sophisticated, approaches are likely to provide superior inference. A key aspect of our proposed work seeks to identify the optimal way of selecting which codes to include in the analysis, exploring various data-driven approaches that have been proposed in this context. This work will also explore optimal methods for handling missing data.

Data-driven approaches can be perceived as a "black box", reducing the investigator's ability to understand the data and therefore to fully consider potential biases, leading to a lack of confidence in the results. Therefore, we propose to develop a suite of visual diagnostics to provide intuition about the key drivers of the results in a clinically interpretable way.

Through the dissemination strategy outlined elsewhere, our research will enable practitioners to obtain robust estimates of comparative effectiveness, in a timely and cost-effective manner, while also having the tools to fully understand the results of these analyses.

Planned Impact

Our proposed research tackles outstanding issues hindering the use of electronic health record data (EHR) for comparative effectiveness research. Although our research will focus on this setting, results will also be more broadly applicable to wider areas where routinely collected data is being used to draw causal conclusions.

Our results will be of interest to, and stand to benefit, a broad range of stakeholders in this area, including pharmaceutical companies, academic researchers in health and social science, data holders (e.g. Clinical Practice Research Datalink) providing electronic health record data to researchers interested in questions of comparative effectiveness; policy makers and bodies which are responsible for pharmacovigilance, such as the Medicines and Healthcare Products Regulatory Agency (MHRA).

Ultimately, through enabling robust, timely and cost-efficient ways of addressing questions of comparative effectiveness, we expect the research to benefit clinicians and their patients.

Funded Value:

£465,792

Funded Period:

Aug 19 - Feb 24

Funder:

MRC

Project Status:

Active

Project Category:

Research Grant

Project Reference:

MR/S01442X/1

Principal Investigator:

Elizabeth Williamson

Health Category:

Unclassified

Organisations

People	ORCID iD
Elizabeth Williamson (Principal Investigator)
Clemence Leyrat (Co-Investigator)
Ian Douglas (Co-Investigator)
Stijn Vansteelandt (Co-Investigator)
Liam Smeeth (Co-Investigator)
James Carpenter (Co-Investigator)
Karla Diaz-Ordaz (Co-Investigator)	http://orcid.org/0000-0003-3155-1561

Publications

Author Name Title

Publication Date Published

|< < 1 2 > >|

10 25 50

Tackney MS (2021) A framework for handling missing accelerometer outcome data in trials. in Trials

Horne E (2023) Challenges in estimating waning effectiveness of two doses of BNT162b2 and ChAdOx1 COVID-19 vaccines beyond six months: an OpenSAFELY cohort study using linked electronic health records

Leyrat C (2021) Common Methods for Handling Missing Data in Marginal Structural Models: What Works and Why. in American journal of epidemiology

Hulme W (2022) Comparative effectiveness of BNT162b2 versus mRNA-1273 boosting in England: a cohort study in OpenSAFELY-TPP

Bell L (2020) Engagement With a Behavior Change App for Alcohol Reduction: Data Visualization for Longitudinal Observational Study. in Journal of medical Internet research

Chatton A (2022) G-computation and doubly robust standardisation for continuous-time data: A comparison with inverse probability weighting. in Statistical methods in medical research

Tazare J (2020) Implementing high-dimensional propensity score principles to improve confounder adjustment in UK electronic health records in Pharmacoepidemiology and Drug Safety

Smith MJ (2022) Introduction to computational causal inference using reproducible Stata, R, and Python code: A tutorial. in Statistics in medicine

Tackney MS (2023) Multiple imputation approaches for epoch-level accelerometer data in trials. in Statistical methods in medical research

Morris TP (2022) Planning a method for covariate adjustment in individually randomised trials: a practical guide. in Trials

Research Tools and Methods
Collaboration


Title	HDPS code
Description	We have developed code to apply the HDPS in Stata, described in the following publication, in press: - hdps: a suite for applying high-dimensional propensity score approaches (Accepted, Stata Journal) John Tazare, Liam Smeeth, Stephen JW Evans, Ian J Douglas, Elizabeth J Williamson
Type Of Material	Improvements to research infrastructure
Year Produced	2023
Provided To Others?	Yes
Impact	The package of code has not yet been published - it is in press - so thus far has not been used outside our group


Title	Stata command
Description	We have developed a Stata command to apply the high dimensional propensity score. Previously, use of this approach was restricted to SAS and R. John Tazare & Ian Douglas & Elizabeth Williamson, 2019. "hdps: Implementation of high-dimensional propensity score approaches in Stata," London Stata Conference 2019 05, Stata Users Group.
Type Of Material	Improvements to research infrastructure
Year Produced	2021
Provided To Others?	Yes
Impact	This code greatly widens access to the HDPS method, which is an important tool to remove confounding in observational data.
URL	https://ideas.repec.org/p/boc/usug19/05.html


Description	GSK
Organisation	GlaxoSmithKline (GSK)
Department	GlaxoSmithKline, Stevenage
Country	United Kingdom
Sector	Private
PI Contribution	Our previous collaboration with GSK has been sustained through this grant, via the idea of applying the HDPS to a new design which GSK are interested in - prevalent new user designs. As a preliminary step to investigating HDPS in this context, we have reviewed current implementation practice in a collaborative paper currently under review: - Prevalent new user designs: a literature review of current implementation practice (Under Review) John Tazare, Daniel C Gibbons, Marleen Bokern, Elizabeth J Williamson, Iain A. Gillespie, Marianne Cunnington, John Logie, Ian J Douglas
Collaborator Contribution	This work was undertaken collaboratively by the Research Fellow employed on this grant and a GSK staff member
Impact	Publications - one under review, one to be submitted shortly PhD studentship - we are currently advertising for a GSK-funded PhD student to work in an area arising from this work
Start Year	2019