ROBEST: Ensuring robustness of evidence in public health research for increased policy impact: widened use of advanced causal inference techniques

Lead Research Organisation: London School of Hygiene & Tropical Medicine
Department Name: Epidemiology and Population Health

Abstract

Coherent and effective public health policies rest on reliable evidence, such that researchers are able to identify, demonstrate, and raise awareness for a need for change, as well as measure the causal effect of proposed changes. Such evidence can be built upon rich electronic health records now available in many varied research fields including public health, health economics, epidemiology and clinical science. The potential of these data is enormous as it offers a valuable source of information to obtain real-world evidence to inform public health policies. Nonetheless, reliable evidence can only be obtained through widespread use of robust statistical methodology among applied researchers with interests on evaluative research.
The large number of potential confounders and their possible complex relationships with the outcome makes the use of standard regression methods challenging or even impossible in some instance. Furthermore, the observational nature of such data makes any causal interpretation of the findings with conventional analytic approaches hazardous. These caveats call for specific causal inference methodology, aimed at approaching observational data with a randomised trial mindset.
Alongside the growing availability of data, there has been a rapid development of statistical tools designed to further the use of observational data to answer causal questions. One of the recently developed algorithms, blending machine learning techniques with causal inference methodology, is the targeted maximum likelihood estimation (TMLE). This cutting-edge approach combines double-robust estimation and good statistical properties, enabling causal inference.
Nonetheless, there is some discrepancy between the speed of methodological development and the adoption of these innovative methods among applied researchers. We identified three reasons for this misalignment: a gap in the understanding of the new methods, a lack of ready-to-use software, and the scarcity of published publications showcasing the superiority of TMLE. We aim to address these shortcomings in this proposal.
We will provide applied researchers with tutorials designed to demystify complex mathematical and statistical concepts used in the latest developments of targeted machine learning estimation. Furthermore, we propose to implement the latest TMLE developments in Stata, a statistical software favoured by most applied researchers in public health, health economics, epidemiology and clinical science. We will extend the eltmle (https://github.com/migariane/eltmle) Stata command we developed, together with extensive help file, by adding new functionalities to allow robust statistical inference. Furthermore, we plan the publication of a simple yet detailed article in the Stata Journal, online tutorials and empirical applications illustrating the use of eltmle. Lastly, we will provide demonstrations of the good properties of TMLE in simulated scenarios.
We will apply eltmle command to estimate how working environment causally affects cancer incidence and mortality, and to evaluate the causal effect of the type of colon cancer surgery (laparoscopy vs. open) on 30-day mortality.
Our dissemination strategy will target both applied researchers and stakeholders. It includes several channels, from classical publications and conference presentations, to dissemination through online open-source tutorials and technical support using open-source tools such as GitHub, as well as early engagement with stakeholders to develop the applied studies. Furthermore, we will run a two-day workshop hosted at the London School of Hygiene and Tropical Medicine, aiming to foster a network of eltmle users.

Technical Summary

Often, questions that motivate studies in the health, social and behavioral sciences are causal but tend to be answered using classical statistical methods. However, causal inference methods are needed when causality cannot be guaranteed by design (i.e., observational studies) or when randomisation fails and does not provide the required balance in trials. Over the years, rapid ongoing advances in the field of causal inference for observational data have resulted in several algorithms to estimate the causal effects of a treatment on an outcome. Recently, data-adaptive estimation using machine learning techniques has been incorporated in the development of causal inference estimators. One of these algorithms is the targeted maximum likelihood estimation (TMLE). TMLE is a semiparametric double-robust, efficient substitution estimator allowing for data-adaptive estimation while obtaining valid statistical inference. In addition to being double-robust, TMLE allows the inclusion of machine learning algorithms that minimise the risk of model misspecification, a problem that persists for competing estimators. Nonetheless, TMLE rests on relatively complex statistical and mathematical concepts that need to be demystified for wider adoption. Furthermore, some questions remain for statistical inference in non-parametric settings (i.e., confidence intervals nominal coverage). This is an area of ongoing work where cross-validation is used to overcome TMLE issues in non-parametric settings (i.e., Donsker class condition). The Donsker class condition refers to the smoothness needed in finite samples to assume asymptotic linearity and implement statistical inference based on the influence function. We plan to extend a previous Stata implementation of TMLE we developed to implement the most recent theoretical advances: i) to produce robust statistical inference in finite samples, and ii) to include other functionalities making the package readily accessible for applied researchers.