ROBEST: Ensuring robustness of evidence in public health research for increased policy impact: widened use of advanced causal inference techniques
Lead Research Organisation:
London School of Hygiene and Tropical Medicine
Department Name: Epidemiology and Population Health
Abstract
Coherent and effective public health policies rest on reliable evidence, such that researchers are able to identify, demonstrate, and raise awareness for a need for change, as well as measure the causal effect of proposed changes. Such evidence can be built upon rich electronic health records now available in many varied research fields including public health, health economics, epidemiology and clinical science. The potential of these data is enormous as it offers a valuable source of information to obtain real-world evidence to inform public health policies. Nonetheless, reliable evidence can only be obtained through widespread use of robust statistical methodology among applied researchers with interests on evaluative research.
The large number of potential confounders and their possible complex relationships with the outcome makes the use of standard regression methods challenging or even impossible in some instance. Furthermore, the observational nature of such data makes any causal interpretation of the findings with conventional analytic approaches hazardous. These caveats call for specific causal inference methodology, aimed at approaching observational data with a randomised trial mindset.
Alongside the growing availability of data, there has been a rapid development of statistical tools designed to further the use of observational data to answer causal questions. One of the recently developed algorithms, blending machine learning techniques with causal inference methodology, is the targeted maximum likelihood estimation (TMLE). This cutting-edge approach combines double-robust estimation and good statistical properties, enabling causal inference.
Nonetheless, there is some discrepancy between the speed of methodological development and the adoption of these innovative methods among applied researchers. We identified three reasons for this misalignment: a gap in the understanding of the new methods, a lack of ready-to-use software, and the scarcity of published publications showcasing the superiority of TMLE. We aim to address these shortcomings in this proposal.
We will provide applied researchers with tutorials designed to demystify complex mathematical and statistical concepts used in the latest developments of targeted machine learning estimation. Furthermore, we propose to implement the latest TMLE developments in Stata, a statistical software favoured by most applied researchers in public health, health economics, epidemiology and clinical science. We will extend the eltmle (https://github.com/migariane/eltmle) Stata command we developed, together with extensive help file, by adding new functionalities to allow robust statistical inference. Furthermore, we plan the publication of a simple yet detailed article in the Stata Journal, online tutorials and empirical applications illustrating the use of eltmle. Lastly, we will provide demonstrations of the good properties of TMLE in simulated scenarios.
We will apply eltmle command to estimate how working environment causally affects cancer incidence and mortality, and to evaluate the causal effect of the type of colon cancer surgery (laparoscopy vs. open) on 30-day mortality.
Our dissemination strategy will target both applied researchers and stakeholders. It includes several channels, from classical publications and conference presentations, to dissemination through online open-source tutorials and technical support using open-source tools such as GitHub, as well as early engagement with stakeholders to develop the applied studies. Furthermore, we will run a two-day workshop hosted at the London School of Hygiene and Tropical Medicine, aiming to foster a network of eltmle users.
The large number of potential confounders and their possible complex relationships with the outcome makes the use of standard regression methods challenging or even impossible in some instance. Furthermore, the observational nature of such data makes any causal interpretation of the findings with conventional analytic approaches hazardous. These caveats call for specific causal inference methodology, aimed at approaching observational data with a randomised trial mindset.
Alongside the growing availability of data, there has been a rapid development of statistical tools designed to further the use of observational data to answer causal questions. One of the recently developed algorithms, blending machine learning techniques with causal inference methodology, is the targeted maximum likelihood estimation (TMLE). This cutting-edge approach combines double-robust estimation and good statistical properties, enabling causal inference.
Nonetheless, there is some discrepancy between the speed of methodological development and the adoption of these innovative methods among applied researchers. We identified three reasons for this misalignment: a gap in the understanding of the new methods, a lack of ready-to-use software, and the scarcity of published publications showcasing the superiority of TMLE. We aim to address these shortcomings in this proposal.
We will provide applied researchers with tutorials designed to demystify complex mathematical and statistical concepts used in the latest developments of targeted machine learning estimation. Furthermore, we propose to implement the latest TMLE developments in Stata, a statistical software favoured by most applied researchers in public health, health economics, epidemiology and clinical science. We will extend the eltmle (https://github.com/migariane/eltmle) Stata command we developed, together with extensive help file, by adding new functionalities to allow robust statistical inference. Furthermore, we plan the publication of a simple yet detailed article in the Stata Journal, online tutorials and empirical applications illustrating the use of eltmle. Lastly, we will provide demonstrations of the good properties of TMLE in simulated scenarios.
We will apply eltmle command to estimate how working environment causally affects cancer incidence and mortality, and to evaluate the causal effect of the type of colon cancer surgery (laparoscopy vs. open) on 30-day mortality.
Our dissemination strategy will target both applied researchers and stakeholders. It includes several channels, from classical publications and conference presentations, to dissemination through online open-source tutorials and technical support using open-source tools such as GitHub, as well as early engagement with stakeholders to develop the applied studies. Furthermore, we will run a two-day workshop hosted at the London School of Hygiene and Tropical Medicine, aiming to foster a network of eltmle users.
Technical Summary
Often, questions that motivate studies in the health, social and behavioral sciences are causal but tend to be answered using classical statistical methods. However, causal inference methods are needed when causality cannot be guaranteed by design (i.e., observational studies) or when randomisation fails and does not provide the required balance in trials. Over the years, rapid ongoing advances in the field of causal inference for observational data have resulted in several algorithms to estimate the causal effects of a treatment on an outcome. Recently, data-adaptive estimation using machine learning techniques has been incorporated in the development of causal inference estimators. One of these algorithms is the targeted maximum likelihood estimation (TMLE). TMLE is a semiparametric double-robust, efficient substitution estimator allowing for data-adaptive estimation while obtaining valid statistical inference. In addition to being double-robust, TMLE allows the inclusion of machine learning algorithms that minimise the risk of model misspecification, a problem that persists for competing estimators. Nonetheless, TMLE rests on relatively complex statistical and mathematical concepts that need to be demystified for wider adoption. Furthermore, some questions remain for statistical inference in non-parametric settings (i.e., confidence intervals nominal coverage). This is an area of ongoing work where cross-validation is used to overcome TMLE issues in non-parametric settings (i.e., Donsker class condition). The Donsker class condition refers to the smoothness needed in finite samples to assume asymptotic linearity and implement statistical inference based on the influence function. We plan to extend a previous Stata implementation of TMLE we developed to implement the most recent theoretical advances: i) to produce robust statistical inference in finite samples, and ii) to include other functionalities making the package readily accessible for applied researchers.
Publications
Exarchakou A
(2024)
What can hospital emergency admissions prior to cancer diagnosis tell us about socio-economic inequalities in cancer diagnosis? Evidence from population-based data in England.
in British journal of cancer
Gaber CE
(2024)
De-Mystifying the Clone-Censor-Weight Method for Causal Research Using Observational Data: A Primer for Cancer Researchers.
in Cancer medicine
Martinuka O
(2023)
Target trial emulation with multi-state model analysis to assess treatment effectiveness using clinical COVID-19 data.
in BMC medical research methodology
Pilleron S
(2023)
Immortal-time bias in older vs younger age groups: a simulation study with application to a population-based cohort of patients with colon cancer.
in British journal of cancer
Smith M
(2024)
Comparison of common multiple imputation approaches: An application of logistic regression with an interaction
in Research Methods in Medicine & Health Sciences
Smith MJ
(2023)
Application of targeted maximum likelihood estimation in public health and epidemiological studies: a systematic review.
in Annals of epidemiology
Wei C
(2024)
Effectiveness of post-COVID-19 primary care attendance in improving survival in very old patients with multimorbidity: a territory-wide target trial emulation.
in The British journal of general practice : the journal of the Royal College of General Practitioners
Zepeda-Tello R
(2022)
The Delta-Method and Influence Function in Medical Statistics: a Reproducible Tutorial
| Description | MS UCL PhD supervision |
| Organisation | University College London |
| Country | United Kingdom |
| Sector | Academic/University |
| PI Contribution | MS has been invited to contribute to the supervision of PhD Student Aasiyah Rashan on the topic of "Determining the transferability of treatment effects between international critical care populations". As such MS was given an Honorary Research Fellow position at UCL. |
| Collaborator Contribution | MS is providing support to Aasiyah for specific aims and objectives of their PhD project. MS participates in regular calls with the supervisory team, and contributes to developing Aasiyah's skills and expertise in their research topic. |
| Impact | No outputs so far. |
| Start Year | 2023 |
| Description | ACIC - poster Matthew Smith |
| Form Of Engagement Activity | A talk or presentation |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Third sector organisations |
| Results and Impact | Matthew Smith presented a poster at the American Causal Inference Conference, he attended the conference and met with members of the team at Berkeley university who develop TMLE |
| Year(s) Of Engagement Activity | 2024 |
| URL | https://sci-info.org/wp-content/uploads/2024/05/event_202312_agenda_pdf_aoggo.pdf |
| Description | Miguel's ROBEST presentation (Granada) |
| Form Of Engagement Activity | A talk or presentation |
| Part Of Official Scheme? | No |
| Geographic Reach | Regional |
| Primary Audience | Professional Practitioners |
| Results and Impact | Miguel-Angel Luque Fernandez was invited to present the work conducted as part of the research funded through ROBEST to the Institute of Mathematics at the University of Granada, Spain. |
| Year(s) Of Engagement Activity | 2023 |
| URL | https://wpd.ugr.es/~imag/events/event/ensemble-learning-targeted-maximum-likelihood-estimation-for-s... |
| Description | Pacific Causal Inference Conference MS presentation |
| Form Of Engagement Activity | A talk or presentation |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Professional Practitioners |
| Results and Impact | Ms was invited to present on ongoing developments of our work on causal inference for the relative survival setting at the Pacific Causal Inference Conference in a section dedicated to survival outcomes. |
| Year(s) Of Engagement Activity | 2024 |
| URL | https://www.spco.cc/pcic/ |
| Description | REDICO advisory board |
| Form Of Engagement Activity | A formal working group, expert panel or dialogue |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Other audiences |
| Results and Impact | Camille Maringe is a member of the advisory board of the REDICO programme (Uni. Of Luxembourg). |
| Year(s) Of Engagement Activity | 2023,2024,2025 |
| URL | https://researchportal.lih.lu/en/projects/reducing-disparities-in-cancer-outcomes |
| Description | Yorkshire Cancer Research - research advisory board |
| Form Of Engagement Activity | A formal working group, expert panel or dialogue |
| Part Of Official Scheme? | No |
| Geographic Reach | Regional |
| Primary Audience | Other audiences |
| Results and Impact | Camille Maringe joined as a member of the Yorkshire Cancer Research Advisory Panel. Panel members assist with the assessment of funding applications by reviewing a few applications remotely (typically 2 or 3 each year) or attending the annual Research Advisory Meeting in Harrogate to consider shortlisted proposals. In addition, panel members may be asked for advice about future direction and planning of new activities, on an ad hoc basis. |
| Year(s) Of Engagement Activity | 2025 |
| URL | https://www.yorkshirecancerresearch.org.uk/ |
