Development of robust statistical and machine learning algorithms for extrapolation in causal inference

Lead Research Organisation: University of Oxford

Abstract

This project falls within the EPSRC mathematical sciences research area.

Extrapolation in causal inference refers to the process of making predictions or estimating causal effects for situations or contexts that lie outside the observed data range. It allows researchers to generalize treatment effect estimates obtained from a particular study to new or different settings. For example, if a clinical trial evaluates a drug's effectiveness in a specific patient population, researchers may want to extrapolate the results to assess the drug's efficacy in a different population. This is a crucial aspect of treatment effect estimation in causal inference because real-world applications often require making inferences beyond the scope of available data. It involves making assumptions about the similarity between the observed and extrapolated contexts. These assumptions can introduce uncertainty and potential biases into the estimated treatment effects.

Common challenges include differences in baseline characteristics, unmeasured confounders, and variations in treatment response between the observed and extrapolated contexts. Although in this big data era machine learning has shown its impressive capability in predictive performance with sufficient data, the performance is usually unstable, making its contribution not reliable. The lack of a robust machine learning model being able to extrapolate and transfer learning on existing data to target population are present and interweave.

We aim to build a theory of robust extrapolation in causal inference to address all the above questions by marrying machine learning and statistics. We will deliver scalable methods that extrapolate well, with rigorous theoretical proof on uncertainty quantification. Collaborating with our industry partners (Novartis), we have made some progress on data collection and application scenario identification. We expect to bring theory to practice where our method can facilitate clinical trial design and treatment effect identification / estimation. This will be done by first getting a thorough understanding of the simpler phenomenon of clinical trial decision making in this context.
In summary, extrapolation in causal inference treatment effect estimation is essential when researchers aim to apply causal effect estimates beyond the confines of their observed data. While it can be challenging and requires careful consideration of assumptions and validation, well-designed extrapolation methods enhance the applicability and generalizability of causal inference findings in various domains, including healthcare, social sciences, and policy analysis.

Planned Impact

The primary CDT impact will be training 75 PhD graduates as the next generation of leaders in statistics and statistical machine learning. These graduates will lead in industry, government, health care, and academic research. They will bridge the gap between academia and industry, resulting in significant knowledge transfer to both established and start-up companies. Because this cohort will also learn to mentor other researchers, the CDT will ultimately address a UK-wide skills gap. The students will also be crucial in keeping the UK at the forefront of methodological research in statistics and machine learning.
After graduating, students will act as multipliers, educating others in advanced methodology throughout their career. There are a range of further impacts:
- The CDT has a large number of high calibre external partners in government, health care, industry and science. These partnerships will catalyse immediate knowledge transfer, bringing cutting edge methodology to a large number of areas. Knowledge transfer will also be achieved through internships/placements of our students with users of statistics and machine learning.
- Our Women in Mathematics and Statistics summer programme is aimed at students who could go on to apply for a PhD. This programme will inspire the next generation of statisticians and also provide excellent leadership training for the CDT students.
- The students will develop new methodology and theory in the domains of statistics and statistical machine learning. It will be relevant research, addressing the key questions behind real world problems. The research will be published in the best possible statistics journals and machine learning conferences and will be made available online. To maximize reproducibility and replicability, source code and replication files will be made available as open source software or, when relevant to an industrial collaboration, held as a patent or software copyright.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/S023151/1 01/04/2019 30/09/2027
2740759 Studentship EP/S023151/1 01/10/2022 30/09/2026 Linying Yang