Instrumental Variables Estimation, Selection of Instruments and Two-Sample Mendelian Randomisation

Lead Research Organisation: University of Oxford

Abstract

Learning about how one variable causes the other is of broad interest and importance across all sciences. An example is the causal effect of obesity on heart disease. Many consider conducting randomised experiments the only solution, but that is not always feasible due to ethical, financial or practical considerations.
Instrumental Variables estimation is a method to identify and estimate causal effects from observational data. An instrument, a third variable, affects the exposure, not the outcome, except through its impact on the exposure. It is able to remove bias induced by an unobserved variable affecting both the exposure and outcome, thus being widely valuable. This scenario is called unobserved confounding. In observational genetic studies, Mendelian randomisation (MR) employs genetic variants as instrumental variables to discover the causal effects of modifiable health exposures on disease. Two-sample MR uses two independent data sets for statistical inference through summary data, maintaining privacy by not requiring access to individual-level information.
The research aims to develop novel statistical methods for improving instrumental variables estimation in causal inference to address biased estimates in practical applications.
The most popular MR estimator is the inverse variance weighted (IVW) estimator. However, IVW can be heavily biased due to the many weak instruments problem. Introducing a pre-selection step has become the standard in practice. Although such selection removes much of the weak instruments bias, it brings another bias, the winner's curse bias, because the surviving instruments are the winners. Finding a third independent sample solves such bias, but relying on such a sample is often infeasible. Therefore, finding a better estimator has gathered attention from researchers. Recent proposals include the debiased IVW (dIVW) estimator and the Profile Score (PS) estimator. dIVW serves as an enhanced iteration of the IVW estimator. When evaluating estimators, two paramount properties come to the forefront: consistency and asymptotic normality, particularly in scenarios with a large number of instruments. Although all three estimators share these properties, they rely on disparate sets of assumptions to function effectively. The central objective of this project is to undertake an in-depth examination of these assumptions, with the aim of refining them and establishing criteria for selecting the most suitable estimator. More specifically, the project entails a comparative analysis of the conditions under which the dIVW and PS estimators excel. It is envisioned that the project will yield a blend of theoretical insights and empirical validation. The project also concerns the selection of valid instruments, defined by two exclusion conditions - (1) not relate directly to the outcome and (2) not relate to unobserved variables that affect both the exposure and outcome. The existing Confidence Interval method is only able to select instruments that satisfy condition (1) from instruments assumed to satisfy condition (2). The project plans to develop a methodology to select valid instruments without having to assume conditions (1) or (2). A research direction focuses on double/debiased machine learning. Many studies primarily focus on linear model specifications. I will allow for general functions in the model and train the machine to learn about these. In addition, I shall use orthogonal moments to eliminate the bias of estimators and cross-fitting to remove over-fitting biases, and improve inference. There are many more possible directions, such as varying (heteroscedastic) variances or algorithms for choosing a threshold in instrument selection. Addressing these inquiries, in conjunction with the requisite software development, holds significant relevance and is poised to benefit a vast user base within the realm of applied causal inference. This project falls within the EPSRC mathematical science area.

Planned Impact

The primary CDT impact will be training 75 PhD graduates as the next generation of leaders in statistics and statistical machine learning. These graduates will lead in industry, government, health care, and academic research. They will bridge the gap between academia and industry, resulting in significant knowledge transfer to both established and start-up companies. Because this cohort will also learn to mentor other researchers, the CDT will ultimately address a UK-wide skills gap. The students will also be crucial in keeping the UK at the forefront of methodological research in statistics and machine learning.
After graduating, students will act as multipliers, educating others in advanced methodology throughout their career. There are a range of further impacts:
- The CDT has a large number of high calibre external partners in government, health care, industry and science. These partnerships will catalyse immediate knowledge transfer, bringing cutting edge methodology to a large number of areas. Knowledge transfer will also be achieved through internships/placements of our students with users of statistics and machine learning.
- Our Women in Mathematics and Statistics summer programme is aimed at students who could go on to apply for a PhD. This programme will inspire the next generation of statisticians and also provide excellent leadership training for the CDT students.
- The students will develop new methodology and theory in the domains of statistics and statistical machine learning. It will be relevant research, addressing the key questions behind real world problems. The research will be published in the best possible statistics journals and machine learning conferences and will be made available online. To maximize reproducibility and replicability, source code and replication files will be made available as open source software or, when relevant to an industrial collaboration, held as a patent or software copyright.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/S023151/1 01/04/2019 30/09/2027
2740743 Studentship EP/S023151/1 01/10/2022 30/09/2026 Jeffrey Tse