The Causal Continuum - Transforming Modelling and Computation in Causal Inference

Lead Research Organisation: University College London
Department Name: Statistical Science

Abstract

A central aspect of science and engineering is to be able to answer "what if" questions. What will happen if this gene suffers a mutation? What are the public health consequences of having this social benefit cut? What can we do to mitigate disparities among social groups? To which extent are lockdowns useful to mitigate a pandemic? Which ramifications will take place if failures occur at these points of a major logistical operation such as food supply chains?

These are cause-effect questions. Answering them is hard because it involves change. Historical data may fail to capture the implications of change, placing causal questions out of the comfort zone by which data is used to inform decisions. It is one thing to predict the life expectancy of a smoker, as done by public health officials or insurance companies. It is much harder to understand what will happen if we convince someone to stop smoking, as historical data may have a substantive number of cases where people stopped smoking shortly before dying of respiratory disease, due to discomfort. A statistical or machine learning method oblivious to these causal explanations may actually say that stopping smoking is bad for one's health.

Ideally, we would like to perform randomised controlled trials where the choice of action to be taken is decided by the flip of coin, so that confounding factors between cause and effect are overridden. This removal of confounding is necessary to show convincingly, for instance, that a covid-19 vaccination works due to biological processes as opposed to sociological confounding factors among those who choose to be vaccinated and their health outcomes. However, in many cases such trials can be very expensive (understanding genetic networks involves a large experimental space) or unethical (we cannot force someone to smoke or not), and even when they take place, a controlled trial may not fully control the factor of interest (we can randomly assign a drug or placebo to a patient, but we may not have the means to make the patient comply with the treatment if they stay at home).

Data scientists have not ignored these problems, and we can thank the hard work of epidemiologists, for instance, for presenting a convincing case establishing the harmful link between smoking and lung cancer. But without randomised trials, the answer to a "what if" question requires assumptions or otherwise it is unknowable. This means that causal inference progresses slowly and is prone to mistakes. Part of the reason is that, traditionally, methods for causal inference largely rely on pre-defined families of assumptions chosen by statisticians designing methods that will provide unambiguous answers. Applied scientists then choose to adopt a particular method according to what manages to be a good enough approximation to their understanding of the world (one simple case: assume we have no common causes that are not measured in the data!). Although there are tools for sensitivity analysis (what if assumptions are violated in some particular ways?), they don't address the main issue directly: a domain-expert should be given the chance of specifying upfront assumptions according to the way they see appropriate, and not be artificially told a single, convenient answer, but what indeed can be disentangled from the observational data given the information provided. One of the reasons this workflow is not popular is the need for computationally-intensive algorithms to deduce the consequences of such assumptions.

This project has the ambition of changing the common practice for causal inference, increasing transparency and the speed by which we understand the limits of our knowledge and where to look for in order to progress. It will rely on cutting-edge algorithms for providing a flexible sandbox for domain experts to express their knowledge on a very flexible way, while offering also the backend support for the sophisticated computational methods needed.

Publications

10 25 50