A general framework to adjust for missing confounders in observational studies

Lead Research Organisation: Imperial College London
Department Name: School of Public Health


Assessing the impact of a risk factor/exposure X on a health outcome Y in observational studies is invariably subject to confounding issues. Cohort studies are an ideal source of information as they typically contain a rich set of individual level variables. Nevertheless a study based only on a cohort may suffer from problems of selection bias and lack of population representativeness. Cohort studies may also lack statistical power to assess rare outcomes, and geographical or other group-level variations which limits the extent to which contextual factors such as area level social deprivation can be investigated.
Routinely collected administrative data are a good alternative in terms of representativeness; however, these data sources typically have a limited number of variables for a large population, and might miss important predictors/confounders leading to potentially biased estimation of the risks.
We propose a general framework that integrating these two sources of data takes advantage of the detailed information on confounders from cohorts/surveys and benefits from the statistical power and population representativeness of the registries. This strategy entails missing data imputation as administrative datasets contain data on each individual in the target population, while cohorts/surveys typically cover only a subset of individuals, so that the confounders obtained from the latter source will be partially measured (i.e. will be missing for some of the units in the registries). Imputing each single confounder could prove computationally unfeasible and constrained to several assumptions given the potentially large number of confounders to consider.
We will build a propensity score like index (which we will call Partial Propensity Score - PPS) to summarise the values of the confounders from the cohorts/surveys so we will need to impute only one variable when missing. Through a flexible model the index will be included in the epidemiological analysis and we will be able to provide a direct estimate of the causal link between X and Y as all the confounders have been taken into account.
We will build our framework first on individual level data and then extend it to aggregated level, e.g. small area studies generally used to summarise spatial and spatio-temporal variations in epidemiological risks (e.g. for disease surveillance) or to focus on aetiological questions (e.g. to unveil environmental/social determinant of mortality or morbidity).
We will use Bayesian full probability modelling which provides a flexible approach of incorporating different assumptions about the missing data mechanism and accommodating different patterns of missing data, and through realistic simulation studies we will evaluate the properties of the framework and compare it with other state-of-the-art methods. In addition two real case studies will be considered. The first will assess the risk of low birth weight given exposure to chlorine in water in Northern England and will be based on individual level data. The second will investigate the impact of air pollution concentration and noise exposure on hospital admissions from cardiovascular causes in England and Wales and will be at the small area level. Through the case studies we will be able to unveil how our proposed methodology changes the results of epidemiological analyses in terms of the effect of exposure on the health outcomes, compared to the commonly used analysis based on data from population registries only. This will have the potential of translating into changes in health policies and strategies to take into account the improved, more accurate results and could become the new state-of-the-art method for analysis of observational studies.

Technical Summary

We propose a general framework to deal with the confounding issue in epidemiological studies, integrating different sources of data. Administrative registries to study the relationship between an exposure X and an outcome Y ensure good population coverage and high statistical power to detect small effects or to analyse rare end points. However they do provide information on a limited set of confounders, thus suffering from residual confounding issues and leading to potentially biased results. Our strategy is to integrate registries with cohorts/surveys that contain a rich set of potentially confounding variables; however as these are based on population samples, information is only available for a subsample of the units in the registries, thus calling for imputation of the missing data. We will build a propensity score like index (PPS) to summarise the partial measured confounders from the cohorts/surveys so that only one variable needs to be imputed. The index is obtained regressing X against all the confounders and is then included in the regression model to link X and Y in a flexible way. Framed in a Bayesian perspective we will specify a joint model which incorporates the index construction, imputation and link with the health outcome, thus allowing for uncertainty to be propagated across all parts. At the same time we will investigate the impact of information from the outcome and from the exposure on the imputation of the missing PPS (called feedback in Bayesian literature). We will build the framework for individual level data and extend it to deal with small area aggregated data, commonly used in spatial epidemiology for the analysis of geographic patterns of diseases with respect to environmental, demographic, socioeconomic factors. We will apply our framework to two case studies and investigate how the epidemiological results are modified compared to the standard modelling approaches.

Planned Impact

The proposed research will bring new insights in the statistical methodology to deal with confounding issues of the relationship between an exposure X and an outcome Y in observational studies. It is based on the integration of data sources (administrative registries and cohorts/surveys) and uses the partial propensity score index to summarise confounders available from cohorts/surveys only on a subset of the population. The index is imputed where missing and then it is included in a flexible way in the analysis model so that direct effect of X on Y is estimated under the assumption of no unknown confounders.
This research will provide a step forward in small area epidemiological studies and individual level studies based on administrative databases, as it will overcome the issue of residual confounding which is typically present due to the limited information on potential confounders. It will produce a methodological framework of analysis together with an easily accessible toolbox including software code, routines, tutorials and examples, which will be relevant for epidemiologists and health researchers. We will communicate the results from our methodological advances to epidemiologists and health researchers through publication in peer review journals and through conference presentations; in addition we will organise a short course at the end of the project to train researchers to use our toolbox.
The proposed method will be applied to two environmental health studies, to investigate the effect of chlorination in water on low birth weight and to assess the role of air pollution and noise exposure on cardiovascular hospitalisations. For both applications we will compare the results in terms of exposure effect between our approach and the state-of-the-art methods used for small area and individual level data. This will allow us to show the impact of residual confounding in the standard analyses. For instance residual confounding might play a key role in the mixed evidence on the link between chlorination and birth outcomes available from the literature and on the results of recent studies suggesting an inverse association at small area between air pollution and health in the centre of major cities after having adjusted for social deprivation (e.g. Traffic project in London http://www.kcl.ac.uk/lsm/research/divisions/aes/research/ERG/research-projects/traffic/index.aspx - the PI is providing statistical lead on this; In New York Krewski et al. http://www.ncbi.nlm.nih.gov/pubmed/19627030). Through our framework we will adjust for residual confounding via cohorts/surveys (e.g. Millennium Cohort, Health Survey for England, UK Biobank) and pinpoint the real exposure effects thus potentially (i) unveiling the need for changes in policies and (ii) becoming the state-of-the-art approach for environmental studies.
We will disseminate the results from our framework through the Imperial College and MRC-PHE press offices which will allow reaching a wide audience and being in contact with the media. Our established link with Public Health England and their partnership with the Environmental Agency will also allow us to easily get in touch with stakeholders and policy makers in the timeframe of the project.
Finally the proposed framework will be extendable to other epidemiological studies where the need for integration of data sources for dealing with residual confounding arises. For instance we anticipate that in the time frame of the project we will discuss the use of the methodology to integrate e-health databases (e.g. The Health Improvement Network which has close ties with some members of the project team) and registries to investigate a range of clinical and public health conditions such as mental health, sexual health, infectious disease and drug prescriptions/doctor consultations, aiming at carrying out future funded research.


10 25 50
Description MRC Population and System Medicine board
Amount £898,620 (GBP)
Funding ID MR/P023673/1 
Organisation Medical Research Council (MRC) 
Sector Public
Country United Kingdom
Start 11/2017 
End 10/2020
Description Concluding Workshop 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact We ran a workshop at the Royal Statistical Society to mark the end of the project. There was a range of talks from investigators across the project and from international experts in causality/data integration in observational studies, from a statistical, epidemiological and public health angle. The workshop was oversubscribed and we had an interesting discussion as the last session of the day, which provided room for interaction among participants.
Year(s) Of Engagement Activity 2019
URL https://www.statslife.org.uk/events/events-calendar/past-events-2018/eventdetail/1339/-/data-integra...