HCM: Using genetic associations to account for selection bias in epidemiology

Lead Research Organisation: University of Leicester
Department Name: Health Sciences


There are several ways in which epidemiological studies can be biased such that spurious or distorted relationships appear between variables. One particular problem is selection bias, which can occur when the subjects included in a study are in some way not representative of their source population. This can sometimes be accounted for by modelling the selection process in the data analysis, but one cannot always be sure that this modelling is accurate. There is a continuing need for further methods to account for selection bias, particularly in genetic epidemiological studies in which it can be difficult to adequately model a selection process. We have recently developed a novel approach to adjust for an effect known as index event bias, which occurs when performing a study of prognosis, or survival, among subjects with a certain disease. Because subjects with disease are not typical of the general population, having higher expected levels of the risk factors, the associations of those risk factors with prognosis may differ from their actual causal effects. In particular, false positive associations may occur, suggesting causal effects of risk factors on prognosis when in fact none exist. Since such studies often aim to identify effective treatments for disease, it is very important to reduce the likelihood of false positives. Our novel approach compares the results of genome-wide association studies of disease incidence to those of disease prognosis in order to derive a correction for the apparent association of any risk factor with prognosis.

Our approach shows great promise but has some limitations that we will address in this project. In its current form, it relies on a technical assumption that essentially requires genes to act on disease incidence and prognosis through different biological pathways. This is disputable and cannot be verified, but we have identified some related approaches that make different assumptions on biological mechanism. We will develop and examine these complementary approaches in order to provide a suite of methods which, if giving similar results, will reassure researchers that their results are not significantly biased. We will also adapt our approach to analyse traits that precede the selection of study subjects, rather than being subsequent to it. Such methods will be useful for studies of large population cohorts, such as the UK Biobank, in which participation in the cohort itself can be influenced by the factors under study.

We are conducting two genome-wide association studies that will depend strongly on our methods. The first is a study of ~250,000 subjects with coronary heart disease, in which we are seeking the determinants of subsequent events including recurrent heart attacks, stroke, surgery and death. In addition to the discovery of novel genetic determinants, we are evaluating the role of traditional risk factors for heart disease, such as cholesterol, whose effects on subsequent events are currently unclear. Importantly, while our methods use genetic data to derive corrections for selection bias, the corrections can be applied to any other non-genetic factors, in this case cholesterol. The second study is an international consortium of idiopathic pulmonary fibrosis cases, a disease with a dismal prognosis for which we aim to identify novel therapeutic targets through genome-wide association studies of survival and of lung function decline. Of particular interest is the mucin 5B gene (MUC5B), which has a very strong effect on the risk of idiopathic pulmonary fibrosis, but a paradoxical association with longer survival time. Our proposed approach suggests that a strong selection bias is at play and that the biological effect of MUC5B is in fact on shorter survival. In this project we will obtain a more definitive assessment of MUC5B by applying the full suite of methods we will develop, to a larger dataset that will become available.

Technical Summary

Selection bias occurs in epidemiological studies when the studied subjects are not representative of the source population, leading to biased associations between variables in the selected sample. Currently, methods of adjustment model the selection process in various ways, but cannot allow for unknown factors influencing selection. We have recently identified a new approach to selection bias using results of genome-wide association studies. We have initially developed an adjustment for index event bias, which occurs in studies of prognosis or survival among cases of disease. We obtain adjusted associations from the regression of total genetic effects on prognosis on the corresponding effects on incidence. Our approach has an analogy with Mendelian randomisation (MR), an increasingly prominent method that uses genetic associations to account for unmeasured confounding, and similarly to MR our approach entails untestable assumptions. Currently we require direct genetic effects on prognosis to be independent of corresponding effects on incidence, implying that different biological pathways are involved. In this project we will develop complementary methods based on the same insights but requiring different assumptions, to provide a suite of sensitivity analyses to assess the robustness of a given study to selection bias. We will develop methods for a related selection bias in studies of factors preceding the trait under selection. This will be particularly relevant to large cohort studies such as UK Biobank, in which participants are healthier than the general population. We will integrate our methods with MR to provide methods that are potentially robust to both confounding and selection bias. We will apply our methods to a study of genetic and traditional risk factors for subsequent events in ~250,000 coronary heart disease patients. We will further apply our methods to a genome-wide association study of survival with idiopathic pulmonary fibrosis.

Planned Impact

The immediate beneficiaries will be a wide range of academic and clinical researchers performing epidemiological research. By reducing the likelihood of their results being biased by selection effects, greater confidence in the validity of results will be achieved. This will be particularly relevant for the identification or development of new treatments of disease, and for improving confidence in results obtained from large publicly funded cohorts such as UK Biobank. The pharmaceutical industry will benefit in its efforts to translate genomics into healthcare, particularly by improving the success rate of clinical trials through better target identification.
Description IEU 
Organisation University of Bristol
Department MRC Integrative Epidemiology Unit
Country United Kingdom 
Sector Academic/University 
PI Contribution Development of statistical methods for causal inference
Collaborator Contribution Development of statistical methods for causal inference
Impact None
Start Year 2016
Description UCLEB 
Organisation University College London
Department Research Department of Epidemiology and Public Health
Country United Kingdom 
Sector Academic/University 
PI Contribution Statistical analysis and advice
Collaborator Contribution Data collection, project management
Impact Fine mapping of genetic loci for cardiovascular outcomes
Start Year 2011