Strategy for analysing epidemiological data involving genetic, endogenous, environmental factors and their interactions

Lead Research Organisation: Imperial College London
Department Name: Epidemiology & Public Health


It is well known that many chronic diseases, in particular cancer, diabetes and heart disease, are multifactorial, involving complex relationships between genetic predisposition and other risk factors such as diet, life style characteristics, and the physical environment. To study these, the strategy adopted in a number of large prospective cohort studies (i.e. studies that follow a group of individuals through time) currently underway worldwide is to conduct a series of sub-studies focussed on specific disease outcomes where increasingly large quantities of data from laboratory analysis and questionnaires are recorded. However, statistical tools to interpret such rich data have, as yet, failed to keep pace and the expensive data being produced are not exploited to their full potential. This project will develop improved techniques for simultaneous analysis of data measuring a wide diversity of risk factors and for evaluating their interactions. These tools will be applied to two specific case studies related to breast and lung cancer that are part of the large European Prospective Investigation on Cancer and Nutrition (over 500,000 people followed). Besides these specific analyses, the generic strategy and methodological developments are of broad interest throughout the epidemiological community, with the potential to have a profound impact on the understanding of the causes of complex diseases and on the improvement of human health. To aid the interpretation of the rich output produced by our models, we will also develop visualisation tools that could be used easily by a wide community of experts in different domains (clinicians, epidemiologists, public health specialists), thus facilitating dissemination of important research. The proposal requires extensive interdisciplinary work, combining expertise in statistical modelling, genetics, epidemiology and computing. In view of their complementary skills, the team of investigators at Imperial College is uniquely placed to successfully achieve these objectives.

Technical Summary

Large prospective cohort studies where extensive covariate data is collected simultaneously on genetic variants for candidate genes, on risk factors connected to the host, to lifestyle, and to exposure to the physical environment have been recognized as one of the key designs in modern epidemiology. For such studies, estimating joint effects and interactions between the different domains apprehended (e.g. genetic, nutritional, environmental) is of high interest but a challenging task and there is a need to develop effective inferential strategies. The aim of this proposal is to capitalize on current statistical research on methods of analysis for large data sets in order to formulate and implement novel analysis strategies tailored for the paradigm of epidemiological, biological and genetic data arising from large cohort studies with biobanks. We will use the European Prospective Investigation on Cancer and Nutrition (EPIC) cohort study (over 500,000 people followed) as a paradigm to develop the implementation of our research strategy. In particular we will employ two exemplar case studies on (a) breast cancer and (b) lung cancer to test our analysis strategy: (a) The current data set contains close to 2000 BC cases and matched controls with detailed information on diet, anthropometric measurements, hormonal measurements in plasma and genotyping for 60 genes. (b) The study relates to the effect of environmental tobacco smoke and air pollution in non-smokers. 30 genes including DNA repair genes and metabolic genes have been analysed in 1500 cases and matched controls.

The methodological framework that we will mainly adopt will be that of hierarchical models and associated Bayesian computations. The models that we propose to employ stay within the broad regression framework, thus preserving epidemiologic interpretability, but we shall considerably extend its flexibility, by incorporating model-based clustering, variable selection techniques and stochastic search for interactions (e.g. gene-environment) within the regression formulation. Our aim is to develop a series of flexible regression models that will form the basis of a novel strategy for investigating main effects and patterns of interaction among complex multi-factorial epidemiological data sets. We also propose to compare our strategy to data driven analysis techniques. The key novel aspects of our approach are (i) dimension reduction techniques that are tailored to the domains investigated; (ii) the building of interpretable models that can be enriched by epidemiological knowledge; (iii) accounting for uncertainty in each class of models implemented.


10 25 50
Description Project grant
Amount £123,000 (GBP)
Organisation National Institute of Health and Medical Research (INSERM) 
Sector Academic/University
Country France
Title ESS++ 
Description A stochastic algorithm performing variable selection in the linear model for very large dimensional space (e.g. large genomics data sets). 
Type Of Material Data analysis technique 
Year Produced 2011 
Provided To Others? Yes  
Impact We have been able to analyse GWAS studies in a fully multivariate fashion. 
Title Profile regression 
Description Matlab clustering models 
Type Of Material Data analysis technique 
Year Produced 2009 
Provided To Others? Yes  
Impact The method will be disseminated in the statistical community 
Description Profile analysis of ICARE study 
Organisation National Institute of Health and Medical Research (INSERM)
Department UMRS 1018 (Centre for Research in Epidemiology and Population Health (CESP))
Country France 
Sector Public 
PI Contribution We are applying our developped methodology to French lung cancer case control data
Collaborator Contribution It has provided a very nice case study to demonstrate the utility of our method for epidemiological analysis
Impact Some of the results have been presented at an international statistical conference. A paper has also recently been published on this
Start Year 2010