Bayesian Discovery of Regression Structures: a tool kit for genetic epidemiology and integrative genomics analyses

Lead Research Organisation: Imperial College London
Department Name: School of Public Health


To better understand multifactorial diseases such as cancer, researchers are using new biotechnologies that probe the genetic code and measure a range of biological mechanisms that are fundamental for the good functioning of human health. Of particular interest is to use these data to discover potential associations between mutations in the genetic code and perturbations of biological processes that can lead to disease; perturbations that may also result from exposure to life style and environmental risk factors. These new biotechnologies produce vast amounts of information and the sheer size of these data render their analysis difficult. Statisticians are consequently faced with the difficult task of finding specific combinations of genetic and environmental factors that are related to disease status amongst a vast array of possibilities. Proposing better statistical tools to do this task is important so that the expensive data that is being collected in many cohort studies are exploited to their full potential. This project proposes to develop improved techniques for discovering structures in the data, with a focus on multidimensional and multivariate aspects of large genetic and genomics data sets. The project will take advantage of recent work by the team to create a novel framework for such analyses and deliver a set of techniques, a ?tool kit?, to analyse such data. Taking a comprehensive multivariate point of view to analyse large scale data requires the development of sophisticated algorithms, which will be one key aspect of the research programme. To ensure feasibility, advances in the latest computer technology using graphics cards will be exploited. The methods developed will be implemented in an open-source environment so that they can be easily adapted to a wide range of questions. In the project, these tools will be applied to three specific case studies: (i) an analysis of the regulation of lipid mechanisms in a large Finnish cohort; (ii) an analysis of multifactorial pathways in Breast cancer; and (iii) an analysis of the genetic influence on brain activity of psychotic patients. These case studies will serve as vignettes for the methods, showcasing their applicability and helping their dissemination. As demonstrated by the range of applications, the proposal requires extensive interdisciplinary work, combining expertise in statistical modelling, computing, genetics and epidemiology. In view of their complementary skills and access to rich data bases, the team of investigators at Imperial College is uniquely placed to successfully achieve these objectives.

Technical Summary

Finding important health determinants through regression analyses is a fundamental approach in all the health sciences. In this project, we focus on two domains, genetic epidemiology and integrative genomics, where advances are made by taking full advantage of new high-throughput technologies, leading to the collection of a vast set of explanatory variables. In these domains, desirable statistical outputs are reproducible regression models that select only a few relevant predictors (i.e. risk factors, SNPs, transcripts) amongst a very large set of possible candidates, together with good assessment of how uncertain their role is. Our approach is to build upon the unifying Bayesian hierarchical modelling paradigm to construct parsimonious regression models that can translate the underlying biology and facilitate the interpretation of results. The team of investigators have recently completed the development of a sophisticated algorithm, the Evolutionary Stochastic Search algorithm, which efficiently implements a Bayesian variable selection procedure for linear regression models in spaces containing thousands of predictors. The project?s aim is to capitalise on this foundation work and substantially expand it to build a powerful and versatile tool kit of regression models applicable to a wide range of ?cross-omics? analyses, i.e. analyses that involve two or more different types of ?omics? data, each of large dimensions. Such cross-analyses will become a major focus of research in functional genomics in the years to come, in parallel with the advent of new biotechnologies.

We will develop a set of models aimed at Bayesian variable selection (i) in the presence of interactions; (ii) with multiple responses; (iii) including biologically structured prior knowledge. The scope of the algorithms will be considerably expanded by integrating new parallel computing techniques and novel software architecture (CUDA, Compute Unified Device Architecture) that enormously reduce computing time. We will use the methods to discover new associations and structures in three challenging case studies concerning: (a) the genetic regulation of lipid mechanisms in a large Finnish cohort; (b) multifactorial pathways in breast cancer; and (c) the genetic influence on brain activation of psychotic patients. These case studies are embedded in large collaborative projects coordinated by the epidemiology investigators and have been chosen to highlight different facets of the tool kit modules. The computer programmes implemented in the tool kit will be open source and made publicly available. Dissemination plans will benefit from the extensive network of collaborators in the case studies and also include two purposely designed workshops.


10 25 50

publication icon
Gill D (2017) Age at menarche and lung function: a Mendelian randomization study. in European journal of epidemiology