Methods and tools for structural models integrating multiple high-throughput omics data sets in genetic epidemiology

Lead Research Organisation: Brunel University London
Department Name: Mathematics

Abstract

In recent years, new methods for biological measurements using sophisticated technologies have enabled the simultaneous measurement of thousands of potential molecular biomarkers of disease. These biomarkers range from genetic variants which are fixed for each person throughout their life, through gene products such as proteins which are produced dynamically and vary with time and across different cells in the body, to large molecule metabolites which more closely reflect the processes involved in both normal functioning and disease development. Many biomarkers reflect environmental exposures, including lifestyle, occupational and dietary factors, and thus serve to study in a comprehensive way the interaction between genes and environment in relation to disease outcomes.

The complexity and size of these data sets render their analysis difficult. Limitations of traditional multi-variate statistical methods have meant that the majority of existing analyses rely on univariate methods, which consider each type of biomarker, and in fact each particular molecule, separately. This means that important information on how different molecules co-vary is lost. Producing robust statistical tools capable of analysing these large-scale data sets coherently is important to ensure the best exploitation of these expensive data.

In our project we propose to use structural equation models, which are able to model the relations between several different types of biomarkers and disease pathways in a single model. Traditionally these models have either been used on very small data sets, numbering tens of variables, or on sets of hundreds of variables but all of the same type. We propose to develop structural models which are capable of analysing multiple high-dimensional biomarker data sets together, thus enabling these models to be used on modern epidemiological data sets. The project will take advantage of our recent work in high-dimensional statistical modelling of pairs of molecular biomarker data sets, and extend our advances to the more complex structural models for analysing several biomarker sets together.

The methods will be developed with reference to case studies from the North Finnish Birth Cohort, whose Principal Investigator is a co-Investigator on this project. We will also benefit from collaborations with the Airwave Health Monitoring Study and the European Prospective Investigation into Cancer and Nutrition cohort, both of which are hosted at Imperial. The project requires extensive interdisciplinary work, combining expertise in statistics, epidemiology, genetics and computation. In view of their complementary skills and access to data bases, the team of investigators is uniquely placed to successively achieve these objectives.

Technical Summary

In recent years, new methods for biological measurements using sophisticated technologies (genomics, epigenomics, transcriptomics, proteomics, metabolomics) have enabled the simultaneous measurement of thousands of potential molecular biomarkers of disease. Many biomarkers reflect environmental exposures, including lifestyle, occupational and dietary factors, and thus serve to study in a comprehensive way the interaction between genes and environment in relation to disease outcomes.

Structural equation models (SEM) provide an ideal framework for extending current individual analyses of "omics" data sets to deal with integration of multiple omics data sets with clinical outcomes and environmental and lifestyle risk factors. Our aim is to combine SEM with Bayesian variable selection priors, in order to select small sets of biomarkers most relevant to the disease pathways of interest.

The implementation of the models will build on our existing multi-variate regression software, hence taking advantage of the years of work that have produced the ability to analyse very high-dimensional data, whilst robustly accounting for the multi dimensional aspects of these data.

The methods will be developed with reference to case studies from the North Finnish Birth Cohort, whose Principal Investigator is a co-Investigator on this project. We will also benefit from collaborations with the Airwave Health Monitoring Study and the European Prospective Investigation into Cancer and Nutrition cohort, both of which are hosted at Imperial.

Planned Impact

Epidemiology and medicine are currently taking advantage of high-throughput technologies to produce vast amounts of complex, interrelated data. Cohort studies enable life-course analysis of environmental, lifestyle, occupational and dietary effects on normal biological functioning and on disease development. The measurements of thousands of molecular biomarkers at multiple life stages should lead to new discoveries in disease mechanisms, and crucially, the effect of alterable environmental risk factors. This has the potential to impact the nation's health through the improved understanding of mechanisms of disease development and normal biological function.

Our project will impact on the fields of epidemiology and medicine by allowing large-scale data integration, combining expert knowledge with powerful model selection methods. The input from the biologist co-investigators in the software design process will be vital in making it accessible to users in epidemiology, sociology and medicine.

One of the main goals of genetic studies of complex traits is to flag pathways relevant to disease that
could reveal novel therapeutic targets. With the proposed new methodology and accompanying
software we will be able to shed light to the biological mechanisms by which genetic variation could
influence final phenotypes. Moreover we will uncover the identity of the gene(s) affected by the
susceptibility variant(s) at each genetic region, and informative
functional mechanisms for disease pathways identification that can have large impact to disease
prevention.

An important way in which our research impacts the wider community is through the use of the software by epidemiologists and data analysts working in the public, private and voluntary sectors. Hence a crucial pathway to impact is the provision of relevant, accurate and user-friendly software. The use of the tools will be greatly increased by making available documentation that is clear and written in a manner directed at the users.

There is a lack of software tools for data integration which can work for the very large data sets currently being collected. Therefore our work can have a significant impact as one of the first available tools for large-scale data integration of multiple biomarker data sets with environmental risk factors and disease outcomes.
 
Title Bayes SEM with variable selection 
Description New Bayesian structural equation model with variable selection on covariates and covariance matrix. 
Type Of Material Computer model/algorithm 
Year Produced 2018 
Provided To Others? No  
Impact Currently in use for life-course epidemiological analysis of cohort data in healthy ageing project. Software will be made available in the future. 
 
Description Collaboration with University of Oslo 
Organisation University of Oslo
Department Institute of Basic Medical Sciences
Country Norway 
Sector Academic/University 
PI Contribution Methods development, software development. Hosted visiting doctoral student from University of Oslo for 6 month visit.
Collaborator Contribution Methods development, software development. Provision of large-scale cancer data set for case study.
Impact 1 paper submitted to Journal of Statistical Software 1 manuscript in preparation
Start Year 2018
 
Description DynaHealth 
Organisation Imperial College London
Department Department of Epidemiology and Biostatistics
Country United Kingdom 
Sector Academic/University 
PI Contribution Statistical advice, models and analysis for life-course epidemiology, genetic epidemiology.
Collaborator Contribution Consortium of epidemiological cohorts. Epidemiological and genetic models, data collection and analysis.
Impact Felix, JF et al. (2016), Genome-wide association analysis identifies three new susceptibility loci for childhood body mass index. HUMAN MOLECULAR GENETICS, 25 (2). pp. 389 - 403. doi: 10.1093/hmg/ddv472 Van der Valk et al. (2015), A novel common variant in DCST2 is associated with length in early life and height in adulthood. Hum Mol Genet. 24(4):1155-68.
Start Year 2015
 
Description Mendelian Randomization 
Organisation Imperial College London
Country United Kingdom 
Sector Academic/University 
PI Contribution Methods development, software development. Development of multivariate methods and software suitable for extensions to Mendelian Randomization. This collaboration grew directly out of the methodological work done on the MRC Methods grant.
Collaborator Contribution Methods development, software development. Expertise in Mendelian Randomization.
Impact Manuscript in progress
Start Year 2019
 
Description Mendelian Randomization 
Organisation University of Cambridge
Department MRC Biostatistics Unit
Country United Kingdom 
Sector Academic/University 
PI Contribution Methods development, software development. Development of multivariate methods and software suitable for extensions to Mendelian Randomization. This collaboration grew directly out of the methodological work done on the MRC Methods grant.
Collaborator Contribution Methods development, software development. Expertise in Mendelian Randomization.
Impact Manuscript in progress
Start Year 2019
 
Title Bayes SUR 
Description R package for Bayesian multivariate regression models with sparse variable selection and sparse covariance selection. Incorporates a range of options for priors for variable selection and covariance selection. 
Type Of Technology Software 
Year Produced 2019 
Open Source License? Yes  
Impact Currently being used by three different collaborating groups. 
URL https://cran.r-project.org/web/packages/BayesSUR/index.html
 
Title Bayesian Structural Equation Model Regression Software 
Description Software fits a Bayesian Structural Equation Model with variable selection. 
Type Of Technology Software 
Year Produced 2018 
Open Source License? Yes  
Impact Collaborating with DynaHealth project, working with colleagues at Imperial College London on applying software in life-course epidemiology application. 
URL https://www.dynahealth.eu/software