Methods and tools for structural models integrating multiple high-throughput omics data sets in genetic epidemiology

Lead Research Organisation: Brunel University London

Department Name: Mathematics

Abstract

In recent years, new methods for biological measurements using sophisticated technologies have enabled the simultaneous measurement of thousands of potential molecular biomarkers of disease. These biomarkers range from genetic variants which are fixed for each person throughout their life, through gene products such as proteins which are produced dynamically and vary with time and across different cells in the body, to large molecule metabolites which more closely reflect the processes involved in both normal functioning and disease development. Many biomarkers reflect environmental exposures, including lifestyle, occupational and dietary factors, and thus serve to study in a comprehensive way the interaction between genes and environment in relation to disease outcomes.

The complexity and size of these data sets render their analysis difficult. Limitations of traditional multi-variate statistical methods have meant that the majority of existing analyses rely on univariate methods, which consider each type of biomarker, and in fact each particular molecule, separately. This means that important information on how different molecules co-vary is lost. Producing robust statistical tools capable of analysing these large-scale data sets coherently is important to ensure the best exploitation of these expensive data.

In our project we propose to use structural equation models, which are able to model the relations between several different types of biomarkers and disease pathways in a single model. Traditionally these models have either been used on very small data sets, numbering tens of variables, or on sets of hundreds of variables but all of the same type. We propose to develop structural models which are capable of analysing multiple high-dimensional biomarker data sets together, thus enabling these models to be used on modern epidemiological data sets. The project will take advantage of our recent work in high-dimensional statistical modelling of pairs of molecular biomarker data sets, and extend our advances to the more complex structural models for analysing several biomarker sets together.

The methods will be developed with reference to case studies from the North Finnish Birth Cohort, whose Principal Investigator is a co-Investigator on this project. We will also benefit from collaborations with the Airwave Health Monitoring Study and the European Prospective Investigation into Cancer and Nutrition cohort, both of which are hosted at Imperial. The project requires extensive interdisciplinary work, combining expertise in statistics, epidemiology, genetics and computation. In view of their complementary skills and access to data bases, the team of investigators is uniquely placed to successively achieve these objectives.

Technical Summary

In recent years, new methods for biological measurements using sophisticated technologies (genomics, epigenomics, transcriptomics, proteomics, metabolomics) have enabled the simultaneous measurement of thousands of potential molecular biomarkers of disease. Many biomarkers reflect environmental exposures, including lifestyle, occupational and dietary factors, and thus serve to study in a comprehensive way the interaction between genes and environment in relation to disease outcomes.

Structural equation models (SEM) provide an ideal framework for extending current individual analyses of "omics" data sets to deal with integration of multiple omics data sets with clinical outcomes and environmental and lifestyle risk factors. Our aim is to combine SEM with Bayesian variable selection priors, in order to select small sets of biomarkers most relevant to the disease pathways of interest.

The implementation of the models will build on our existing multi-variate regression software, hence taking advantage of the years of work that have produced the ability to analyse very high-dimensional data, whilst robustly accounting for the multi dimensional aspects of these data.

The methods will be developed with reference to case studies from the North Finnish Birth Cohort, whose Principal Investigator is a co-Investigator on this project. We will also benefit from collaborations with the Airwave Health Monitoring Study and the European Prospective Investigation into Cancer and Nutrition cohort, both of which are hosted at Imperial.

Planned Impact

Epidemiology and medicine are currently taking advantage of high-throughput technologies to produce vast amounts of complex, interrelated data. Cohort studies enable life-course analysis of environmental, lifestyle, occupational and dietary effects on normal biological functioning and on disease development. The measurements of thousands of molecular biomarkers at multiple life stages should lead to new discoveries in disease mechanisms, and crucially, the effect of alterable environmental risk factors. This has the potential to impact the nation's health through the improved understanding of mechanisms of disease development and normal biological function.

Our project will impact on the fields of epidemiology and medicine by allowing large-scale data integration, combining expert knowledge with powerful model selection methods. The input from the biologist co-investigators in the software design process will be vital in making it accessible to users in epidemiology, sociology and medicine.

One of the main goals of genetic studies of complex traits is to flag pathways relevant to disease that
could reveal novel therapeutic targets. With the proposed new methodology and accompanying
software we will be able to shed light to the biological mechanisms by which genetic variation could
influence final phenotypes. Moreover we will uncover the identity of the gene(s) affected by the
susceptibility variant(s) at each genetic region, and informative
functional mechanisms for disease pathways identification that can have large impact to disease
prevention.

An important way in which our research impacts the wider community is through the use of the software by epidemiologists and data analysts working in the public, private and voluntary sectors. Hence a crucial pathway to impact is the provision of relevant, accurate and user-friendly software. The use of the tools will be greatly increased by making available documentation that is clear and written in a manner directed at the users.

There is a lack of software tools for data integration which can work for the very large data sets currently being collected. Therefore our work can have a significant impact as one of the first available tools for large-scale data integration of multiple biomarker data sets with environmental risk factors and disease outcomes.

Funded Value:

£381,453

Funded Period:

Jan 16 - May 18

Funder:

MRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

MR/M013138/1

Principal Investigator:

Alexandra Lewin

Health Category:

Unclassified

Organisations

People	ORCID iD
Alexandra Lewin (Principal Investigator)
Leonardo Bottolo (Co-Investigator)
Sylvia Richardson (Co-Investigator)
Marjo-Riitta Jarvelin (Co-Investigator)

Publications

Author Name Title

Publication Date Published

|< < 1 2 3 4 > >|

10 25 50

Auvinen J (2021) Systematic evaluation of the association between hemoglobin levels and metabolic profile implicates beneficial effects of hypoxia. in Science advances

Wong AYS (2020) The association between partner bereavement and melanoma: cohort studies in the U.K. and Denmark. in The British journal of dermatology

Kanya L (2019) The criterion validity of willingness to pay methods: A systematic review and meta-analysis of the evidence. in Social science & medicine (1982)

Anokye N (2018) The effectiveness and cost-effectiveness of a complex community sport intervention to increase physical activity: an interrupted time series design. in BMJ open

Chen J (2021) The trans-ancestral genomic architecture of glycemic traits. in Nature genetics

Lowry E (2019) Understanding the complexity of glycaemic health: systematic bio-psychosocial modelling of fasting glucose in middle-age adults; a DynaHEALTH study. in International journal of obesity (2005)

Parmar P (2020) Understanding the cumulative risk of maternal prenatal biopsychosocial factors on birth weight: a DynaHEALTH study on two birth cohorts. in Journal of epidemiology and community health

Research Databases and Models
Collaboration
Software and Technical Products


Title	Bayes SEM with variable selection
Description	New Bayesian structural equation model with variable selection on covariates and covariance matrix.
Type Of Material	Computer model/algorithm
Year Produced	2018
Provided To Others?	No
Impact	Currently in use for life-course epidemiological analysis of cohort data in healthy ageing project. Software will be made available in the future.


Description	Collaboration with University of Oslo
Organisation	University of Oslo
Department	Institute of Basic Medical Sciences
Country	Norway
Sector	Academic/University
PI Contribution	Methods development, software development. Hosted visiting doctoral student from University of Oslo for 6 month visit.
Collaborator Contribution	Methods development, software development. Provision of large-scale cancer data set for case study.
Impact	1 paper submitted to Journal of Statistical Software 1 manuscript in preparation
Start Year	2018


Description	DynaHealth
Organisation	Imperial College London
Department	Department of Epidemiology and Biostatistics
Country	United Kingdom
Sector	Academic/University
PI Contribution	Statistical advice, models and analysis for life-course epidemiology, genetic epidemiology.
Collaborator Contribution	Consortium of epidemiological cohorts. Epidemiological and genetic models, data collection and analysis.
Impact	Felix, JF et al. (2016), Genome-wide association analysis identifies three new susceptibility loci for childhood body mass index. HUMAN MOLECULAR GENETICS, 25 (2). pp. 389 - 403. doi: 10.1093/hmg/ddv472 Van der Valk et al. (2015), A novel common variant in DCST2 is associated with length in early life and height in adulthood. Hum Mol Genet. 24(4):1155-68.
Start Year	2015


Description	Mendelian Randomization
Organisation	Imperial College London
Country	United Kingdom
Sector	Academic/University
PI Contribution	Methods development, software development. Development of multivariate methods and software suitable for extensions to Mendelian Randomization. This collaboration grew directly out of the methodological work done on the MRC Methods grant.
Collaborator Contribution	Methods development, software development. Expertise in Mendelian Randomization.
Impact	Manuscript in progress
Start Year	2019


Description	Mendelian Randomization
Organisation	University of Cambridge
Department	MRC Biostatistics Unit
Country	United Kingdom
Sector	Academic/University
PI Contribution	Methods development, software development. Development of multivariate methods and software suitable for extensions to Mendelian Randomization. This collaboration grew directly out of the methodological work done on the MRC Methods grant.
Collaborator Contribution	Methods development, software development. Expertise in Mendelian Randomization.
Impact	Manuscript in progress
Start Year	2019


Title	Bayes SUR
Description	R package for Bayesian multivariate regression models with sparse variable selection and sparse covariance selection. Incorporates a range of options for priors for variable selection and covariance selection.
Type Of Technology	Software
Year Produced	2019
Open Source License?	Yes
Impact	Currently being used by three different collaborating groups.
URL	https://cran.r-project.org/web/packages/BayesSUR/index.html


Title	Bayesian Structural Equation Model Regression Software
Description	Software fits a Bayesian Structural Equation Model with variable selection.
Type Of Technology	Software
Year Produced	2018
Open Source License?	Yes
Impact	Collaborating with DynaHealth project, working with colleagues at Imperial College London on applying software in life-course epidemiology application.
URL	https://www.dynahealth.eu/software