Novel statistical methods for transcriptomic imputation to enhance understanding of causal mechanisms underlying human diseases

Lead Research Organisation: University of Manchester
Department Name: School of Biological Sciences

Abstract

Genome-wide association studies (GWAS) have been successful in identifying chromosomal regions (loci) that contain genetic variants that contribute to many complex human traits and common diseases, including those that have major public health burden, such as cancers, diabetes and arthritis. Association signals for many complex traits predominantly localise to regions that influence disease by modulating gene expression (i.e. the process by which DNA is converted into a functional gene product), which may vary across tissues and cell types (referred to as the transcriptome). However, studies of the relationships between gene expression and complex traits have been restricted to investigations in small samples because of cost and availability of relevant tissues. Consequently, there has been limited progress in identifying the causal genes in GWAS regions and in understanding of the biological processes through which genetic variants impact on disease pathophysiology, thereby hindering the translation of these findings into the clinic through targeted drug development.

One increasingly utilised approach to understand molecular pathways underlying human disease is through integrated analysis of genetic variation and transcriptomic data resources from large-scale tissue-based molecular profiling initiatives. For example, the Genotype-Tissue Expression Project has generated high-density genome-wide genotyping and gene expression across a wide range of tissues, and has made these data publicly available. One primary finding of these investigations has been the identification of expression quantitative trait loci (eQTL) that link genetic variation to the regulation of gene expression in diverse tissues. Methods have thus been developed that aim to detect association of complex traits with gene expression by: (i) building tissue-specific multi-eQTL models in these molecular profiling resources; and (ii) using these models to predict (or "impute") the transcriptome into GWAS data (based on individual-level genotypes or association summary statistics). However, existing transcriptome imputation methods typically: (i) consider each cell type separately, and do not take advantage of the observed correlations in gene expression between cell types driven by cross-tissue eQTLs; and/or (ii) do not account for eQTL model uncertainty (i.e. many different genetic variants may regulate gene expression), resulting in potential for false positive findings.

The aim of this proposal is to develop novel statistical methods for transcriptomic imputation into GWAS to address these limitations by: (i) harnessing multi-tissue expression to build eQTL models that better predict gene expression than those that consider each cell type separately; and (ii) use computationally efficient Bayesian statistical methods that appropriately allow for uncertainty in the eQTL model, reducing the potential for "over-fitting". The methodology will be implemented in user friendly software that will be made freely available to the wider research community. The methodology and software will be utilised to create a repository of imputed multi-tissue gene expression into 500,000 participants from the UK Biobank for whom GWAS data are already available. These imputed transcriptomic profiles will be tested for association with rheumatoid arthritis and other musculoskeletal diseases, cardiovascular disease, cancer and diabetes, revealing novel causal genes and improving understanding of molecular mechanisms and relevant cell types underlying disease biology. The repository will also be returned to UK Biobank for archiving and distribution to approved researchers to identify causal genes for any trait of interest available in the resource. These analyses will have enhanced potential for translation of GWAS findings by identifying drug targets for up- or down-regulation of causal genes for which expression is associated with risk of disease.

Technical Summary

Integrated analysis of genetic variation and transcriptomic data resources from large-scale tissue-based molecular profiling initiatives has improved understanding of the biological mechanisms controlling the regulation of gene expression across diverse tissues through the identification of expression quantitative trait loci (eQTL). Methods have been developed to detect association of complex traits with gene expression by: (i) building tissue-specific multi-eQTL models in these molecular profiling resources; and (ii) using these models to "impute" the transcriptome into genome-wide association studies (GWAS). However, existing transcriptome imputation methods typically: (i) do not take advantage of the observed correlations in expression between cell types driven by cross-tissue eQTLs; and/or (ii) do not account for eQTL model uncertainty, resulting in potential for over-fitting and increased false positive error rates. The overall aim of this proposal is to develop novel statistical methodologies for transcriptomic imputation into GWAS to address these limitations by harnessing multi-tissue expression to detect causal genes for complex human traits in a Bayesian Markov chain Monte Carlo framework to allow for model uncertainty. The methodology will be implemented in user-friendly and computationally efficient software, which will be applicable to individual-level genotype data or association summary statistics. The methodology will be applied to create a repository of imputed cross-tissue transcriptomic profiles across genes for all participants in the UK Biobank, which will be tested for association with rheumatoid arthritis and other musculoskeletal diseases, cardiovascular disease, cancer and diabetes. The repository will also be returned to UK Biobank for archiving and distribution to approved researchers to identify causal genes for any trait of interest available in the resource.

Publications

10 25 50