Scalable causal gene network inference via genetic node ordering

Lead Research Organisation: University of Edinburgh
Department Name: The Roslin Institute

Abstract

The aim of this proposal is to reconstruct causal, global and high-quality gene networks from large-scale omics data to understand how the genotype determines the phenotype. To achieve this we will: (i) develop a novel statistical method for reconstructing causal gene networks based on total genetic node ordering; (ii) implement the method in a unique and ultra-fast computer software for genome-scale causal network reconstruction; (iii) validate the method in silico using benchmark datasets from human and pig.

Genetic differences between individuals cause variation in phenotypes. This principle underpins genome-wide association studies (GWAS), which map the genetic architecture of complex traits by measuring genetic variation on a genome-wide scale across many individuals. A major challenge in GWAS is to understand the molecular mechanisms that explain the statistical association between quantitative trait loci (QTLs) and phenotypes. Because the majority of QTLs lie in non-coding genomic regions and presumably play a gene-regulatory role, it is hypothesized that genetic variation affects the status of molecular networks of interacting genes, proteins and metabolites, which collectively control physiological phenotypes. Since comprehensive, experimentally verified, cell-type-specific networks of molecular biological interactions are lacking, statistical and computational methods which reconstruct causal trait-associated networks from omics data are essential to study the impact of genetic variation on gene regulatory networks.

Causal gene networks consist of directed interactions between genes and are usually modelled as Bayesian networks, which assume that the expression level of a gene is normally distributed around a linear combination of the expression levels of its causal regulators and that no gene can affect its own expression directly nor indirectly via an extended cycle of interactions. Current state-of-the-art algorithms for learning the structure and parameters of a Bayesian network from experimental data relie on local optimization where a model is improved one edge at a time. Such algorithms are feasible for systems of a few hundred genes, but modern sequencing technologies measure the abundance of orders of magnitude more RNA molecules, and increased sample sizes mean that ever more of those are detected as variable across individuals. To develop a scalable method to reconstruct causal gene networks from whole-genome genotype and transcriptome data measured across many individuals is therefore an open problem of outstanding interest.

Statistical theory permits one exception to the intractibility of the large-scale causal network inference problem: if there exists a total ordering of the nodes in the network, such that the parents of any node can be found among the nodes ranked before it, then the problem reduces to a set of independent, tractable optimization problems, one for each node. In genetics, pairs of gene expression traits can be causally ordered using genotype data. This is based on the principle of Mendelian randomization which states that because genotypes of unlinked SNPs are inherited independently, if gene A is causal for gene B, then the association between the expression of gene B and eQTL of gene A must be conditional on expression of gene A. Here we propose to use graph-theoretical concepts to derive a total causal ordering of nodes based on pairwise Mendelian randomization tests. We will then use penalized linear regression to reconstruct a sparse maximum-likelihood Bayesian causal gene network from the inferred total genetic node ordering. Preliminary results support the hypothesis that this method will lead to a dramatic reduction in computational cost, a higher model likelihood score and better biological validation, compared to current methods based on local optimization techniques.

Technical Summary

Genome-wide association studies have uncovered the genetic architecture of numerous complex traits in model organisms, crops, livestock species and human. A major challenge now is to understand the molecular mechanisms that explain genetic associations. Because the majority of trait-associated loci lie in non-coding genomic regions, it is hypothesized that they play a gene regulatory role and that genetic variation affects the status of molecular networks of interacting genes, proteins and metabolites, which collectively control physiological phenotypes. The aim of this proposal is to reconstruct causal, global and high-quality gene networks from large-scale omics data to understand how the genotype determines the phenotype. To achieve this we will develop a novel statistical method for reconstructing causal gene networks based on total genetic node ordering, implement the method in a unique and ultra-fast computer software for genome-scale causal network reconstruction, and validate the method in silico using benchmark datasets from human and pig. The proposed method will be based on pairwise Mendelian randomization tests to establish the most likely causal direction between two correlated gene expression traits, graph-theoretical concepts to derive a total causal ordering of nodes based on pairwise orderings, and penalized linear regression to reconstruct a sparse maximum-likelihood Bayesian causal gene network from the inferred total genetic node ordering.

Planned Impact

This project proposes to develop a novel method and software tool to reconstruct causal, global and high-quality gene networks from large-scale omics data to understand how the genotype determines the phenotype.

The academic impact of the project will extend well beyond the immediate professional circle of the applicants and includes all researchers who perform systems genetics studies to understand the fundamental molecular mechanisms that connect genetic variation to phenotypic variation.

Researchers at private commercial companies in the biotechnological and pharmaceutical sectors also have a strong interest in the research described in this proposal. They often face the challenge that candidate disease genes reported by genome-wide association studies are not directly druggable. The ability to reconstruct causal gene networks to generate hypotheses on causal upstream regulators of lead candidate genes and the potential downstream side-effects of affecting them via existing or novel drugs is essential in modern drug target discovery research.

Researchers at both commercial and academic organizations will benefit from this project by the availibility of a novel software tool to reconstruct causal gene networks, applicable to the size of contemporary datasets and packaged in a user-friendly toolbox that will integrate seamlessly with existing data analysis pipelines for the R and Matlab statistical computing environments.

The applicants are committed to an open access policy for all software developed during this project. Under the conditions of the GNU Public License (GPL), anyone will be allowed to use and distribute the developed software. No active commercialisation through licensing of the software as a for-profit product is therefore planned. The applicants strongly believe that both the academic and private research sector will benefit most from an open software development. Although this will not likely lead to the creation of a new commercialisable product, the scientific knowledge gained from developing and benchmarking the novel software will be exploited. The Roslin Institute is committed to knowledge exchange and commercial companies can benefit from the knowledge gained in this project through consultancy agreements with the applicants. Already the PI, with support from Edinburgh Research and Innovation, has entered such an agreement with the SME Clinical Gene Networks AB (CGN), to oversee the reconstruction of gene networks surrounding identified genomic risk loci for cardiovascular disease.

An important impact of this project will concern the training of a highly skilled postdoctoral research associate for academic or non-academic professions alike. There is currently a great demand for computational scientists to assist in the analysis of ''big data'' in academic and non-academic life science organisations, but few computational scientists possess the necessary experience of working with molecular biological data. Through working on this project and performing the benchmark analyses on human and pig test datasets, the postdoctoral research associate will be trained in biological data analysis and at the end of the project will be well prepared for a cross-disciplinary research career.

Publications

10 25 50
 
Description Understanding how genetic variation between individuals determines variation in observable traits or disease risk is one of the core aims of genetics. It is known that genetic variation often affects gene regulatory DNA elements and directly causes variation in expression of nearby genes. This effect in turn cascades down to other genes via the complex pathways and gene interaction networks that ultimately govern how cells operate in an ever changing environment. In theory, when genetic variation and gene expression levels are measured simultaneously in a large number of individuals, the causal effects of genes on each other can be inferred using statistical models similar to those used in randomized controlled trials. We developed a novel method and ultra-fast software Findr which, unlike existing methods, takes into account the complex but unknown network context when predicting causality between specific gene pairs. Findr's predictions have a significantly higher overlap with known gene networks compared to existing methods, using both simulated and real data. Findr is also nearly a million times faster, and hence the only software in its class that can handle modern datasets where the expression levels of ten-thousands of genes are simultaneously measured in hundreds to thousands of individuals.
Exploitation Route Researchers who generate large-scale genotype and transcriptome, proteome and/or metabolome data, and wish to draw biologically meaningful inferences on the causal regulatory relationships between molecular abundance traits to understand the mechanisms by which genetic variation affects phenotypic variation, are found in many areas of biotechnology and biomedicine, and across academia, industry and the health sector. To facilitate uptake of our methods by other research groups, irrespective of their background, we have implemented them in a software package Findr, which is available without any restrictions at https://github.com/lingfeiwang/findr
Sectors Agriculture, Food and Drink,Healthcare,Pharmaceuticals and Medical Biotechnology

URL https://doi.org/10.1371/journal.pcbi.1005703
 
Description Communication and Engagement The main outcome of this project was the development of a software tool "Findr" for the statistical inference of causal gene regulatory interactions from large-scale omics data. This tool has been publicly released, and is available and promoted from a dedicated website: http://github.com/lingfeiwang/findr. The results of the project have been presented at major international conferences on computational biology and statistical genetics: European Conference on Computational Biology 2016, RECOMB/ISCB Conference on Regulatory Genomics and Systems Biology 2016, Mendelian Randomization Conference 2017, RECOMB/ISCB Conference on Regulatory Genomics and Systems Biology 2017. Leading researchers in the field of network inference met at a workshop "Network Inference: New Methods and New Data" (2016), co-organized by the PI, to share the latest results and discuss future challenges in the area of this project. Knowledge Exchange Under the conditions of the GNU Public License (GPL), anyone is allowed to use and distribute the developed software. No active commercialisation through licensing of the software as a for-profit product is therefore planned. Researcher Training At the start of the project, the postdoctoral research associate (PDRA) appointed on this project had no prior experience in the field of computational biology (PhD in Theoretical Physics). Throughout the project, the PDRA has attended training courses in bioinformatics, scientific writing and personal skills development, both locally at the University of Edinburgh and nationally (e.g. "In Silico Systems Biology EMBL-EBI-Wellcome Trust Course 2016"), as well as international conferences in the UK, EU and USA. He has also received invaluable informal training through daily interactions with computational and experimental life science researchers at The Roslin Institute and The University of Edinburgh, and has been able to build a personal research portfolio (3 first-author journal publications [1 published, 2 under review] and 2 first-author book chapters [1 published, 1 in press]). This has led to him being offered a long-term (5 years) postdoctoral researcher position at the Broad Institute, one of the leading research institute in human genomics and computational biology world-wide.
First Year Of Impact 2016
Sector Education
Impact Types Economic

 
Description BHF Centre of Research Excellence Edinburgh
Amount £20,983 (GBP)
Organisation British Heart Foundation (BHF) 
Sector Charity/Non Profit
Country United Kingdom of Great Britain & Northern Ireland (UK)
Start 06/2017 
End 11/2017
 
Description MRC Precision Medicine DTP
Amount £90,000 (GBP)
Organisation MRC Doctoral Training Program 
Sector Public
Country United Kingdom of Great Britain & Northern Ireland (UK)
Start 09/2017 
End 03/2021
 
Description RO1 to Bjorkegren, Johan M.
Amount $290,914 (USD)
Funding ID 1R01HL125863 
Organisation National Heart, Lung, and Blood Institute (NHLBI) 
Sector Public
Country United States of America
Start 09/2015 
End 04/2019
 
Description STARNET 
Organisation Icahn School of Medicine at Mount Sinai
Department Icahn Institute for Genomics and Multiscale Biology
Country United States of America 
Sector Academic/University 
PI Contribution This collaboration is to apply the software developed in the project "Scalable causal gene network inference via genetic node ordering" to the STARNET data consisting of genotype and multi-tissue gene expression data from 600 human individuals. My research team contributes and runs the network inference software.
Collaborator Contribution Prof Johan Bjorkegren contributes access to the STARNET data (genotype data from 600 individuals and more than 3500 RNA-seq profiles) and expertise in cardiovascular biology to interpret the inferred gene networks.
Impact Network reconstruction of pilot study data (~100 individuals) has been published (Talukdar et al, Cell Systems 2016, doi:10.1016/j.cels.2016.02.002). Identification of expression-associated DNA variants in the full dataset (~600 individuals) has been published (Franzen et al, Science 2016, doi:10.1126/science.aad6970). Reconstruction of causal genotype-gene expression-phenotype networks is in progress.
Start Year 2015
 
Description UGent - IBCN 
Organisation University of Ghent
Department Internet Based Communication Networks and Services Research Group
Country Belgium, Kingdom of 
Sector Academic/University 
PI Contribution This collaboration is to develop graph theoretical and statistical machine learning algorithms to reconstruct causal gene networks from multiple omics data. My research team contributes expertise in statistical machine learning and in-depth knowledge of omics data.
Collaborator Contribution The partner contributes general expertise in graph theoretical algorithms and staff time (Dr Pieter Audenaert) to implement specific algorithms.
Impact Joint graph theoretical work related to this collaboration has been published (Melckenbeek et al, PLOS One 2016, 10.1371/journal.pone.0147078). Specific work related to the project "Scalable causal gene network inference via genetic node ordering" has resulted in functions included in the Findr software (https://github.com/lingfeiwang/findr). Further joint publications are in preparation.
Start Year 2015
 
Title Findr - Fast Inference of Networks of Directed Regulations 
Description Findr is a fast and scalable software for causal inference and gene network reconstruction from genome-transcriptome variation data 
Type Of Technology Software 
Year Produced 2015 
Open Source License? Yes  
Impact Findr is an ultra-fast software for the statistical inference of causal gene regulatory interactions between ten-thousands of genes from large-scale omics data. In comparison to existing state-of-the-art solutions, Findr predicts real biological interactions more accurately, while being nearly a million times faster. A paper describing Findr has been published in PLOS Computational Biology. 
URL https://github.com/lingfeiwang/findr
 
Title Lassopv - Nonparametric p-value estimation in Lasso penalized regression 
Description Lassopv estimates p-values for the probability that a predictor occurs by chance in Lasso penalized regression. Unlike existing methods, p-values obtained by lassopv are comparable across multiple regressions with different response variables and different and unequally sized sets of predictors. As such, lassopv allows to control accurately the false discovery rate of individual edges in Bayesian gene networks. 
Type Of Technology Software 
Year Produced 2017 
Open Source License? Yes  
Impact A manuscript describing lassopv has been submitted for publication. 
URL https://github.com/lingfeiwang/lassopv