Scalable causal gene network inference via genetic node ordering

Lead Research Organisation: University of Edinburgh

Department Name: The Roslin Institute

Abstract

The aim of this proposal is to reconstruct causal, global and high-quality gene networks from large-scale omics data to understand how the genotype determines the phenotype. To achieve this we will: (i) develop a novel statistical method for reconstructing causal gene networks based on total genetic node ordering; (ii) implement the method in a unique and ultra-fast computer software for genome-scale causal network reconstruction; (iii) validate the method in silico using benchmark datasets from human and pig.

Genetic differences between individuals cause variation in phenotypes. This principle underpins genome-wide association studies (GWAS), which map the genetic architecture of complex traits by measuring genetic variation on a genome-wide scale across many individuals. A major challenge in GWAS is to understand the molecular mechanisms that explain the statistical association between quantitative trait loci (QTLs) and phenotypes. Because the majority of QTLs lie in non-coding genomic regions and presumably play a gene-regulatory role, it is hypothesized that genetic variation affects the status of molecular networks of interacting genes, proteins and metabolites, which collectively control physiological phenotypes. Since comprehensive, experimentally verified, cell-type-specific networks of molecular biological interactions are lacking, statistical and computational methods which reconstruct causal trait-associated networks from omics data are essential to study the impact of genetic variation on gene regulatory networks.

Causal gene networks consist of directed interactions between genes and are usually modelled as Bayesian networks, which assume that the expression level of a gene is normally distributed around a linear combination of the expression levels of its causal regulators and that no gene can affect its own expression directly nor indirectly via an extended cycle of interactions. Current state-of-the-art algorithms for learning the structure and parameters of a Bayesian network from experimental data relie on local optimization where a model is improved one edge at a time. Such algorithms are feasible for systems of a few hundred genes, but modern sequencing technologies measure the abundance of orders of magnitude more RNA molecules, and increased sample sizes mean that ever more of those are detected as variable across individuals. To develop a scalable method to reconstruct causal gene networks from whole-genome genotype and transcriptome data measured across many individuals is therefore an open problem of outstanding interest.

Statistical theory permits one exception to the intractibility of the large-scale causal network inference problem: if there exists a total ordering of the nodes in the network, such that the parents of any node can be found among the nodes ranked before it, then the problem reduces to a set of independent, tractable optimization problems, one for each node. In genetics, pairs of gene expression traits can be causally ordered using genotype data. This is based on the principle of Mendelian randomization which states that because genotypes of unlinked SNPs are inherited independently, if gene A is causal for gene B, then the association between the expression of gene B and eQTL of gene A must be conditional on expression of gene A. Here we propose to use graph-theoretical concepts to derive a total causal ordering of nodes based on pairwise Mendelian randomization tests. We will then use penalized linear regression to reconstruct a sparse maximum-likelihood Bayesian causal gene network from the inferred total genetic node ordering. Preliminary results support the hypothesis that this method will lead to a dramatic reduction in computational cost, a higher model likelihood score and better biological validation, compared to current methods based on local optimization techniques.

Technical Summary

Genome-wide association studies have uncovered the genetic architecture of numerous complex traits in model organisms, crops, livestock species and human. A major challenge now is to understand the molecular mechanisms that explain genetic associations. Because the majority of trait-associated loci lie in non-coding genomic regions, it is hypothesized that they play a gene regulatory role and that genetic variation affects the status of molecular networks of interacting genes, proteins and metabolites, which collectively control physiological phenotypes. The aim of this proposal is to reconstruct causal, global and high-quality gene networks from large-scale omics data to understand how the genotype determines the phenotype. To achieve this we will develop a novel statistical method for reconstructing causal gene networks based on total genetic node ordering, implement the method in a unique and ultra-fast computer software for genome-scale causal network reconstruction, and validate the method in silico using benchmark datasets from human and pig. The proposed method will be based on pairwise Mendelian randomization tests to establish the most likely causal direction between two correlated gene expression traits, graph-theoretical concepts to derive a total causal ordering of nodes based on pairwise orderings, and penalized linear regression to reconstruct a sparse maximum-likelihood Bayesian causal gene network from the inferred total genetic node ordering.

Planned Impact

This project proposes to develop a novel method and software tool to reconstruct causal, global and high-quality gene networks from large-scale omics data to understand how the genotype determines the phenotype.

The academic impact of the project will extend well beyond the immediate professional circle of the applicants and includes all researchers who perform systems genetics studies to understand the fundamental molecular mechanisms that connect genetic variation to phenotypic variation.

Researchers at private commercial companies in the biotechnological and pharmaceutical sectors also have a strong interest in the research described in this proposal. They often face the challenge that candidate disease genes reported by genome-wide association studies are not directly druggable. The ability to reconstruct causal gene networks to generate hypotheses on causal upstream regulators of lead candidate genes and the potential downstream side-effects of affecting them via existing or novel drugs is essential in modern drug target discovery research.

Researchers at both commercial and academic organizations will benefit from this project by the availibility of a novel software tool to reconstruct causal gene networks, applicable to the size of contemporary datasets and packaged in a user-friendly toolbox that will integrate seamlessly with existing data analysis pipelines for the R and Matlab statistical computing environments.

The applicants are committed to an open access policy for all software developed during this project. Under the conditions of the GNU Public License (GPL), anyone will be allowed to use and distribute the developed software. No active commercialisation through licensing of the software as a for-profit product is therefore planned. The applicants strongly believe that both the academic and private research sector will benefit most from an open software development. Although this will not likely lead to the creation of a new commercialisable product, the scientific knowledge gained from developing and benchmarking the novel software will be exploited. The Roslin Institute is committed to knowledge exchange and commercial companies can benefit from the knowledge gained in this project through consultancy agreements with the applicants. Already the PI, with support from Edinburgh Research and Innovation, has entered such an agreement with the SME Clinical Gene Networks AB (CGN), to oversee the reconstruction of gene networks surrounding identified genomic risk loci for cardiovascular disease.

An important impact of this project will concern the training of a highly skilled postdoctoral research associate for academic or non-academic professions alike. There is currently a great demand for computational scientists to assist in the analysis of ''big data'' in academic and non-academic life science organisations, but few computational scientists possess the necessary experience of working with molecular biological data. Through working on this project and performing the benchmark analyses on human and pig test datasets, the postdoctoral research associate will be trained in biological data analysis and at the end of the project will be well prepared for a cross-disciplinary research career.

Funded Value:

£148,694

Funded Period:

Aug 15 - Feb 17

Funder:

BBSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

BB/M020053/1

Principal Investigator:

Tom Michoel

Research Subject:

Genetics & development (28%)

Omic sciences & technologies (14%)

Tools, technologies & methods (56%)

Research Topic:

Bioinformatics (42%)

Gene action & regulation (28%)

Genomics (7%)

Tools for the biosciences (14%)

Transcriptomics (7%)

Organisations

People	ORCID iD
Tom Michoel (Principal Investigator)	http://orcid.org/0000-0003-4749-4725
Albert Tenesa (Co-Investigator)

Publications

Author Name

Title Publication Date Published

|< < 1 2 3 4 5 6 7 8 9 > >|

10 25 50

Banks CJ (2016) Functional transcription factor target discovery via compendia of binding and expression profiles. in Scientific reports

Crawford AA (2021) Variation in the SERPINA6/SERPINA1 locus alters morning plasma cortisol, hepatic corticosteroid binding globulin expression, gene expression in peripheral tissues, and risk of cardiovascular disease. in Journal of human genetics

Erola P (2020) Model-based clustering of multi-tissue gene expression data in Bioinformatics

Erola P (2019) Learning Differential Module Networks Across Multiple Experimental Conditions. in Methods in molecular biology (Clifton, N.J.)

Koplev S (2022) A mechanistic framework for cardiometabolic and coronary artery diseases. in Nature cardiovascular research

Lingfei Wang (2019) Figure S1 from Accurate wisdom of the crowd from unsupervised dimension reduction

Lingfei Wang (2019) Figure S8 from Accurate wisdom of the crowd from unsupervised dimension reduction

Lingfei Wang (2019) Figure S5 from Accurate wisdom of the crowd from unsupervised dimension reduction

Lingfei Wang (2019) Figure S3 from Accurate wisdom of the crowd from unsupervised dimension reduction

Lingfei Wang (2019) Figure S7 from Accurate wisdom of the crowd from unsupervised dimension reduction

Key Findings
Impact Summary
Further Funding
Collaboration
Software and Technical Products


Description	Understanding how genetic variation between individuals determines variation in observable traits or disease risk is one of the core aims of genetics. It is known that genetic variation often affects gene regulatory DNA elements and directly causes variation in expression of nearby genes. This effect in turn cascades down to other genes via the complex pathways and gene interaction networks that ultimately govern how cells operate in an ever changing environment. In theory, when genetic variation and gene expression levels are measured simultaneously in a large number of individuals, the causal effects of genes on each other can be inferred using statistical models similar to those used in randomized controlled trials. We developed a novel method and ultra-fast software Findr which, unlike existing methods, takes into account the complex but unknown network context when predicting causality between specific gene pairs. Findr's predictions have a significantly higher overlap with known gene networks compared to existing methods, using both simulated and real data. Findr is also nearly a million times faster, and hence the only software in its class that can handle modern datasets where the expression levels of ten-thousands of genes are simultaneously measured in hundreds to thousands of individuals.
Exploitation Route	Researchers who generate large-scale genotype and transcriptome, proteome and/or metabolome data, and wish to draw biologically meaningful inferences on the causal regulatory relationships between molecular abundance traits to understand the mechanisms by which genetic variation affects phenotypic variation, are found in many areas of biotechnology and biomedicine, and across academia, industry and the health sector. To facilitate uptake of our methods by other research groups, irrespective of their background, we have implemented them in a software package Findr, which is available without any restrictions at https://github.com/lingfeiwang/findr
Sectors	Agriculture Food and Drink Healthcare Pharmaceuticals and Medical Biotechnology
URL	https://doi.org/10.1371/journal.pcbi.1005703https://doi.org/10.3389/fgene.2019.01196


Description	Communication and Engagement The main outcome of this project was the development of a software tool "Findr" for the statistical inference of causal gene regulatory interactions from large-scale omics data. This tool has been publicly released, and is available and promoted from a dedicated website: http://github.com/lingfeiwang/findr. The results of the project have been presented at major international conferences on computational biology and statistical genetics: European Conference on Computational Biology 2016, RECOMB/ISCB Conference on Regulatory Genomics and Systems Biology 2016, Mendelian Randomization Conference 2017, RECOMB/ISCB Conference on Regulatory Genomics and Systems Biology 2017. Leading researchers in the field of network inference met at a workshop "Network Inference: New Methods and New Data" (2016), co-organized by the PI, to share the latest results and discuss future challenges in the area of this project. Knowledge Exchange Under the conditions of the GNU Public License (GPL), anyone is allowed to use and distribute the developed software. No active commercialisation through licensing of the software as a for-profit product is therefore planned. Researcher Training At the start of the project, the postdoctoral research associate (PDRA) appointed on this project had no prior experience in the field of computational biology (PhD in Theoretical Physics). Throughout the project, the PDRA has attended training courses in bioinformatics, scientific writing and personal skills development, both locally at the University of Edinburgh and nationally (e.g. "In Silico Systems Biology EMBL-EBI-Wellcome Trust Course 2016"), as well as international conferences in the UK, EU and USA. He has also received invaluable informal training through daily interactions with computational and experimental life science researchers at The Roslin Institute and The University of Edinburgh, and has been able to build a personal research portfolio (3 first-author journal publications [1 published, 2 under review] and 2 first-author book chapters [1 published, 1 in press]). This has led to him being offered a long-term (5 years) postdoctoral researcher position at the Broad Institute, one of the leading research institute in human genomics and computational biology world-wide.
First Year Of Impact	2016
Sector	Education
Impact Types	Economic


Description	BHF Centre of Research Excellence Edinburgh
Amount	£20,983 (GBP)
Organisation	British Heart Foundation (BHF)
Sector	Charity/Non Profit
Country	United Kingdom
Start	05/2017
End	11/2017


Description	Intelligent systems for personalized and precise risk prediction and diagnosis of non-communicable diseases
Amount	kr 12,800,000 (NOK)
Funding ID	312045
Organisation	Research Council of Norway
Sector	Public
Country	Norway
Start	01/2021
End	12/2024


Description	MRC Precision Medicine DTP
Amount	£90,000 (GBP)
Organisation	MRC Doctoral Training Program
Sector	Academic/University
Country	United Kingdom
Start	08/2017
End	03/2021


Description	RO1 to Bjorkegren, Johan M.
Amount	$3,000,000 (USD)
Funding ID	1R01HL125863
Organisation	National Institutes of Health (NIH)
Department	National Heart, Lung, and Blood Institute (NHLBI)
Sector	Public
Country	United States
Start	08/2015
End	04/2019


Description	STARNET
Organisation	Icahn School of Medicine at Mount Sinai
Department	Icahn Institute for Genomics and Multiscale Biology
Country	United States
Sector	Academic/University
PI Contribution	This collaboration is to apply the software developed in the project "Scalable causal gene network inference via genetic node ordering" to the STARNET data consisting of genotype and multi-tissue gene expression data from 600 human individuals. My research team contributes and runs the network inference software.
Collaborator Contribution	Prof Johan Bjorkegren contributes access to the STARNET data (genotype data from 600 individuals and more than 3500 RNA-seq profiles) and expertise in cardiovascular biology to interpret the inferred gene networks.
Impact	Network reconstruction of pilot study data (~100 individuals) has been published (Talukdar et al, Cell Systems 2016, doi:10.1016/j.cels.2016.02.002). Identification of expression-associated DNA variants in the full dataset (~600 individuals) has been published (Franzen et al, Science 2016, doi:10.1126/science.aad6970). Reconstruction of causal genotype-gene expression-phenotype networks is in progress.
Start Year	2015


Description	UGent - IBCN
Organisation	University of Ghent
Department	Internet Based Communication Networks and Services Research Group
Country	Belgium
Sector	Academic/University
PI Contribution	This collaboration is to develop graph theoretical and statistical machine learning algorithms to reconstruct causal gene networks from multiple omics data. My research team contributes expertise in statistical machine learning and in-depth knowledge of omics data.
Collaborator Contribution	The partner contributes general expertise in graph theoretical algorithms and staff time (Dr Pieter Audenaert) to implement specific algorithms.
Impact	Joint graph theoretical work related to this collaboration has been published (Melckenbeek et al, PLOS One 2016, 10.1371/journal.pone.0147078). Specific work related to the project "Scalable causal gene network inference via genetic node ordering" has resulted in functions included in the Findr software (https://github.com/lingfeiwang/findr). Further joint publications are in preparation.
Start Year	2015


Title	Bayonet
Description	Matlab implementation of an analytic solution and stationary phase approximation for the Bayesian lasso and elastic net.
Type Of Technology	Software
Year Produced	2017
Open Source License?	Yes
Impact	Selected paper at Neural Information Processing Systems Conference (Montreal, Dec 2018)
URL	http://papers.nips.cc/paper/7542-analytic-solution-and-stationary-phase-approximation-for-the-bayesi...


Title	Findr - Fast Inference of Networks of Directed Regulations
Description	Findr is a fast and scalable software for causal inference and gene network reconstruction from genome-transcriptome variation data
Type Of Technology	Software
Year Produced	2015
Open Source License?	Yes
Impact	Findr is an ultra-fast software for the statistical inference of causal gene regulatory interactions between ten-thousands of genes from large-scale omics data. In comparison to existing state-of-the-art solutions, Findr predicts real biological interactions more accurately, while being nearly a million times faster. A paper describing Findr has been published in PLOS Computational Biology.
URL	https://github.com/lingfeiwang/findr


Title	Lassopv - Nonparametric p-value estimation in Lasso penalized regression
Description	Lassopv estimates p-values for the probability that a predictor occurs by chance in Lasso penalized regression. Unlike existing methods, p-values obtained by lassopv are comparable across multiple regressions with different response variables and different and unequally sized sets of predictors. As such, lassopv allows to control accurately the false discovery rate of individual edges in Bayesian gene networks.
Type Of Technology	Software
Year Produced	2017
Open Source License?	Yes
Impact	A manuscript describing lassopv has been submitted for publication.
URL	https://github.com/lingfeiwang/lassopv