Advancing Bayesian network algorithms for inferring gene regulation using an integrative computational-biological approach in a yeast model system

Lead Research Organisation: University of St Andrews
Department Name: Biology


Recently it has become possible to collect large amounts of data in biology, for example, measuring the expression level of every gene in yeast. This large amount of data in biology has spurred development of computational tools to analyse it. Such data and computational tools enables us to look at biology at a broader level than previously possible: we can examine a large number of interacting elements, instead of doing directed experiments on only a few, enabling investigation into how the entire system behaves. One area of such work is to use computational algorithms to reveal gene regulatory networks. Gene regulation is when a protein--known as a regulator--binds to the DNA near a gene and affects how that gene expressed, either increasing or decreasing the amount of RNA produced. This RNA is then used to make the protein product of the gene. So the binding of the regulator near the gene ultimately affects the amount of protein the gene makes. The regulator is also a protein, and thus was also produced by a gene making RNA making protein. In fact, the regulator could have a regulator of its own. A gene regulatory network is a network formed by proteins that are regulators for other proteins, which either perform some function in the cell or are regulators for yet more proteins. Even though a regulatory network consists of steps going from genes to RNA and RNA to protein, current algorithms use data from only RNA, not proteins. This is mostly because RNA measurement is easier, and thus data is available. However, protein measurement is improving, and it may be important to consider the RNA to protein transition, as regulation could occur at this step too. Here, we propose to improve algorithms that reveal gene regulatory networks by including protein data. Additionally, there is a lot of other information available that might help us figure out the gene regulatory network: locations where regulators have been found to bind to DNA, what genes are near DNA sequences to which we know regulators bind, what proteins bind to each other, and what genes changed expression when another gene was manipulated. We will also add all of these pieces of information into the algorithm, in an effort to take maximal advantage of the available information to accurately predict gene regulatory networks. But making an algorithm that ought to do things is not the whole story--we also have to test it. We will test the algorithms we develop in two ways. First, we will use a simulation, where we make up a gene regulatory network, sample data from it like we are doing a biological experiment--but in the computer, and then see if the algorithm can figure out the gene regulatory network we made. This step helps us figure out where we got things right, when the algorithm finds the correct network, and where we got things wrong, when the algorithm makes mistakes. We can then work on fixing the algorithm to make fewer mistakes. Second, we will take the algorithm we have tested in the simulator, and made as good as we can, and apply it to data taken from yeast in biological laboratory. The algorithm will output a network showing what it predicts to be the gene regulatory network based on the data. We will then pick pieces of this network, such as a regulator and gene pair, to test in our own yeast experiment. These tests will tell us if the algorithm is making accurate predictions or not. This type of validation, while important, is rarely performed because different people usually make the algorithms than do the biology. Thus, the proposed research meets this often-missed need. The ultimate goal of this research is to produce an algorithm that does a good job of predicting gene regulatory networks. Once we have this algorithm, future research can use it to measure gene regulatory networks and study their features. In particular, we plan to use the algorithm produced here to study the evolution of gene regulatory networks in future projects.

Technical Summary

The advent of large amounts of biological data has spurred much computational research in analysing this data to understand biology on a systems level. However, as computation and biology are often performed by separate groups, there is little interplay between computational development and biological experimentation; this leads to computational tools whose biological validity is unknown. I propose to rectify this issue, by integrating biological experimentation with the computational development task. The proposed research concentrates on developing algorithms for revealing gene regulatory networks. Variations in gene regulation are responsible for tissue differences, developmental change, some disease states such as cancer, and have been suggested to be a main substrate for evolutionary change. Thus, algorithms capable of accurately revealing gene regulatory networks could have impact in many areas of biology. Current algorithms for the genetic network inference task generally consider data from only RNA expression, not protein expression. However, translational regulation may be an important feature of gene regulation; thus, I propose to develop a Bayesian network inference algorithm which can model both transcriptional and translational regulation using RNA and protein data. I will additionally incorporate the ability to use other sources of information, such as location data from ChIP-chip experiments, in the network inference task. The algorithm will be developed iteratively along with tests in a simulation framework and with biological intervention experiments in yeast, S. cerevisiae. Simulation tests will enable characterisation of the algorithm's performance across a range of situations and reveal areas to target for improvement. Biological manipulation, adjusting the level of putative regulators by incorporating inducible promoters into the genome and then measuring putative targets, will enable biological verification of algorithm performance.
Description Novel methodology:
We developed a novel methodology for predicting protein expression from mRNA data, incorporating ribosome density, ribosome occupancy, codon usage, gene copy number, and mRNA free folding energy alongside mRNA measurements in a generalized linear model to learn a predictive model for protein expression. Two key assumptions underlie this methodology: (1) additional, unknown factors relating mRNA to protein will be similar for proteins involved in the same biological process, and (2) it is possible to learn this relationship using protein and mRNA levels collected from "control" conditions, the same across multiple experiments. We then build separate generalised linear models for individual functional groups of proteins (e.g., pathways) using multiple mRNA experiments plus the other information listed above to predict protein expression in the "control" condition.

We applied this methodology to budding yeast, S. cerevisiae (below). For others to apply this methodology to their own system what is required is: (1) the additional genetic information about each gene (e.g., ribosome density and occupancy, etc.), (2) categorisation of genes into functional processes (e.g., KEGG pathways), (3) at least one, preferably more, high-throughput quantitative protein expression datasets taken under the relevant "control" conditions, and (4) multiple high-throughput quantitative mRNA expression datasets containing the same "control" conditions.

Predictive models:
We developed predictive models for protein expression from mRNA expression for 38 KEGG pathways (all pathways containing enough proteins for learning represented in the protein expression datasets) in budding yeast, S. cerevisiae.

Evolved synthetic constructs:
We evolved a single ancestral synthetic construct of budding yeast, S. cerevisiae, (with GFP-tagged osmostic stress response protein TPS2) to mild osmotic stress produced by liquid media containing 0.3M NaCl. The construct was evolved in 16 replicates, 8 each in either shaking or static culture, for 90-150 days (600-1000 generations) with stocks saved every 3 days (20 generations), resulting in 480+ "snapshots" of evolution.
Exploitation Route St Andrews iGEM Team:
Existence of the lab and this project's turn to interest in synthetic biology enabled the formation of an iGEM (international genetic engineering machines) team at St Andrews. The team began in 2010, obtaining a gold medal at the competition, and has continued for the subsequent three years. The 2011 team obtained a gold medal and honourable mention for the Human Practices award. The 2012 team obtained a gold medal and won the Best Natural Biobrick award. The team provided valuable supervision experience for the PDRA and GRA employed on the grant, as well as PhD students in the School of Biology.
Sectors Healthcare,Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology

Description A collaboration was developed with Definiens AG in Germany, regarding their imaging processing software for automatic measurement of cellular morphology. We collaborated with them to develop protocols for measuring appropriate features of cells and used these measurements in Bayesian network analysis. The PI has engaged in many areas of science outreach during the course of this grant. Interactions with industry/media/other: The PI has been a member of SciTalk, science outreach to writers, since 2008. She consulted with Scottish Enterprise in 2010 regarding synthetic biology in Scotland and has been to a "synthetic biology scoping event" in 2011, considering ways of building synthetic biology expertise in Scotland and interactions with industry. She was interviewed by a Wellcome Trust journalist for an article (available at: and blog entry ( and also supervised the St Andrews iGEM Team's writing of a guest Wellcome Trust blog entry and contributed photographs ( Exhibition/public event activities: The PI assisted in St Andrews' National Science and Engineering Week exhibition in 2010. Further public events were done through supervision of the St Andrews iGEM Team: poster presentation at the Edinburgh Science Festival's Synthetic Biology Debate (2010), an interactive display at St Andrews' National Science and Engineering Week (2011), and co-organisation and production with Dundee's iGEM Team of a synthetic biology debate at the World Schools Debating Championship (2011).
First Year Of Impact 2008
Sector Education,Healthcare,Pharmaceuticals and Medical Biotechnology
Impact Types Cultural,Societal,Economic