Micro-quantitative and macro-qualitative gene network models from ChIP-sequencing and microarray data.

Lead Research Organisation: Brunel University London
Department Name: Information Systems Computing and Maths

Abstract

It is a key aim in biology to discover the major causes of a disease at the genetic level, as this can lead to the development of a cure for that disease. Particular emphasis in research is given to the study of proteins and, amongst these, to transcription factors. These are key to the regulation of a cell, by activating or repressing other proteins in the organism and in so doing regulating their transcription rate. This is the rate by which genes are turned into proteins which then perform the various distinct functions in a cell. In this proposal, we look closely at two specific proteins, p300 and CBP, whose incorrect functioning has been linked to a number of major diseases, such as different types of cancer. In particular, it is of interest to discover differences in the roles played by p300 and CBP in the regulatory process. These two proteins are known to be transcription coactivators, thus activating a number of other proteins. It is the first aim of this proposal to detect exactly which proteins in the biological system are activated by p300 and CBP, either jointly or separately. To achieve this aim, we will use ChIP-sequencing data, produced by the recent deep-sequencing technology. This has shown greater advances over the traditional approaches, such as microarray experiments, in terms of accuracy, depth and speed, so we expect to obtain more accurate results using these data. As the first deep-sequencing datasets are becoming available, it is now timely to develop appropriate statistical models to analyse these data. In particular, it is an objective of this proposal to develop a statistical model to analyse ChIP-sequencing data, when more than one transcription factor is available. The challenge is to include in the model the effect of the two different antibodies used in the experimental set-up for the two different transcription factors. As a result of this analysis, a number of targets will be detected as regulated by both or by just one of the two transcription factors. As a second step of the proposal, we aim to integrate the ChIP-sequencing analysis with time-course microarray data on the same system. A dedicated statistical model on gene expression data will give further insight into the regulatory mechanisms of p300 and CBP, their activity and the kinetics of regulation of their target genes. Finally, we aim to extend the small refined regulatory network consisting of p300/CBP and their target genes into a larger scale network, by exploring possible interactions between any component in the network and other transcription factors in the biological system. So as a final result, the proposal will produce a small refined statistical analysis of the set of genes regulated by the two transcription activators as well as a global view of the network of regulation of the whole biological system. This will shed light, both at the local and global level, onto the regulatory mechanism of these two proteins and potentially lead to advances in the cure of the diseases underlying their malfunctioning. The work in this proposal will be important to biologists, who will be able to gain insight into the specific biological problem under study, to mathematicians and statisticians interested in the advances made on the statistical methodology, and to the wider community, that would benefit from any knowledge gained on extremely important health-related regulatory mechanisms.

Technical Summary

In this proposal, we aim to uncover the regulatory mechanisms of the two transcriptional coactivators p300 and CBP. Despite high levels of homology, these two transcription factors (TFs) have been found indispensable during embryogenesis and their incorrect functioning has been linked to a number of diseases. In particular, it is of interest to discover differences in the roles played by p300 and CBP in the regulatory process. To achieve this aim, we will develop dedicated statistical models on a diverse set of data. Firstly, we will identify the target genes of the two TFs from ChIP-sequencing data. Thanks to the recent advances in deep-sequencing technology, these data can now provide accurate measures of binding affinity of a TF throughout the genome. The statistical analysis of these data will include into the model the effect of the two different antibodies used for the two TFs, in order to best detect the truly differentially bound sites. As a result of this analysis, a number of targets will be detected as regulated by both or by just one of the TFs. The resulting network motif will be further investigated using a Michaelis-Menten statistical model of regulation, estimated from available gene expression data on the same system and extended from our previous research to the case where multiple TFs are present. The statistical model will allow us to discover a possible competing role of the two TFs during transcription, to get estimates of the TF activities and to quantify the kinetics of regulation of their target genes. Finally, we wish to extend this quantified motif into a larger scale network, by exploring possible interactions between any component in the motif and other TFs in the biological system. So as a final result, the proposal will produce a small refined statistical analysis of the set of genes regulated by the the two TFs, as well as a global view of the network of regulation of the whole biological system.

Planned Impact

The functioning of the two transcriptional co-activators, p300 and CBP, is associated with extremely important health-related problems. As such, any advance towards the understanding of the genetic regulatory mechanism of these proteins will be extremely beneficial. The work in this proposal will be important both to biologists, who will be able to gain insight into the specific biological system investigated, and to mathematicians and statisticians, interested in the general methodology developed in the proposal. Clinicians, working on similar health-related problems from a clinical perspective, will also benefit from the research and the potential findings of this proposal. In particular, there is growing interest in methods that combine the genomic information and their statistical analysis with clinical information on patients data. The pharmaceutical industry is also a potential beneficiary of this project, as advances on the genetic regulatory mechanism of p300 and CBP can point towards the discovery of new drugs for the health problems associated to the malfunctioning of these two proteins. The society as a whole will benefit from the findings of this proposal, as these can potentially lead to advances in the cure of major diseases. On an individual level, the PDRA will benefit from this proposal, as they will be employed on the project for two years. In particular, they will learn to work independently as a researcher, they will have the chance to learn and progress on an extremely challenging and exciting research area, and they will gain experience in working in an interdisciplinary area by collaborating with biologists. As the PDRA will become an expert in the research area of this proposal, this in turn will have an impact on society. A number of activities will be carried out as part of our impact plan to engage and communicate with our potential beneficiaries: 1. We aim to disseminate the finding of our research by attending conferences, holding seminars, creating newsletters for our beneficiaries and publishing papers in high-impact journals. 2. We will make the code available on the PI's webpage, so that other statistical and biological groups can easily access and reproduce our methodology as well as compare it with existing ones. We will develop an R package with a collection of the codes developed during the project to further enable the open access of our methodology. 3. We aim to develop a stronger link with clinicians working on similar biological and health-related problems so that in future an integrative approach can be developed between the genomic and the medical information.
 
Description We have developed advanced methods for the analysis of ChIP-seq data. The method allows the detection of broad regions of enrichment and accounts for the spatial dependencies in the genomic data.
Exploitation Route We plan to extend the software (R package) with additional methods and to write a journal paper just on the software, so that it can be made as accessible as possible to biologists.
Sectors Healthcare,Pharmaceuticals and Medical Biotechnology

 
Description The R package that was developed as part of the award was subsequently used to analyse data from a well-known pharmaceutical company. The paper was published in the journal Neurocomputing
First Year Of Impact 2015
Sector Pharmaceuticals and Medical Biotechnology
Impact Types Societal