Managing the Data Explosion in Post-Genomic Biology with Fast Bayesian Computational Methods

Lead Research Organisation: University of Cambridge
Department Name: Engineering

Abstract

Rapid technological advances in molecular biology are providing an unprecedented opportunity to investigate the basic processes of life. This `post-genomic' phase of molecular biology has resulted in an explosion of typically high dimensional structured data from new technologies for transcriptomics (microarrays), proteomics and metabolomics. Such data requires novel mathematical, statistical and computational methods for their interpretation and analysis. This proposal focuses on the development of statistical and computational methods for the analysis of such data, using novel approaches from the fields of machine learning and nonparametric Bayesian statistics. The project involves a close collaboration of scientists with expertise in machine learning and statistics, bioinformatics and molecular biology. The new software tools will be developed in the context of real-world scientific problems, such as: elucidating signalling networks in plant stress responses; metabolic regulation in the bacteria Streptomyces, major producers of antibiotics and delineating the molecular mechanisms contributing to mitochondrial dysfunction in obesity and diabetes. The scientific goal of the project will be to apply these novel methods to modelling bioinformatics data, but the methods developed will be broadly applicable across a number of fields.

Publications

10 25 50

publication icon
Orbanz P (2011) Projective limit random probabilities on Polish spaces in Electronic Journal of Statistics

publication icon
Orbanz P. (2009) Construction of nonparametric Bayesian models from parametric Bayes equations in Advances in Neural Information Processing Systems 22 - Proceedings of the 2009 Conference

publication icon
Savage RS (2010) Discovering transcriptional modules by Bayesian data integration. in Bioinformatics (Oxford, England)

publication icon
Williamson S. (2010) Dependent Indian buffet processes in Journal of Machine Learning Research

 
Description We identified 6 key computational and scientific challenges which we addressed in this project: (1) developing fast algorithms and software tools for Bayesian hierarchical clustering, (2) novel algorithms for clustering time series data, (3) new non-parametric models for finding overlapping clusters, (4) new non-parametric models for context dependent clustering, (5) developing an integrated software toolkit implementing the algorithms in (1), (2), (3) and (4), and (6) closed loop modelling, hypothesis generation, and experimentation on the biological pathways discovered.
Exploitation Route Although the immediate scientific goals of the project will be to apply these novel methods to modelling bioinformatics data, the methods developed in this project will be broadly applicable across many disciplines. Examples include: clustering stocks with different prices dynamics in finance, clustering regions with different growth patterns in economics, and signal processing applications. The methods developed in this project will thus have applications across many different areas. We therefore anticipate that academic researchers and ultimately industrial and commercial concerns in these fields will be long term beneficiaries of this research.
Sectors Digital/Communication/Information Technologies (including Software)

Financial Services

and Management Consultancy

Healthcare

Pharmaceuticals and Medical Biotechnology

URL http://www2.warwick.ac.uk/fac/sci/systemsbiology/research/software/
 
Description The immediate beneficiaries of the project have been our experimental collaborators at Warwick, who have already generated extensive datasets from microarray analysis of gene expression time series and the effects of a variety of knockout mutants, experimental treatments or clinical conditions on gene expression patterns. The wider beneficiaries of the research have been the community of molecular biology researchers who have utilized our software in high-throughput data analysis. To ensure that the outputs of our EPSRC supported research are widely disseminated, versions of our code have been released as Open Source Matlab code or through the R/Bioconductor environment. Although the immediate scientific goals of the project have been to apply these novel methods to modelling bioinformatics data, the methods developed in this project will be broadly applicable across many disciplines. Examples include: clustering stocks with different prices dynamics in finance, clustering regions with different growth patterns in economics, and signal processing applications. The methods developed in this project will thus have applications across many different areas. We therefore anticipate that academic researchers and ultimately industrial and commercial concerns in these fields will be long term beneficiaries of this research.
First Year Of Impact 2012
Sector Agriculture, Food and Drink,Healthcare
Impact Types Societal

Economic

 
Description EPSRC
Amount £289,422 (GBP)
Funding ID EP/I026827/1 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start  
 
Description EPSRC
Amount £1,158,512 (GBP)
Funding ID EP/I036575/1 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start  
 
Description Medical Research Council
Amount £436,500 (GBP)
Funding ID MRC Biostatistics Fellowship 
Organisation Medical Research Council (MRC) 
Sector Public
Country United Kingdom
Start