Advanced Bayesian Computation for Cross-Disciplinary Research

Lead Research Organisation: University of Cambridge
Department Name: Engineering

Abstract

We live in an era of abundant data. Rapid technological advances, such as the internet, have made it possible to collect, store and share large amounts of information more easily than ever before. The availability of large amounts of data has had a major impact on society, commerce, and the sciences.

Data plays a particularly important role in the sciences. Data is what you get from conducting experiments, and data is what you use to test scientific theories. In recent years, the amount of data collected and generated in the sciences has grown tremendously. We need better tools to model this data, so that we can understand and test theories and make scientific predictions.

Our proposal focuses on advanced statistical tools for modelling data. It is important that the models are based on probability and statistics, because any model of real world phenomena has to represent the uncertainty we have from incomplete information and noisy measurements. Probability theory provides a coherent mathematical language for expressing uncertainty in models. Our proposal develops models based on Bayesian statistics, which used to be called
``inverse probability'' until the 20th century, and refers to the application of probability theory to learn unknown quantities from observable data. Bayesian statistics can also be used to compare multiple models (i.e. hypotheses) given the data, and thus can play a fundamental role in scientific hypothesis testing.

We will develop new computational tools for Bayesian modelling, ensuring that the models are flexible enough to capture the complexity of real-world phenomena and scalable enough to deal with very large data sets. We will also develop new methods for deciding which data to collect and which experiments to perform, which can greatly reduce the cost of scientific inquiry. We will make use of the latest advances in computer hardware, in the form of massively parallel graphics processing units (GPUs) to speed up modelling of scientific data.

This proposal is truly cross-disciplinary in that we do not focus on a single scientific discipline. In fact, we have assembled a team whose expertise spans Bayesian modelling across the physical, biological and social sciences. We will create modelling tools for better astronomical surveying of the skies so that we can understand the composition of our universe;
we will create tools for analysing gene and protein data to so that we can better understand biological phenomena and design drug therapies; and we will develop powerful methods for modelling and predicting economic and financial data which will hopefully reduce risk in financial markets.

Surprisingly, these diverse areas of the sciences---astronomy, biology and economics---can come together through a unified set of computational and statistical modelling tools. Our advances will benefit not just these areas but many other areas of science based on data-intensive modelling.

Planned Impact

The project will have an impact on a number of end-users of statistical modelling techniques in biomedicine, agriculture, computational science, quantitative finance, and other data intensive fields.

Over the last two decades, and particularly with the advent of whole genome sequencing, computational methods have become central to modern molecular biology. They impact not only the progress of academic research, but also applied areas such as pharmaceutical and agricultural chemical discovery, and the discovery of biomarkers in clinical
contexts or for crop breeding programs. Improved bioinformatic techniques, then, have both an ethical impact, by potentially reducing the numbers of costly toxicological or animal experiments, and potential economic and health benefits, such as identifying desirable agronomic traits and enabling personalized genomic medicine.

In terms of industrial interfaces, the investigators have had extensive interactions with industry throughout their careers.
Ghahramani has had funded industrial collaborations in internet search and computer systems (Microsoft, Google), finance (FX Concepts), and telecommunications (Datapath, NTT), along with consultancies at GlaxoWellcome and other companies. A number of Ghahramani's former students and postdocs have moved on to positions in leading industrial labs (Citadel, Yahoo!, Google, Microsoft). Ghahramani also has experience in the commercial exploitation of research having founded a start-up company (Xyggy) which is developing novel approaches to Bayesian search. Given the intense interest from and aggressive recruitment by industry, advances in statistical machine learning are clearly of great benefit to a number of industrial sectors, most notably information technology, financial firms, and telecommunications companies.

Wild has spent part of his career in the pharmaceutical, biotechnology and bioinformatics software industries, with Allelix
Biopharmaceuticals, Oxford Molecular and GlaxoWellcome. Prior to joining Warwick Systems Biology Centre he was a founding faculty member of the Keck Graduate Institute of Applied Life Sciences (KGI), where he played a leading role in founding a unique Masters programme which combined training in computational and systems biology and bioengineering with aspects of management, pharmaceutical development and bioscience business awareness. The pharmaceutical, biotechnology and bioinformatics fields will also greatly benefit from novel and scalable tools for statistical modelling.

We will investigate potential commercial exploitation of our research outputs if needed, either via collaborative efforts with our partners or through a potential spin-off. We have strong links to and support from our institutions' IP and enterprise teams (e.g. Cambridge Enterprise and Sussex Research and Enterprise Division). However, although we are sensitive to the possibilities and benefits to society of short-term commercialisation, this is not the primary driver of the research programme proposed.

Finally, a significant component of the impact of our project will come from the training of PhD and postdoctoral staff in advanced computational statistical methods. These highly skilled researchers are often recruited by industry or may even develop their own start-up companies using their invaluable skills for exploiting the information revolution. Although it is hard to predict this impact, it is clear that the skills gained in this research area have great practical relevance outside of academia.

Publications

10 25 50

publication icon
Palla K. (2012) A nonparametric variable clustering model in Advances in Neural Information Processing Systems

publication icon
Palla K. (2012) A nonparametric variable clustering model in Advances in Neural Information Processing Systems

publication icon
Osborne M.A. (2012) Active learning of model evidence using Bayesian quadrature in Advances in Neural Information Processing Systems

publication icon
Lloyd J.R. (2014) Automatic construction and natural-language description of nonparametric regression models in Proceedings of the National Conference on Artificial Intelligence

publication icon
Mohamed S. (2012) Bayesian and L 1 approaches for sparse unsupervised learning in Proceedings of the 29th International Conference on Machine Learning, ICML 2012

publication icon
Kirk P (2012) Bayesian correlated clustering to integrate multiple datasets. in Bioinformatics (Oxford, England)

 
Description We live in an era of abundant data. Rapid technological advances, such as the internet, have made it possible to collect, store and share large amounts of information more easily than ever before. The availability of large amounts of data has had a major impact on society, commerce, and the sciences.

Data plays a particularly important role in the sciences. Data is what you get from conducting experiments, and data is what you use to test scientific theories. In recent years, the amount of data collected and generated in the sciences has grown tremendously.

In this project we have developed better tools to model this data, so that we can understand and test theories and make scientific predictions.
We develop new computational tools for Bayesian modelling, ensuring that the models are flexible enough to capture the complexity of real-world phenomena and scalable enough to deal with very large data sets. We also developed new methods for deciding which data to collect and which experiments to perform, which can greatly reduce the cost of scientific inquiry. We have explored the use of the latest advances in computer hardware, in the form of massively parallel graphics processing units (GPUs) to speed up modelling of scientific data.
Our team has expertise spanning Bayesian modelling across the physical, biological and social sciences. Our work develops modelling tools for better astronomical surveying of the skies so that we can understand the composition of our universe; we have also created tools for analysing gene and protein data to so that we can better understand biological phenomena and design drug therapies; and we are developing powerful methods for modelling and predicting economic and financial data which will hopefully reduce risk in financial markets.
Surprisingly, these diverse areas of the sciences---astronomy, biology and economics---can come together through a unified set of computational and statistical modelling tools. Our advances will benefit not just these areas but many other areas of science based on data-intensive modelling.
Exploitation Route There are many possible applications of the methodologies we have developed outside of the scientific areas outlined in our proposal.
Sectors Digital/Communication/Information Technologies (including Software),Financial Services, and Management Consultancy,Healthcare

 
Description This work in this proposal developed novel theory and algorithmic tools for Bayesian modeling by exploring interdisciplinary approaches from bioinformatics, astronomy, and financial econometrics. The work had impacts on each of those fields as well as on many other areas of machine learning. We developed new scalable methods for learning from Big Data problems and new flexible nonparametric models that can be used to learn realistic models of complex data. Our methods are used by network scientists and by plant scientists to understand biological phenomena. Our work in Bayesian time series is also of use for understanding high-frequency financial data. Since the end of this grant, the field of machine learning has evolved considerably, with a new emphasis on deep learning methods, while Bayesian nonparametric methods have waned in relative importance. Nonetheless, the Bayesian methods continue to be a cornerstone methodology in astronomy and for handling uncertainty in time series. Moreover, papers several papers (and software artifacts) arising from this grant, on topics such as Gaussian processes, probabilistic machine learning, and the automatic statistician, have over 400 citations and have influenced the trajectory of future research.
First Year Of Impact 2014
Sector Agriculture, Food and Drink,Financial Services, and Management Consultancy,Healthcare,Pharmaceuticals and Medical Biotechnology
Impact Types Societal,Economic

 
Description Big Data Capital Grant
Amount £1,500,000 (GBP)
Funding ID BB/M018431/1 
Organisation Biotechnology and Biological Sciences Research Council (BBSRC) 
Sector Public
Country United Kingdom
Start 01/2015 
End 03/2017
 
Title Causal structure identification 
Description Implements the methods described in: Penfold & Wild, 2011. How to infer gene networks from expression profiles, revisited. Interface Focus 1(6):857-870 Penfold et al., 2012. Nonparametric Bayesian inference for perturbed and orthologous gene regulatory networks. Bioinformatics 28:i233-i241 
Type Of Technology Software 
Year Produced 2014 
Impact See above 
URL http://www2.warwick.ac.uk/fac/sci/systemsbiology/research/software
 
Title Compositional kernel search for Gaussian processes 
Description This is the software that accompanies the paper: James Robert Lloyd, David Duvenaud, Roger Grosse, Joshua B. Tenenbaum, and Zoubin Ghahramani. Automatic construction and natural-language description of nonparametric regression models. In Association for the Advancement of Artificial Intelligence (AAAI), July 2014. 
Type Of Technology Software 
Year Produced 2014 
Impact The code is one of the major components in the Automatic Statistician project: http://www.automaticstatistician.com/ 
URL http://www.github.com/jamesrobertlloyd/gpss-research
 
Title GPflow. A Gaussian process library for TensorFlow. 
Description GPflow is a package for building Gaussian process models in python, using Google's TensorFlow. It was written by James Hensman and Alexander G. de G. Matthews. 
Type Of Technology Software 
Year Produced 2016 
Open Source License? Yes  
Impact GPflow is publically available. The authors James Hensman and Alexander G. de G. Matthews are both using it to create research papers. Other third party users have downloaded and use the software in there research. GPflow allow a Bayesian nonparametric model, namely the Gaussian process, to be implemented on multiple distributed GPUs. 
URL https://github.com/GPflow/GPflow
 
Title MDI-GPU 
Description Accelerating integrative modelling for genomic scale data using GP-GPU computing 
Type Of Technology Software 
Year Produced 2016 
Open Source License? Yes  
Impact publication 
URL http://www2.warwick.ac.uk/fac/sci/systemsbiology/research/software/
 
Title OSCI 
Description Inferring Orthologous Gene Regulatory Networks Using Interspecies Data Fusion 
Type Of Technology Software 
Year Produced 2016 
Open Source License? Yes  
Impact publication 
URL http://www2.warwick.ac.uk/fac/sci/systemsbiology/research/software/