Advanced Bayesian Computation for Cross-Disciplinary Research

Lead Research Organisation: University of Cambridge

Department Name: Engineering

Abstract

We live in an era of abundant data. Rapid technological advances, such as the internet, have made it possible to collect, store and share large amounts of information more easily than ever before. The availability of large amounts of data has had a major impact on society, commerce, and the sciences.

Data plays a particularly important role in the sciences. Data is what you get from conducting experiments, and data is what you use to test scientific theories. In recent years, the amount of data collected and generated in the sciences has grown tremendously. We need better tools to model this data, so that we can understand and test theories and make scientific predictions.

Our proposal focuses on advanced statistical tools for modelling data. It is important that the models are based on probability and statistics, because any model of real world phenomena has to represent the uncertainty we have from incomplete information and noisy measurements. Probability theory provides a coherent mathematical language for expressing uncertainty in models. Our proposal develops models based on Bayesian statistics, which used to be called
``inverse probability'' until the 20th century, and refers to the application of probability theory to learn unknown quantities from observable data. Bayesian statistics can also be used to compare multiple models (i.e. hypotheses) given the data, and thus can play a fundamental role in scientific hypothesis testing.

We will develop new computational tools for Bayesian modelling, ensuring that the models are flexible enough to capture the complexity of real-world phenomena and scalable enough to deal with very large data sets. We will also develop new methods for deciding which data to collect and which experiments to perform, which can greatly reduce the cost of scientific inquiry. We will make use of the latest advances in computer hardware, in the form of massively parallel graphics processing units (GPUs) to speed up modelling of scientific data.

This proposal is truly cross-disciplinary in that we do not focus on a single scientific discipline. In fact, we have assembled a team whose expertise spans Bayesian modelling across the physical, biological and social sciences. We will create modelling tools for better astronomical surveying of the skies so that we can understand the composition of our universe;
we will create tools for analysing gene and protein data to so that we can better understand biological phenomena and design drug therapies; and we will develop powerful methods for modelling and predicting economic and financial data which will hopefully reduce risk in financial markets.

Surprisingly, these diverse areas of the sciences---astronomy, biology and economics---can come together through a unified set of computational and statistical modelling tools. Our advances will benefit not just these areas but many other areas of science based on data-intensive modelling.

Planned Impact

The project will have an impact on a number of end-users of statistical modelling techniques in biomedicine, agriculture, computational science, quantitative finance, and other data intensive fields.

Over the last two decades, and particularly with the advent of whole genome sequencing, computational methods have become central to modern molecular biology. They impact not only the progress of academic research, but also applied areas such as pharmaceutical and agricultural chemical discovery, and the discovery of biomarkers in clinical
contexts or for crop breeding programs. Improved bioinformatic techniques, then, have both an ethical impact, by potentially reducing the numbers of costly toxicological or animal experiments, and potential economic and health benefits, such as identifying desirable agronomic traits and enabling personalized genomic medicine.

In terms of industrial interfaces, the investigators have had extensive interactions with industry throughout their careers.
Ghahramani has had funded industrial collaborations in internet search and computer systems (Microsoft, Google), finance (FX Concepts), and telecommunications (Datapath, NTT), along with consultancies at GlaxoWellcome and other companies. A number of Ghahramani's former students and postdocs have moved on to positions in leading industrial labs (Citadel, Yahoo!, Google, Microsoft). Ghahramani also has experience in the commercial exploitation of research having founded a start-up company (Xyggy) which is developing novel approaches to Bayesian search. Given the intense interest from and aggressive recruitment by industry, advances in statistical machine learning are clearly of great benefit to a number of industrial sectors, most notably information technology, financial firms, and telecommunications companies.

Wild has spent part of his career in the pharmaceutical, biotechnology and bioinformatics software industries, with Allelix
Biopharmaceuticals, Oxford Molecular and GlaxoWellcome. Prior to joining Warwick Systems Biology Centre he was a founding faculty member of the Keck Graduate Institute of Applied Life Sciences (KGI), where he played a leading role in founding a unique Masters programme which combined training in computational and systems biology and bioengineering with aspects of management, pharmaceutical development and bioscience business awareness. The pharmaceutical, biotechnology and bioinformatics fields will also greatly benefit from novel and scalable tools for statistical modelling.

We will investigate potential commercial exploitation of our research outputs if needed, either via collaborative efforts with our partners or through a potential spin-off. We have strong links to and support from our institutions' IP and enterprise teams (e.g. Cambridge Enterprise and Sussex Research and Enterprise Division). However, although we are sensitive to the possibilities and benefits to society of short-term commercialisation, this is not the primary driver of the research programme proposed.

Finally, a significant component of the impact of our project will come from the training of PhD and postdoctoral staff in advanced computational statistical methods. These highly skilled researchers are often recruited by industry or may even develop their own start-up companies using their invaluable skills for exploiting the information revolution. Although it is hard to predict this impact, it is clear that the skills gained in this research area have great practical relevance outside of academia.

Funded Value:

£1,158,511

Funded Period:

Sep 11 - Feb 16

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/I036575/1

Principal Investigator:

Zoubin Ghahramani

Research Subject:

Mathematical sciences (100%)

Research Topic:

Statistics & Appl. Probability (100%)

Organisations

People	ORCID iD
Zoubin Ghahramani (Principal Investigator)
Jim Griffin (Co-Investigator)	http://orcid.org/0000-0002-4828-7368
David Wild (Co-Investigator)
Andrew Liddle (Co-Investigator)
Pia Mukherjee (Researcher)

Publications

Author Name Title

Publication Date Published

|< < 1 2 3 4 5 6 > >|

10 25 50

Hern (2015) A General Framework for Constrained Bayesian Optimization using Information-based Search in arXiv e-prints

Tripuraneni, N. (2015) A Linear-Time Particle Gibbs Sampler for Infinite Hidden Markov Models

Hickman R (2013) A local regulatory network around three NAC transcription factors in stress responses and senescence in Arabidopsis leaves. in The Plant journal : for cell and molecular biology

Palla K. (2012) A nonparametric variable clustering model in Advances in Neural Information Processing Systems

Darkins R (2013) Accelerating Bayesian hierarchical clustering of time series data with a randomised algorithm. in PloS one

Osborne M.A. (2012) Active learning of model evidence using Bayesian quadrature in Advances in Neural Information Processing Systems

Lloyd J.R. (2014) Automatic construction and natural-language description of nonparametric regression models in Proceedings of the National Conference on Artificial Intelligence

Mohamed S. (2012) Bayesian and L 1 approaches for sparse unsupervised learning in Proceedings of the 29th International Conference on Machine Learning, ICML 2012

Kirk P (2012) Bayesian correlated clustering to integrate multiple datasets. in Bioinformatics (Oxford, England)

Key Findings
Impact Summary
Further Funding
Software and Technical Products


Description	We live in an era of abundant data. Rapid technological advances, such as the internet, have made it possible to collect, store and share large amounts of information more easily than ever before. The availability of large amounts of data has had a major impact on society, commerce, and the sciences. Data plays a particularly important role in the sciences. Data is what you get from conducting experiments, and data is what you use to test scientific theories. In recent years, the amount of data collected and generated in the sciences has grown tremendously. In this project we have developed better tools to model this data, so that we can understand and test theories and make scientific predictions. We develop new computational tools for Bayesian modelling, ensuring that the models are flexible enough to capture the complexity of real-world phenomena and scalable enough to deal with very large data sets. We also developed new methods for deciding which data to collect and which experiments to perform, which can greatly reduce the cost of scientific inquiry. We have explored the use of the latest advances in computer hardware, in the form of massively parallel graphics processing units (GPUs) to speed up modelling of scientific data. Our team has expertise spanning Bayesian modelling across the physical, biological and social sciences. Our work develops modelling tools for better astronomical surveying of the skies so that we can understand the composition of our universe; we have also created tools for analysing gene and protein data to so that we can better understand biological phenomena and design drug therapies; and we are developing powerful methods for modelling and predicting economic and financial data which will hopefully reduce risk in financial markets. Surprisingly, these diverse areas of the sciences---astronomy, biology and economics---can come together through a unified set of computational and statistical modelling tools. Our advances will benefit not just these areas but many other areas of science based on data-intensive modelling.
Exploitation Route	There are many possible applications of the methodologies we have developed outside of the scientific areas outlined in our proposal.
Sectors	Digital/Communication/Information Technologies (including Software),Financial Services, and Management Consultancy,Healthcare


Description	This work in this proposal developed novel theory and algorithmic tools for Bayesian modeling by exploring interdisciplinary approaches from bioinformatics, astronomy, and financial econometrics. The work had impacts on each of those fields as well as on many other areas of machine learning. We developed new scalable methods for learning from Big Data problems and new flexible nonparametric models that can be used to learn realistic models of complex data. Our methods are used by network scientists and by plant scientists to understand biological phenomena. Our work in Bayesian time series is also of use for understanding high-frequency financial data. Since the end of this grant, the field of machine learning has evolved considerably, with a new emphasis on deep learning methods, while Bayesian nonparametric methods have waned in relative importance. Nonetheless, the Bayesian methods continue to be a cornerstone methodology in astronomy and for handling uncertainty in time series. Moreover, papers several papers (and software artifacts) arising from this grant, on topics such as Gaussian processes, probabilistic machine learning, and the automatic statistician, have over 400 citations and have influenced the trajectory of future research.
First Year Of Impact	2014
Sector	Agriculture, Food and Drink,Financial Services, and Management Consultancy,Healthcare,Pharmaceuticals and Medical Biotechnology
Impact Types	Societal,Economic


Description	Big Data Capital Grant
Amount	£1,500,000 (GBP)
Funding ID	BB/M018431/1
Organisation	Biotechnology and Biological Sciences Research Council (BBSRC)
Sector	Public
Country	United Kingdom
Start	01/2015
End	03/2017


Title	Causal structure identification
Description	Implements the methods described in: Penfold & Wild, 2011. How to infer gene networks from expression profiles, revisited. Interface Focus 1(6):857-870 Penfold et al., 2012. Nonparametric Bayesian inference for perturbed and orthologous gene regulatory networks. Bioinformatics 28:i233-i241
Type Of Technology	Software
Year Produced	2014
Impact	See above
URL	http://www2.warwick.ac.uk/fac/sci/systemsbiology/research/software


Title	Compositional kernel search for Gaussian processes
Description	This is the software that accompanies the paper: James Robert Lloyd, David Duvenaud, Roger Grosse, Joshua B. Tenenbaum, and Zoubin Ghahramani. Automatic construction and natural-language description of nonparametric regression models. In Association for the Advancement of Artificial Intelligence (AAAI), July 2014.
Type Of Technology	Software
Year Produced	2014
Impact	The code is one of the major components in the Automatic Statistician project: http://www.automaticstatistician.com/
URL	http://www.github.com/jamesrobertlloyd/gpss-research


Title	GPflow. A Gaussian process library for TensorFlow.
Description	GPflow is a package for building Gaussian process models in python, using Google's TensorFlow. It was written by James Hensman and Alexander G. de G. Matthews.
Type Of Technology	Software
Year Produced	2016
Open Source License?	Yes
Impact	GPflow is publically available. The authors James Hensman and Alexander G. de G. Matthews are both using it to create research papers. Other third party users have downloaded and use the software in there research. GPflow allow a Bayesian nonparametric model, namely the Gaussian process, to be implemented on multiple distributed GPUs.
URL	https://github.com/GPflow/GPflow


Title	MDI-GPU
Description	Accelerating integrative modelling for genomic scale data using GP-GPU computing
Type Of Technology	Software
Year Produced	2016
Open Source License?	Yes
Impact	publication
URL	http://www2.warwick.ac.uk/fac/sci/systemsbiology/research/software/


Title	OSCI
Description	Inferring Orthologous Gene Regulatory Networks Using Interspecies Data Fusion
Type Of Technology	Software
Year Produced	2016
Open Source License?	Yes
Impact	publication
URL	http://www2.warwick.ac.uk/fac/sci/systemsbiology/research/software/

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications