Advanced Bayesian Computation for Cross-Disciplinary Research
Lead Research Organisation:
University of Cambridge
Department Name: Engineering
Abstract
We live in an era of abundant data. Rapid technological advances, such as the internet, have made it possible to collect, store and share large amounts of information more easily than ever before. The availability of large amounts of data has had a major impact on society, commerce, and the sciences.
Data plays a particularly important role in the sciences. Data is what you get from conducting experiments, and data is what you use to test scientific theories. In recent years, the amount of data collected and generated in the sciences has grown tremendously. We need better tools to model this data, so that we can understand and test theories and make scientific predictions.
Our proposal focuses on advanced statistical tools for modelling data. It is important that the models are based on probability and statistics, because any model of real world phenomena has to represent the uncertainty we have from incomplete information and noisy measurements. Probability theory provides a coherent mathematical language for expressing uncertainty in models. Our proposal develops models based on Bayesian statistics, which used to be called
``inverse probability'' until the 20th century, and refers to the application of probability theory to learn unknown quantities from observable data. Bayesian statistics can also be used to compare multiple models (i.e. hypotheses) given the data, and thus can play a fundamental role in scientific hypothesis testing.
We will develop new computational tools for Bayesian modelling, ensuring that the models are flexible enough to capture the complexity of real-world phenomena and scalable enough to deal with very large data sets. We will also develop new methods for deciding which data to collect and which experiments to perform, which can greatly reduce the cost of scientific inquiry. We will make use of the latest advances in computer hardware, in the form of massively parallel graphics processing units (GPUs) to speed up modelling of scientific data.
This proposal is truly cross-disciplinary in that we do not focus on a single scientific discipline. In fact, we have assembled a team whose expertise spans Bayesian modelling across the physical, biological and social sciences. We will create modelling tools for better astronomical surveying of the skies so that we can understand the composition of our universe;
we will create tools for analysing gene and protein data to so that we can better understand biological phenomena and design drug therapies; and we will develop powerful methods for modelling and predicting economic and financial data which will hopefully reduce risk in financial markets.
Surprisingly, these diverse areas of the sciences---astronomy, biology and economics---can come together through a unified set of computational and statistical modelling tools. Our advances will benefit not just these areas but many other areas of science based on data-intensive modelling.
Data plays a particularly important role in the sciences. Data is what you get from conducting experiments, and data is what you use to test scientific theories. In recent years, the amount of data collected and generated in the sciences has grown tremendously. We need better tools to model this data, so that we can understand and test theories and make scientific predictions.
Our proposal focuses on advanced statistical tools for modelling data. It is important that the models are based on probability and statistics, because any model of real world phenomena has to represent the uncertainty we have from incomplete information and noisy measurements. Probability theory provides a coherent mathematical language for expressing uncertainty in models. Our proposal develops models based on Bayesian statistics, which used to be called
``inverse probability'' until the 20th century, and refers to the application of probability theory to learn unknown quantities from observable data. Bayesian statistics can also be used to compare multiple models (i.e. hypotheses) given the data, and thus can play a fundamental role in scientific hypothesis testing.
We will develop new computational tools for Bayesian modelling, ensuring that the models are flexible enough to capture the complexity of real-world phenomena and scalable enough to deal with very large data sets. We will also develop new methods for deciding which data to collect and which experiments to perform, which can greatly reduce the cost of scientific inquiry. We will make use of the latest advances in computer hardware, in the form of massively parallel graphics processing units (GPUs) to speed up modelling of scientific data.
This proposal is truly cross-disciplinary in that we do not focus on a single scientific discipline. In fact, we have assembled a team whose expertise spans Bayesian modelling across the physical, biological and social sciences. We will create modelling tools for better astronomical surveying of the skies so that we can understand the composition of our universe;
we will create tools for analysing gene and protein data to so that we can better understand biological phenomena and design drug therapies; and we will develop powerful methods for modelling and predicting economic and financial data which will hopefully reduce risk in financial markets.
Surprisingly, these diverse areas of the sciences---astronomy, biology and economics---can come together through a unified set of computational and statistical modelling tools. Our advances will benefit not just these areas but many other areas of science based on data-intensive modelling.
Planned Impact
The project will have an impact on a number of end-users of statistical modelling techniques in biomedicine, agriculture, computational science, quantitative finance, and other data intensive fields.
Over the last two decades, and particularly with the advent of whole genome sequencing, computational methods have become central to modern molecular biology. They impact not only the progress of academic research, but also applied areas such as pharmaceutical and agricultural chemical discovery, and the discovery of biomarkers in clinical
contexts or for crop breeding programs. Improved bioinformatic techniques, then, have both an ethical impact, by potentially reducing the numbers of costly toxicological or animal experiments, and potential economic and health benefits, such as identifying desirable agronomic traits and enabling personalized genomic medicine.
In terms of industrial interfaces, the investigators have had extensive interactions with industry throughout their careers.
Ghahramani has had funded industrial collaborations in internet search and computer systems (Microsoft, Google), finance (FX Concepts), and telecommunications (Datapath, NTT), along with consultancies at GlaxoWellcome and other companies. A number of Ghahramani's former students and postdocs have moved on to positions in leading industrial labs (Citadel, Yahoo!, Google, Microsoft). Ghahramani also has experience in the commercial exploitation of research having founded a start-up company (Xyggy) which is developing novel approaches to Bayesian search. Given the intense interest from and aggressive recruitment by industry, advances in statistical machine learning are clearly of great benefit to a number of industrial sectors, most notably information technology, financial firms, and telecommunications companies.
Wild has spent part of his career in the pharmaceutical, biotechnology and bioinformatics software industries, with Allelix
Biopharmaceuticals, Oxford Molecular and GlaxoWellcome. Prior to joining Warwick Systems Biology Centre he was a founding faculty member of the Keck Graduate Institute of Applied Life Sciences (KGI), where he played a leading role in founding a unique Masters programme which combined training in computational and systems biology and bioengineering with aspects of management, pharmaceutical development and bioscience business awareness. The pharmaceutical, biotechnology and bioinformatics fields will also greatly benefit from novel and scalable tools for statistical modelling.
We will investigate potential commercial exploitation of our research outputs if needed, either via collaborative efforts with our partners or through a potential spin-off. We have strong links to and support from our institutions' IP and enterprise teams (e.g. Cambridge Enterprise and Sussex Research and Enterprise Division). However, although we are sensitive to the possibilities and benefits to society of short-term commercialisation, this is not the primary driver of the research programme proposed.
Finally, a significant component of the impact of our project will come from the training of PhD and postdoctoral staff in advanced computational statistical methods. These highly skilled researchers are often recruited by industry or may even develop their own start-up companies using their invaluable skills for exploiting the information revolution. Although it is hard to predict this impact, it is clear that the skills gained in this research area have great practical relevance outside of academia.
Over the last two decades, and particularly with the advent of whole genome sequencing, computational methods have become central to modern molecular biology. They impact not only the progress of academic research, but also applied areas such as pharmaceutical and agricultural chemical discovery, and the discovery of biomarkers in clinical
contexts or for crop breeding programs. Improved bioinformatic techniques, then, have both an ethical impact, by potentially reducing the numbers of costly toxicological or animal experiments, and potential economic and health benefits, such as identifying desirable agronomic traits and enabling personalized genomic medicine.
In terms of industrial interfaces, the investigators have had extensive interactions with industry throughout their careers.
Ghahramani has had funded industrial collaborations in internet search and computer systems (Microsoft, Google), finance (FX Concepts), and telecommunications (Datapath, NTT), along with consultancies at GlaxoWellcome and other companies. A number of Ghahramani's former students and postdocs have moved on to positions in leading industrial labs (Citadel, Yahoo!, Google, Microsoft). Ghahramani also has experience in the commercial exploitation of research having founded a start-up company (Xyggy) which is developing novel approaches to Bayesian search. Given the intense interest from and aggressive recruitment by industry, advances in statistical machine learning are clearly of great benefit to a number of industrial sectors, most notably information technology, financial firms, and telecommunications companies.
Wild has spent part of his career in the pharmaceutical, biotechnology and bioinformatics software industries, with Allelix
Biopharmaceuticals, Oxford Molecular and GlaxoWellcome. Prior to joining Warwick Systems Biology Centre he was a founding faculty member of the Keck Graduate Institute of Applied Life Sciences (KGI), where he played a leading role in founding a unique Masters programme which combined training in computational and systems biology and bioengineering with aspects of management, pharmaceutical development and bioscience business awareness. The pharmaceutical, biotechnology and bioinformatics fields will also greatly benefit from novel and scalable tools for statistical modelling.
We will investigate potential commercial exploitation of our research outputs if needed, either via collaborative efforts with our partners or through a potential spin-off. We have strong links to and support from our institutions' IP and enterprise teams (e.g. Cambridge Enterprise and Sussex Research and Enterprise Division). However, although we are sensitive to the possibilities and benefits to society of short-term commercialisation, this is not the primary driver of the research programme proposed.
Finally, a significant component of the impact of our project will come from the training of PhD and postdoctoral staff in advanced computational statistical methods. These highly skilled researchers are often recruited by industry or may even develop their own start-up companies using their invaluable skills for exploiting the information revolution. Although it is hard to predict this impact, it is clear that the skills gained in this research area have great practical relevance outside of academia.
Organisations
Publications
Alvarez-Fernandez R
(2021)
Time-series transcriptomics reveals a BBX32-directed control of acclimation to high light in mature Arabidopsis leaves.
in The Plant journal : for cell and molecular biology
Bechtold U
(2016)
Time-Series Transcriptomics Reveals That AGAMOUS-LIKE22 Affects Primary Metabolism and Developmental Processes in Drought-Stressed Arabidopsis
in The Plant Cell
Bratières S
(2015)
GPstruct: Bayesian Structured Prediction Using Gaussian Processes.
in IEEE transactions on pattern analysis and machine intelligence
Calliess, J-P
(2016)
Baysesian Lipschitz Constant Estimation and Quadrature
Chen Y.
(2016)
Scalable discrete sampling as a multi-armed bandit problem
in 33rd International Conference on Machine Learning, ICML 2016
Chen, Y
(2016)
Scalable Discrete Sampling as a Multi-Armed Bandit Problem
Darkins R
(2013)
Accelerating Bayesian hierarchical clustering of time series data with a randomised algorithm.
in PloS one
De G. Matthews A.G.
(2016)
On sparse variational methods and the Kullback-Leibler divergence between stochastic processes
in Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, AISTATS 2016
Duvenaud D.
(2013)
Structure discovery in nonparametric regression through compositional kernel search
in 30th International Conference on Machine Learning, ICML 2013
Description | We live in an era of abundant data. Rapid technological advances, such as the internet, have made it possible to collect, store and share large amounts of information more easily than ever before. The availability of large amounts of data has had a major impact on society, commerce, and the sciences. Data plays a particularly important role in the sciences. Data is what you get from conducting experiments, and data is what you use to test scientific theories. In recent years, the amount of data collected and generated in the sciences has grown tremendously. In this project we have developed better tools to model this data, so that we can understand and test theories and make scientific predictions. We develop new computational tools for Bayesian modelling, ensuring that the models are flexible enough to capture the complexity of real-world phenomena and scalable enough to deal with very large data sets. We also developed new methods for deciding which data to collect and which experiments to perform, which can greatly reduce the cost of scientific inquiry. We have explored the use of the latest advances in computer hardware, in the form of massively parallel graphics processing units (GPUs) to speed up modelling of scientific data. Our team has expertise spanning Bayesian modelling across the physical, biological and social sciences. Our work develops modelling tools for better astronomical surveying of the skies so that we can understand the composition of our universe; we have also created tools for analysing gene and protein data to so that we can better understand biological phenomena and design drug therapies; and we are developing powerful methods for modelling and predicting economic and financial data which will hopefully reduce risk in financial markets. Surprisingly, these diverse areas of the sciences---astronomy, biology and economics---can come together through a unified set of computational and statistical modelling tools. Our advances will benefit not just these areas but many other areas of science based on data-intensive modelling. |
Exploitation Route | There are many possible applications of the methodologies we have developed outside of the scientific areas outlined in our proposal. |
Sectors | Digital/Communication/Information Technologies (including Software) Financial Services and Management Consultancy Healthcare |
Description | This work in this proposal developed novel theory and algorithmic tools for Bayesian modeling by exploring interdisciplinary approaches from bioinformatics, astronomy, and financial econometrics. The work had impacts on each of those fields as well as on many other areas of machine learning. We developed new scalable methods for learning from Big Data problems and new flexible nonparametric models that can be used to learn realistic models of complex data. Our methods are used by network scientists and by plant scientists to understand biological phenomena. Our work in Bayesian time series is also of use for understanding high-frequency financial data. Since the end of this grant, the field of machine learning has evolved considerably, with a new emphasis on deep learning methods, while Bayesian nonparametric methods have waned in relative importance. Nonetheless, the Bayesian methods continue to be a cornerstone methodology in astronomy and for handling uncertainty in time series. Moreover, papers several papers (and software artifacts) arising from this grant, on topics such as Gaussian processes, probabilistic machine learning, and the automatic statistician, have over 400 citations and have influenced the trajectory of future research. |
First Year Of Impact | 2014 |
Sector | Agriculture, Food and Drink,Financial Services, and Management Consultancy,Healthcare,Pharmaceuticals and Medical Biotechnology |
Impact Types | Societal Economic |
Description | Big Data Capital Grant |
Amount | £1,500,000 (GBP) |
Funding ID | BB/M018431/1 |
Organisation | Biotechnology and Biological Sciences Research Council (BBSRC) |
Sector | Public |
Country | United Kingdom |
Start | 01/2015 |
End | 03/2017 |
Title | Causal structure identification |
Description | Implements the methods described in: Penfold & Wild, 2011. How to infer gene networks from expression profiles, revisited. Interface Focus 1(6):857-870 Penfold et al., 2012. Nonparametric Bayesian inference for perturbed and orthologous gene regulatory networks. Bioinformatics 28:i233-i241 |
Type Of Technology | Software |
Year Produced | 2014 |
Impact | See above |
URL | http://www2.warwick.ac.uk/fac/sci/systemsbiology/research/software |
Title | Compositional kernel search for Gaussian processes |
Description | This is the software that accompanies the paper: James Robert Lloyd, David Duvenaud, Roger Grosse, Joshua B. Tenenbaum, and Zoubin Ghahramani. Automatic construction and natural-language description of nonparametric regression models. In Association for the Advancement of Artificial Intelligence (AAAI), July 2014. |
Type Of Technology | Software |
Year Produced | 2014 |
Impact | The code is one of the major components in the Automatic Statistician project: http://www.automaticstatistician.com/ |
URL | http://www.github.com/jamesrobertlloyd/gpss-research |
Title | GPflow. A Gaussian process library for TensorFlow. |
Description | GPflow is a package for building Gaussian process models in python, using Google's TensorFlow. It was written by James Hensman and Alexander G. de G. Matthews. |
Type Of Technology | Software |
Year Produced | 2016 |
Open Source License? | Yes |
Impact | GPflow is publically available. The authors James Hensman and Alexander G. de G. Matthews are both using it to create research papers. Other third party users have downloaded and use the software in there research. GPflow allow a Bayesian nonparametric model, namely the Gaussian process, to be implemented on multiple distributed GPUs. |
URL | https://github.com/GPflow/GPflow |
Title | MDI-GPU |
Description | Accelerating integrative modelling for genomic scale data using GP-GPU computing |
Type Of Technology | Software |
Year Produced | 2016 |
Open Source License? | Yes |
Impact | publication |
URL | http://www2.warwick.ac.uk/fac/sci/systemsbiology/research/software/ |
Title | OSCI |
Description | Inferring Orthologous Gene Regulatory Networks Using Interspecies Data Fusion |
Type Of Technology | Software |
Year Produced | 2016 |
Open Source License? | Yes |
Impact | publication |
URL | http://www2.warwick.ac.uk/fac/sci/systemsbiology/research/software/ |