Bayesian Nonparametric Methods for Aggregated and Multivariate Outputs

Lead Research Organisation: Imperial College London
Department Name: Mathematics

Abstract

This project investigates 2 types of situations under data-scarce labelled data that are expensive to obtain, situations which are problems that occur in many environmental and social sciences problems. We aim to develop novel methods that tackle these situations using flexible proxy models that encode prior beliefs and interpretable uncertainty quantifications. This project falls within the EPSRC Mathematical Sciences research area and is partly funded by and in collaboration with Cervest Limited, an artificial intelligence start-up focusing on Earth Science AI, and Imperial College London. This collaboration between industry and academia will allow our research to have access to a wide array of Earth observation datasets from the industry as well as for the industry to gain access to novel methodologies for their own work.

The first part involving aggregated outputs address the situation where we typically observe or must average out quantities over large groups of individuals or geographical areas. An important application where this type of problem occurs is in computing the average treatment effect of administering a pharmaceutical or policy intervention. When labelled data is scarce, this type of problem is even more complex. For instance, how do we model crop yields over a large geographical region when we only know what the yield is for the entire region?

The second part of the project involves modelling multiple quantities, such as precipitation and temperature, jointly in a way that exploits their inter-dependence. Again, when labelled data is scarce modelling multiple quantities can allow for additional signals to be extracted.

To capture complex interactions between covariates and outputs, nonparametric methods, ones that assume infinitely many model parameters such as Gaussian processes (GP), provide a flexible way for encoding prior beliefs, and there is also a rich literature on using GPs for label-scarce and feature-rich situations (Law et al. (2018); Hamelijnck et al. (2019)). GPs encode prior beliefs using normal distributions and can also give uncertainty quantification, which is highly desirable for situations when this is important. Recently, tree-based models (Chipman et al. (2010); Lakshminarayanan et al. (2016)), where the prior belief is broken down into subgroups of individuals or subregions, have been of interest to the machine learning community, yielding highly competitive results to GPs. Like GPs, tree-based models also provide a flexible nonparametric model that can provide uncertainty quantification, but properties of tree-based priors have yet to have been fully exploited for more complex applications. We hope to work on the development of novel nonparametric methodologies as solutions for our project aims.

We will first develop novel nonparametric modelling approaches for applications that involve aggregated quantities of interest and outputs. We will then work on developing flexible models for multiple outputs with broad applications for environmental sciences in mind.

References:
Chipman, H.A., George, E.I. and McCulloch, R.E., 2010. BART: Bayesian additive regression trees. The Annals of Applied Statistics, 4(1), pp.266-298.

Hamelijnck, O., Damoulas, T., Wang, K. and Girolami, M., 2019. Multi-resolution multi-task Gaussian processes. In Advances in Neural Information Processing Systems (pp. 14025-14035).

Lakshminarayanan, B., Roy, D.M. and Teh, Y.W., 2016, May. Mondrian forests for large-scale regression when uncertainty matters. In Artificial Intelligence and Statistics (pp. 1478-1487).

Law, H.C., Sejdinovic, D., Cameron, E., Lucas, T., Flaxman, S., Battle, K. and Fukumizu, K., 2018. Variational learning on aggregate outputs with Gaussian processes. In Advances in Neural Information Processing Systems (pp. 6081-6091).

Planned Impact

The primary CDT impact will be training 75 PhD graduates as the next generation of leaders in statistics and statistical machine learning. These graduates will lead in industry, government, health care, and academic research. They will bridge the gap between academia and industry, resulting in significant knowledge transfer to both established and start-up companies. Because this cohort will also learn to mentor other researchers, the CDT will ultimately address a UK-wide skills gap. The students will also be crucial in keeping the UK at the forefront of methodological research in statistics and machine learning.
After graduating, students will act as multipliers, educating others in advanced methodology throughout their career. There are a range of further impacts:
- The CDT has a large number of high calibre external partners in government, health care, industry and science. These partnerships will catalyse immediate knowledge transfer, bringing cutting edge methodology to a large number of areas. Knowledge transfer will also be achieved through internships/placements of our students with users of statistics and machine learning.
- Our Women in Mathematics and Statistics summer programme is aimed at students who could go on to apply for a PhD. This programme will inspire the next generation of statisticians and also provide excellent leadership training for the CDT students.
- The students will develop new methodology and theory in the domains of statistics and statistical machine learning. It will be relevant research, addressing the key questions behind real world problems. The research will be published in the best possible statistics journals and machine learning conferences and will be made available online. To maximize reproducibility and replicability, source code and replication files will be made available as open source software or, when relevant to an industrial collaboration, held as a patent or software copyright.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/S023151/1 01/04/2019 30/09/2027
2283505 Studentship EP/S023151/1 01/10/2019 30/09/2023 Harrison Zhu
 
Description Our main goals are to build statistical models that can help us make predictions and explain how those predictions were made. These are equally important for decision-making, since decision-makers want to be able to make accurate decisions but also consider potential risks. We are mainly interested in 2 types of real-world problems: (1) data labels that only come in aggregated quantities, such as crop yields (2) data labels with multiple variables e.g. (temperature, precipitation...) or pixels in a 32x32 image.

For (1), we developed a novel Gaussian process-based statistical model that can be used to model aggregated data labels, whilst also taking into account of the multiple resolution nature of potential features (such as precipitation and temperature) that can help us make more accurate predictions. Secondly, we also developed a novel probabilistic integration method that also allows us to calculate integrals (which are aggregated quantities) and quantify the uncertainty.

For (2), we developed a probabilistic deep learning model that allows us to computationally efficiently model very high-dimensional temporal and spatiotemporal datasets, such as videos and climate data. The method beats many existing benchmark models, both in terms of performance and computational scalability, and can also be used for interpretability.

We showed in our publications that these works can be used for a variety of real world applications, such as agriculture, climate modelling and video data modelling.
Exploitation Route All these accomplishments have corresponding open-source publications/working papers and code that can be extended in the future by the research community. We have made sure that these outcomes are highly accessible to the general public. Given that there are still many months left of the award, we will continue to finish off other contributions to (1) and (2). Examples include modelling multivariate outputs that exhibit jump behaviour via stochastic differential equations and more efficiently modelling many high-dimensional datasets via meta-learning.
Sectors Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Environment