Turing AI Fellowship: Probabilistic Algorithms for Scalable and Computable Approaches to Learning (PASCAL)

Lead Research Organisation: Lancaster University
Department Name: Mathematics and Statistics

Abstract

We are living in an unprecedented age where vast quantities of our personal data are continually recorded and analysed, for example, our travel patterns, shopping habits and fitness routines. Our daily lives are now tied into this evolving loop of data collection, leading to data-based automated decisions, that can make recommendations and optimise our routines. There is tremendous economic and societal value in understanding this deluge of unstructured disparate data streams. A key challenge in Artificial Intelligence (AI) research is to extract meaningful value from these data sources to make decisions that can be trusted and understood to improve society.

The PASCAL research programme is focused on developing an end-to-end framework, from data to decisions, that naturally accounts for data uncertainty and provides transparent and interpretable decision-making tools. The algorithms developed throughout this research project will be generally-applicable in a wide range of application domains and appropriate for modern computer hardware infrastructure. All of the research and associated algorithms will be widely available through high-quality open-source software that will ensure the widest possible uptake of this research within the international AI research community.

PASCAL will focus on two primary applications areas: cybersecurity and transportation, which will stimulate and motivate this research and ensure wide-spread impact within these sectors. To drive through the impact and uptake of this research within these sectors, we will work closely with committed strategic partners, GCHQ, the Heilbronn Institute of Mathematical Research, Transport Research Laboratory, the University of Washington and the Alan Turing Institute.

Cybersecurity - The proliferation of computers and mobile technology over the last few decades has led to an exponential increase in recorded data. Much of this data is personally, economically and nationally sensitive and protecting it is a key priority for any government or large organisation. Threats to data security exist on a global scale and identifying potential threats requires cybersecurity experts to evaluate and extract critical intelligence from complex and evolving data sources. In order to model and understand the intricate patterns between these data sources requires complex mathematical models. The PASCAL programme will develop new algorithms that maintain the richness of these mathematical models and use them to provide interpretable and transparent decision recommendations.

Autonomous vehicles (AV) - The transition to AVs will be the most significant global change in transportation for the past century. The economic benefit and successful implementation of this technology within the UK requires a thorough understanding of the risks posed by driverless vehicles and what new procedures are required to ensure human safety. Through PASCAL, we will develop a framework to artificially-generate realistic traffic scenarios to test AVs under a wide range of road conditions and create criteria to safely accredit AV vehicles in the UK.

Publications

10 25 50

publication icon
Turnbull K (2023) Sequential estimation of temporally evolving latent space network models in Computational Statistics & Data Analysis

publication icon
Nemeth C (2021) Stochastic Gradient Markov Chain Monte Carlo in Journal of the American Statistical Association

publication icon
Fairbrother J (2022) GaussianProcesses.jl : A Nonparametric Bayes Package for the Julia Language in Journal of Statistical Software

publication icon
Coullon J (2021) Ensemble sampler for infinite-dimensional inverse problems in Statistics and Computing

 
Description The focus of this grant has been to develop probabilistic approaches to machine learning which can accurately capture real-world uncertainties. Through collaboration with project partners, namely TRL, Shell, and Tesco, this work is being developed to address challenges within these companies. For example, in the case of Shell, our work is being developed to track methane emissions from Shell facilities. Our work with Tesco is developing a new optimisation scheme which will automatically adjust price discounts in stores. The overarching theme of these strands of work is to develop fast and computationally scalable approaches to probabilistic modelling which preserves uncertainty quantification in order to make robust real-world decisions.
Exploitation Route The publications that are currently being developed will be widely available through open-source licenses. Additionally, software that implements these techniques is already in development and will allow users from other sectors to utilise this work.
Sectors Environment,Retail,Security and Diplomacy,Transport

 
Description Bayesian inverse modelling and data assimilation of atmospheric emissions.
Amount £120,000 (GBP)
Funding ID 2605180 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start 09/2021 
End 09/2025
 
Description Scalable Monte Carlo in the General Big Data Setting.
Amount £120,000 (GBP)
Funding ID 1949442 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start 10/2017 
End 05/2022
 
Description Bayesian inverse modelling and data assimilation of atmospheric emissions 
Organisation Shell International Petroleum
Department Shell UK Ltd
Country United Kingdom 
Sector Private 
PI Contribution Our research focuses on locating source(s) and quantifying emission rate(s) of anthropogenic greenhouse gases; with a focus on methane. To do so, we are modelling gas dispersion in the atmosphere and implementing probabilistic inversion for source characterisation. We are predicting spatio-temporal gas dispersion using Gaussian plume and other models from computational fluid dynamics based on Navier-Stokes equations and assessing their computational cost and accuracy under different atmospheric conditions. Additionally, we are developing novel methodologies involving gradient-based MCMC algorithms and Gaussian Processes to perform efficient probabilistic inversion, which identifies source(s) location based on gas concentration measurements. Due to the high-dimensional nature of the problem, MCMC inversion is computationally expensive. Hence, this research is undertaken with the aim to create models which are computationally fast and applicable, including for live tracking of emissions by drones or satellites.
Collaborator Contribution Shell is providing data and domain expertise in modelling air currents from their Statistics team. The team at Shell is dedicating a significant amount of staff time to meet with the Lancaster University research team, meeting at least once per week. Additionally, they are financially supporting opportunities for visits to Shell HQ.
Impact No significant outputs to report at this time.
Start Year 2022
 
Description Collaboration with Dr Leah South 
Organisation Queensland University of Technology (QUT)
Country Australia 
Sector Academic/University 
PI Contribution My team and I are meeting with Dr South on a weekly basis to prepare a research project for academic publication.
Collaborator Contribution Dr Leah South is advising our project on the application of Stein's method within the context of stochastic gradient MCMC
Impact Currently in development
Start Year 2020
 
Description Data subsampling for scalable inference 
Organisation Stanford University
Country United States 
Sector Academic/University 
PI Contribution Monte Carlo methods are often required to produce exact inference and to evaluate models in the Bayesian setting. These algorithms are widely implemented by scientists and industrial practitioners, due to their versatility and strong theoretical properties. Unfortunately, standard Monte Carlo algorithms are ill-suited for conducting inference on large datasets. This is because they require complete evaluations of the full data at each iteration, leading to a computational cost that increases (at the very least) proportionally with the data size. These issues have prompted considerable interest amongst the machine learning and statistics communities to develop Bayesian inference methods which can scale easily in relation to the size of the data. The project has developed new scalable Markov chain Monte Carlo (MCMC) algorithms based on stochastic gradient MCMC. In particular, we have developed new techniques for modelling temporally-varying data and new ways to optimally subsample data which leads to lower variance stochastic gradient estimates.
Collaborator Contribution This project has been in collaboration with Prof Emily Fox (formerly of the University of Washington). Prof Fox is a world leader in statistical machine learning and her expertise has been invaluable in the development of scalable MCMC techniques in the temporally-evolving setting.
Impact Two publications were produced as a result of this collaboration. One paper has been accepted for publication in AISTATS and a second publication is currently under review.
Start Year 2018
 
Description Diffusion-based Deep Generative Models for Assessing Safety in Autonomous Vehicles 
Organisation Transport Research Laboratory Ltd (TRL)
Country United Kingdom 
Sector Private 
PI Contribution This project is working towards developing new deep generative models based on diffusion models. This is a recent and growing field of machine learning, where the goal is to learn a probability distribution from a finite set of samples. This classical problem in statistics has been studied for many decades, but until recently efficient learning of high-dimensional distributions remained impossible in practice. Recent advances in the field of deep generative modelling aim to learn the unknown data-generating distribution using neural network models to generate fake, yet realistic-looking data, such as images and videos, and compare the output to real datasets. The goal of this project is to use ideas from deep generative modelling to create sufficiently complex road scenarios that can be used within autonomous vehicle simulators, such as CARLA.
Collaborator Contribution TRL is providing data and domain expertise to this project. In particular, TRL has access to road accident data which we are using to train our machine-learning models. The team at TRL is also providing domain expertise, particular with regard to the CARLA simulation software, and assisting our team with implementation challenges.
Impact This project is still in the early stages of development are they are no outcomes to report yet.
Start Year 2022
 
Description Optimising In-Store Price Reductions 
Organisation Tesco Plc
Country United Kingdom 
Sector Private 
PI Contribution When selling products, demand for that product does not remain consistent throughout its lifetime. As time progresses a product is deemed less desirable by customers due to factors such as declining quality or newer improved products being released. We often wish to maximise revenue and keeping prices consistent while demand is decreasing is not likely to achieve this. This project looks at pricing strategies for products towards the end of their saleable lifetime, known as markdowns. This project focusses on in-store markdown pricing of a vast array of types of products, which requires adaptable solutions. Our current methods use a two-stage approach: first predicting the demand for products and then using this to find the optimal price(s) for the remaining sales period. We are using novel methods for predicting demand and optimising within markdowns and are interested in considering a holistic approach where the uncertainty of demand is taken into account within the optimisation routine.
Collaborator Contribution Tesco has provided data and IT equipment from their stores which has allowed us to develop probabilistic models of the product demand for multiple products. Tesco has also been very actively engaged in directing this project with regular meetings with Tesco staff and site visits to their HQ to support further discussions.
Impact Publications are currently in progress and we are working towards implementing new techniques within Tesco's systems.
Start Year 2021
 
Description Statistical analysis of multiple interaction data 
Organisation Elsevier
Department Elsevier UK
Country United Kingdom 
Sector Private 
PI Contribution Elsevier provides various online services and tools for researchers, such as Mendeley and ScienceDirect, and are interested in the problem of user segmentation - understanding who their users are and how they interact with their platforms. Our goal is to develop novel methodologies to assist with this task. Of particular interest is the analysis of clickstream data, which contains information regarding visits of users to Elsevier webpages. The data has two key properties are leveraging. Namely, it is both intermittent and bursty, with cascades of clicks in quick succession followed by periods of inactivity. This has provided a means to interpret this as network data. Using the intermittent and bursty properties of these data, we are able to partition a single user's data into a sequence of paths over webpages. This represents an instance of a so-called interaction network, where one observes interactions amongst entities over time (here entities=webpages and interactions=paths). This differs subtly from the case where relations amongst entities are observed explicitly, such as in traditional social network data, and has led to recent work in the literature on new models.
Collaborator Contribution Elsevier has provided data and domain expertise that has assisted in our analysis. We have had regular meetings with Elsevier staff and visits to their offices. These interactions have been invaluable to making progress on this project.
Impact Two publications on this work are currently in submission
Start Year 2019
 
Description Statistical network modelling for populations of networks 
Organisation Elsevier
Department Elsevier UK
Country United Kingdom 
Sector Private 
PI Contribution Developing a tool to cluster researchers who use Elsevier's platforms.
Collaborator Contribution Elsevier has provided data and technical expertise which has allowed us to make methodological developments on this project.
Impact Ongoing
Start Year 2019
 
Title GaussianProcesses.jl 
Description Gaussian processes are a family of stochastic processes which provide a flexible nonparametric tool for modelling data. A Gaussian Process places a prior over functions, and can be described as an infinite dimensional generalisation of a multivariate Normal distribution. Moreover, the joint distribution of any finite collection of points is a multivariate Normal. This process can be fully characterised by its mean and covariance functions, where the mean of any point in the process is described by the mean function and the covariance between any two observations is specified by the kernel. Given a set of observed real-valued points over a space, the Gaussian Process is used to make inference on the values at the remaining points in the space. This package allows the user to fit exact Gaussian process models when the observations are Gaussian distributed about the latent function. In the case where the observations are non-Gaussian, the posterior distribution of the latent function is intractable. The package allows for Monte Carlo sampling from the posterior. 
Type Of Technology Software 
Year Produced 2015 
Open Source License? Yes  
Impact This software is widely used by the Julia community. 
URL https://github.com/STOR-i/GaussianProcesses.jl
 
Title SGMCMC R package 
Description This software implements a host of stochastic gradient MCMC algorithms for fast Bayesian inference. This software has been developed for the R language and is build upon the Google Tensorflow library. Utilising the efficient computation of Tensorflow, and in particular, the automatic differentiation tools available through Tensorflow, this software is the first R package which provides a simple user interface for statistician's to use gradient-based MCMC algorithms, without requiring the gradients to be hand-coded. 
Type Of Technology Software 
Year Produced 2018 
Open Source License? Yes  
Impact The software has only recently been released and is yet to achieve its full potential. However, several papers have already cited this software in their work, indicating that it is being used within the community. 
URL https://github.com/STOR-i/sgmcmc
 
Title SGMCMCJax 
Description The software provides a toolbox of algorithms for stochastic gradient Markov chain Monte Carlo (MCMC). The package builds on the Jax library to offer users automatic differentiation tools that can be used to create gradient-based MCMC samplers. 
Type Of Technology Software 
Year Produced 2021 
Open Source License? Yes  
Impact The software has been used in publications and is part of a new book on probabilistic machine learning written by Kevin Murphy. 
URL https://github.com/jeremiecoullon/SGMCMCJax
 
Description Presentation at the Royal Statistical Society 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Gave a presentation at an RSS workshop on Bayesian computation for Stein's method.
Year(s) Of Engagement Activity 2021
URL https://rss.org.uk/training-events/events/events-2021/sections/rss-applied-probability-and-computati...
 
Description Presentation to SecondMind 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Industry/Business
Results and Impact This presentation showed how Bayesian inference algorithms can be applied without learning rates. This new class of algorithms is highly efficient and removes the need for users to hand-tune the learning rate parameters. The presentation led to an interesting discussion with the audience members on the extensions of this approach. The presentation was given to the machine learning team at SecondMind.
Year(s) Of Engagement Activity 2023
URL https://www.secondmind.ai/labs/seminars/
 
Description Talk at Imperial College London 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Professional Practitioners
Results and Impact This talk was given to approximately 30 people at a seminar for the Statistics group at Imperial College. The talk covered stochastic gradient MCMC methods and how standard methods are inefficient without utilising control variate approaches.
Year(s) Of Engagement Activity 2022
 
Description Talk at Leeds University 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Postgraduate students
Results and Impact Research talk to the Mathematics department at the University of Leeds. The focus of the talk was on data science for environmental science challenges. In particular, how the environmental data scientists at Leeds could collaborate further with Lancaster University.
Year(s) Of Engagement Activity 2022
 
Description Talk at the conference of the International Society of Bayesian Analysis 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact This talk covered the recent developments in scalable Markov chain Monte Carlo and many of the pitfalls that exist with current methods. The audience was international and mostly university academics.
Year(s) Of Engagement Activity 2022