Turing AI Fellowship: Probabilistic Algorithms for Scalable and Computable Approaches to Learning (PASCAL)

Lead Research Organisation: Lancaster University

Department Name: Mathematics and Statistics

Abstract

We are living in an unprecedented age where vast quantities of our personal data are continually recorded and analysed, for example, our travel patterns, shopping habits and fitness routines. Our daily lives are now tied into this evolving loop of data collection, leading to data-based automated decisions, that can make recommendations and optimise our routines. There is tremendous economic and societal value in understanding this deluge of unstructured disparate data streams. A key challenge in Artificial Intelligence (AI) research is to extract meaningful value from these data sources to make decisions that can be trusted and understood to improve society.

The PASCAL research programme is focused on developing an end-to-end framework, from data to decisions, that naturally accounts for data uncertainty and provides transparent and interpretable decision-making tools. The algorithms developed throughout this research project will be generally-applicable in a wide range of application domains and appropriate for modern computer hardware infrastructure. All of the research and associated algorithms will be widely available through high-quality open-source software that will ensure the widest possible uptake of this research within the international AI research community.

PASCAL will focus on two primary applications areas: cybersecurity and transportation, which will stimulate and motivate this research and ensure wide-spread impact within these sectors. To drive through the impact and uptake of this research within these sectors, we will work closely with committed strategic partners, GCHQ, the Heilbronn Institute of Mathematical Research, Transport Research Laboratory, the University of Washington and the Alan Turing Institute.

Cybersecurity - The proliferation of computers and mobile technology over the last few decades has led to an exponential increase in recorded data. Much of this data is personally, economically and nationally sensitive and protecting it is a key priority for any government or large organisation. Threats to data security exist on a global scale and identifying potential threats requires cybersecurity experts to evaluate and extract critical intelligence from complex and evolving data sources. In order to model and understand the intricate patterns between these data sources requires complex mathematical models. The PASCAL programme will develop new algorithms that maintain the richness of these mathematical models and use them to provide interpretable and transparent decision recommendations.

Autonomous vehicles (AV) - The transition to AVs will be the most significant global change in transportation for the past century. The economic benefit and successful implementation of this technology within the UK requires a thorough understanding of the risks posed by driverless vehicles and what new procedures are required to ensure human safety. Through PASCAL, we will develop a framework to artificially-generate realistic traffic scenarios to test AVs under a wide range of road conditions and create criteria to safely accredit AV vehicles in the UK.

Funded Value:

£1,097,294

Funded Period:

Jan 21 - Dec 25

Funder:

EPSRC

Project Status:

Active

Project Category:

Fellowship

Project Reference:

EP/V022636/1

Principal Investigator:

Christopher Nemeth

Research Subject:

Info. & commun. Technol. (55%)

Mathematical sciences (45%)

Research Topic:

Artificial Intelligence (55%)

Statistics & Appl. Probability (45%)

Organisations

People	ORCID iD
Christopher Nemeth (Principal Investigator / Fellow)	http://orcid.org/0000-0002-9084-3866

Publications

Author Name

Title Publication Date Published

10 25 50

Vyner, C (2022) SwISS: A scalable Markov chain Monte Carlo divide-and-conquer strategy in Stat

Vyner C (2023) SwISS: A scalable Markov chain Monte Carlo divide-and-conquer strategy in Stat

Turnbull Kathryn (2023) Latent Space Modeling of Hypergraph Data in JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION

Turnbull K (2023) Sequential estimation of temporally evolving latent space network models in Computational Statistics & Data Analysis

South L (2022) Semi-exact control functionals from Sard's method in Biometrika

Shu Q (2023) Characterising the ice sheet surface in Northeast Greenland using Sentinel-1 SAR data in Journal of Glaciology

Sharrock L. (2023) Coin Sampling: Gradient-Based Bayesian Inference without Learning Rates in Proceedings of Machine Learning Research

Sharrock L (2022) Sequential Neural Score Estimation: Likelihood-Free Inference with Conditional Score Based Diffusion Models

Sharrock L (2023) Online parameter estimation for the McKean-Vlasov stochastic differential equation in Stochastic Processes and their Applications

Sharrock L (2023) Tuning-Free Maximum Likelihood Training of Latent Variable Models via Coin Betting

Sharrock L (2023) Learning Rate Free Sampling in Constrained Domains

Sarti D (2023) Bayesian additive regression trees for genotype by environment interaction models in The Annals of Applied Statistics

Putcha S. (2023) Preferential Subsampling for Stochastic Gradient Langevin Dynamics in Proceedings of Machine Learning Research

Oyebamiji O (2023) Multivariate sensitivity analysis for a large-scale climate impact and adaptation model in Journal of the Royal Statistical Society Series C: Applied Statistics

Nemeth C (2021) Stochastic Gradient Markov Chain Monte Carlo in Journal of the American Statistical Association

Mimnagh N (2023) Modelling Insect Populations in Agricultural Landscapes

Mimnagh N (2022) Bayesian multi-species N-mixture models for unmarked animal communities in Environmental and Ecological Statistics

Fairbrother J (2022) GaussianProcesses.jl : A Nonparametric Bayes Package for the Julia Language in Journal of Statistical Software

Coullon J (2022) Markov chain Monte Carlo for a hyperbolic Bayesian inverse problem in traffic flow modeling in Data-Centric Engineering

Coullon J (2023) Efficient and generalizable tuning strategies for stochastic gradient MCMC in Statistics and Computing

Coullon J (2022) SGMCMCJax: a lightweight JAX library for stochastic gradient Markov chain Monte Carlo algorithms in Journal of Open Source Software

Coullon J (2021) Ensemble sampler for infinite-dimensional inverse problems in Statistics and Computing

Cabezas A. (2023) Transport Elliptical Slice Sampling in Proceedings of Machine Learning Research

Aicher C (2023) Stochastic Gradient MCMC for Nonlinear State Space Models in Bayesian Analysis

Key Findings
Further Funding
Collaboration
Software and Technical Products
Engagement Activities


Description	The focus of this grant has been to develop probabilistic approaches to machine learning which can accurately capture real-world uncertainties. Through collaboration with project partners, namely TRL, Shell, and Tesco, this work is being developed to address challenges within these companies. For example, in the case of Shell, our work is being developed to track methane emissions from Shell facilities. Our work with Tesco is developing a new optimisation scheme which will automatically adjust price discounts in stores. The overarching theme of these strands of work is to develop fast and computationally scalable approaches to probabilistic modelling which preserves uncertainty quantification in order to make robust real-world decisions.
Exploitation Route	The publications that are currently being developed will be widely available through open-source licenses. Additionally, software that implements these techniques is already in development and will allow users from other sectors to utilise this work.
Sectors	Environment,Retail,Security and Diplomacy,Transport


Description	Bayesian inverse modelling and data assimilation of atmospheric emissions.
Amount	£120,000 (GBP)
Funding ID	2605180
Organisation	Engineering and Physical Sciences Research Council (EPSRC)
Sector	Public
Country	United Kingdom
Start	09/2021
End	09/2025


Description	Scalable Monte Carlo in the General Big Data Setting.
Amount	£120,000 (GBP)
Funding ID	1949442
Organisation	Engineering and Physical Sciences Research Council (EPSRC)
Sector	Public
Country	United Kingdom
Start	10/2017
End	05/2022


Description	Bayesian inverse modelling and data assimilation of atmospheric emissions
Organisation	Shell International Petroleum
Department	Shell UK Ltd
Country	United Kingdom
Sector	Private
PI Contribution	Our research focuses on locating source(s) and quantifying emission rate(s) of anthropogenic greenhouse gases; with a focus on methane. To do so, we are modelling gas dispersion in the atmosphere and implementing probabilistic inversion for source characterisation. We are predicting spatio-temporal gas dispersion using Gaussian plume and other models from computational fluid dynamics based on Navier-Stokes equations and assessing their computational cost and accuracy under different atmospheric conditions. Additionally, we are developing novel methodologies involving gradient-based MCMC algorithms and Gaussian Processes to perform efficient probabilistic inversion, which identifies source(s) location based on gas concentration measurements. Due to the high-dimensional nature of the problem, MCMC inversion is computationally expensive. Hence, this research is undertaken with the aim to create models which are computationally fast and applicable, including for live tracking of emissions by drones or satellites.
Collaborator Contribution	Shell is providing data and domain expertise in modelling air currents from their Statistics team. The team at Shell is dedicating a significant amount of staff time to meet with the Lancaster University research team, meeting at least once per week. Additionally, they are financially supporting opportunities for visits to Shell HQ.
Impact	No significant outputs to report at this time.
Start Year	2022


Description	Collaboration with Dr Leah South
Organisation	Queensland University of Technology (QUT)
Country	Australia
Sector	Academic/University
PI Contribution	My team and I are meeting with Dr South on a weekly basis to prepare a research project for academic publication.
Collaborator Contribution	Dr Leah South is advising our project on the application of Stein's method within the context of stochastic gradient MCMC
Impact	Currently in development
Start Year	2020


Description	Data subsampling for scalable inference
Organisation	Stanford University
Country	United States
Sector	Academic/University
PI Contribution	Monte Carlo methods are often required to produce exact inference and to evaluate models in the Bayesian setting. These algorithms are widely implemented by scientists and industrial practitioners, due to their versatility and strong theoretical properties. Unfortunately, standard Monte Carlo algorithms are ill-suited for conducting inference on large datasets. This is because they require complete evaluations of the full data at each iteration, leading to a computational cost that increases (at the very least) proportionally with the data size. These issues have prompted considerable interest amongst the machine learning and statistics communities to develop Bayesian inference methods which can scale easily in relation to the size of the data. The project has developed new scalable Markov chain Monte Carlo (MCMC) algorithms based on stochastic gradient MCMC. In particular, we have developed new techniques for modelling temporally-varying data and new ways to optimally subsample data which leads to lower variance stochastic gradient estimates.
Collaborator Contribution	This project has been in collaboration with Prof Emily Fox (formerly of the University of Washington). Prof Fox is a world leader in statistical machine learning and her expertise has been invaluable in the development of scalable MCMC techniques in the temporally-evolving setting.
Impact	Two publications were produced as a result of this collaboration. One paper has been accepted for publication in AISTATS and a second publication is currently under review.
Start Year	2018


Description	Diffusion-based Deep Generative Models for Assessing Safety in Autonomous Vehicles
Organisation	Transport Research Laboratory Ltd (TRL)
Country	United Kingdom
Sector	Private
PI Contribution	This project is working towards developing new deep generative models based on diffusion models. This is a recent and growing field of machine learning, where the goal is to learn a probability distribution from a finite set of samples. This classical problem in statistics has been studied for many decades, but until recently efficient learning of high-dimensional distributions remained impossible in practice. Recent advances in the field of deep generative modelling aim to learn the unknown data-generating distribution using neural network models to generate fake, yet realistic-looking data, such as images and videos, and compare the output to real datasets. The goal of this project is to use ideas from deep generative modelling to create sufficiently complex road scenarios that can be used within autonomous vehicle simulators, such as CARLA.
Collaborator Contribution	TRL is providing data and domain expertise to this project. In particular, TRL has access to road accident data which we are using to train our machine-learning models. The team at TRL is also providing domain expertise, particular with regard to the CARLA simulation software, and assisting our team with implementation challenges.
Impact	This project is still in the early stages of development are they are no outcomes to report yet.
Start Year	2022


Description	Optimising In-Store Price Reductions
Organisation	Tesco Plc
Country	United Kingdom
Sector	Private
PI Contribution	When selling products, demand for that product does not remain consistent throughout its lifetime. As time progresses a product is deemed less desirable by customers due to factors such as declining quality or newer improved products being released. We often wish to maximise revenue and keeping prices consistent while demand is decreasing is not likely to achieve this. This project looks at pricing strategies for products towards the end of their saleable lifetime, known as markdowns. This project focusses on in-store markdown pricing of a vast array of types of products, which requires adaptable solutions. Our current methods use a two-stage approach: first predicting the demand for products and then using this to find the optimal price(s) for the remaining sales period. We are using novel methods for predicting demand and optimising within markdowns and are interested in considering a holistic approach where the uncertainty of demand is taken into account within the optimisation routine.
Collaborator Contribution	Tesco has provided data and IT equipment from their stores which has allowed us to develop probabilistic models of the product demand for multiple products. Tesco has also been very actively engaged in directing this project with regular meetings with Tesco staff and site visits to their HQ to support further discussions.
Impact	Publications are currently in progress and we are working towards implementing new techniques within Tesco's systems.
Start Year	2021


Description	Statistical analysis of multiple interaction data
Organisation	Elsevier
Department	Elsevier UK
Country	United Kingdom
Sector	Private
PI Contribution	Elsevier provides various online services and tools for researchers, such as Mendeley and ScienceDirect, and are interested in the problem of user segmentation - understanding who their users are and how they interact with their platforms. Our goal is to develop novel methodologies to assist with this task. Of particular interest is the analysis of clickstream data, which contains information regarding visits of users to Elsevier webpages. The data has two key properties are leveraging. Namely, it is both intermittent and bursty, with cascades of clicks in quick succession followed by periods of inactivity. This has provided a means to interpret this as network data. Using the intermittent and bursty properties of these data, we are able to partition a single user's data into a sequence of paths over webpages. This represents an instance of a so-called interaction network, where one observes interactions amongst entities over time (here entities=webpages and interactions=paths). This differs subtly from the case where relations amongst entities are observed explicitly, such as in traditional social network data, and has led to recent work in the literature on new models.
Collaborator Contribution	Elsevier has provided data and domain expertise that has assisted in our analysis. We have had regular meetings with Elsevier staff and visits to their offices. These interactions have been invaluable to making progress on this project.
Impact	Two publications on this work are currently in submission
Start Year	2019


Description	Statistical network modelling for populations of networks
Organisation	Elsevier
Department	Elsevier UK
Country	United Kingdom
Sector	Private
PI Contribution	Developing a tool to cluster researchers who use Elsevier's platforms.
Collaborator Contribution	Elsevier has provided data and technical expertise which has allowed us to make methodological developments on this project.
Impact	Ongoing
Start Year	2019


Title	GaussianProcesses.jl
Description	Gaussian processes are a family of stochastic processes which provide a flexible nonparametric tool for modelling data. A Gaussian Process places a prior over functions, and can be described as an infinite dimensional generalisation of a multivariate Normal distribution. Moreover, the joint distribution of any finite collection of points is a multivariate Normal. This process can be fully characterised by its mean and covariance functions, where the mean of any point in the process is described by the mean function and the covariance between any two observations is specified by the kernel. Given a set of observed real-valued points over a space, the Gaussian Process is used to make inference on the values at the remaining points in the space. This package allows the user to fit exact Gaussian process models when the observations are Gaussian distributed about the latent function. In the case where the observations are non-Gaussian, the posterior distribution of the latent function is intractable. The package allows for Monte Carlo sampling from the posterior.
Type Of Technology	Software
Year Produced	2015
Open Source License?	Yes
Impact	This software is widely used by the Julia community.
URL	https://github.com/STOR-i/GaussianProcesses.jl


Title	SGMCMC R package
Description	This software implements a host of stochastic gradient MCMC algorithms for fast Bayesian inference. This software has been developed for the R language and is build upon the Google Tensorflow library. Utilising the efficient computation of Tensorflow, and in particular, the automatic differentiation tools available through Tensorflow, this software is the first R package which provides a simple user interface for statistician's to use gradient-based MCMC algorithms, without requiring the gradients to be hand-coded.
Type Of Technology	Software
Year Produced	2018
Open Source License?	Yes
Impact	The software has only recently been released and is yet to achieve its full potential. However, several papers have already cited this software in their work, indicating that it is being used within the community.
URL	https://github.com/STOR-i/sgmcmc


Title	SGMCMCJax
Description	The software provides a toolbox of algorithms for stochastic gradient Markov chain Monte Carlo (MCMC). The package builds on the Jax library to offer users automatic differentiation tools that can be used to create gradient-based MCMC samplers.
Type Of Technology	Software
Year Produced	2021
Open Source License?	Yes
Impact	The software has been used in publications and is part of a new book on probabilistic machine learning written by Kevin Murphy.
URL	https://github.com/jeremiecoullon/SGMCMCJax


Description	Presentation at the Royal Statistical Society
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Gave a presentation at an RSS workshop on Bayesian computation for Stein's method.
Year(s) Of Engagement Activity	2021
URL	https://rss.org.uk/training-events/events/events-2021/sections/rss-applied-probability-and-computati...


Description	Presentation to SecondMind
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Regional
Primary Audience	Industry/Business
Results and Impact	This presentation showed how Bayesian inference algorithms can be applied without learning rates. This new class of algorithms is highly efficient and removes the need for users to hand-tune the learning rate parameters. The presentation led to an interesting discussion with the audience members on the extensions of this approach. The presentation was given to the machine learning team at SecondMind.
Year(s) Of Engagement Activity	2023
URL	https://www.secondmind.ai/labs/seminars/


Description	Talk at Imperial College London
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Professional Practitioners
Results and Impact	This talk was given to approximately 30 people at a seminar for the Statistics group at Imperial College. The talk covered stochastic gradient MCMC methods and how standard methods are inefficient without utilising control variate approaches.
Year(s) Of Engagement Activity	2022


Description	Talk at Leeds University
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Postgraduate students
Results and Impact	Research talk to the Mathematics department at the University of Leeds. The focus of the talk was on data science for environmental science challenges. In particular, how the environmental data scientists at Leeds could collaborate further with Lancaster University.
Year(s) Of Engagement Activity	2022


Description	Talk at the conference of the International Society of Bayesian Analysis
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	This talk covered the recent developments in scalable Markov chain Monte Carlo and many of the pitfalls that exist with current methods. The audience was international and mostly university academics.
Year(s) Of Engagement Activity	2022

Abstract

Organisations

People

ORCID iD

Publications