Turing AI Fellowship: Probabilistic Algorithms for Scalable and Computable Approaches to Learning (PASCAL)
Lead Research Organisation:
Lancaster University
Department Name: Mathematics and Statistics
Abstract
We are living in an unprecedented age where vast quantities of our personal data are continually recorded and analysed, for example, our travel patterns, shopping habits and fitness routines. Our daily lives are now tied into this evolving loop of data collection, leading to data-based automated decisions, that can make recommendations and optimise our routines. There is tremendous economic and societal value in understanding this deluge of unstructured disparate data streams. A key challenge in Artificial Intelligence (AI) research is to extract meaningful value from these data sources to make decisions that can be trusted and understood to improve society.
The PASCAL research programme is focused on developing an end-to-end framework, from data to decisions, that naturally accounts for data uncertainty and provides transparent and interpretable decision-making tools. The algorithms developed throughout this research project will be generally-applicable in a wide range of application domains and appropriate for modern computer hardware infrastructure. All of the research and associated algorithms will be widely available through high-quality open-source software that will ensure the widest possible uptake of this research within the international AI research community.
PASCAL will focus on two primary applications areas: cybersecurity and transportation, which will stimulate and motivate this research and ensure wide-spread impact within these sectors. To drive through the impact and uptake of this research within these sectors, we will work closely with committed strategic partners, GCHQ, the Heilbronn Institute of Mathematical Research, Transport Research Laboratory, the University of Washington and the Alan Turing Institute.
Cybersecurity - The proliferation of computers and mobile technology over the last few decades has led to an exponential increase in recorded data. Much of this data is personally, economically and nationally sensitive and protecting it is a key priority for any government or large organisation. Threats to data security exist on a global scale and identifying potential threats requires cybersecurity experts to evaluate and extract critical intelligence from complex and evolving data sources. In order to model and understand the intricate patterns between these data sources requires complex mathematical models. The PASCAL programme will develop new algorithms that maintain the richness of these mathematical models and use them to provide interpretable and transparent decision recommendations.
Autonomous vehicles (AV) - The transition to AVs will be the most significant global change in transportation for the past century. The economic benefit and successful implementation of this technology within the UK requires a thorough understanding of the risks posed by driverless vehicles and what new procedures are required to ensure human safety. Through PASCAL, we will develop a framework to artificially-generate realistic traffic scenarios to test AVs under a wide range of road conditions and create criteria to safely accredit AV vehicles in the UK.
The PASCAL research programme is focused on developing an end-to-end framework, from data to decisions, that naturally accounts for data uncertainty and provides transparent and interpretable decision-making tools. The algorithms developed throughout this research project will be generally-applicable in a wide range of application domains and appropriate for modern computer hardware infrastructure. All of the research and associated algorithms will be widely available through high-quality open-source software that will ensure the widest possible uptake of this research within the international AI research community.
PASCAL will focus on two primary applications areas: cybersecurity and transportation, which will stimulate and motivate this research and ensure wide-spread impact within these sectors. To drive through the impact and uptake of this research within these sectors, we will work closely with committed strategic partners, GCHQ, the Heilbronn Institute of Mathematical Research, Transport Research Laboratory, the University of Washington and the Alan Turing Institute.
Cybersecurity - The proliferation of computers and mobile technology over the last few decades has led to an exponential increase in recorded data. Much of this data is personally, economically and nationally sensitive and protecting it is a key priority for any government or large organisation. Threats to data security exist on a global scale and identifying potential threats requires cybersecurity experts to evaluate and extract critical intelligence from complex and evolving data sources. In order to model and understand the intricate patterns between these data sources requires complex mathematical models. The PASCAL programme will develop new algorithms that maintain the richness of these mathematical models and use them to provide interpretable and transparent decision recommendations.
Autonomous vehicles (AV) - The transition to AVs will be the most significant global change in transportation for the past century. The economic benefit and successful implementation of this technology within the UK requires a thorough understanding of the risks posed by driverless vehicles and what new procedures are required to ensure human safety. Through PASCAL, we will develop a framework to artificially-generate realistic traffic scenarios to test AVs under a wide range of road conditions and create criteria to safely accredit AV vehicles in the UK.
Organisations
- Lancaster University (Fellow, Lead Research Organisation)
- Elsevier (Collaboration)
- Queensland University of Technology (QUT) (Collaboration)
- Transport Research Laboratory Ltd (TRL) (Collaboration)
- Stanford University (Collaboration)
- Shell International Petroleum (Collaboration)
- Tesco (United Kingdom) (Collaboration)
- Transport Research Laboratory (United Kingdom) (Project Partner)
- Government Communications Headquarters (Project Partner)
- Heilbronn Institute for Mathematical Research (Project Partner)
- University of Washington (Project Partner)
Publications
Vyner, C
(2022)
SwISS: A scalable Markov chain Monte Carlo divide-and-conquer strategy
in Stat
Vyner C
(2023)
SwISS: A scalable Markov chain Monte Carlo divide-and-conquer strategy
in Stat
Turnbull Kathryn
(2023)
Latent Space Modeling of Hypergraph Data
in JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
Turnbull K
(2023)
Sequential estimation of temporally evolving latent space network models
in Computational Statistics & Data Analysis
South L
(2022)
Semi-exact control functionals from Sard's method
in Biometrika
Shu Q
(2023)
Characterising the ice sheet surface in Northeast Greenland using Sentinel-1 SAR data
in Journal of Glaciology
Sharrock L.
(2023)
Coin Sampling: Gradient-Based Bayesian Inference without Learning Rates
in Proceedings of Machine Learning Research
Sharrock L
(2023)
Online parameter estimation for the McKean-Vlasov stochastic differential equation
in Stochastic Processes and their Applications
Sharrock L
(2023)
Learning Rate Free Sampling in Constrained Domains
Sarti D
(2023)
Bayesian additive regression trees for genotype by environment interaction models
in The Annals of Applied Statistics
Putcha S.
(2023)
Preferential Subsampling for Stochastic Gradient Langevin Dynamics
in Proceedings of Machine Learning Research
Oyebamiji O
(2023)
Multivariate sensitivity analysis for a large-scale climate impact and adaptation model
in Journal of the Royal Statistical Society Series C: Applied Statistics
Nemeth C
(2021)
Stochastic Gradient Markov Chain Monte Carlo
in Journal of the American Statistical Association
Mimnagh N
(2023)
Modelling Insect Populations in Agricultural Landscapes
Mimnagh N
(2022)
Bayesian multi-species N-mixture models for unmarked animal communities
in Environmental and Ecological Statistics
Fairbrother J
(2022)
GaussianProcesses.jl : A Nonparametric Bayes Package for the Julia Language
in Journal of Statistical Software
Coullon J
(2022)
Markov chain Monte Carlo for a hyperbolic Bayesian inverse problem in traffic flow modeling
in Data-Centric Engineering
Coullon J
(2023)
Efficient and generalizable tuning strategies for stochastic gradient MCMC
in Statistics and Computing
Coullon J
(2022)
SGMCMCJax: a lightweight JAX library for stochastic gradient Markov chain Monte Carlo algorithms
in Journal of Open Source Software
Coullon J
(2021)
Ensemble sampler for infinite-dimensional inverse problems
in Statistics and Computing
Cabezas A.
(2023)
Transport Elliptical Slice Sampling
in Proceedings of Machine Learning Research
Aicher C
(2023)
Stochastic Gradient MCMC for Nonlinear State Space Models
in Bayesian Analysis
Description | The focus of this grant has been to develop probabilistic approaches to machine learning which can accurately capture real-world uncertainties. Through collaboration with project partners, namely TRL, Shell, and Tesco, this work is being developed to address challenges within these companies. For example, in the case of Shell, our work is being developed to track methane emissions from Shell facilities. Our work with Tesco is developing a new optimisation scheme which will automatically adjust price discounts in stores. The overarching theme of these strands of work is to develop fast and computationally scalable approaches to probabilistic modelling which preserves uncertainty quantification in order to make robust real-world decisions. |
Exploitation Route | The publications that are currently being developed will be widely available through open-source licenses. Additionally, software that implements these techniques is already in development and will allow users from other sectors to utilise this work. |
Sectors | Environment,Retail,Security and Diplomacy,Transport |
Description | Bayesian inverse modelling and data assimilation of atmospheric emissions. |
Amount | £120,000 (GBP) |
Funding ID | 2605180 |
Organisation | Engineering and Physical Sciences Research Council (EPSRC) |
Sector | Public |
Country | United Kingdom |
Start | 09/2021 |
End | 09/2025 |
Description | Scalable Monte Carlo in the General Big Data Setting. |
Amount | £120,000 (GBP) |
Funding ID | 1949442 |
Organisation | Engineering and Physical Sciences Research Council (EPSRC) |
Sector | Public |
Country | United Kingdom |
Start | 10/2017 |
End | 05/2022 |
Description | Bayesian inverse modelling and data assimilation of atmospheric emissions |
Organisation | Shell International Petroleum |
Department | Shell UK Ltd |
Country | United Kingdom |
Sector | Private |
PI Contribution | Our research focuses on locating source(s) and quantifying emission rate(s) of anthropogenic greenhouse gases; with a focus on methane. To do so, we are modelling gas dispersion in the atmosphere and implementing probabilistic inversion for source characterisation. We are predicting spatio-temporal gas dispersion using Gaussian plume and other models from computational fluid dynamics based on Navier-Stokes equations and assessing their computational cost and accuracy under different atmospheric conditions. Additionally, we are developing novel methodologies involving gradient-based MCMC algorithms and Gaussian Processes to perform efficient probabilistic inversion, which identifies source(s) location based on gas concentration measurements. Due to the high-dimensional nature of the problem, MCMC inversion is computationally expensive. Hence, this research is undertaken with the aim to create models which are computationally fast and applicable, including for live tracking of emissions by drones or satellites. |
Collaborator Contribution | Shell is providing data and domain expertise in modelling air currents from their Statistics team. The team at Shell is dedicating a significant amount of staff time to meet with the Lancaster University research team, meeting at least once per week. Additionally, they are financially supporting opportunities for visits to Shell HQ. |
Impact | No significant outputs to report at this time. |
Start Year | 2022 |
Description | Collaboration with Dr Leah South |
Organisation | Queensland University of Technology (QUT) |
Country | Australia |
Sector | Academic/University |
PI Contribution | My team and I are meeting with Dr South on a weekly basis to prepare a research project for academic publication. |
Collaborator Contribution | Dr Leah South is advising our project on the application of Stein's method within the context of stochastic gradient MCMC |
Impact | Currently in development |
Start Year | 2020 |
Description | Data subsampling for scalable inference |
Organisation | Stanford University |
Country | United States |
Sector | Academic/University |
PI Contribution | Monte Carlo methods are often required to produce exact inference and to evaluate models in the Bayesian setting. These algorithms are widely implemented by scientists and industrial practitioners, due to their versatility and strong theoretical properties. Unfortunately, standard Monte Carlo algorithms are ill-suited for conducting inference on large datasets. This is because they require complete evaluations of the full data at each iteration, leading to a computational cost that increases (at the very least) proportionally with the data size. These issues have prompted considerable interest amongst the machine learning and statistics communities to develop Bayesian inference methods which can scale easily in relation to the size of the data. The project has developed new scalable Markov chain Monte Carlo (MCMC) algorithms based on stochastic gradient MCMC. In particular, we have developed new techniques for modelling temporally-varying data and new ways to optimally subsample data which leads to lower variance stochastic gradient estimates. |
Collaborator Contribution | This project has been in collaboration with Prof Emily Fox (formerly of the University of Washington). Prof Fox is a world leader in statistical machine learning and her expertise has been invaluable in the development of scalable MCMC techniques in the temporally-evolving setting. |
Impact | Two publications were produced as a result of this collaboration. One paper has been accepted for publication in AISTATS and a second publication is currently under review. |
Start Year | 2018 |
Description | Diffusion-based Deep Generative Models for Assessing Safety in Autonomous Vehicles |
Organisation | Transport Research Laboratory Ltd (TRL) |
Country | United Kingdom |
Sector | Private |
PI Contribution | This project is working towards developing new deep generative models based on diffusion models. This is a recent and growing field of machine learning, where the goal is to learn a probability distribution from a finite set of samples. This classical problem in statistics has been studied for many decades, but until recently efficient learning of high-dimensional distributions remained impossible in practice. Recent advances in the field of deep generative modelling aim to learn the unknown data-generating distribution using neural network models to generate fake, yet realistic-looking data, such as images and videos, and compare the output to real datasets. The goal of this project is to use ideas from deep generative modelling to create sufficiently complex road scenarios that can be used within autonomous vehicle simulators, such as CARLA. |
Collaborator Contribution | TRL is providing data and domain expertise to this project. In particular, TRL has access to road accident data which we are using to train our machine-learning models. The team at TRL is also providing domain expertise, particular with regard to the CARLA simulation software, and assisting our team with implementation challenges. |
Impact | This project is still in the early stages of development are they are no outcomes to report yet. |
Start Year | 2022 |
Description | Optimising In-Store Price Reductions |
Organisation | Tesco Plc |
Country | United Kingdom |
Sector | Private |
PI Contribution | When selling products, demand for that product does not remain consistent throughout its lifetime. As time progresses a product is deemed less desirable by customers due to factors such as declining quality or newer improved products being released. We often wish to maximise revenue and keeping prices consistent while demand is decreasing is not likely to achieve this. This project looks at pricing strategies for products towards the end of their saleable lifetime, known as markdowns. This project focusses on in-store markdown pricing of a vast array of types of products, which requires adaptable solutions. Our current methods use a two-stage approach: first predicting the demand for products and then using this to find the optimal price(s) for the remaining sales period. We are using novel methods for predicting demand and optimising within markdowns and are interested in considering a holistic approach where the uncertainty of demand is taken into account within the optimisation routine. |
Collaborator Contribution | Tesco has provided data and IT equipment from their stores which has allowed us to develop probabilistic models of the product demand for multiple products. Tesco has also been very actively engaged in directing this project with regular meetings with Tesco staff and site visits to their HQ to support further discussions. |
Impact | Publications are currently in progress and we are working towards implementing new techniques within Tesco's systems. |
Start Year | 2021 |
Description | Statistical analysis of multiple interaction data |
Organisation | Elsevier |
Department | Elsevier UK |
Country | United Kingdom |
Sector | Private |
PI Contribution | Elsevier provides various online services and tools for researchers, such as Mendeley and ScienceDirect, and are interested in the problem of user segmentation - understanding who their users are and how they interact with their platforms. Our goal is to develop novel methodologies to assist with this task. Of particular interest is the analysis of clickstream data, which contains information regarding visits of users to Elsevier webpages. The data has two key properties are leveraging. Namely, it is both intermittent and bursty, with cascades of clicks in quick succession followed by periods of inactivity. This has provided a means to interpret this as network data. Using the intermittent and bursty properties of these data, we are able to partition a single user's data into a sequence of paths over webpages. This represents an instance of a so-called interaction network, where one observes interactions amongst entities over time (here entities=webpages and interactions=paths). This differs subtly from the case where relations amongst entities are observed explicitly, such as in traditional social network data, and has led to recent work in the literature on new models. |
Collaborator Contribution | Elsevier has provided data and domain expertise that has assisted in our analysis. We have had regular meetings with Elsevier staff and visits to their offices. These interactions have been invaluable to making progress on this project. |
Impact | Two publications on this work are currently in submission |
Start Year | 2019 |
Description | Statistical network modelling for populations of networks |
Organisation | Elsevier |
Department | Elsevier UK |
Country | United Kingdom |
Sector | Private |
PI Contribution | Developing a tool to cluster researchers who use Elsevier's platforms. |
Collaborator Contribution | Elsevier has provided data and technical expertise which has allowed us to make methodological developments on this project. |
Impact | Ongoing |
Start Year | 2019 |
Title | GaussianProcesses.jl |
Description | Gaussian processes are a family of stochastic processes which provide a flexible nonparametric tool for modelling data. A Gaussian Process places a prior over functions, and can be described as an infinite dimensional generalisation of a multivariate Normal distribution. Moreover, the joint distribution of any finite collection of points is a multivariate Normal. This process can be fully characterised by its mean and covariance functions, where the mean of any point in the process is described by the mean function and the covariance between any two observations is specified by the kernel. Given a set of observed real-valued points over a space, the Gaussian Process is used to make inference on the values at the remaining points in the space. This package allows the user to fit exact Gaussian process models when the observations are Gaussian distributed about the latent function. In the case where the observations are non-Gaussian, the posterior distribution of the latent function is intractable. The package allows for Monte Carlo sampling from the posterior. |
Type Of Technology | Software |
Year Produced | 2015 |
Open Source License? | Yes |
Impact | This software is widely used by the Julia community. |
URL | https://github.com/STOR-i/GaussianProcesses.jl |
Title | SGMCMC R package |
Description | This software implements a host of stochastic gradient MCMC algorithms for fast Bayesian inference. This software has been developed for the R language and is build upon the Google Tensorflow library. Utilising the efficient computation of Tensorflow, and in particular, the automatic differentiation tools available through Tensorflow, this software is the first R package which provides a simple user interface for statistician's to use gradient-based MCMC algorithms, without requiring the gradients to be hand-coded. |
Type Of Technology | Software |
Year Produced | 2018 |
Open Source License? | Yes |
Impact | The software has only recently been released and is yet to achieve its full potential. However, several papers have already cited this software in their work, indicating that it is being used within the community. |
URL | https://github.com/STOR-i/sgmcmc |
Title | SGMCMCJax |
Description | The software provides a toolbox of algorithms for stochastic gradient Markov chain Monte Carlo (MCMC). The package builds on the Jax library to offer users automatic differentiation tools that can be used to create gradient-based MCMC samplers. |
Type Of Technology | Software |
Year Produced | 2021 |
Open Source License? | Yes |
Impact | The software has been used in publications and is part of a new book on probabilistic machine learning written by Kevin Murphy. |
URL | https://github.com/jeremiecoullon/SGMCMCJax |
Description | Presentation at the Royal Statistical Society |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Gave a presentation at an RSS workshop on Bayesian computation for Stein's method. |
Year(s) Of Engagement Activity | 2021 |
URL | https://rss.org.uk/training-events/events/events-2021/sections/rss-applied-probability-and-computati... |
Description | Presentation to SecondMind |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Regional |
Primary Audience | Industry/Business |
Results and Impact | This presentation showed how Bayesian inference algorithms can be applied without learning rates. This new class of algorithms is highly efficient and removes the need for users to hand-tune the learning rate parameters. The presentation led to an interesting discussion with the audience members on the extensions of this approach. The presentation was given to the machine learning team at SecondMind. |
Year(s) Of Engagement Activity | 2023 |
URL | https://www.secondmind.ai/labs/seminars/ |
Description | Talk at Imperial College London |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Local |
Primary Audience | Professional Practitioners |
Results and Impact | This talk was given to approximately 30 people at a seminar for the Statistics group at Imperial College. The talk covered stochastic gradient MCMC methods and how standard methods are inefficient without utilising control variate approaches. |
Year(s) Of Engagement Activity | 2022 |
Description | Talk at Leeds University |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Local |
Primary Audience | Postgraduate students |
Results and Impact | Research talk to the Mathematics department at the University of Leeds. The focus of the talk was on data science for environmental science challenges. In particular, how the environmental data scientists at Leeds could collaborate further with Lancaster University. |
Year(s) Of Engagement Activity | 2022 |
Description | Talk at the conference of the International Society of Bayesian Analysis |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | This talk covered the recent developments in scalable Markov chain Monte Carlo and many of the pitfalls that exist with current methods. The audience was international and mostly university academics. |
Year(s) Of Engagement Activity | 2022 |