Turing AI Fellowship: Probabilistic Algorithms for Scalable and Computable Approaches to Learning (PASCAL)

Lead Research Organisation: Lancaster University
Department Name: Mathematics and Statistics

Abstract

We are living in an unprecedented age where vast quantities of our personal data are continually recorded and analysed, for example, our travel patterns, shopping habits and fitness routines. Our daily lives are now tied into this evolving loop of data collection, leading to data-based automated decisions, that can make recommendations and optimise our routines. There is tremendous economic and societal value in understanding this deluge of unstructured disparate data streams. A key challenge in Artificial Intelligence (AI) research is to extract meaningful value from these data sources to make decisions that can be trusted and understood to improve society.

The PASCAL research programme is focused on developing an end-to-end framework, from data to decisions, that naturally accounts for data uncertainty and provides transparent and interpretable decision-making tools. The algorithms developed throughout this research project will be generally-applicable in a wide range of application domains and appropriate for modern computer hardware infrastructure. All of the research and associated algorithms will be widely available through high-quality open-source software that will ensure the widest possible uptake of this research within the international AI research community.

PASCAL will focus on two primary applications areas: cybersecurity and transportation, which will stimulate and motivate this research and ensure wide-spread impact within these sectors. To drive through the impact and uptake of this research within these sectors, we will work closely with committed strategic partners, GCHQ, the Heilbronn Institute of Mathematical Research, Transport Research Laboratory, the University of Washington and the Alan Turing Institute.

Cybersecurity - The proliferation of computers and mobile technology over the last few decades has led to an exponential increase in recorded data. Much of this data is personally, economically and nationally sensitive and protecting it is a key priority for any government or large organisation. Threats to data security exist on a global scale and identifying potential threats requires cybersecurity experts to evaluate and extract critical intelligence from complex and evolving data sources. In order to model and understand the intricate patterns between these data sources requires complex mathematical models. The PASCAL programme will develop new algorithms that maintain the richness of these mathematical models and use them to provide interpretable and transparent decision recommendations.

Autonomous vehicles (AV) - The transition to AVs will be the most significant global change in transportation for the past century. The economic benefit and successful implementation of this technology within the UK requires a thorough understanding of the risks posed by driverless vehicles and what new procedures are required to ensure human safety. Through PASCAL, we will develop a framework to artificially-generate realistic traffic scenarios to test AVs under a wide range of road conditions and create criteria to safely accredit AV vehicles in the UK.

Publications

10 25 50
publication icon
Aicher C (2023) Stochastic Gradient MCMC for Nonlinear State Space Models in Bayesian Analysis

publication icon
Cabezas A. (2023) Transport Elliptical Slice Sampling in Proceedings of Machine Learning Research

publication icon
Coullon J (2021) Ensemble sampler for infinite-dimensional inverse problems in Statistics and Computing

publication icon
Fairbrother J (2022) GaussianProcesses.jl : A Nonparametric Bayes Package for the Julia Language in Journal of Statistical Software

publication icon
Mimnagh N (2022) Bayesian multi-species N-mixture models for unmarked animal communities in Environmental and Ecological Statistics

publication icon
Nemeth C (2021) Stochastic Gradient Markov Chain Monte Carlo in Journal of the American Statistical Association

 
Description The focus of this grant has been to develop probabilistic approaches to machine learning which can accurately capture real-world uncertainties. Through collaboration with project partners, namely TRL, Shell, Microsoft and Tesco, this work is being developed to address challenges within these companies. For example, in the case of Shell, our work is being developed to track methane emissions from Shell facilities. Our work with Tesco is developing a new optimisation scheme which will automatically adjust price discounts in stores. Our collaboration with Microsoft Research is focused on developing new probabilistic algorithms for entity linkage with large language models. The overarching theme of these strands of work is to develop fast and computationally scalable approaches to probabilistic modelling which preserves uncertainty quantification in order to make robust real-world decisions.
Exploitation Route The publications that are currently being developed will be widely available through open-source licenses. Additionally, software that implements these techniques is already in development and will allow users from other sectors to utilise this work.
Sectors Environment

Retail

Security and Diplomacy

Transport

 
Description Bayesian inverse modelling and data assimilation of atmospheric emissions.
Amount £120,000 (GBP)
Funding ID 2605180 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start 08/2021 
End 09/2025
 
Description Scalable Monte Carlo in the General Big Data Setting.
Amount £120,000 (GBP)
Funding ID 1949442 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start 09/2017 
End 05/2022
 
Description Alan Turing Institute (ATI) collaboration 
Organisation Alan Turing Institute
Country United Kingdom 
Sector Academic/University 
PI Contribution I worked with PI Chris and collaborators on writing, methods development and theory for a paper that has now been submitted to a top-tier statistical journal.
Collaborator Contribution The specific collaborators from the Alan Turing Institute involved in this project include Chris Oates, Toni Karvonen and Mark Girolami. Their contributions were in funding two research visits to the Alan Turing Institute and jointly working on writing the manuscript, methods development and theory with myself and PI Chris.
Impact This collaboration has resulted in a computational statistics (single discipline) research paper submitted to a top-tier journal in the area.
Start Year 2018
 
Description Bayesian inverse modelling and data assimilation of atmospheric emissions 
Organisation Shell International Petroleum
Department Shell UK Ltd
Country United Kingdom 
Sector Private 
PI Contribution Our research focuses on locating source(s) and quantifying emission rate(s) of anthropogenic greenhouse gases; with a focus on methane. To do so, we are modelling gas dispersion in the atmosphere and implementing probabilistic inversion for source characterisation. We are predicting spatio-temporal gas dispersion using Gaussian plume and other models from computational fluid dynamics based on Navier-Stokes equations and assessing their computational cost and accuracy under different atmospheric conditions. Additionally, we are developing novel methodologies involving gradient-based MCMC algorithms and Gaussian Processes to perform efficient probabilistic inversion, which identifies source(s) location based on gas concentration measurements. Due to the high-dimensional nature of the problem, MCMC inversion is computationally expensive. Hence, this research is undertaken with the aim to create models which are computationally fast and applicable, including for live tracking of emissions by drones or satellites.
Collaborator Contribution Shell is providing data and domain expertise in modelling air currents from their Statistics team. The team at Shell is dedicating a significant amount of staff time to meet with the Lancaster University research team, meeting at least once per week. Additionally, they are financially supporting opportunities for visits to Shell HQ.
Impact No significant outputs to report at this time.
Start Year 2022
 
Description Collaboration with Dr Leah South 
Organisation Queensland University of Technology (QUT)
Country Australia 
Sector Academic/University 
PI Contribution My team and I are meeting with Dr South on a weekly basis to prepare a research project for academic publication.
Collaborator Contribution Dr Leah South is advising our project on the application of Stein's method within the context of stochastic gradient MCMC
Impact Currently in development
Start Year 2020
 
Description Data subsampling for scalable inference 
Organisation Stanford University
Country United States 
Sector Academic/University 
PI Contribution Monte Carlo methods are often required to produce exact inference and to evaluate models in the Bayesian setting. These algorithms are widely implemented by scientists and industrial practitioners, due to their versatility and strong theoretical properties. Unfortunately, standard Monte Carlo algorithms are ill-suited for conducting inference on large datasets. This is because they require complete evaluations of the full data at each iteration, leading to a computational cost that increases (at the very least) proportionally with the data size. These issues have prompted considerable interest amongst the machine learning and statistics communities to develop Bayesian inference methods which can scale easily in relation to the size of the data. The project has developed new scalable Markov chain Monte Carlo (MCMC) algorithms based on stochastic gradient MCMC. In particular, we have developed new techniques for modelling temporally-varying data and new ways to optimally subsample data which leads to lower variance stochastic gradient estimates.
Collaborator Contribution This project has been in collaboration with Prof Emily Fox (formerly of the University of Washington). Prof Fox is a world leader in statistical machine learning and her expertise has been invaluable in the development of scalable MCMC techniques in the temporally-evolving setting.
Impact Two publications were produced as a result of this collaboration. One paper has been accepted for publication in AISTATS and a second publication is currently under review.
Start Year 2018
 
Description Diffusion-based Deep Generative Models for Assessing Safety in Autonomous Vehicles 
Organisation Transport Research Laboratory Ltd (TRL)
Country United Kingdom 
Sector Private 
PI Contribution This project is working towards developing new deep generative models based on diffusion models. This is a recent and growing field of machine learning, where the goal is to learn a probability distribution from a finite set of samples. This classical problem in statistics has been studied for many decades, but until recently efficient learning of high-dimensional distributions remained impossible in practice. Recent advances in the field of deep generative modelling aim to learn the unknown data-generating distribution using neural network models to generate fake, yet realistic-looking data, such as images and videos, and compare the output to real datasets. The goal of this project is to use ideas from deep generative modelling to create sufficiently complex road scenarios that can be used within autonomous vehicle simulators, such as CARLA.
Collaborator Contribution TRL is providing data and domain expertise to this project. In particular, TRL has access to road accident data which we are using to train our machine-learning models. The team at TRL is also providing domain expertise, particular with regard to the CARLA simulation software, and assisting our team with implementation challenges.
Impact This project is still in the early stages of development are they are no outcomes to report yet.
Start Year 2022
 
Description Optimising In-Store Price Reductions 
Organisation Tesco Plc
Country United Kingdom 
Sector Private 
PI Contribution When selling products, demand for that product does not remain consistent throughout its lifetime. As time progresses a product is deemed less desirable by customers due to factors such as declining quality or newer improved products being released. We often wish to maximise revenue and keeping prices consistent while demand is decreasing is not likely to achieve this. This project looks at pricing strategies for products towards the end of their saleable lifetime, known as markdowns. This project focusses on in-store markdown pricing of a vast array of types of products, which requires adaptable solutions. Our current methods use a two-stage approach: first predicting the demand for products and then using this to find the optimal price(s) for the remaining sales period. We are using novel methods for predicting demand and optimising within markdowns and are interested in considering a holistic approach where the uncertainty of demand is taken into account within the optimisation routine.
Collaborator Contribution Tesco has provided data and IT equipment from their stores which has allowed us to develop probabilistic models of the product demand for multiple products. Tesco has also been very actively engaged in directing this project with regular meetings with Tesco staff and site visits to their HQ to support further discussions.
Impact Publications are currently in progress and we are working towards implementing new techniques within Tesco's systems.
Start Year 2021
 
Description Probabilistic Linking in AI-Powered Knowledge Bases 
Organisation Microsoft Research
Department Microsoft Research Cambridge
Country United Kingdom 
Sector Private 
PI Contribution We are working with researchers at Microsoft to develop new algorithms for entity linkage within the context of large language models.
Collaborator Contribution Microsoft Research is funding a PhD student, based at Lancaster University, to work on entity linkage within large language models.
Impact Not currently
Start Year 2023
 
Description Statistical analysis of multiple interaction data 
Organisation Elsevier
Department Elsevier UK
Country United Kingdom 
Sector Private 
PI Contribution Elsevier provides various online services and tools for researchers, such as Mendeley and ScienceDirect, and are interested in the problem of user segmentation - understanding who their users are and how they interact with their platforms. Our goal is to develop novel methodologies to assist with this task. Of particular interest is the analysis of clickstream data, which contains information regarding visits of users to Elsevier webpages. The data has two key properties are leveraging. Namely, it is both intermittent and bursty, with cascades of clicks in quick succession followed by periods of inactivity. This has provided a means to interpret this as network data. Using the intermittent and bursty properties of these data, we are able to partition a single user's data into a sequence of paths over webpages. This represents an instance of a so-called interaction network, where one observes interactions amongst entities over time (here entities=webpages and interactions=paths). This differs subtly from the case where relations amongst entities are observed explicitly, such as in traditional social network data, and has led to recent work in the literature on new models.
Collaborator Contribution Elsevier has provided data and domain expertise that has assisted in our analysis. We have had regular meetings with Elsevier staff and visits to their offices. These interactions have been invaluable to making progress on this project.
Impact Two publications on this work are currently in submission
Start Year 2019
 
Description Statistical network modelling for populations of networks 
Organisation Elsevier
Department Elsevier UK
Country United Kingdom 
Sector Private 
PI Contribution Developing a tool to cluster researchers who use Elsevier's platforms.
Collaborator Contribution Elsevier has provided data and technical expertise which has allowed us to make methodological developments on this project.
Impact Ongoing
Start Year 2019
 
Title BlackJax 
Description BlackJAX bridges the gap between "one liner" frameworks and modular, customizable libraries. Users can import the library and interact with robust, well-tested and performant samplers with a few lines of code. These samplers are aimed at PPL developers, or people who have a logpdf and just need a sampler that works. But the true strength of BlackJAX lies in its internals and how they can be used to experiment quickly on existing or new sampling schemes. This lower level exposes the building blocks of inference algorithms: integrators, proposal, momentum generators, etc and makes it easy to combine them to build new algorithms. It provides an opportunity to accelerate research on sampling algorithms by providing robust, performant and reusable code. 
Type Of Technology Software 
Year Produced 2022 
Open Source License? Yes  
Impact Ongoing 
URL https://blackjax-devs.github.io/blackjax/
 
Title GaussianProcesses.jl 
Description Gaussian processes are a family of stochastic processes which provide a flexible nonparametric tool for modelling data. A Gaussian Process places a prior over functions, and can be described as an infinite dimensional generalisation of a multivariate Normal distribution. Moreover, the joint distribution of any finite collection of points is a multivariate Normal. This process can be fully characterised by its mean and covariance functions, where the mean of any point in the process is described by the mean function and the covariance between any two observations is specified by the kernel. Given a set of observed real-valued points over a space, the Gaussian Process is used to make inference on the values at the remaining points in the space. This package allows the user to fit exact Gaussian process models when the observations are Gaussian distributed about the latent function. In the case where the observations are non-Gaussian, the posterior distribution of the latent function is intractable. The package allows for Monte Carlo sampling from the posterior. 
Type Of Technology Software 
Year Produced 2015 
Open Source License? Yes  
Impact This software is widely used by the Julia community. 
URL https://github.com/STOR-i/GaussianProcesses.jl
 
Title SGMCMC R package 
Description This software implements a host of stochastic gradient MCMC algorithms for fast Bayesian inference. This software has been developed for the R language and is build upon the Google Tensorflow library. Utilising the efficient computation of Tensorflow, and in particular, the automatic differentiation tools available through Tensorflow, this software is the first R package which provides a simple user interface for statistician's to use gradient-based MCMC algorithms, without requiring the gradients to be hand-coded. 
Type Of Technology Software 
Year Produced 2018 
Open Source License? Yes  
Impact The software has only recently been released and is yet to achieve its full potential. However, several papers have already cited this software in their work, indicating that it is being used within the community. 
URL https://github.com/STOR-i/sgmcmc
 
Title SGMCMCJax 
Description The software provides a toolbox of algorithms for stochastic gradient Markov chain Monte Carlo (MCMC). The package builds on the Jax library to offer users automatic differentiation tools that can be used to create gradient-based MCMC samplers. 
Type Of Technology Software 
Year Produced 2021 
Open Source License? Yes  
Impact The software has been used in publications and is part of a new book on probabilistic machine learning written by Kevin Murphy. 
URL https://github.com/jeremiecoullon/SGMCMCJax
 
Description Panel discussion participant at the Royal Statistical Society conference 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact I took part in a panel discussion at the Royal Statistical Society's International Conference in Harrogate. The panel discussion was on the topic "Evaluating AI: How data science and statistics can shape the UK's AI strategy"
Year(s) Of Engagement Activity 2023
URL https://www.youtube.com/watch?v=7aZrkQIComM
 
Description Presentation at Bayes Comp 2023 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact This talk was given at Bayes Comp 2023, the biennial conference of the Bayesian Computation Section of the International Society for Bayesian Analysis. The talk discussed new gradient-based sampling algorithms which an be applied without learning rates.
Year(s) Of Engagement Activity 2023
URL https://bayescomp2023.com/programme
 
Description Presentation at Bayes on the Beach 2024 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact This talk was given at Bayes on the Beach 2024, a biennial conference organised by the Bayesian Research & Applications Group (BRAG) in collaboration with Queensland University of Technology Centre for Data Science, the Australasian chapter of the International Society for Bayesian Analysis, and the Statistical Society of Australia. The talk discussed new gradient-based sampling algorithms which an be applied without learning rates.
Year(s) Of Engagement Activity 2024
URL https://research.qut.edu.au/qutcds/bayes-onthe-beach/bayes-on-the-beach-program/
 
Description Presentation at Massachusetts Institute of Technology 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact This talk was given in the Computer Science and Artificial Intelligence Laboratory at the Massachusetts Institute of Technology (MIT CSAIL). The talk discussed new gradient-based sampling algorithms which an be applied without learning rates.
Year(s) Of Engagement Activity 2023
 
Description Presentation at Maynooth University 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact This presentation was given at Maynooth University as part of a lecture on advanced statistical methods and how they can be applied in finance industry to automate decision processes.
Year(s) Of Engagement Activity 2023
 
Description Presentation at University College London 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact This poster presentation was given at a workshop on `Distance Based Methods for Machine Learning', held at University College London. The presentation discussed new gradient-based sampling algorithms which an be applied without learning rates.
Year(s) Of Engagement Activity 2023
URL https://dbmml.github.io/
 
Description Presentation at University of British Columbia 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact This talk was given to students and faculty in the Department of Statistics at the University of British Columbia. The talk discussed new gradient-based sampling algorithms which an be applied without learning rates.
Year(s) Of Engagement Activity 2023
 
Description Presentation at University of California, Berkeley 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact This talk was given in the High Dimensional Scientific Computing seminar series at the University of California, Berkeley. The talk discussed new gradient-based sampling algorithms which an be applied without learning rates.
Year(s) Of Engagement Activity 2023
 
Description Presentation at University of Oxford 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Postgraduate students
Results and Impact This talk was given in the Computational Statistics and Machine Learning seminar series at the University of Oxford. The talk discussed new gradient-based sampling algorithms which an be applied without learning rates.
Year(s) Of Engagement Activity 2023
URL https://github.com/oxcsml/ML_bazaar/wiki/Seminar
 
Description Presentation at the Bayes4Health/CoSinES Workshop 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Postgraduate students
Results and Impact I gave a research presentation at the University of Oxford to the Bayes4Health and CoSinES programme grant researchers. The talk was focused on coin sampling as an alternative to popular step-size-dependent Monte Carlo sampling algorithms.
Year(s) Of Engagement Activity 2023
 
Description Presentation at the Royal Statistical Society 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Gave a presentation at an RSS workshop on Bayesian computation for Stein's method.
Year(s) Of Engagement Activity 2021
URL https://rss.org.uk/training-events/events/events-2021/sections/rss-applied-probability-and-computati...
 
Description Presentation at the Turing AI fellows research retreat 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact I gave a presentation at the Natural History Museum on Probabilistic AI and the importance of taking a probabilistic approach to the development of AI systems. The other EPSRC-funded Turing AI fellows were in attendance, with some additional invited AI experts.
Year(s) Of Engagement Activity 2023
 
Description Presentation at the University of Edinburgh 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Postgraduate students
Results and Impact I gave a talk to approximately 40 members of the maths department on the topic of learning-rate-free sampling algorithms. The talk initiated an interesting debate amongst the attendees about the importance of correctly setting the learning rate for many machine learning algorithms.
Year(s) Of Engagement Activity 2023
 
Description Presentation at the University of Oslo 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Postgraduate students
Results and Impact This presentation was given to researchers at the Norwegian Research Council-funded programme, Integreat. The focus of the talk was to highlight the importance of not using off-the-shelf machine learning tools and the need for careful consideration of how ML is used on real-world problems.
Year(s) Of Engagement Activity 2024
URL https://www.integreat.no/events/public-events/tuesday-seminars/2024/0305.html
 
Description Presentation at the Young Irish Statistical Association and International Biometrics Society meeting 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact This presentation was given to postgraduates and young researchers in a joint event of the Young Irish Statistical Association (Y-ISA) and the International Biometrics Society. The talk was about how novel extensions of Bayesian tree-based methods can be used to estimate genotype-by-environment (GxE) interactions in plant-based genetics.
Year(s) Of Engagement Activity 2023
URL https://young-istat.github.io/events/posts/4th_Y-ISA_meeting_abstracts.pdf
 
Description Presentation to SecondMind 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Industry/Business
Results and Impact This presentation showed how Bayesian inference algorithms can be applied without learning rates. This new class of algorithms is highly efficient and removes the need for users to hand-tune the learning rate parameters. The presentation led to an interesting discussion with the audience members on the extensions of this approach. The presentation was given to the machine learning team at SecondMind.
Year(s) Of Engagement Activity 2023
URL https://www.secondmind.ai/labs/seminars/
 
Description Satellite event for the Royal Institution's Christmas Lecture 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Schools
Results and Impact Lancaster University hosted a local viewing of the Royal Institute's Christmas Lecture on AI, which was given by Prof Michael Woolridge. I led a session at the university for school children on AI and the mathematics of AI.
Year(s) Of Engagement Activity 2023
 
Description Talk at Imperial College London 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Professional Practitioners
Results and Impact This talk was given to approximately 30 people at a seminar for the Statistics group at Imperial College. The talk covered stochastic gradient MCMC methods and how standard methods are inefficient without utilising control variate approaches.
Year(s) Of Engagement Activity 2022
 
Description Talk at Leeds University 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Postgraduate students
Results and Impact Research talk to the Mathematics department at the University of Leeds. The focus of the talk was on data science for environmental science challenges. In particular, how the environmental data scientists at Leeds could collaborate further with Lancaster University.
Year(s) Of Engagement Activity 2022
 
Description Talk at the conference of the International Society of Bayesian Analysis 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact This talk covered the recent developments in scalable Markov chain Monte Carlo and many of the pitfalls that exist with current methods. The audience was international and mostly university academics.
Year(s) Of Engagement Activity 2022