Bayesian Learning for Sparse High-Dimensional Data

Lead Research Organisation: University of Liverpool
Department Name: Electrical Engineering and Electronics

Abstract

This project is focused on understanding uncertainty in machine learning models trained on limited datasets. There are many problems where the number of data points is small relative to the number of features. Typical solutions assume independence of features or use dimensionality reduction to learn a maximum likelihood projection of the data. For small data sets, learnt models are critically dependent on the actual data points used. The project will investigate whether Bayesian methods can be used to characterise the uncertainty of estimated parameters efficiently when developing machine learning models for sensor signal time series.

Much recent progress in machine learning has relied on the availability of large datasets, which allows the development of complex models. However, many problems in defence and security do not have access to such data, either because they require use of less widely studied sensors (such as sonar) or they relate to adversaries, who strive to limit data about their activities. Most published models rely on point estimates of parameters, achieved through algorithms such as maximum likelihood or stochastic gradient descent. However, when this type of model is applied in situations with limited data, the uncertainty associated with parameter estimates is usually not taken into account, either when integrating machine learning models into wider systems, or when assessing performance to predict how the model might behave in operational scenarios. Even when other approaches to deal with limited datasets are used, such as transfer learning, uncertainty characterisation is still important as there is often a mismatch between the distribution of the pre-training and training datasets.

This project aims to investigate to what extent Bayesian methods can be used to characterise the uncertainty of estimated parameters when dealing with sparse but potentially high-dimensional data sets, and how this can be implemented in a distributed computing setting. The expected outcome of the project is the development of suitable Bayesian algorithms, along with a software implementation, and an analysis of algorithm performance on relevant datasets.

The research will start with a literature review into appropriate approaches, which could include Variational Bayesian methods, Markov Chain Monte Carlo (MCMC), Sequential Monte Carlo (SMC), Approximate Bayesian Computation (ABC), and other approximate methods. Consideration will be given to the computational feasibility of the algorithms, including the extent to which computing can be distributed to multiple processors or virtual machines in a cloud infrastructure and the transparency (confidence) and performance improvements the various approaches could provide. Suitable innovative techniques will be developed, assessed, and compared against baseline approaches. Bayesian Neural Networks (BNN) will also be researched with implementations containing techniques such as SMC and MCMC methods, amongst others. The algorithms will be applied to a number of sponsor-supplied datasets, such as sonar sensor or electrical device measurement time-series. The research will be to determine the extent to which the uncertainty representation accommodates operational data that may not have the same distribution as the training data. Based on discussions with the sponsor and an analysis of the results, industrially relevant scenarios where the algorithms can be used will be identified.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/S023445/1 01/04/2019 30/09/2027
2889818 Studentship EP/S023445/1 01/10/2023 30/09/2027 Daniel Sumler