Deep Probabilistic Models for Making Sense of Unstructured Data

Lead Research Organisation: University of Sheffield
Department Name: Neurosciences

Abstract

The future information infrastructure will be characterized by massive streaming sets of distributed data-sources. These data will challenge classical statistical and machine learning methodologies both from a computational and a theoretical perspective. This proposal investigates a flexible class of models for learning and inference in the context of these challenges. We will develop learning infrastructures that are powerful, flexible and 'privacy
aware' with a user-centric focus. These learning infrastructures will be developed in the context of particular application challenges, including mental health, the developing world and personal information management. These applications are inspired by collaborations with citizenme, the NewMind Network for Mental Health Technology Research and Makerere University in Kampala, Uganda.

Planned Impact

We will follow the principles of open data science to ensure the impact of our work is felt as strongly as possible in industry, health and for the developing world. These principles are as follows:

- Make new analysis methodologies available as widely and rapidly as possible with as few conditions on their use as possible.
- Educate our commercial, scientific and medical partners in the use of these latest methodologies.
- Act to achieve a balance between data sharing for societal benefit and the right of an individual to own their data.

In practice we will carry out these ideas by making our software freely available under BSD licenses (see e.g. Sheffield's GPy software, 65 watchers and 159 stars on GitHub), deploying a program of focussed summer schools, for example the the Gaussian Process Summer Schools (http://gpss.cc) with editions across Europe, Australia, South America and Africa and participating in other summer schools and tutorials.

Our entire project is a close collaboration with industrial partners, clinical partners and researchers in developing countries. These collaborators have representatives on our advisory group which will steer our research to ensure it feeds directly into our users' needs.
 
Description 1. Differential Privacy (DP) for Gaussian Processes (GPs).
There are several components to this research. To briefly outline, Differential privacy is a method for making queries of a database in such a way that individual rows of the dataset are protected from being inferred through the statistics released. Gaussian Processes are a powerful, widely used probabilistic modelling framework which is used to solve regression and classification problems.
Our first result is to apply the work of Hall et al. 2013, allowing us to make DP GP predictions by adding a scaled sample from the GP's prior covariance distribution. This made the training data's output (y) values private.
We next improved on this by using inducing inputs (a method usually used to allow GPs to scale) to reduce the sensitivity of the predictions to perturbations in the training data. An earlier draft of this work is available on archive [1]
Next we found that the noise samples' covariances could be made more efficient by considering the gradient of the test points wrt the training points. We derived a method for finding the optimum DP-noise covariance matrix, what we might call the 'cloaking matrix' for a given set of test and training inputs.
We have published in AISTATS the first paper on this topic - focusing on the methods described above. We are also about to submit a separate more extensive journal paper covering;
Hyperparameter optimisation in the DP framework
An improved cloaking method, using both inducing inputs or a non-stationary kernel
We are investigating whether making the inputs private using a similar methodology is practical.

2. Integral Kernel
As part of the above development, to allow comparison with current DP methods, we considered how to make predictions from histogram data (such data is currently easy to apply DP to using the standard "histogram-mechanism"). We built on the group's previous work in latent force models, and devised a GP kernel to allow predictions to be made from histogram data. For example, if one knew in a census area there were 8 people aged 0-20, 25 aged 20-40, 13 aged 41-100 and 25 aged 30-60, and you wanted to predict the density of people aged 23, this kernel would allow one to make such a prediction. A GPy kernel was written with associated unittests [2].
The method has been extended to include reporting predictions of integrals (as well as densities).
An additional approximation method for non-hyper-cuboid regions has been developed.
Experiments have been performed on data from several domains to demonstrate its efficacy.
A final draft for a paper submission has been produced, alternative methods are being incorporated.

3. Time Series clustering methods
As part of our collaboration with Sheffield Teaching Hospitals and the Sheffield Institute for Translational Neuroscience, we have been analysing a dataset of 1500 patients with motor neurone disease (MND), which includes amyotrophic lateral sclerosis (ALS). As part of this analysis we aim to cluster the patients based on their symptom data. At irregular intervals the patients are assessed by a clinician, who records ten metrics regarding their capabilities (covering mobility, eating, breathing, etc). Each time series has a poorly defined start and finish, which suggests we need to align the data prior or during clustering. We developed a kernel for GPy which allows inputs to be shifted (in multiple dimensions, across multiple groups) to maximise the log-likelihood of the data, as part of the hyperparameter optimisation step of GPy. For the clustering itself we use a greedy algorithm in which we first considered the log-likelihood of a model with each the patients separately and then paired. We greedily cluster these clusters/patients until the log-likelihood can not be further improved. We are further improving this by (approximately) integrating over the hyperparameter values to reduce the risk of overfitting.
This work is in progress, although the offset kernel is available already.

4. Adversarial samples for GPs and GP LVMs
Through collaboration with Kathrin Grosse and David Pfaff at CISPA, Germany we are investigating Adversarial samples for Gaussian Processes. We've produced example samples for the MNIST with the GP LVM and standard GP classifier (using the Laplace approximation).
A method for finding a bound on the vulnerability of a GP classifier has been produced.
We are attempting to improve on this bound and allow it to work in higher dimensions.
A paper has been submitted covering the above methods for producing adversarial samples, and empirically exploring the potential of using the uncertainty estimates of a GP's prediction to detect adversarial attacks.

5. Developing world datasets
As part of the grant, we are working with Makerere University, Kampala, to provide support in collecting new datasets, in the field of urban informatics.
Last year we trialled our crowd-sourced crashmap data transcription tool. This took images from police road traffic collision record books and transcribed them. A custom-made transcription platform was required to allow us to ensure columns containing sensitive information remained private. A presentation [4] about this work was given at the Data Science Africa 2015 conference in Kampala and at the Workshop on 'Big and Open Data for International Development' in Manchester.
Mike Smith had previously worked on monitoring air quality in Kampala using low cost equipment [5]. He has since tested several different sensors (equipment funded by a separate grant from the University of Sheffield) and is currently collaborating with Engineer Bainomugisha, Makerere on the deployment of these units. The sensors are being installed on motorbike taxis (known as boda-bodas) to monitor PM2.5 pollution across the city. The project will integrate data from these and static sensors being mounted at the university. In particular the development of a low-cost component to remove the effect of humidity is ongoing in Sheffield as part of this project, to ensure future data quality from the field.
Vehicle collision data: Currently this project is on pause as the collaborator in country is not currently available to provide further datasets.
The air pollution network:
The network itself has been expanded to 8 sensors (mobile and static) after several iterations of hardware design. This has been funding by a USAID grant award to Engineer Bainomugisha and Michael Smith.
Work is continuing to improve data quality. In particular Dr. Smith will be returning to Kampala this year to provide calibration and validation measurements across the city.
Data from the US Embassy's new sensor has been combined into the model
An extended abstract to the Advances in Data Science 2018 workshop has been accepted that investigates the efficacy of coregionalising the low-quality sensor data with the precision data from the US embassy to improve prediction accuracy.
Working with the new Urban Observatory we will have access to additional hardware, and will be extending the model to apply to Sheffield air pollution data. In particular we will be able to incorporate more accurate weather data.
Working with ARM - they are producing a front-end visualisation of the predictions.

6. Dialysis Analysis
In a collaboration with James Fotheringham, a consultant Nephrologist we are working on a hierarchical model aimed at predicting various patient-specific variables, extrapolating 2-4 days into the future. The data is from a cohort of 10,000 patients receiving dialysis, across Europe. The models being investigated are as follows:
For comparison - a simple GP regression model will be fitted to each variable for each patient.
A coregionalised individual patient model - this looks at the relationship between the variables recorded and attempts to improve the prediction accuracy by using the correlations between the variables. We are currently working on applying a version of SGD to select the hyperparameters in this very parameterised model.
A hierarchical error model. It was hypothesised that changes in one variable predict changes in other variables in the immediate run up to a hospitalisation event. It was anticipated that this structure in the data could be used to predict the error in the predictions at the individual model level and thus allow improvements driven by a population level model of the errors from the individuals. However it was found there was insufficient structure in the data for this to achieve more than a small decrease in RMSE (of <~2%).
A population model of mean values is a simple but potentially effective way of improving the predictions in the initial period of a patient's time series. Similarly the hyperparameters could be selected from previous patient data, rather than training on, potentially insufficient data, from an individual patient. These are experiments we are currently conducting.
We will be presenting a poster on this topic at the Bayes Comp 2018 conference in Barcelona in March 2018.
An initial codebase to analyse this was rewritten to provide a clear and reliable framework, including unit testing with simulated patient datasets. The DASK EC2 system has been debugged and adapted to work with Amazon EC2 and allow the patient models to be computed in parallel.

[1] http://arxiv.org/abs/1606.00720
[2] https://github.com/SheffieldML/GPy/blob/devel/GPy/kern/src/multidimensional_integral_limits.py
[3] https://github.com/lionfish0/clustering/blob/master/Offset%20GP%20Regression%20Model%20Demonstration.ipynb
[4] https://drive.google.com/file/d/0B-14eY3gwnmGV253WEhsaTFiMms/view
[5] http://www.michaeltsmith.org.uk/other/airpol3.pdf
Exploitation Route Differential Privacy (DP) for Gaussian Processes (GPs): There are many ways this could be extended academically; applying the methods of concentrated DP, by adjusting the lengthscale across the space to balance privacy and precision, by exploring different kernels, and applying methods for hyperparameter optimisation. In practice, we hope to see this method used to anonymise data from clinical research and future census data.

Integral Kernel: In many practical cases we need estimates from histogram data (a very common method for data aggregation). This provides a simple, practical and principled method for calculating these estimates. It is also useful if one wants to apply differential privacy using a histogram query. Finally, the method is already being used at CitizenMe to estimate future audience sizes based on answers to previous questions.

Developing world datasets: At last years Data Science Africa conference we were approached by the new ambulance service in Kampala, who plan to use the crashmap dataset to help decide the optimum placement of their ambulance stations.
Sectors Aerospace, Defence and Marine,Communities and Social Services/Policy,Digital/Communication/Information Technologies (including Software),Environment,Financial Services, and Management Consultancy,Healthcare,Security and Diplomacy,Other

 
Description There are two key non-academic uses from this work: - The Integral Kernel method is being used at CitizenMe to estimate future audience sizes based on answers to previous questions. Previously it wasn't clear how one would perform this estimate correctly. CitizenMe, in collaboration with this project, developed a method to perform inference across users' devices, to avoid privacy-compromising aggregation of their data. - The ambulance service of Kampala uses the crash map dataset to help decide the optimum placement of their ambulance stations. Additionally, we won funding from QR GCRF and support from the Urban flows Observatory at Sheffield to deploy 50 additional sensors. We developed a kernel to produce a new calibration method for these sensors. We also won QR GCRF funding to support the deployment and initial development of the novel calibration kernel.
Sector Digital/Communication/Information Technologies (including Software),Healthcare,Security and Diplomacy
Impact Types Societal,Economic