Scalable and Exact Data Science for Security and Location-based Data

Lead Research Organisation: Lancaster University
Department Name: Mathematics and Statistics

Abstract

Incredible technological advances in data collection and storage have created a world in which we are constantly generating data. From supermarket loyalty cards and social media posts to healthcare records and credit card transactions, a digital footprint exists for every aspect of our lives. The ability of data science to analyse and act upon these complex and varied data sources has the potential to improve and revolutionise our lives in a myriad of ways, for example, through the development of driverless cars and personalised medicine.

The great challenge of data science lies in the trade-off between the speed and accuracy with which large volumes of data can be analysed and acted upon within complex data environments. Extracting deeper knowledge from data requires increasingly sophisticated mathematical models. However, applying such models introduces significant computational constraints, forcing data scientists to rely upon simpler models or approximate inference tools.

In collaboration with strategic partners, this project will bring together industry experts to investigate new approaches to data science driven by fundamental challenges in modelling and analysing large-scale spatial and security data. The data and issues within this domain are highly-significant to modern society as they cover, for example, issues pertaining to fraud detection and computer hacking, as well as understanding and predicting human behaviour within a Smart City environment.

Novel mathematical advances in computational statistics and machine learning will be developed to produce scalable techniques for applying sophisticated mathematical models to large-scale heterogeneous and structured data sources. A key component of this project is reproducibility through the creation of open-source software. These tools will allow data scientists to implement research outcomes to extract key features from complex data and make decisions with high accuracy under uncertainty.

Planned Impact

This research agenda is designed to address the significant topical challenges of modern data science which impede its applicability within complex data environments. Through close engagement with industrial stakeholders, this research will produce a transformative approach to analysing large-scale heterogeneous data in the areas of spatio-temporal modelling and security & defence applications.

This project is supported by an impressive array of committed partners: Prowler.io, The Heilbronn Institute of Mathematical Research (HIMR) and The Alan Turing Institute (ATI), who provide significant expertise in the areas of security and spatio-temporal modelling. Through an integrative research programme with the project partners, key research outcomes will be tested and deployed on the data and systems owned by these partners, providing real-world verification of the applicability of research outputs.

Through the co-design and implementation of research objectives with project partners, the scalable data science tools created under this fellowship will contribute to the knowledge economy of the UK, by enabling researchers and practitioners to employ complex mathematical models to previously prohibitively high-dimensional data sources. Key engagements with HIMR will support the application of this research to address imperative national security challenges.

Open-source software will be developed stemming from research outcomes. This will support the far-reaching impact of this work beyond the academic community, providing tools for end-users to freely implement on a wide variety data sources beyond the security and spatio-temporal domains. This will become part of the core toolbox for both public and private sector organisation seeking to fit complex models to large data.

Publications

10 25 50
publication icon
An Z (2019) Accelerating Bayesian Synthetic Likelihood With the Graphical Lasso in Journal of Computational and Graphical Statistics

publication icon
Baker J (2018) Control variates for stochastic gradient MCMC in Statistics and Computing

publication icon
Baker J (2019) sgmcmc : An R Package for Stochastic Gradient Markov Chain Monte Carlo in Journal of Statistical Software

publication icon
Nemeth C (2021) Stochastic Gradient Markov Chain Monte Carlo in Journal of the American Statistical Association

publication icon
Verjans V (2020) Bayesian calibration of firn densification models in The Cryosphere

 
Description Detecting soil degradation and restoration through a novel coupled sensor and machine learning framework
Amount £811,651 (GBP)
Funding ID NE/T012307/1 
Organisation Natural Environment Research Council 
Sector Public
Country United Kingdom
Start 01/2020 
End 12/2022
 
Description Explainable AI for UK agricultural land use decision-making
Amount £43,151 (GBP)
Funding ID NE/T004002/1 
Organisation Natural Environment Research Council 
Sector Public
Country United Kingdom
Start 07/2019 
End 07/2020
 
Description Alan Turing Institute (ATI) collaboration 
Organisation Alan Turing Institute
Country United Kingdom 
Sector Academic/University 
PI Contribution I worked with PI Chris and collaborators on writing, methods development and theory for a paper that has now been submitted to a top-tier statistical journal.
Collaborator Contribution The specific collaborators from the Alan Turing Institute involved in this project include Chris Oates, Toni Karvonen and Mark Girolami. Their contributions were in funding two research visits to the Alan Turing Institute and jointly working on writing the manuscript, methods development and theory with myself and PI Chris.
Impact This collaboration has resulted in a computational statistics (single discipline) research paper submitted to a top-tier journal in the area.
Start Year 2018
 
Description Methods for multimodal sampling 
Organisation PROWLER.io
Country United Kingdom 
Sector Private 
PI Contribution Developed a new algorithm for sampling from multimodal posterior distributions
Collaborator Contribution Provided new insights and developed software
Impact A paper on this work was published in a top AI conference - http://papers.nips.cc/paper/8683-pseudo-extended-markov-chain-monte-carlo
Start Year 2018
 
Title SGMCMC R package 
Description This software implements a host of stochastic gradient MCMC algorithms for fast Bayesian inference. This software has been developed for the R language and is build upon the Google Tensorflow library. Utilising the efficient computation of Tensorflow, and in particular, the automatic differentiation tools available through Tensorflow, this software is the first R package which provides a simple user interface for statistician's to use gradient-based MCMC algorithms, without requiring the gradients to be hand-coded. 
Type Of Technology Software 
Year Produced 2018 
Open Source License? Yes  
Impact The software has only recently been released and is yet to achieve its full potential. However, several papers have already cited this software in their work, indicating that it is being used within the community. 
URL https://github.com/STOR-i/sgmcmc
 
Title ZVCV R package 
Description This R package can be used to implement gradient-based variance reduction techniques, including a method that we developed as part of the grant. The package is on the main R package repository (CRAN) and on GitHub, with the updated version on GitHub to be sent to CRAN in the next month or so. 
Type Of Technology Software 
Year Produced 2020 
Open Source License? Yes  
Impact This is the only R package for gradient-based variance reduction techniques that I'm aware of. It has been downloaded over 3500 times. 
URL https://github.com/LeahPrice/ZVCV
 
Description Alan Turing Institute reading group presentation 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Other audiences
Results and Impact I gave a tutorial on a parametric alternative to approximate Bayesian computation to the reading group. This sparked more interest in the approach and its theoretical properties.
Year(s) Of Engagement Activity 2019
 
Description Invited talk at Bayes4Health & CoSInES workshop 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact My talk on derivative-based variance reduction sparked a discussion with the first author of a journal of the royal statistical society read paper, which led to us submitting a comment on how methods from the talk could be used in their novel application.
Year(s) Of Engagement Activity 2019
 
Description Poster presentation at BayesComp2020 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact I presented a poster on derivative-based variance reduction. I spoke with several people about the work, including one person who was interested in applying the proposed methods to his high-dimensional application.
Year(s) Of Engagement Activity 2020
URL http://users.stat.ufl.edu/~jhobert/BayesComp2020/Conf_Website/
 
Description STORi conference 2019 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Postgraduate students
Results and Impact Presented a talk at the annual conference for my CDT. This was targeted at current students and partners of the CDT. The conference was designed to provide an opportunity for those involved to get a sense of the wide variety of research being undertaken at my CDT.
Year(s) Of Engagement Activity 2019
 
Description Seminar (University of Manchester) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Professional Practitioners
Results and Impact Seminar for the University of Manchester statistics, quantification of uncertainty, inverse problems and data science group. Debate and discussion on multimodal methods for Markov chain Monte Carlo algorithms.
Year(s) Of Engagement Activity 2019
 
Description Seminar (University of Oslo) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Professional Practitioners
Results and Impact A seminar on scalable Markov chain Monte Carlo algorithms was given to the Statistics Department at the University of Oslo. Several interesting discussions stemmed from this talk and a new collaboration.
Year(s) Of Engagement Activity 2018
 
Description Seminar (University of Oxford) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Professional Practitioners
Results and Impact A seminar was given to the University of Oxford Statistics Department on pseudo-extended MCMC methods. Many interesting questions and discussions followed on from this meeting.
Year(s) Of Engagement Activity 2019
 
Description Seminar, Bocconi University 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Professional Practitioners
Results and Impact The PI gave a seminar at Bocconi University, Milan, Italy. The seminar was attended by approximately 30 people, including academic staff, PhD students and research associates. The talk covered research outputs from this grant pertaining to scalable Monte Carlo inference. Several members of the audience were interested in learning more about this work and incorporating these techniques in their research.
Year(s) Of Engagement Activity 2018
URL http://didattica.unibocconi.eu/eventi/event.php?IdPag=5575&dip=55&id=5735&IdFld=265&See=
 
Description Talk at CMStatistics 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact I spoke in an invited session about a novel sequential Monte Carlo method. I had several questions, but also had one participant working in applied areas who was interested in applying this method to their application.
Year(s) Of Engagement Activity 2019
URL http://cmstatistics.org/RegistrationsV2/CFE2019/viewSubmission.php?in=401&token=30n7nssqr596820r98p4...
 
Description Talk at Monte Carlo methods (MCM) 2019 conference 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact I gave a talk in an invited session about a novel sequential Monte Carlo, which sparked some interest from someone who has developed a new sampler in applying their method within ours.
Year(s) Of Engagement Activity 2019
URL http://www.mcm2019.unsw.edu.au/FinalProgram-rotated.pdf
 
Description Talk at an RSS workshop (Reading University) 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Professional Practitioners
Results and Impact Chris Nemeth was invited to give a talk on MCMC methods at the Royal Statistical Society Reading local group event on Bayesian computation.
Year(s) Of Engagement Activity 2018
 
Description University of New South Wales seminar 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Postgraduate students
Results and Impact I gave a seminar to the UNSW statistics department as well as postdocs from across Australia (who were attending for an annual retreat) on derivative-based control variates. After the seminar, I had extensive discussions about the research with a few researchers who were working in related areas.
Year(s) Of Engagement Activity 2019
URL http://www.maths.unsw.edu.au/seminars/archive/annual/2019?page=8
 
Description University of Warwick seminar 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Other audiences
Results and Impact I was invited to give a seminar at University of Warwick, where I spoke about gradient-based control variates. Several people were interested in discussing the methods in more detail after the talk.
Year(s) Of Engagement Activity 2019
URL http://warwick.ac.uk/fac/sci/statistics/news/algorithms-seminars/2018-19/