Scalable and Exact Data Science for Security and Location-based Data

Lead Research Organisation: Lancaster University
Department Name: Mathematics and Statistics

Abstract

Incredible technological advances in data collection and storage have created a world in which we are constantly generating data. From supermarket loyalty cards and social media posts to healthcare records and credit card transactions, a digital footprint exists for every aspect of our lives. The ability of data science to analyse and act upon these complex and varied data sources has the potential to improve and revolutionise our lives in a myriad of ways, for example, through the development of driverless cars and personalised medicine.

The great challenge of data science lies in the trade-off between the speed and accuracy with which large volumes of data can be analysed and acted upon within complex data environments. Extracting deeper knowledge from data requires increasingly sophisticated mathematical models. However, applying such models introduces significant computational constraints, forcing data scientists to rely upon simpler models or approximate inference tools.

In collaboration with strategic partners, this project will bring together industry experts to investigate new approaches to data science driven by fundamental challenges in modelling and analysing large-scale spatial and security data. The data and issues within this domain are highly-significant to modern society as they cover, for example, issues pertaining to fraud detection and computer hacking, as well as understanding and predicting human behaviour within a Smart City environment.

Novel mathematical advances in computational statistics and machine learning will be developed to produce scalable techniques for applying sophisticated mathematical models to large-scale heterogeneous and structured data sources. A key component of this project is reproducibility through the creation of open-source software. These tools will allow data scientists to implement research outcomes to extract key features from complex data and make decisions with high accuracy under uncertainty.

Planned Impact

This research agenda is designed to address the significant topical challenges of modern data science which impede its applicability within complex data environments. Through close engagement with industrial stakeholders, this research will produce a transformative approach to analysing large-scale heterogeneous data in the areas of spatio-temporal modelling and security & defence applications.

This project is supported by an impressive array of committed partners: Prowler.io, The Heilbronn Institute of Mathematical Research (HIMR) and The Alan Turing Institute (ATI), who provide significant expertise in the areas of security and spatio-temporal modelling. Through an integrative research programme with the project partners, key research outcomes will be tested and deployed on the data and systems owned by these partners, providing real-world verification of the applicability of research outputs.

Through the co-design and implementation of research objectives with project partners, the scalable data science tools created under this fellowship will contribute to the knowledge economy of the UK, by enabling researchers and practitioners to employ complex mathematical models to previously prohibitively high-dimensional data sources. Key engagements with HIMR will support the application of this research to address imperative national security challenges.

Open-source software will be developed stemming from research outcomes. This will support the far-reaching impact of this work beyond the academic community, providing tools for end-users to freely implement on a wide variety data sources beyond the security and spatio-temporal domains. This will become part of the core toolbox for both public and private sector organisation seeking to fit complex models to large data.

Publications

10 25 50
 
Title SGMCMC R package 
Description This software implements a host of stochastic gradient MCMC algorithms for fast Bayesian inference. This software has been developed for the R language and is build upon the Google Tensorflow library. Utilising the efficient computation of Tensorflow, and in particular, the automatic differentiation tools available through Tensorflow, this software is the first R package which provides a simple user interface for statistician's to use gradient-based MCMC algorithms, without requiring the gradients to be hand-coded. 
Type Of Technology Software 
Year Produced 2018 
Open Source License? Yes  
Impact The software has only recently been released and is yet to achieve its full potential. However, several papers have already cited this software in their work, indicating that it is being used within the community. 
URL https://github.com/STOR-i/sgmcmc
 
Description Seminar, Bocconi University 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Professional Practitioners
Results and Impact The PI gave a seminar at Bocconi University, Milan, Italy. The seminar was attended by approximately 30 people, including academic staff, PhD students and research associates. The talk covered research outputs from this grant pertaining to scalable Monte Carlo inference. Several members of the audience were interested in learning more about this work and incorporating these techniques in their research.
Year(s) Of Engagement Activity 2018
URL http://didattica.unibocconi.eu/eventi/event.php?IdPag=5575&dip=55&id=5735&IdFld=265&See=