StatScale: Statistical Scalability for Streaming Data

Lead Research Organisation: Lancaster University
Department Name: Mathematics and Statistics

Abstract

We live in the age of data. Technology is transforming our ability to collect and store data on unprecedented scales. From the use of Oyster card data to improve London's transport network, to the Square Kilometre Array astrophysics project that has the potential to transform our understanding of the universe, Big Data can inform and enrich many aspects of our lives. Due to the widespread use of sensor-based systems in everyday life, with even smartphones having sensors that can monitor location and activity level, much of the explosion of data is in the form of data streams: data from one or more related sources that arrive over time. It has even been estimates that there will be over 30 billion devices collecting data streams by 2020.

The important role of Statistics within "Big Data" and data streams has been clear for some time. However the current tendency has been to focus purely on algorithmic scalability, such as how to develop versions of existing statistical algorithms that scale better with the amount of data. Such an approach, however, ignores the fact that fundamentally new issues often arise when dealing with data sets of this magnitude, and highly innovative solutions are required.

Model error is one such issue. Many statistical approaches are based on the use of mathematical models for data. These models are only approximations of the real data-generating mechanisms. In traditional applications, this model error is usually small compared with the inherent sampling variability of the data, and can be overlooked. However, there is an increasing realisation that model error can dominate in Big Data applications. Understanding the impact of model error, and developing robust methods that have excellent statistical properties even in the presence of model error, are major challenges.

A second issue is that many current statistical approaches are not computationally feasible for Big Data. In practice we will often need to use less efficient statistical methods that are computationally faster, or require less computer memory. This introduces a statistical-computational trade-off that is unique to Big Data, leading to many open theoretical questions, and important practical problems.

The strategic vision for this programme grant is to investigate and develop an integrated approach to tackling these and other fundamental statistical challenges. In order to do this we will focus in particular on analysing data streams. An important issue with this type of data is detecting changes in the structure of the data over time. This will be an early area of focus for the programme, as it has been identified as one of seven key problem areas for Big Data. Moreover it is an area in which our research will lead to practically important breakthroughs. Our philosophy is to tackle methodological, theoretical and computational aspects of these statistical problems together, an approach that is only possible through the programme grant scheme. Such a broad perspective is essential to achieve the substantive fundamental advances in statistics envisaged, and to ensure our new methods are sufficiently robust and efficient to be widely adopted by academics, industry and society more generally.

Planned Impact

Who will benefit?
This proposal will benefit a variety of different stakeholders including:
(a) A wide range of industries, including collaborating industrial partners and those organisations that handle large volumes of data [e.g. the NHS, Transport Agencies, Energy companies etc.];
(b) Society more generally through the application of this research;
(c) The academic research community, particularly in disciplines that underpin and relate to the data sciences;
(d) Project personnel: PDRAs and PhD students.

How will they benefit?
New techniques: (a, b, c)
The research undertaken will develop a number of exciting new statistical techniques that will be disseminated to our partners and user communities. Our methods will result in more efficient and cost-effective ways of marshalling precious resources, by making principled analysis of very large datasets either i) feasible and/or ii) faster and more accurate. These benefits will flow through the economy and society via a number of different mechanisms. These might include more efficient use of resources (e.g. better management of oil fields via improved processing of well operation data); improved productivity (e.g. via more timely management and intervention of faults on telecommunications networks) and society more generally (via the development of statistical methods capable of analysing eHealth data streams, e.g. support monitoring of vulnerable elderly living independently). To enable this, we specifically include resource for a Research Software Engineer to make available high-quality, documented open source code for others to use.

Targeted Knowledge Exchange: (a)
Significant further benefit will accrue to beneficiary group (a) through their partnership on this project. At this stage, we benefit from the support of several leading organisations in the energy, health and telecommunications sectors. They have expressed enthusiastic support for this programme's vision and have provided valuable insight and advice as we have developed this proposal.
For example, through dialogue with this community the idea of short-term secondment visits via a partnership programme has developed. PDRAs will spend periods of time at partner locations developing case studies that demonstrate the utility of developed methods on data rich products and systems. They are also keen to work with us to develop successful knowledge exchange mechanisms. Representatives from this community will also sit on the Advisory and Impact Board.

Generic Knowledge Exchange: (c)
We will develop methods that are of considerable interest to the academic community both in Statistics and other fields. As well as the traditional routes of journal publication, workshops and conferences the programme will develop open source R software that embodies our techniques: these will benefit the academic community and beyond. Further, we will work with our advisory group to share our techniques to a wider audience where appropriate, through an academic partnership programme that will facilitate research retreats, academic exchanges etc.

Developing good people: (b,d)
The programme will develop highly skilled researchers in a statistical field of high strategic importance. Project personnel will benefit from a supportive training, research and development environment, given the opportunity to create new techniques and see them employed in a productive and worthwhile setting. They will therefore be ideally positioned to seek future employment in a field/industry that enables them to make a strong contribution to society.

Contributing to the future supply of people: (all)
This proposal will secure an increase in the number and quality of researchers in statistics in an area of historic shortage. In particular, with the advent of sensor-based industrial systems, the need for developing future research leaders capable of underpinning the UK's competitive advantage in this area is crucial

Publications

10 25 50
publication icon
Aston John A. D. (2014) Efficiency of change point tests in high dimensional settings in arXiv e-prints

publication icon
Bardwell L (2018) Most Recent Changepoint Detection in Panel Data in Technometrics

publication icon
Dylan Hocking Toby (2017) A log-linear time algorithm for constrained changepoint detection in arXiv e-prints

publication icon
Fearnhead P (2018) Detecting Changes in Slope With an L 0 Penalty in Journal of Computational and Graphical Statistics

publication icon
Fearnhead P (2018) Changepoint Detection in the Presence of Outliers in Journal of the American Statistical Association

 
Description So far we have attracted interest from our industrial partners in the use of our developed methods. At the time of writing, not all results have been published so the full extent of the impact of this grant will not be known for some time.
First Year Of Impact 2019
Sector Aerospace, Defence and Marine,Digital/Communication/Information Technologies (including Software),Energy,Environment
Impact Types Societal,Economic

 
Description Isaac Newton Programme on Statistical Scalability
Amount £180,000 (GBP)
Funding ID Statistical Scalability 
Organisation Isaac Newton Institute for Mathematical Sciences 
Sector Academic/University
Country United Kingdom
Start 01/2018 
End 06/2018
 
Description Methodologically Enhanced Virtual Labs for Early Warning of Significant or Catastrophic Change in Ecosystems: Changepoints for a Changing Planet
Amount £203,419 (GBP)
Funding ID NE/T006102/1 
Organisation Natural Environment Research Council 
Sector Public
Country United Kingdom
Start 11/2019 
End 11/2020
 
Description Next Generation Converged Digital infrastructure (NG-CDI)
Amount £5,000,000 (GBP)
Funding ID EP/R004935/1 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start 11/2017 
End 10/2022
 
Title Anomaly 
Description An implementation of CAPA (Collective And Point Anomaly) for the detection of anomalies in time series data. 
Type Of Technology Software 
Year Produced 2018 
Open Source License? Yes  
Impact An implementation of CAPA (Collective And Point Anomaly) for the detection of anomalies in time series data. 
URL https://cran.r-project.org/web/packages/anomaly/index.html
 
Title GRPtests 
Description Methodology for testing nonlinearity in the conditional mean function in low- or high-dimensional generalized linear models, and the significance of (potentially large) groups of predictors. Details on the algorithms can be found in the paper by Jankova, Shah, Buehlmann and Samworth (2019) . 
Type Of Technology Software 
Year Produced 2019 
Open Source License? Yes  
Impact Too early to say 
URL https://CRAN.R-project.org/package=GRPtests
 
Title GeneralisedCovarianceMeasure: Test for Conditional Independence Based on the Generalized Covariance Measure (GCM) 
Description A statistical hypothesis test for conditional independence. It performs nonlinear regressions on the conditioning variable and then tests for a vanishing covariance between the resulting residuals. It can be applied to both univariate random variables and multivariate random vectors. Details of the method can be found in Rajen D. Shah and Jonas Peters (2018) . 
Type Of Technology Software 
Year Produced 2018 
Open Source License? Yes  
Impact Too early to say 
URL https://CRAN.R-project.org/package=GeneralisedCovarianceMeasure
 
Title R package called IndepTest 
Description An R package to implement an independence test called MINT, proposed in Berrett and Samworth (2017) 
Type Of Technology Software 
Year Produced 2017 
Open Source License? Yes  
Impact Too early to say. 
URL https://cran.r-project.org/web/packages/IndepTest/index.html