StatScale: Statistical Scalability for Streaming Data

Lead Research Organisation: Lancaster University
Department Name: Mathematics and Statistics

Abstract

We live in the age of data. Technology is transforming our ability to collect and store data on unprecedented scales. From the use of Oyster card data to improve London's transport network, to the Square Kilometre Array astrophysics project that has the potential to transform our understanding of the universe, Big Data can inform and enrich many aspects of our lives. Due to the widespread use of sensor-based systems in everyday life, with even smartphones having sensors that can monitor location and activity level, much of the explosion of data is in the form of data streams: data from one or more related sources that arrive over time. It has even been estimates that there will be over 30 billion devices collecting data streams by 2020.

The important role of Statistics within "Big Data" and data streams has been clear for some time. However the current tendency has been to focus purely on algorithmic scalability, such as how to develop versions of existing statistical algorithms that scale better with the amount of data. Such an approach, however, ignores the fact that fundamentally new issues often arise when dealing with data sets of this magnitude, and highly innovative solutions are required.

Model error is one such issue. Many statistical approaches are based on the use of mathematical models for data. These models are only approximations of the real data-generating mechanisms. In traditional applications, this model error is usually small compared with the inherent sampling variability of the data, and can be overlooked. However, there is an increasing realisation that model error can dominate in Big Data applications. Understanding the impact of model error, and developing robust methods that have excellent statistical properties even in the presence of model error, are major challenges.

A second issue is that many current statistical approaches are not computationally feasible for Big Data. In practice we will often need to use less efficient statistical methods that are computationally faster, or require less computer memory. This introduces a statistical-computational trade-off that is unique to Big Data, leading to many open theoretical questions, and important practical problems.

The strategic vision for this programme grant is to investigate and develop an integrated approach to tackling these and other fundamental statistical challenges. In order to do this we will focus in particular on analysing data streams. An important issue with this type of data is detecting changes in the structure of the data over time. This will be an early area of focus for the programme, as it has been identified as one of seven key problem areas for Big Data. Moreover it is an area in which our research will lead to practically important breakthroughs. Our philosophy is to tackle methodological, theoretical and computational aspects of these statistical problems together, an approach that is only possible through the programme grant scheme. Such a broad perspective is essential to achieve the substantive fundamental advances in statistics envisaged, and to ensure our new methods are sufficiently robust and efficient to be widely adopted by academics, industry and society more generally.

Planned Impact

Who will benefit?
This proposal will benefit a variety of different stakeholders including:
(a) A wide range of industries, including collaborating industrial partners and those organisations that handle large volumes of data [e.g. the NHS, Transport Agencies, Energy companies etc.];
(b) Society more generally through the application of this research;
(c) The academic research community, particularly in disciplines that underpin and relate to the data sciences;
(d) Project personnel: PDRAs and PhD students.

How will they benefit?
New techniques: (a, b, c)
The research undertaken will develop a number of exciting new statistical techniques that will be disseminated to our partners and user communities. Our methods will result in more efficient and cost-effective ways of marshalling precious resources, by making principled analysis of very large datasets either i) feasible and/or ii) faster and more accurate. These benefits will flow through the economy and society via a number of different mechanisms. These might include more efficient use of resources (e.g. better management of oil fields via improved processing of well operation data); improved productivity (e.g. via more timely management and intervention of faults on telecommunications networks) and society more generally (via the development of statistical methods capable of analysing eHealth data streams, e.g. support monitoring of vulnerable elderly living independently). To enable this, we specifically include resource for a Research Software Engineer to make available high-quality, documented open source code for others to use.

Targeted Knowledge Exchange: (a)
Significant further benefit will accrue to beneficiary group (a) through their partnership on this project. At this stage, we benefit from the support of several leading organisations in the energy, health and telecommunications sectors. They have expressed enthusiastic support for this programme's vision and have provided valuable insight and advice as we have developed this proposal.
For example, through dialogue with this community the idea of short-term secondment visits via a partnership programme has developed. PDRAs will spend periods of time at partner locations developing case studies that demonstrate the utility of developed methods on data rich products and systems. They are also keen to work with us to develop successful knowledge exchange mechanisms. Representatives from this community will also sit on the Advisory and Impact Board.

Generic Knowledge Exchange: (c)
We will develop methods that are of considerable interest to the academic community both in Statistics and other fields. As well as the traditional routes of journal publication, workshops and conferences the programme will develop open source R software that embodies our techniques: these will benefit the academic community and beyond. Further, we will work with our advisory group to share our techniques to a wider audience where appropriate, through an academic partnership programme that will facilitate research retreats, academic exchanges etc.

Developing good people: (b,d)
The programme will develop highly skilled researchers in a statistical field of high strategic importance. Project personnel will benefit from a supportive training, research and development environment, given the opportunity to create new techniques and see them employed in a productive and worthwhile setting. They will therefore be ideally positioned to seek future employment in a field/industry that enables them to make a strong contribution to society.

Contributing to the future supply of people: (all)
This proposal will secure an increase in the number and quality of researchers in statistics in an area of historic shortage. In particular, with the advent of sensor-based industrial systems, the need for developing future research leaders capable of underpinning the UK's competitive advantage in this area is crucial

Publications

10 25 50
publication icon
Agarwal G (2023) Semiparametric detection of changepoints in location, scale, and copula in Statistical Analysis and Data Mining: The ASA Data Science Journal

publication icon
Bardwell L (2018) Most Recent Changepoint Detection in Panel Data in Technometrics

publication icon
Berrett TB (2021) USP: an independence test that improves on Pearson's chi-squared and the G-test. in Proceedings. Mathematical, physical, and engineering sciences

publication icon
Berrett, T. B. (2021) Optimal rates for independence testing via U-statistic permutation tests in Annals of Statistics

publication icon
Chen Y (2022) High-Dimensional, Multiscale Online Changepoint Detection in Journal of the Royal Statistical Society Series B: Statistical Methodology

publication icon
Chen Y (2023) Inference in High-Dimensional Online Changepoint Detection in Journal of the American Statistical Association

 
Description To date, the methods developed have predominantly attracted interest from our industrial partners. At the time of writing, not all results have been published so the full extent of the impact of this grant will not be known for some time. However we note two noteworthy examples of impact arising from the programme thus far: The first relates to the anomaly detection work reported by Fisch et. al (2018), which is already being used by BT to provide data-driven insights that help operate and maintain the UK's digital infrastructure. Other methods, such as those reported by Jewell et al. (2020), have been used by the Allen Institute for Brain Science as they develop understanding of how the human brain works.
First Year Of Impact 2019
Sector Aerospace, Defence and Marine,Digital/Communication/Information Technologies (including Software),Energy,Environment,Healthcare
Impact Types Societal,Economic

 
Description Isaac Newton Programme on Statistical Scalability
Amount £180,000 (GBP)
Funding ID Statistical Scalability 
Organisation Isaac Newton Institute for Mathematical Sciences 
Sector Academic/University
Country United Kingdom
Start 01/2018 
End 06/2018
 
Description Methodologically Enhanced Virtual Labs for Early Warning of Significant or Catastrophic Change in Ecosystems: Changepoints for a Changing Planet
Amount £203,419 (GBP)
Funding ID NE/T006102/1 
Organisation Natural Environment Research Council 
Sector Public
Country United Kingdom
Start 11/2019 
End 11/2020
 
Description Next Generation Converged Digital infrastructure (NG-CDI)
Amount £5,000,000 (GBP)
Funding ID EP/R004935/1 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start 11/2017 
End 10/2022
 
Title Anomaly 
Description An implementation of CAPA (Collective And Point Anomaly) for the detection of anomalies in time series data. 
Type Of Technology Software 
Year Produced 2018 
Open Source License? Yes  
Impact An implementation of CAPA (Collective And Point Anomaly) for the detection of anomalies in time series data. 
URL https://cran.r-project.org/web/packages/anomaly/index.html
 
Title BayesProject: Fast Projection Direction for Multivariate Changepoint Detection 
Description Implementations in 'cpp' of the BayesProject algorithm (see G. Hahn, P. Fearnhead, I.A. Eckley (2020) ) which implements a fast approach to compute a projection direction for multivariate changepoint detection, as well as the sum-cusum and max-cusum methods, and a wild binary segmentation wrapper for all algorithms. 
Type Of Technology Software 
Year Produced 2020 
Open Source License? Yes  
Impact This is open source software, and we are unaware of any notable impacts. 
URL https://doi.org/10.1007%2Fs11222-020-09966-2
 
Title CatReg: Solution Paths for Linear and Logistic Regression Models with SCOPE Penalty 
Description Computes solutions for regularised linear and logistic regression models with high-dimensional categorical covariates. 
Type Of Technology Software 
Year Produced 2020 
Open Source License? Yes  
Impact Too early to say 
URL https://CRAN.R-project.org/package=CatReg
 
Title ChangepointInference 
Description Software to implement post-selection inference method for change points from Jewell, S., Fearnhead, P., & Witten, D. (Accepted/In press). Testing for a Change in Mean After Changepoint Detection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 
Type Of Technology Software 
Year Produced 2022 
Open Source License? Yes  
Impact None 
URL https://arxiv.org/abs/1910.04291
 
Title DeCAFS: Detecting Changes in Autocorrelated and Fluctuating Signals 
Description Detect abrupt changes in time series with local fluctuations as a random walk process and autocorrelated noise as an AR(1) process. See Romano, G., Rigaill, G., Runge, V., Fearnhead, P. (2020) 
Type Of Technology Software 
Year Produced 2020 
Open Source License? Yes  
Impact This is open-source software, we are currently unaware of any notable impacts. 
URL https://arxiv.org/abs/2005.01379
 
Title Functional Online CUSUM 
Description Implement the Functional Online CUSUM method of Fast Online Changepoint Detection via Functional Pruning CUSUM statistics Gaetano Romano, Idris Eckley, Paul Fearnhead, Guillem Rigaill arXiv.2110.08205 
Type Of Technology Software 
Year Produced 2021 
Open Source License? Yes  
Impact None. Though interest in the method has been shown by British Telecom 
URL https://arxiv.org/abs/2110.08205
 
Title GRPtests 
Description Methodology for testing nonlinearity in the conditional mean function in low- or high-dimensional generalized linear models, and the significance of (potentially large) groups of predictors. Details on the algorithms can be found in the paper by Jankova, Shah, Buehlmann and Samworth (2019) . 
Type Of Technology Software 
Year Produced 2019 
Open Source License? Yes  
Impact Too early to say 
URL https://CRAN.R-project.org/package=GRPtests
 
Title GeneralisedCovarianceMeasure: Test for Conditional Independence Based on the Generalized Covariance Measure (GCM) 
Description A statistical hypothesis test for conditional independence. It performs nonlinear regressions on the conditioning variable and then tests for a vanishing covariance between the resulting residuals. It can be applied to both univariate random variables and multivariate random vectors. Details of the method can be found in Rajen D. Shah and Jonas Peters (2018) . 
Type Of Technology Software 
Year Produced 2018 
Open Source License? Yes  
Impact Used by A.P. Moller Maersk in testing whether structural causal models relating to pricing can be falsified. 
URL https://CRAN.R-project.org/package=GeneralisedCovarianceMeasure
 
Title IndepTest 
Description R package for independence testing 
Type Of Technology Software 
Year Produced 2018 
Open Source License? Yes  
Impact None as yet. 
URL https://cran.r-project.org/web/packages/IndepTest/index.html
 
Title InspectChangepoint 
Description R package for high-dimensional changepoint estimation. 
Type Of Technology Software 
Year Produced 2016 
Open Source License? Yes  
Impact None as yet. 
URL https://cran.r-project.org/web/packages/InspectChangepoint/index.html
 
Title LogConcComp 
Description Github python code for computing the log-concave maximum likelihood estimator 
Type Of Technology Software 
Year Produced 2021 
Open Source License? Yes  
Impact None as yet. 
URL https://github.com/wenyuC94/LogConcComp
 
Title MCARtest: Optimal Nonparametric Testing of Missing Completely at Random 
Description R package 
Type Of Technology Software 
Year Produced 2022 
Open Source License? Yes  
Impact None as yet. 
URL https://cran.r-project.org/web/packages/MCARtest/index.html
 
Title MissInspect 
Description Github R functions for changepoint estimation with heterogeneous missingness 
Type Of Technology Software 
Year Produced 2021 
Open Source License? Yes  
Impact None as yet. 
URL https://github.com/wangtengyao/MissInspect
 
Title R package called IndepTest 
Description An R package to implement an independence test called MINT, proposed in Berrett and Samworth (2017) 
Type Of Technology Software 
Year Produced 2017 
Open Source License? Yes  
Impact Too early to say. 
URL https://cran.r-project.org/web/packages/IndepTest/index.html
 
Title R package:CROPS 
Description Implementation of the CROPS wrapper for changepoint methods. The CROPS algorithm is described in Haynes, Kaylea, Idris A. Eckley, and Paul Fearnhead. "Computationally efficient changepoint detection for a range of penalties." Journal of Computational and Graphical Statistics 26.1 (2017): 134-143. 
Type Of Technology Software 
Year Produced 2022 
Open Source License? Yes  
Impact None 
URL https://cran.r-project.org/web/packages/crops/index.html
 
Title RobKF: Innovative and/or Additive Outlier Robust Kalman Filtering 
Description Implements a series of robust Kalman filtering approaches. It implements the additive outlier robust filters of Ruckdeschel et al. (2014) and Agamennoni et al. (2018) , the innovative outlier robust filter of Ruckdeschel et al. (2014) , as well as the innovative and additive outlier robust filter of Fisch et al. (2020) 
Type Of Technology Software 
Year Produced 2020 
Open Source License? Yes  
Impact This is open source software, and we are unaware of any notable impacts. 
URL https://arxiv.org/abs/2007.03238
 
Title SPCAvRP 
Description R package for sparse PCA 
Type Of Technology Software 
Year Produced 2019 
Open Source License? Yes  
Impact None as yet. 
URL https://cran.r-project.org/web/packages/SPCAvRP/index.html
 
Title Sshaped 
Description R package for fitting S-shaped functions 
Type Of Technology Software 
Year Produced 2022 
Open Source License? Yes  
Impact None as yet. 
URL https://cran.r-project.org/web/packages/Sshaped/index.html
 
Title USP 
Description R package for independence testing 
Type Of Technology Software 
Year Produced 2021 
Open Source License? Yes  
Impact None as yet. 
URL https://cran.r-project.org/web/packages/USP/index.html
 
Title gfpop: Graph-Constrained Functional Pruning Optimal Partitioning 
Description Penalized parametric change-point detection by functional pruning dynamic programming algorithm. The successive means are constrained using a graph structure with edges of types null, up, down, std or abs. To each edge we can associate some additional properties: a minimal gap size, a penalty, some robust parameters (K,a). The user can also constrain the inferred means to lie between some minimal and maximal values. Data is modeled by a quadratic cost with possible use of a robust loss, biweight and Huber (see edge parameters K and a). Other losses are also available with log-linear representation or a log-log representation. 
Type Of Technology Software 
Year Produced 2020 
Open Source License? Yes  
Impact This is open source software and we are unaware of any notable impacts. 
URL https://arxiv.org/abs/2002.03646
 
Title ghcm: Functional Conditional Independence Testing with the GHCM 
Description A statistical hypothesis test for conditional independence. Given residuals from a sufficiently powerful regression, it tests whether the covariance of the residuals is vanishing. It can be applied to both discretely-observed functional data and multivariate data. Details of the method can be found in Anton Rask Lundborg, Rajen D. Shah and Jonas Peters (2020) . 
Type Of Technology Software 
Year Produced 2021 
Open Source License? Yes  
Impact Too early to say. 
URL https://CRAN.R-project.org/package=ghcm
 
Title ocd 
Description R package for online changepoint detection 
Type Of Technology Software 
Year Produced 2020 
Open Source License? Yes  
Impact None as yet. 
URL https://cran.r-project.org/web/packages/ocd/index.html
 
Title ocd_CI 
Description R functions on github for online changepoint detection. 
Type Of Technology Software 
Year Produced 2021 
Open Source License? Yes  
Impact None as yet. 
URL https://github.com/yudongchen88/ocd_CI
 
Title primePCA 
Description R package on CRAN for high-dimensional PCA with heterogeneous missingness 
Type Of Technology Software 
Year Produced 2021 
Open Source License? Yes  
Impact None as yet. 
URL https://cran.r-project.org/web/packages/primePCA/index.html