Was that change real? Quantifying uncertainty for change points
Lead Research Organisation:
London School of Economics and Political Science
Department Name: Statistics
Abstract
Detecting changes in data is currently one of the most active areas of statistics. In many applications there is interest in segmenting the data into regions with the same statistical properties, either as a way to flexibly model data, to help with down-stream analysis or to ensure predictions are made based only on relevant data. Whilst in others the main interest lies in detecting when changes have occurred as they indicate features of interest, from potential failures of machinery to security breaches or the presence of genomic features such as copy number variations.
To date most research in this area has been developing methods for detecting changes: algorithms that input data and output a best guess as to whether there have been relevant changes, and if so how many there have been and when they occurred. A comparatively ignored problem is assessing how confident we are that a specific change has occurred in a given part of the data.
In many applications, quantifying the uncertainty around whether a change has occurred is of paramount importance. For example, if we are monitoring a large communication network, and changes indicate potential faults, it is helpful to know how confident we are that there is a fault at any given point in the network so that we can prioritise the use of limited resources available for investigating and repairing faults. When analysing calcium imaging data on neuronal activity, where changes correspond to times at which a neuron fires, it is helpful to know how certain we are that a neuron fired at each time point so as to improve down-stream analysis of the data.
A naive approach to this problem is to first detect changes and then apply standard statistical tests for their presence. But this approach is flawed as it uses the data twice, first to decide where to test and then to perform the test. We can overcome this using sample splitting ideas - where we use half the data to detect a change, and the other half to perform the test. But such methods lose power, e.g. from using only part of the data to detect changes.
This proposal will develop statistically valid approaches to quantifying uncertainty, that are more powerful than sample splitting approaches. These approaches are based on two complementary ideas (i) performing inference prior to detection; and (ii) develop tests for a change that account for earlier detection steps. The output will be a new general toolbox for change points encompassing both new general statistical methods and their implementation within software packages.
To date most research in this area has been developing methods for detecting changes: algorithms that input data and output a best guess as to whether there have been relevant changes, and if so how many there have been and when they occurred. A comparatively ignored problem is assessing how confident we are that a specific change has occurred in a given part of the data.
In many applications, quantifying the uncertainty around whether a change has occurred is of paramount importance. For example, if we are monitoring a large communication network, and changes indicate potential faults, it is helpful to know how confident we are that there is a fault at any given point in the network so that we can prioritise the use of limited resources available for investigating and repairing faults. When analysing calcium imaging data on neuronal activity, where changes correspond to times at which a neuron fires, it is helpful to know how certain we are that a neuron fired at each time point so as to improve down-stream analysis of the data.
A naive approach to this problem is to first detect changes and then apply standard statistical tests for their presence. But this approach is flawed as it uses the data twice, first to decide where to test and then to perform the test. We can overcome this using sample splitting ideas - where we use half the data to detect a change, and the other half to perform the test. But such methods lose power, e.g. from using only part of the data to detect changes.
This proposal will develop statistically valid approaches to quantifying uncertainty, that are more powerful than sample splitting approaches. These approaches are based on two complementary ideas (i) performing inference prior to detection; and (ii) develop tests for a change that account for earlier detection steps. The output will be a new general toolbox for change points encompassing both new general statistical methods and their implementation within software packages.
People |
ORCID iD |
Piotr Fryzlewicz (Principal Investigator) |
Publications
Anastasiou A
(2022)
Cross-covariance isolate detect: A new change-point method for estimating dynamic functional connectivity.
in Medical image analysis
Anastasiou A
(2021)
Detecting multiple generalized change-points by isolating single ones
in Metrika
Fryzlewicz P
(2023)
Narrowest Significance Pursuit: Inference for Multiple Change-Points in Linear Models
in Journal of the American Statistical Association
Fryzlewicz P
(2024)
Robust Narrowest Significance Pursuit: Inference for Multiple Change-Points in the Median
in Journal of Business & Economic Statistics
Li J
(2024)
Automatic change-point detection in time series via deep learning
in Journal of the Royal Statistical Society Series B: Statistical Methodology
Li Y
(2022)
Detection of Multiple Structural Breaks in Large Covariance Matrices
in Journal of Business & Economic Statistics
Maeng H
(2023)
Detecting linear trend changes in data sequences
in Statistical Papers
Title | Narrowest Significance Pursuit: Inference for Multiple Change-Points in Linear Models |
Description | We propose Narrowest Significance Pursuit (NSP), a general and flexible methodology for automatically detecting localized regions in data sequences which each must contain a change-point (understood as an abrupt change in the parameters of an underlying linear model), at a prescribed global significance level. NSP works with a wide range of distributional assumptions on the errors, and guarantees important stochastic bounds which directly yield exact desired coverage probabilities, regardless of the form or number of the regressors. In contrast to the widely studied "post-selection inference" approach, NSP paves the way for the concept of "post-inference selection." An implementation is available in the R package nsp. Supplementary materials for this article are available online. |
Type Of Material | Database/Collection of data |
Year Produced | 2023 |
Provided To Others? | Yes |
URL | https://tandf.figshare.com/articles/dataset/Narrowest_Significance_Pursuit_inference_for_multiple_ch... |
Title | Narrowest Significance Pursuit: Inference for Multiple Change-Points in Linear Models |
Description | We propose Narrowest Significance Pursuit (NSP), a general and flexible methodology for automatically detecting localized regions in data sequences which each must contain a change-point (understood as an abrupt change in the parameters of an underlying linear model), at a prescribed global significance level. NSP works with a wide range of distributional assumptions on the errors, and guarantees important stochastic bounds which directly yield exact desired coverage probabilities, regardless of the form or number of the regressors. In contrast to the widely studied "post-selection inference" approach, NSP paves the way for the concept of "post-inference selection." An implementation is available in the R package nsp. Supplementary materials for this article are available online. |
Type Of Material | Database/Collection of data |
Year Produced | 2023 |
Provided To Others? | Yes |
URL | https://tandf.figshare.com/articles/dataset/Narrowest_Significance_Pursuit_inference_for_multiple_ch... |
Title | Narrowest Significance Pursuit: inference for multiple change-points in linear models |
Description | We propose Narrowest Significance Pursuit (NSP), a general and flexible methodology for automatically detecting localised regions in data sequences which each must contain a change-point (understood as an abrupt change in the parameters of an underlying linear model), at a prescribed global significance level. NSP works with a wide range of distributional assumptions on the errors, and guarantees important stochastic bounds which directly yield exact desired coverage probabilities, regardless of the form or number of the regressors. In contrast to the widely studied "post-selection inference" approach, NSP paves the way for the concept of "post-inference selection". An implementation is available in the R package nsp. |
Type Of Material | Database/Collection of data |
Year Produced | 2023 |
Provided To Others? | Yes |
URL | https://tandf.figshare.com/articles/dataset/Narrowest_Significance_Pursuit_inference_for_multiple_ch... |
Title | nsp: Inference for Multiple Change-Points in Linear Models |
Description | Implementation of Narrowest Significance Pursuit, a general and flexible methodology for automatically detecting localised regions in data sequences which each must contain a change-point (understood as an abrupt change in the parameters of an underlying linear model), at a prescribed global significance level. Narrowest Significance Pursuit works with a wide range of distributional assumptions on the errors, and yields exact desired finite-sample coverage probabilities, regardless of the form or number of the covariates. |
Type Of Technology | Software |
Year Produced | 2021 |
Open Source License? | Yes |
Impact | n/a |
URL | https://CRAN.R-project.org/package=nsp |