StatScale: Statistical Scalability for Streaming Data
Lead Research Organisation:
Lancaster University
Department Name: Mathematics and Statistics
Abstract
We live in the age of data. Technology is transforming our ability to collect and store data on unprecedented scales. From the use of Oyster card data to improve London's transport network, to the Square Kilometre Array astrophysics project that has the potential to transform our understanding of the universe, Big Data can inform and enrich many aspects of our lives. Due to the widespread use of sensor-based systems in everyday life, with even smartphones having sensors that can monitor location and activity level, much of the explosion of data is in the form of data streams: data from one or more related sources that arrive over time. It has even been estimates that there will be over 30 billion devices collecting data streams by 2020.
The important role of Statistics within "Big Data" and data streams has been clear for some time. However the current tendency has been to focus purely on algorithmic scalability, such as how to develop versions of existing statistical algorithms that scale better with the amount of data. Such an approach, however, ignores the fact that fundamentally new issues often arise when dealing with data sets of this magnitude, and highly innovative solutions are required.
Model error is one such issue. Many statistical approaches are based on the use of mathematical models for data. These models are only approximations of the real data-generating mechanisms. In traditional applications, this model error is usually small compared with the inherent sampling variability of the data, and can be overlooked. However, there is an increasing realisation that model error can dominate in Big Data applications. Understanding the impact of model error, and developing robust methods that have excellent statistical properties even in the presence of model error, are major challenges.
A second issue is that many current statistical approaches are not computationally feasible for Big Data. In practice we will often need to use less efficient statistical methods that are computationally faster, or require less computer memory. This introduces a statistical-computational trade-off that is unique to Big Data, leading to many open theoretical questions, and important practical problems.
The strategic vision for this programme grant is to investigate and develop an integrated approach to tackling these and other fundamental statistical challenges. In order to do this we will focus in particular on analysing data streams. An important issue with this type of data is detecting changes in the structure of the data over time. This will be an early area of focus for the programme, as it has been identified as one of seven key problem areas for Big Data. Moreover it is an area in which our research will lead to practically important breakthroughs. Our philosophy is to tackle methodological, theoretical and computational aspects of these statistical problems together, an approach that is only possible through the programme grant scheme. Such a broad perspective is essential to achieve the substantive fundamental advances in statistics envisaged, and to ensure our new methods are sufficiently robust and efficient to be widely adopted by academics, industry and society more generally.
The important role of Statistics within "Big Data" and data streams has been clear for some time. However the current tendency has been to focus purely on algorithmic scalability, such as how to develop versions of existing statistical algorithms that scale better with the amount of data. Such an approach, however, ignores the fact that fundamentally new issues often arise when dealing with data sets of this magnitude, and highly innovative solutions are required.
Model error is one such issue. Many statistical approaches are based on the use of mathematical models for data. These models are only approximations of the real data-generating mechanisms. In traditional applications, this model error is usually small compared with the inherent sampling variability of the data, and can be overlooked. However, there is an increasing realisation that model error can dominate in Big Data applications. Understanding the impact of model error, and developing robust methods that have excellent statistical properties even in the presence of model error, are major challenges.
A second issue is that many current statistical approaches are not computationally feasible for Big Data. In practice we will often need to use less efficient statistical methods that are computationally faster, or require less computer memory. This introduces a statistical-computational trade-off that is unique to Big Data, leading to many open theoretical questions, and important practical problems.
The strategic vision for this programme grant is to investigate and develop an integrated approach to tackling these and other fundamental statistical challenges. In order to do this we will focus in particular on analysing data streams. An important issue with this type of data is detecting changes in the structure of the data over time. This will be an early area of focus for the programme, as it has been identified as one of seven key problem areas for Big Data. Moreover it is an area in which our research will lead to practically important breakthroughs. Our philosophy is to tackle methodological, theoretical and computational aspects of these statistical problems together, an approach that is only possible through the programme grant scheme. Such a broad perspective is essential to achieve the substantive fundamental advances in statistics envisaged, and to ensure our new methods are sufficiently robust and efficient to be widely adopted by academics, industry and society more generally.
Planned Impact
Who will benefit?
This proposal will benefit a variety of different stakeholders including:
(a) A wide range of industries, including collaborating industrial partners and those organisations that handle large volumes of data [e.g. the NHS, Transport Agencies, Energy companies etc.];
(b) Society more generally through the application of this research;
(c) The academic research community, particularly in disciplines that underpin and relate to the data sciences;
(d) Project personnel: PDRAs and PhD students.
How will they benefit?
New techniques: (a, b, c)
The research undertaken will develop a number of exciting new statistical techniques that will be disseminated to our partners and user communities. Our methods will result in more efficient and cost-effective ways of marshalling precious resources, by making principled analysis of very large datasets either i) feasible and/or ii) faster and more accurate. These benefits will flow through the economy and society via a number of different mechanisms. These might include more efficient use of resources (e.g. better management of oil fields via improved processing of well operation data); improved productivity (e.g. via more timely management and intervention of faults on telecommunications networks) and society more generally (via the development of statistical methods capable of analysing eHealth data streams, e.g. support monitoring of vulnerable elderly living independently). To enable this, we specifically include resource for a Research Software Engineer to make available high-quality, documented open source code for others to use.
Targeted Knowledge Exchange: (a)
Significant further benefit will accrue to beneficiary group (a) through their partnership on this project. At this stage, we benefit from the support of several leading organisations in the energy, health and telecommunications sectors. They have expressed enthusiastic support for this programme's vision and have provided valuable insight and advice as we have developed this proposal.
For example, through dialogue with this community the idea of short-term secondment visits via a partnership programme has developed. PDRAs will spend periods of time at partner locations developing case studies that demonstrate the utility of developed methods on data rich products and systems. They are also keen to work with us to develop successful knowledge exchange mechanisms. Representatives from this community will also sit on the Advisory and Impact Board.
Generic Knowledge Exchange: (c)
We will develop methods that are of considerable interest to the academic community both in Statistics and other fields. As well as the traditional routes of journal publication, workshops and conferences the programme will develop open source R software that embodies our techniques: these will benefit the academic community and beyond. Further, we will work with our advisory group to share our techniques to a wider audience where appropriate, through an academic partnership programme that will facilitate research retreats, academic exchanges etc.
Developing good people: (b,d)
The programme will develop highly skilled researchers in a statistical field of high strategic importance. Project personnel will benefit from a supportive training, research and development environment, given the opportunity to create new techniques and see them employed in a productive and worthwhile setting. They will therefore be ideally positioned to seek future employment in a field/industry that enables them to make a strong contribution to society.
Contributing to the future supply of people: (all)
This proposal will secure an increase in the number and quality of researchers in statistics in an area of historic shortage. In particular, with the advent of sensor-based industrial systems, the need for developing future research leaders capable of underpinning the UK's competitive advantage in this area is crucial
This proposal will benefit a variety of different stakeholders including:
(a) A wide range of industries, including collaborating industrial partners and those organisations that handle large volumes of data [e.g. the NHS, Transport Agencies, Energy companies etc.];
(b) Society more generally through the application of this research;
(c) The academic research community, particularly in disciplines that underpin and relate to the data sciences;
(d) Project personnel: PDRAs and PhD students.
How will they benefit?
New techniques: (a, b, c)
The research undertaken will develop a number of exciting new statistical techniques that will be disseminated to our partners and user communities. Our methods will result in more efficient and cost-effective ways of marshalling precious resources, by making principled analysis of very large datasets either i) feasible and/or ii) faster and more accurate. These benefits will flow through the economy and society via a number of different mechanisms. These might include more efficient use of resources (e.g. better management of oil fields via improved processing of well operation data); improved productivity (e.g. via more timely management and intervention of faults on telecommunications networks) and society more generally (via the development of statistical methods capable of analysing eHealth data streams, e.g. support monitoring of vulnerable elderly living independently). To enable this, we specifically include resource for a Research Software Engineer to make available high-quality, documented open source code for others to use.
Targeted Knowledge Exchange: (a)
Significant further benefit will accrue to beneficiary group (a) through their partnership on this project. At this stage, we benefit from the support of several leading organisations in the energy, health and telecommunications sectors. They have expressed enthusiastic support for this programme's vision and have provided valuable insight and advice as we have developed this proposal.
For example, through dialogue with this community the idea of short-term secondment visits via a partnership programme has developed. PDRAs will spend periods of time at partner locations developing case studies that demonstrate the utility of developed methods on data rich products and systems. They are also keen to work with us to develop successful knowledge exchange mechanisms. Representatives from this community will also sit on the Advisory and Impact Board.
Generic Knowledge Exchange: (c)
We will develop methods that are of considerable interest to the academic community both in Statistics and other fields. As well as the traditional routes of journal publication, workshops and conferences the programme will develop open source R software that embodies our techniques: these will benefit the academic community and beyond. Further, we will work with our advisory group to share our techniques to a wider audience where appropriate, through an academic partnership programme that will facilitate research retreats, academic exchanges etc.
Developing good people: (b,d)
The programme will develop highly skilled researchers in a statistical field of high strategic importance. Project personnel will benefit from a supportive training, research and development environment, given the opportunity to create new techniques and see them employed in a productive and worthwhile setting. They will therefore be ideally positioned to seek future employment in a field/industry that enables them to make a strong contribution to society.
Contributing to the future supply of people: (all)
This proposal will secure an increase in the number and quality of researchers in statistics in an area of historic shortage. In particular, with the advent of sensor-based industrial systems, the need for developing future research leaders capable of underpinning the UK's competitive advantage in this area is crucial
Publications
Zhu Z
(2022)
High-dimensional principal component analysis with heterogeneous missingness.
in Journal of the Royal Statistical Society. Series B, Statistical methodology
Zheng C
(2022)
Consistency of a range of penalised cost approaches for detecting multiple changepoints
in Electronic Journal of Statistics
Xu, M.
(2021)
High-dimensional nonparametric density estimation via symmetry and shape constraints
in Annals of Statistics
Wilson R
(2021)
A wavelet-based approach for imputation in nonstationary multivariate time series
in Statistics and Computing
Ward Kes
(2023)
A Constant-per-Iteration Likelihood Ratio Test for Online Changepoint Detection for Exponential Family Models
in arXiv e-prints
Ward Kes
(2022)
Poisson-FOCuS: An efficient online method for detecting count bursts with application to gamma ray burst detection
in arXiv e-prints
Description | The StatScale Programme was conceived to help catalyse research in the broad area of scalable statistical methods for streaming data. To achieve this, StatScale's team brought together diverse research strengths across the range of statistical research activity to develop the next generation tools required to realise this ambitious vision. As a consequence, StatScale has made major contributions in three main areas: · Changepoint methods · Conditional Independence Testing · Model Misspecification At StatScale's outset each of these areas represented a new challenge where the development of novel methods and tools was meaningful and useful. Major progress has been made in all three - providing a suite of important publications, describing new statistical tools and their implementation in software form for researchers and practitioners alike. Moreover, both the research and associated community building events supported by the programme have helped to stimulate activity within in each of these areas internationally. This is particularly clear in the area of changepoint methods, a statistical topic that is of growing importance to a number of other research and application areas. Finally, the StatScale programme also developed a number of postdocs who have gone onto academic positions at a range of leading universities. |
Exploitation Route | The research from this programme may be taken forward in a number of ways. For example, within the Statistical research community, we hope that StatScale's legacy will be to have catalysed and sustained activity in each of the three main areas for several years to come. Our understanding is that the methods, and associated software, developed are also being explored in a number of other disciplines: from computer science and digital networking, to astrophysics. As the tools and methods become increasingly shared, we envisage the breadth of areas benefiting from this research growing. |
Sectors | Aerospace Defence and Marine Agriculture Food and Drink Chemicals Construction Digital/Communication/Information Technologies (including Software) Electronics Energy Environment Financial Services and Management Consultancy Healthcare Government Democracy and Justice Manufacturing including Industrial Biotechology Culture Heritage Museums and Collections Pharmaceuticals and Medical Biotechnology Retail Security and Diplomacy |
Description | To date, the methods developed have predominantly attracted interest from our industrial partners. At the time of writing, not all results have been published so the full extent of the impact of this grant will not be known for some time. However we note two noteworthy examples of impact arising from the programme thus far: The first relates to the anomaly detection work reported by Fisch et. al (2022), which is already being used by BT to provide data-driven insights that help operate and maintain the UK's digital infrastructure. Other methods, such as those reported by Jewell et al. (2020), have been used by the Allen Institute for Brain Science as they develop understanding of how the human brain works. |
First Year Of Impact | 2019 |
Sector | Aerospace, Defence and Marine,Agriculture, Food and Drink,Construction,Digital/Communication/Information Technologies (including Software),Energy,Environment,Financial Services, and Management Consultancy,Healthcare,Government, Democracy and Justice,Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology,Transport |
Impact Types | Societal Economic |
Description | Isaac Newton Programme on Statistical Scalability |
Amount | £180,000 (GBP) |
Funding ID | Statistical Scalability |
Organisation | Isaac Newton Institute for Mathematical Sciences |
Sector | Academic/University |
Country | United Kingdom |
Start | 01/2018 |
End | 06/2018 |
Description | Methodologically Enhanced Virtual Labs for Early Warning of Significant or Catastrophic Change in Ecosystems: Changepoints for a Changing Planet |
Amount | £203,419 (GBP) |
Funding ID | NE/T006102/1 |
Organisation | Natural Environment Research Council |
Sector | Public |
Country | United Kingdom |
Start | 11/2019 |
End | 11/2021 |
Description | Next Generation Converged Digital infrastructure (NG-CDI) |
Amount | £5,000,000 (GBP) |
Funding ID | EP/R004935/1 |
Organisation | Engineering and Physical Sciences Research Council (EPSRC) |
Sector | Public |
Country | United Kingdom |
Start | 11/2017 |
End | 10/2022 |
Title | Detecting Changes in Slope With an L0 Penalty |
Description | While there are many approaches to detecting changes in mean for a univariate time series, the problem of detecting multiple changes in slope has comparatively been ignored. Part of the reason for this is that detecting changes in slope is much more challenging: simple binary segmentation procedures do not work for this problem, while existing dynamic programming methods that work for the change in mean problem cannot be used for detecting changes in slope. We present a novel dynamic programming approach, CPOP, for finding the "best" continuous piecewise linear fit to data under a criterion that measures fit to data using the residual sum of squares, but penalizes complexity based on an L0 penalty on changes in slope. We prove that detecting changes in this manner can lead to consistent estimation of the number of changepoints, and show empirically that using an L0 penalty is more reliable at estimating changepoint locations than using an L1 penalty. Empirically CPOP has good computational properties, and can analyze a time series with 10,000 observations and 100 changes in a few minutes. Our method is used to analyze data on the motion of bacteria, and provides better and more parsimonious fits than two competing approaches. Supplementary material for this article is available online. |
Type Of Material | Database/Collection of data |
Year Produced | 2018 |
Provided To Others? | Yes |
URL | https://tandf.figshare.com/articles/Detecting_changes_in_slope_with_an_i_L_i_sub_0_sub_penalty/69870... |
Title | Detecting Changes in Slope With an L0 Penalty |
Description | While there are many approaches to detecting changes in mean for a univariate time series, the problem of detecting multiple changes in slope has comparatively been ignored. Part of the reason for this is that detecting changes in slope is much more challenging: simple binary segmentation procedures do not work for this problem, while existing dynamic programming methods that work for the change in mean problem cannot be used for detecting changes in slope. We present a novel dynamic programming approach, CPOP, for finding the "best" continuous piecewise linear fit to data under a criterion that measures fit to data using the residual sum of squares, but penalizes complexity based on an L0 penalty on changes in slope. We prove that detecting changes in this manner can lead to consistent estimation of the number of changepoints, and show empirically that using an L0 penalty is more reliable at estimating changepoint locations than using an L1 penalty. Empirically CPOP has good computational properties, and can analyze a time series with 10,000 observations and 100 changes in a few minutes. Our method is used to analyze data on the motion of bacteria, and provides better and more parsimonious fits than two competing approaches. Supplementary material for this article is available online. |
Type Of Material | Database/Collection of data |
Year Produced | 2018 |
Provided To Others? | Yes |
URL | https://tandf.figshare.com/articles/Detecting_changes_in_slope_with_an_i_L_i_sub_0_sub_penalty/69870... |
Title | Detecting changes in slope with an L0 penalty |
Description | Whilst there are many approaches to detecting changes in mean for a univariate time-series, the problem of detecting multiple changes in slope has comparatively been ignored. Part of the reason for this is that detecting changes in slope is much more challenging: simple binary segmentation procedures do not work for this problem, whilst existing dynamic programming methods that work for the change in mean problem cannot be used for detecting changes in slope. We present a novel dynamic programming approach, CPOP, for finding the "best" continuous piecewise-linear fit to data under a criterion that measures fit to data using the residual sum of squares, but penalises complexity based on an L0 penalty on changes in slope. We prove that detecting changes in this manner can lead to consistent estimation of the number of changepoints, and show empirically that using an L0 penalty is more reliable at estimating changepoint locations than using an L1 penalty. Empirically CPOP has good computational properties, and can analyse a time-series with 10,000 observations and 100 changes in a few minutes. Our method is used to analyse data on the motion of bacteria, and provides better and more parsimonious fits than two competing approaches. |
Type Of Material | Database/Collection of data |
Year Produced | 2018 |
Provided To Others? | Yes |
URL | https://tandf.figshare.com/articles/Detecting_changes_in_slope_with_an_i_L_i_sub_0_sub_penalty/69870... |
Title | Inference in High-Dimensional Online Changepoint Detection |
Description | We introduce and study two new inferential challenges associated with the sequential detection of change in a high-dimensional mean vector. First, we seek a confidence interval for the changepoint, and second, we estimate the set of indices of coordinates in which the mean changes. We propose an online algorithm that produces an interval with guaranteed nominal coverage, and whose length is, with high probability, of the same order as the average detection delay, up to a logarithmic factor. The corresponding support estimate enjoys control of both false negatives and false positives. Simulations confirm the effectiveness of our methodology, and we also illustrate its applicability on the U.S. excess deaths data from 2017 to 2020. The supplementary material, which contains the proofs of our theoretical results, is available online. |
Type Of Material | Database/Collection of data |
Year Produced | 2023 |
Provided To Others? | Yes |
URL | https://tandf.figshare.com/articles/dataset/Inference_in_High-dimensional_Online_Changepoint_Detecti... |
Title | Inference in High-Dimensional Online Changepoint Detection |
Description | We introduce and study two new inferential challenges associated with the sequential detection of change in a high-dimensional mean vector. First, we seek a confidence interval for the changepoint, and second, we estimate the set of indices of coordinates in which the mean changes. We propose an online algorithm that produces an interval with guaranteed nominal coverage, and whose length is, with high probability, of the same order as the average detection delay, up to a logarithmic factor. The corresponding support estimate enjoys control of both false negatives and false positives. Simulations confirm the effectiveness of our methodology, and we also illustrate its applicability on the U.S. excess deaths data from 2017 to 2020. The supplementary material, which contains the proofs of our theoretical results, is available online. |
Type Of Material | Database/Collection of data |
Year Produced | 2023 |
Provided To Others? | Yes |
URL | https://tandf.figshare.com/articles/dataset/Inference_in_High-dimensional_Online_Changepoint_Detecti... |
Title | Inference in High-dimensional Online Changepoint Detection |
Description | We introduce and study two new inferential challenges associated with the sequential detection of change in a high-dimensional mean vector. First, we seek a confidence interval for the changepoint, and second, we estimate the set of indices of coordinates in which the mean changes. We propose an online algorithm that produces an interval with guaranteed nominal coverage, and whose length is, with high probability, of the same order as the average detection delay, up to a logarithmic factor. The corresponding support estimate enjoys control of both false negatives and false positives. Simulations confirm the effectiveness of our methodology, and we also illustrate its applicability on the US excess deaths data from 2017-2020. The supplementary material, which contains the proofs of our theoretical results, is available online. |
Type Of Material | Database/Collection of data |
Year Produced | 2023 |
Provided To Others? | Yes |
URL | https://tandf.figshare.com/articles/dataset/Inference_in_High-dimensional_Online_Changepoint_Detecti... |
Title | Subset Multivariate Collective and Point Anomaly Detection |
Description | In the recent years, there has been a growing interest in identifying anomalous structure within multivariate data sequences. We consider the problem of detecting collective anomalies, corresponding to intervals where one, or more, of the data sequences behaves anomalously. We first develop a test for a single collective anomaly that has power to simultaneously detect anomalies that are either rare, that is affecting few data sequences, or common. We then show how to detect multiple anomalies in a way that is computationally efficient but avoids the approximations inherent in binary segmentation-like approaches. This approach is shown to consistently estimate the number and location of the collective anomalies-a property that has not previously been shown for competing methods. Our approach can be made robust to point anomalies and can allow for the anomalies to be imperfectly aligned. We show the practical usefulness of allowing for imperfect alignments through a resulting increase in power to detect regions of copy number variation. Supplemental files for this article are available online. |
Type Of Material | Database/Collection of data |
Year Produced | 2021 |
Provided To Others? | Yes |
URL | https://tandf.figshare.com/articles/dataset/Subset_Multivariate_Collective_and_Point_Anomaly_Detecti... |
Title | Subset Multivariate Collective and Point Anomaly Detection |
Description | In the recent years, there has been a growing interest in identifying anomalous structure within multivariate data sequences. We consider the problem of detecting collective anomalies, corresponding to intervals where one, or more, of the data sequences behaves anomalously. We first develop a test for a single collective anomaly that has power to simultaneously detect anomalies that are either rare, that is affecting few data sequences, or common. We then show how to detect multiple anomalies in a way that is computationally efficient but avoids the approximations inherent in binary segmentation-like approaches. This approach is shown to consistently estimate the number and location of the collective anomalies-a property that has not previously been shown for competing methods. Our approach can be made robust to point anomalies and can allow for the anomalies to be imperfectly aligned. We show the practical usefulness of allowing for imperfect alignments through a resulting increase in power to detect regions of copy number variation. Supplemental files for this article are available online. |
Type Of Material | Database/Collection of data |
Year Produced | 2021 |
Provided To Others? | Yes |
URL | https://tandf.figshare.com/articles/dataset/Subset_Multivariate_Collective_and_Point_Anomaly_Detecti... |
Title | Anomaly |
Description | An implementation of CAPA (Collective And Point Anomaly) for the detection of anomalies in time series data. |
Type Of Technology | Software |
Year Produced | 2018 |
Open Source License? | Yes |
Impact | An implementation of CAPA (Collective And Point Anomaly) for the detection of anomalies in time series data. |
URL | https://cran.r-project.org/web/packages/anomaly/index.html |
Title | BayesProject: Fast Projection Direction for Multivariate Changepoint Detection |
Description | Implementations in 'cpp' of the BayesProject algorithm (see G. Hahn, P. Fearnhead, I.A. Eckley (2020) ) which implements a fast approach to compute a projection direction for multivariate changepoint detection, as well as the sum-cusum and max-cusum methods, and a wild binary segmentation wrapper for all algorithms. |
Type Of Technology | Software |
Year Produced | 2020 |
Open Source License? | Yes |
Impact | This is open source software, and we are unaware of any notable impacts. |
URL | https://doi.org/10.1007%2Fs11222-020-09966-2 |
Title | CatReg: Solution Paths for Linear and Logistic Regression Models with SCOPE Penalty |
Description | Computes solutions for regularised linear and logistic regression models with high-dimensional categorical covariates. |
Type Of Technology | Software |
Year Produced | 2020 |
Open Source License? | Yes |
Impact | Too early to say |
URL | https://CRAN.R-project.org/package=CatReg |
Title | ChangepointInference |
Description | Software to implement post-selection inference method for change points from Jewell, S., Fearnhead, P., & Witten, D. (Accepted/In press). Testing for a Change in Mean After Changepoint Detection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) |
Type Of Technology | Software |
Year Produced | 2022 |
Open Source License? | Yes |
Impact | None |
URL | https://arxiv.org/abs/1910.04291 |
Title | DeCAFS: Detecting Changes in Autocorrelated and Fluctuating Signals |
Description | Detect abrupt changes in time series with local fluctuations as a random walk process and autocorrelated noise as an AR(1) process. See Romano, G., Rigaill, G., Runge, V., Fearnhead, P. (2020) |
Type Of Technology | Software |
Year Produced | 2020 |
Open Source License? | Yes |
Impact | This is open-source software, we are currently unaware of any notable impacts. |
URL | https://arxiv.org/abs/2005.01379 |
Title | Functional Online CUSUM |
Description | Implement the Functional Online CUSUM method of Fast Online Changepoint Detection via Functional Pruning CUSUM statistics Gaetano Romano, Idris Eckley, Paul Fearnhead, Guillem Rigaill arXiv.2110.08205 |
Type Of Technology | Software |
Year Produced | 2021 |
Open Source License? | Yes |
Impact | None. Though interest in the method has been shown by British Telecom |
URL | https://arxiv.org/abs/2110.08205 |
Title | GRPtests |
Description | Methodology for testing nonlinearity in the conditional mean function in low- or high-dimensional generalized linear models, and the significance of (potentially large) groups of predictors. Details on the algorithms can be found in the paper by Jankova, Shah, Buehlmann and Samworth (2019) . |
Type Of Technology | Software |
Year Produced | 2019 |
Open Source License? | Yes |
Impact | Too early to say |
URL | https://CRAN.R-project.org/package=GRPtests |
Title | GeneralisedCovarianceMeasure: Test for Conditional Independence Based on the Generalized Covariance Measure (GCM) |
Description | A statistical hypothesis test for conditional independence. It performs nonlinear regressions on the conditioning variable and then tests for a vanishing covariance between the resulting residuals. It can be applied to both univariate random variables and multivariate random vectors. Details of the method can be found in Rajen D. Shah and Jonas Peters (2018) . |
Type Of Technology | Software |
Year Produced | 2018 |
Open Source License? | Yes |
Impact | Used by A.P. Moller Maersk in testing whether structural causal models relating to pricing can be falsified. |
URL | https://CRAN.R-project.org/package=GeneralisedCovarianceMeasure |
Title | IndepTest |
Description | R package for independence testing |
Type Of Technology | Software |
Year Produced | 2018 |
Open Source License? | Yes |
Impact | None as yet. |
URL | https://cran.r-project.org/web/packages/IndepTest/index.html |
Title | InspectChangepoint |
Description | R package for high-dimensional changepoint estimation. |
Type Of Technology | Software |
Year Produced | 2016 |
Open Source License? | Yes |
Impact | None as yet. |
URL | https://cran.r-project.org/web/packages/InspectChangepoint/index.html |
Title | LogConcComp |
Description | Github python code for computing the log-concave maximum likelihood estimator |
Type Of Technology | Software |
Year Produced | 2021 |
Open Source License? | Yes |
Impact | None as yet. |
URL | https://github.com/wenyuC94/LogConcComp |
Title | MCARtest: Optimal Nonparametric Testing of Missing Completely at Random |
Description | R package |
Type Of Technology | Software |
Year Produced | 2022 |
Open Source License? | Yes |
Impact | None as yet. |
URL | https://cran.r-project.org/web/packages/MCARtest/index.html |
Title | MissInspect |
Description | Github R functions for changepoint estimation with heterogeneous missingness |
Type Of Technology | Software |
Year Produced | 2021 |
Open Source License? | Yes |
Impact | None as yet. |
URL | https://github.com/wangtengyao/MissInspect |
Title | R package:CROPS |
Description | Implementation of the CROPS wrapper for changepoint methods. The CROPS algorithm is described in Haynes, Kaylea, Idris A. Eckley, and Paul Fearnhead. "Computationally efficient changepoint detection for a range of penalties." Journal of Computational and Graphical Statistics 26.1 (2017): 134-143. |
Type Of Technology | Software |
Year Produced | 2022 |
Open Source License? | Yes |
Impact | None |
URL | https://cran.r-project.org/web/packages/crops/index.html |
Title | RobKF: Innovative and/or Additive Outlier Robust Kalman Filtering |
Description | Implements a series of robust Kalman filtering approaches. It implements the additive outlier robust filters of Ruckdeschel et al. (2014) and Agamennoni et al. (2018) , the innovative outlier robust filter of Ruckdeschel et al. (2014) , as well as the innovative and additive outlier robust filter of Fisch et al. (2020) |
Type Of Technology | Software |
Year Produced | 2020 |
Open Source License? | Yes |
Impact | This is open source software, and we are unaware of any notable impacts. |
URL | https://arxiv.org/abs/2007.03238 |
Title | SPCAvRP |
Description | R package for sparse PCA |
Type Of Technology | Software |
Year Produced | 2019 |
Open Source License? | Yes |
Impact | None as yet. |
URL | https://cran.r-project.org/web/packages/SPCAvRP/index.html |
Title | Sshaped |
Description | R package for fitting S-shaped functions |
Type Of Technology | Software |
Year Produced | 2022 |
Open Source License? | Yes |
Impact | None as yet. |
URL | https://cran.r-project.org/web/packages/Sshaped/index.html |
Title | USP |
Description | R package for independence testing |
Type Of Technology | Software |
Year Produced | 2021 |
Open Source License? | Yes |
Impact | None as yet. |
URL | https://cran.r-project.org/web/packages/USP/index.html |
Title | gfpop: Graph-Constrained Functional Pruning Optimal Partitioning |
Description | Penalized parametric change-point detection by functional pruning dynamic programming algorithm. The successive means are constrained using a graph structure with edges of types null, up, down, std or abs. To each edge we can associate some additional properties: a minimal gap size, a penalty, some robust parameters (K,a). The user can also constrain the inferred means to lie between some minimal and maximal values. Data is modeled by a quadratic cost with possible use of a robust loss, biweight and Huber (see edge parameters K and a). Other losses are also available with log-linear representation or a log-log representation. |
Type Of Technology | Software |
Year Produced | 2020 |
Open Source License? | Yes |
Impact | This is open source software and we are unaware of any notable impacts. |
URL | https://arxiv.org/abs/2002.03646 |
Title | ghcm: Functional Conditional Independence Testing with the GHCM |
Description | A statistical hypothesis test for conditional independence. Given residuals from a sufficiently powerful regression, it tests whether the covariance of the residuals is vanishing. It can be applied to both discretely-observed functional data and multivariate data. Details of the method can be found in Anton Rask Lundborg, Rajen D. Shah and Jonas Peters (2020) . |
Type Of Technology | Software |
Year Produced | 2021 |
Open Source License? | Yes |
Impact | Too early to say. |
URL | https://CRAN.R-project.org/package=ghcm |
Title | ocd |
Description | R package for online changepoint detection |
Type Of Technology | Software |
Year Produced | 2020 |
Open Source License? | Yes |
Impact | None as yet. |
URL | https://cran.r-project.org/web/packages/ocd/index.html |
Title | ocd_CI |
Description | R functions on github for online changepoint detection. |
Type Of Technology | Software |
Year Produced | 2021 |
Open Source License? | Yes |
Impact | None as yet. |
URL | https://github.com/yudongchen88/ocd_CI |
Title | primePCA |
Description | R package on CRAN for high-dimensional PCA with heterogeneous missingness |
Type Of Technology | Software |
Year Produced | 2021 |
Open Source License? | Yes |
Impact | None as yet. |
URL | https://cran.r-project.org/web/packages/primePCA/index.html |