Modelling Vast Time Series: Sparsity and Segmentation

Lead Research Organisation: London School of Economics and Political Science

Department Name: Statistics

Abstract

In this modern information age the availability of large or vast time series data brings opportunities with challenges to time series analysts. The demand for modelling and forecasting high-dimensional time series arises from various practical problems such as panel study of economic, social and natural phenomena (such as weather), financial market analysis and communications engineering. We propose two new approaches for analyzing high-dimensional time series data when the dimension is as large as, or even greater than, the length of observed time series.

The first approach is to fit the data with sparse vector auto-regressive models (VAR). For some applications when the components are ordered, we will further explore the sparsity due to a band structure. Note that we impose sparsity or banding directly on the coefficient matrices in VAR models. Hence, the relevant inference methods and the associated theory are different from those for the estimation of large covariance matrices.

Our second approach is segmentation via transformation. We seek for a contemporaneous linear transformation such that the transformed time series is divided into several sub-vectors, and those sub-vectors are both contemporaneously and serially uncorrelated. Therefore, they can be modelled separately.

The challenges of our proposal are two-fold: First we need to develop the statistical inference methods and the associated theory for identifying the sparse structure and for fitting sparse VAR models with large dimensions. Let p denote the dimension of the time series. We aim to reduce the number of model parameters from the order of the square of p to the order of p, and to develop the valid inference methods when log(p)= o(n). Secondly, we need to identify the linear transformation to identify the latent segmentation structure, i.e. the block-diagonal autocovariance structure when such a structure exists.

High-dimensional data analysis (i.e. 'big data') is one of the most vibrant research areas in statistics in the last decade. Most work to date concentrates on linear regression with a large number of candidate regressors (i.e. the so-called 'large p small n' paradigm). Another stream of the research is on the inference of large covariance matrices. Though bearing a similar banner, the problems addressed in the proposal are different, as we deal with high-dimensional time series and we need to estimate large transformation or coefficient matrices that are not positive semi-definite. We aim for simple and effective inference methods so that they can be implemented with ordinary PCs for the data of dimensions in the order of thousands.

Planned Impact

The proposed research is on the modelling of vast time series data which has direct or indirect impact in various applications. Broadly, the impact can reach two main kinds of non-academic audience.

1. Public and private industry sectors for which analysis and forecasting of multiple time series are relevant. This includes, among others, analysis and forecasting market demands, consume patterns, product sales, internet traffic flows, electricity demands, portfolio allocation, risk management.

2. Public service and government offices for which analysis and forecasting of multiple time series impacts service and policy decisions. This includes, among others, weather monitoring and forecasting, environment protection, wild life monitoring, epidemic tracking and Google trends.

The PI has been providing research-led consultancy services to various organizations, where close and long-term collaboration has been established. This collaboration has already borne fruit in the form of joint research papers with the users directly addressing the practical problems which matter to the relevant organizations; see references [5] & [9] in "Case for Support". For example, since 2010 the PI has been working with EDF on their business forecasting for electricity demands in each of the 30 minutes intervals in the next 24 hours. This, in its simplest form, is a forecasting for 48-dimensional time series. The dimension will increase when we separate different consumer groups; for example, the demands from industry and domestic consumptions, though related, present very different patterns. Since early 2012 the PI has been working with Barclays Bank to provide statistical advices on back-testing, which is an important part of credit risk management under "Basel III" -- a global mandatory banding regulation.

The work carried out on behalf of both EDF and Barclays involves modelling and forecasting vast and high-dimensional time series data and poses challenges that cannot be resolved using standard statistical methods. The proposed research, if successful, will have direct impact on solving the practical problems at both EDF and Barclays, as well as those encountered in other such scenarios.

Funded Value:

£392,909

Funded Period:

Mar 14 - Apr 17

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/L01226X/1

Principal Investigator:

Qiwei Yao

Research Subject:

Mathematical sciences (100%)

Research Topic:

Statistics & Appl. Probability (100%)

Organisations

London School of Economics and Political Science (Lead Research Organisation)

People	ORCID iD
Qiwei Yao (Principal Investigator)

Publications

Author Name

Title Publication Date Published

|< < 1 2 3 > >|

10 25 50

Chang J (2017) Testing for high-dimensional white noise using maximum cross-correlations in Biometrika

Chang J (2018) Principal component analysis for second-order stationary vector time series in The Annals of Statistics

Chang J (2020) Estimation of Subgraph Densities in Noisy Networks in Journal of the American Statistical Association

Chang J (2015) High dimensional stochastic regression with latent factors, endogeneity and nonlinearity in Journal of Econometrics

Chang J (2018) Confidence regions for entries of a large precision matrix in Journal of Econometrics

Cho H (2015) Modeling and Stochastic Learning for Forecasting in High Dimensions

Dou B (2016) Generalized Yule-Walker estimation for spatio-temporal models with unknown diagonal coefficients in Journal of Econometrics

Gao W (2014) Estimation for Dynamic and Static Panel Probit Models with Large Individual Effects

Gao W (2016) Estimation for Dynamic and Static Panel Probit Models with Large Individual Effects in Journal of Time Series Analysis

Gao Z (2018) Banded Spatio-Temporal Autoregressions

Key Findings
Impact Summary
Software and Technical Products


Description	We have accomplished the two major scientific goals of this EPSRC funded research project. 1. The newly developed principal component analysis for multiple time series (PCA4TS) transforms a vector time series into several lower-dimensional subseries, and those subseries are uncorrelated with each other both contemporaneously and serially. It overcomes the failure in controlling the serial dependence of the standard PCA for time series data. This new PCA, as preliminary analysis for vector time series, improves future prediction, and increases inference efficiency. Unlike PCA for independent data, there is no guarantee that the required linear transformation exists. When it does not, the proposed method provides an approximate segmentation which ignores small correlations which are often of little practical usefulness. The method is also applicable to segment multiple volatility processes. The methodology is implemented in R-package PCA4TS available from the CRAN project. 2. We have established the inference methods and theory for high-dimensional and banded autoregressive models, which provides a new alternative for sparse modelling for large time series data.
Exploitation Route	The proposed method can be used as a preliminary step in analysing any multivariate time series data, as an effective way to reduce dimensionality, leading to more efficient statistical inference and more accurate future prediction.
Sectors	Agriculture Food and Drink Communities and Social Services/Policy Creative Economy Digital/Communication/Information Technologies (including Software) Energy Environment Financial Services and Management Consultancy Government Democracy and Justice Manufacturing including Industrial Biotechology Retail
URL	http://stats.lse.ac.uk/q.yao/qyao.links/publications.html


Description	PI continues the collaboration with the EDF on forecasting daily electricity loads since 2010. The proposed curve regression method has been used for forecasting daily electricity demand curves. In 2012-2014, PI also played a role in constructing the counterparty credit risk backtesting methodology at the Barclays Bank, which is a mandated requirement under the Basel III global regulatory standard. The newly proposed procedure for estimating extreme quantiles has been employed in the daily operation at Barclays since September 2013.
First Year Of Impact	2013
Sector	Digital/Communication/Information Technologies (including Software),Energy,Financial Services, and Management Consultancy,Retail
Impact Types	Societal Economic


Title	HDTSA
Description	An R package available at CRAN project specialized on various statistical inference for high-dimensional time series factor modelling, principal component analysis for vector and matrix time series, cointegration, and the inference for unit roots and cointegration.
Type Of Technology	Software
Year Produced	2023
Open Source License?	Yes
Impact	The software is publically avaialble through CRAN project.
URL	https://cran.r-project.org/package=HDTSA


Title	HDtest: High Dimensional Hypothesis Testing for Mean Vectors, Covariance Matrices, and White Noise of Vector Time Series
Description	This is an R package implementing the test for high-dimensional white noise proposed in Chang, Yao and Zhou (2017) Testing for high-dimensional white noise using maximum cross-correlations. Biometrika, 104, 111-127.
Type Of Technology	Software
Year Produced	2018
Open Source License?	Yes
Impact	It is available in the open source CRAN project.
URL	https://cran.r-project.org/web/packages/HDtest/index.html


Title	PCA4TS
Description	This is an open-source R project: it seeks for a contemporaneous linear transformation for a multivariate time series such that the transformed series is segmented into several lower-dimensional subseries, and those subseries are uncorrelated with each other both contemporaneously and serially
Type Of Technology	Software
Year Produced	2015
Open Source License?	Yes
Impact	Still early days.
URL	https://cran.r-project.org/web/packages/PCA4TS/index.html


Title	clr
Description	This is an R package for curve linear regression developed in Haeran Cho, Yannig Goude, Xavier Brossat, Qiwei Yao (2013). Modeling and Forecasting Daily Electricity Load Curves: A Hybrid Approach. Journal of the American Statistical Association, Vol.108, 7-13. It is available from CRAN project.
Type Of Technology	Software
Year Produced	2018
Open Source License?	Yes
Impact	Since the publication of Cho et al (2013), there have been quite a few requests for the software which implements the curve linear regression methods. The development of this R package is to cater for the demand.
URL	https://cran.r-project.org/web/packages/clr/index.html

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications