Modelling Vast Time Series: Sparsity and Segmentation

Lead Research Organisation: London School of Economics and Political Science
Department Name: Statistics

Abstract

In this modern information age the availability of large or vast time series data brings opportunities with challenges to time series analysts. The demand for modelling and forecasting high-dimensional time series arises from various practical problems such as panel study of economic, social and natural phenomena (such as weather), financial market analysis and communications engineering. We propose two new approaches for analyzing high-dimensional time series data when the dimension is as large as, or even greater than, the length of observed time series.

The first approach is to fit the data with sparse vector auto-regressive models (VAR). For some applications when the components are ordered, we will further explore the sparsity due to a band structure. Note that we impose sparsity or banding directly on the coefficient matrices in VAR models. Hence, the relevant inference methods and the associated theory are different from those for the estimation of large covariance matrices.

Our second approach is segmentation via transformation. We seek for a contemporaneous linear transformation such that the transformed time series is divided into several sub-vectors, and those sub-vectors are both contemporaneously and serially uncorrelated. Therefore, they can be modelled separately.

The challenges of our proposal are two-fold: First we need to develop the statistical inference methods and the associated theory for identifying the sparse structure and for fitting sparse VAR models with large dimensions. Let p denote the dimension of the time series. We aim to reduce the number of model parameters from the order of the square of p to the order of p, and to develop the valid inference methods when log(p)= o(n). Secondly, we need to identify the linear transformation to identify the latent segmentation structure, i.e. the block-diagonal autocovariance structure when such a structure exists.

High-dimensional data analysis (i.e. 'big data') is one of the most vibrant research areas in statistics in the last decade. Most work to date concentrates on linear regression with a large number of candidate regressors (i.e. the so-called 'large p small n' paradigm). Another stream of the research is on the inference of large covariance matrices. Though bearing a similar banner, the problems addressed in the proposal are different, as we deal with high-dimensional time series and we need to estimate large transformation or coefficient matrices that are not positive semi-definite. We aim for simple and effective inference methods so that they can be implemented with ordinary PCs for the data of dimensions in the order of thousands.

Planned Impact

The proposed research is on the modelling of vast time series data which has direct or indirect impact in various applications. Broadly, the impact can reach two main kinds of non-academic audience.

1. Public and private industry sectors for which analysis and forecasting of multiple time series are relevant. This includes, among others, analysis and forecasting market demands, consume patterns, product sales, internet traffic flows, electricity demands, portfolio allocation, risk management.

2. Public service and government offices for which analysis and forecasting of multiple time series impacts service and policy decisions. This includes, among others, weather monitoring and forecasting, environment protection, wild life monitoring, epidemic tracking and Google trends.

The PI has been providing research-led consultancy services to various organizations, where close and long-term collaboration has been established. This collaboration has already borne fruit in the form of joint research papers with the users directly addressing the practical problems which matter to the relevant organizations; see references [5] & [9] in "Case for Support". For example, since 2010 the PI has been working with EDF on their business forecasting for electricity demands in each of the 30 minutes intervals in the next 24 hours. This, in its simplest form, is a forecasting for 48-dimensional time series. The dimension will increase when we separate different consumer groups; for example, the demands from industry and domestic consumptions, though related, present very different patterns. Since early 2012 the PI has been working with Barclays Bank to provide statistical advices on back-testing, which is an important part of credit risk management under "Basel III" -- a global mandatory banding regulation.

The work carried out on behalf of both EDF and Barclays involves modelling and forecasting vast and high-dimensional time series data and poses challenges that cannot be resolved using standard statistical methods. The proposed research, if successful, will have direct impact on solving the practical problems at both EDF and Barclays, as well as those encountered in other such scenarios.

Publications

10 25 50
 
Description We have accomplished the two major scientific goals of this EPSRC funded research project.

1. The newly developed principal component analysis for multiple time series (PCA4TS) transforms a vector time series into several lower-dimensional subseries, and those subseries are uncorrelated with each other both contemporaneously and serially. It overcomes the failure in controlling the serial dependence of the standard PCA for time series data. This new PCA, as preliminary analysis for vector time series, improves future prediction, and increases inference efficiency. Unlike PCA for independent data, there is no guarantee that the required linear transformation exists. When it does not, the proposed method provides an approximate segmentation which ignores small correlations which are often of little practical usefulness. The method is also applicable to segment multiple volatility processes. The methodology is implemented in R-package PCA4TS available from the CRAN project.

2. We have established the inference methods and theory for high-dimensional and banded autoregressive models, which provides a new alternative for sparse modelling for large time series data.
Exploitation Route The proposed method can be used as a preliminary step in analysing any multivariate time series data, as an effective way to reduce dimensionality, leading to more efficient statistical inference and more accurate future prediction.
Sectors Agriculture, Food and Drink,Communities and Social Services/Policy,Creative Economy,Digital/Communication/Information Technologies (including Software),Energy,Environment,Financial Services, and Management Consultancy,Government, Democracy and Justice,Manufacturing, including Industrial Biotechology,Retail

URL http://stats.lse.ac.uk/q.yao/qyao.links/publications.html
 
Description PI continues the collaboration with the EDF on forecasting daily electricity loads since 2010. The proposed curve regression method has been used for forecasting daily electricity demand curves. In 2012-2014, PI also played a role in constructing the counterparty credit risk backtesting methodology at the Barclays Bank, which is a mandated requirement under the Basel III global regulatory standard. The newly proposed procedure for estimating extreme quantiles has been employed in the daily operation at Barclays since September 2013.
First Year Of Impact 2013
Sector Digital/Communication/Information Technologies (including Software),Energy,Financial Services, and Management Consultancy,Retail
Impact Types Societal,Economic

 
Title HDtest: High Dimensional Hypothesis Testing for Mean Vectors, Covariance Matrices, and White Noise of Vector Time Series 
Description This is an R package implementing the test for high-dimensional white noise proposed in Chang, Yao and Zhou (2017) Testing for high-dimensional white noise using maximum cross-correlations. Biometrika, 104, 111-127. 
Type Of Technology Software 
Year Produced 2018 
Open Source License? Yes  
Impact It is available in the open source CRAN project. 
URL https://cran.r-project.org/web/packages/HDtest/index.html
 
Title PCA4TS 
Description This is an open-source R project: it seeks for a contemporaneous linear transformation for a multivariate time series such that the transformed series is segmented into several lower-dimensional subseries, and those subseries are uncorrelated with each other both contemporaneously and serially 
Type Of Technology Software 
Year Produced 2015 
Open Source License? Yes  
Impact Still early days. 
URL https://cran.r-project.org/web/packages/PCA4TS/index.html