Statistical Network Analysis: Model Selection, Differential Privacy, and Dynamic Structures

Lead Research Organisation: London School of Economics & Pol Sci
Department Name: Statistics

Abstract

In this proposal we tackle some challenging problems in the following three aspects of statistical network analysis.

1. Jittered resampling for selecting network models

The first and arguably the most important step in statistical modelling is to choose an appropriate model for a given data set. While there exist many data-driven model-selection methods in statistics in general, including those based on data reuse (i.e., bootstrap resampling, cross-validation), their application to network data is problematic. Therefore it remains common to choose a network model subjectively. The major difficulty in the reuse of network data is to mimic the underlying probability mechanisms. A few existing attempts include cross-validation under some specific settings. We propose a new `bootstrap jittering' or `jittered resampling' method for selecting an appropriate network model. The method does not impose any specific forms/conditions, therefore providing a generic tool for network model selection.

2. Edge differential privacy for network data

In network data individuals are typically represented by nodes and their inter-relationships are represented by edges. Therefore network data often contain sensitive individual/personal information. On the other hand the information of interest in the data should be perserved. Hence the primary concern for data privacy is two-folded: (a) to release only a sanitized version of the original network data to protect privacy, and (b) the sanitized data should preserves the information of interest such that the analysis based on the sanitized data is still meaningful. This is a vibrant research area now as data privacy becomes ever increasingly sensitive and important with available abundant personal information in digital format in this information age, though the contribution from statistics is still at a preliminary stage. We will adopt the so-called dyadwise randomized response approach. While such a scheme is differentially private, the inference based on the released data is largely unknown. Our initial investigation reveals some attractive features of this approach, suggesting more efficient statistical inference than those based on other data release mechanism. We will further develop this scheme to handle networks with additional node features/attributes (e.g., social networks with additional information on age, gender, hobby, occupation etc).

3. Modelling and forecasting dynamic networks

Most existing statistical inference methods for networks are confined to static network data, though a substantial proportion of real networks are dynamic in nature. Understanding and being able to forecast the changes over time are of immense importance for, e.g., monitoring anomalies in internet traffic networks, predicting demand and setting pricing in electricity supply networks, managing natural resources in environmental readings in sensor networks, and understanding how news and opinion propagates in online social networks. Unfortunately the development of the foundation for dynamic networks is still in its infancy, and the available modelling and inference tools are sparse. As for dealing with dynamic changes of networks, most available techniques are based on the evolution analysis of snapshot networks over time without really modelling the changes dynamically. Although this reflects the fact that most networks change slowly over time, it does not provides any insight on the dynamics underlying the changes and is almost powerless for future prediction for which it is essential to build appropriate stochastic models to capture dynamic dependence and dynamic changes explicitly. Combining recent developments on tensor decomposition and factor-driven dimension reduction with the efficient time series tools such as exponential smoothing and Kalman filters, we will take on this challenge to build some new dynamic models.

Publications

10 25 50
 
Description 1. We provide a simple and explicit autoregressive type framework to model and forecast dynamic changes of network data. It facilitates simple and efficient statistical inference and model diagnostic checking. The framework can serve as a basic building block to accommodate various stylized features observed in real network data.
2. A standing challenge in data privacy is the trade-off between the level of privacy and the efficiency of statistical inference. We adopt the so-called dyadwise randomized response approach. While such a scheme is differentially private, the estimation for network parameters shows an interesting phase transition. We further devise a novel adaptive bootstrap procedure to construct uniform inference across different phases.
Exploitation Route Issues around data privacy attracts ever increasing attention in society. Network data privacy is perceived to be especially difficult due to its special structure and often binary nature. Our work in this direction will contribute to the statisticians' contribution in this important area. Dynamic modelling and forecasting for network flows directly links time series with
the new development and challenges associated with big data, which is very much needed. Hence the potential beneficiaries include both theoretical and applied network data analysts in statistics and other disciplines such as computer science, social network, network communication, energy distribution and forecasting, genetic linkages, economics, finance and etc.
Sectors Communities and Social Services/Policy

Creative Economy

Energy

Environment

Financial Services

and Management Consultancy

Manufacturing

including Industrial Biotechology

Security and Diplomacy

URL https://stats.lse.ac.uk/q.yao/qyao.links/publicationsAll.html
 
Title Factor Modeling for Clustering High-Dimensional Time Series 
Description We propose a new unsupervised learning method for clustering a large number of time series based on a latent factor structure. Each cluster is characterized by its own cluster-specific factors in addition to some common factors which impact on all the time series concerned. Our setting also offers the flexibility that some time series may not belong to any clusters. The consistency with explicit convergence rates is established for the estimation of the common factors, the cluster-specific factors, and the latent clusters. Numerical illustration with both simulated data as well as a real data example is also reported. As a spin-off, the proposed new approach also advances significantly the statistical inference for the factor model of Lam and Yao. Supplementary materials for this article are available online. 
Type Of Material Database/Collection of data 
Year Produced 2023 
Provided To Others? Yes  
URL https://tandf.figshare.com/articles/dataset/Factor_Modelling_for_Clustering_High-dimensional_Time_Se...
 
Title Factor Modeling for Clustering High-Dimensional Time Series 
Description We propose a new unsupervised learning method for clustering a large number of time series based on a latent factor structure. Each cluster is characterized by its own cluster-specific factors in addition to some common factors which impact on all the time series concerned. Our setting also offers the flexibility that some time series may not belong to any clusters. The consistency with explicit convergence rates is established for the estimation of the common factors, the cluster-specific factors, and the latent clusters. Numerical illustration with both simulated data as well as a real data example is also reported. As a spin-off, the proposed new approach also advances significantly the statistical inference for the factor model of Lam and Yao. Supplementary materials for this article are available online. 
Type Of Material Database/Collection of data 
Year Produced 2023 
Provided To Others? Yes  
URL https://tandf.figshare.com/articles/dataset/Factor_Modelling_for_Clustering_High-dimensional_Time_Se...
 
Title Factor Modeling for Clustering High-Dimensional Time Series 
Description We propose a new unsupervised learning method for clustering a large number of time series based on a latent factor structure. Each cluster is characterized by its own cluster-specific factors in addition to some common factors which impact on all the time series concerned. Our setting also offers the flexibility that some time series may not belong to any clusters. The consistency with explicit convergence rates is established for the estimation of the common factors, the cluster-specific factors, and the latent clusters. Numerical illustration with both simulated data as well as a real data example is also reported. As a spin-off, the proposed new approach also advances significantly the statistical inference for the factor model of Lam and Yao. Supplementary materials for this article are available online. 
Type Of Material Database/Collection of data 
Year Produced 2023 
Provided To Others? Yes  
URL https://tandf.figshare.com/articles/dataset/Factor_Modelling_for_Clustering_High-dimensional_Time_Se...
 
Title Factor Modeling for Clustering High-Dimensional Time Series 
Description We propose a new unsupervised learning method for clustering a large number of time series based on a latent factor structure. Each cluster is characterized by its own cluster-specific factors in addition to some common factors which impact on all the time series concerned. Our setting also offers the flexibility that some time series may not belong to any clusters. The consistency with explicit convergence rates is established for the estimation of the common factors, the cluster-specific factors, and the latent clusters. Numerical illustration with both simulated data as well as a real data example is also reported. As a spin-off, the proposed new approach also advances significantly the statistical inference for the factor model of Lam and Yao. Supplementary materials for this article are available online. 
Type Of Material Database/Collection of data 
Year Produced 2023 
Provided To Others? Yes  
URL https://tandf.figshare.com/articles/dataset/Factor_Modelling_for_Clustering_High-dimensional_Time_Se...
 
Title Factor Modelling for Clustering High-dimensional Time Series 
Description We propose a new unsupervised learning method for clustering a large number of time series based on a latent factor structure. Each cluster is characterized by its own cluster-specific factors in addition to some common factors which impact on all the time series concerned. Our setting also offers the flexibility that some time series may not belong to any clusters. The consistency with explicit convergence rates is established for the estimation of the common factors, the cluster-specific factors, and the latent clusters. Numerical illustration with both simulated data as well as a real data example is also reported. As a spin-off, the proposed new approach also advances significantly the statistical inference for the factor model of Lam and Yao (2012). 
Type Of Material Database/Collection of data 
Year Produced 2023 
Provided To Others? Yes  
URL https://tandf.figshare.com/articles/dataset/Factor_Modelling_for_Clustering_High-dimensional_Time_Se...
 
Title Simultaneous Decorrelation of Matrix Time Series 
Description We propose a contemporaneous bilinear transformation for a p × q matrix time series to alleviate the difficulties in modeling and forecasting matrix time series when p and/or q are large. The resulting transformed matrix assumes a block structure consisting of several small matrices, and those small matrix series are uncorrelated across all times. Hence, an overall parsimonious model is achieved by modeling each of those small matrix series separately without the loss of information on the linear dynamics. Such a parsimonious model often has better forecasting performance, even when the underlying true dynamics deviates from the assumed uncorrelated block structure after transformation. The uniform convergence rates of the estimated transformation are derived, which vindicate an important virtue of the proposed bilinear transformation, that is, it is technically equivalent to the decorrelation of a vector time series of dimension max(p, q) instead of p × q. The proposed method is illustrated numerically via both simulated and real data examples. Supplementary materials for this article are available online. 
Type Of Material Database/Collection of data 
Year Produced 2023 
Provided To Others? Yes  
URL https://tandf.figshare.com/articles/dataset/Simultaneous_Decorrelation_of_Matrix_Time_Series_/216417...
 
Title Simultaneous Decorrelation of Matrix Time Series 
Description We propose a contemporaneous bilinear transformation for a p × q matrix time series to alleviate the difficulties in modeling and forecasting matrix time series when p and/or q are large. The resulting transformed matrix assumes a block structure consisting of several small matrices, and those small matrix series are uncorrelated across all times. Hence, an overall parsimonious model is achieved by modeling each of those small matrix series separately without the loss of information on the linear dynamics. Such a parsimonious model often has better forecasting performance, even when the underlying true dynamics deviates from the assumed uncorrelated block structure after transformation. The uniform convergence rates of the estimated transformation are derived, which vindicate an important virtue of the proposed bilinear transformation, that is, it is technically equivalent to the decorrelation of a vector time series of dimension max(p, q) instead of p × q. The proposed method is illustrated numerically via both simulated and real data examples. Supplementary materials for this article are available online. 
Type Of Material Database/Collection of data 
Year Produced 2023 
Provided To Others? Yes  
URL https://tandf.figshare.com/articles/dataset/Simultaneous_Decorrelation_of_Matrix_Time_Series_/216417...
 
Title Simultaneous Decorrelation of Matrix Time Series* 
Description We propose a contemporaneous bilinear transformation for a p × q matrix time series to alleviate the difficulties in modeling and forecasting matrix time series when p and/or q are large. The resulting transformed matrix assumes a block structure consisting of several small matrices, and those small matrix series are uncorrelated across all times. Hence an overall parsimonious model is achieved by modelling each of those small matrix series separately without the loss of information on the linear dynamics. Such a parsimonious model often has better forecasting performance, even when the underlying true dynamics deviates from the assumed uncorrelated block structure after transformation. The uniform convergence rates of the estimated transformation are derived, which vindicate an important virtue of the proposed bilinear transformation, i.e. it is technically equivalent to the decorrelation of a vector time series of dimension max(p, q) instead of p × q. The proposed method is illustrated numerically via both simulated and real data examples. 
Type Of Material Database/Collection of data 
Year Produced 2022 
Provided To Others? Yes  
URL https://tandf.figshare.com/articles/dataset/Simultaneous_Decorrelation_of_Matrix_Time_Series_/216417...
 
Description EDF in Paris 
Organisation EDF Energy
Department EDF Innovation and Research
Country France 
Sector Private 
PI Contribution We and the EDF team led by Dr Y Goude have worked together to develop the curve regression methodology and time series PCA h to tackle the new challenges in forecasting the loads due to the increase of renewable energy and the development of small distributed production units, as well as the changing consumption behavour due to plug-in (hybrid)electric vehicles, heat pumps and personal storage capacities.
Collaborator Contribution We and the EDF team have worked together to develop the curve linear regression methodology and have tested it with EDF data.
Impact Four publications, and one software package in R.
Start Year 2010
 
Title HDTSA 
Description An R package available at CRAN project specialized on various statistical inference for high-dimensional time series factor modelling, principal component analysis for vector and matrix time series, cointegration, and the inference for unit roots and cointegration. 
Type Of Technology Software 
Year Produced 2023 
Open Source License? Yes  
Impact The software is publically avaialble through CRAN project. 
URL https://cran.r-project.org/package=HDTSA
 
Description An invited talk at Conference on "Recent Advances in Statistics and Data Science" in Rutgers 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Study participants or study members
Results and Impact Conference on Recent Advances in Statistics and Data Science with a Celebration of Professors Regina Liu and Cun-Hui Zhang's Special Birthdays
Year(s) Of Engagement Activity 2023
URL https://statistics.rutgers.edu/news-events/conferences/684-conference-on-recent-advances-in-statisti...
 
Description Invited talk at 2023 IMS International Conference on Statistics and Data Science, Lisbon 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Study participants or study members
Results and Impact The objective of ICSDS is to bring together researchers in statistics and data science from academia, industry, and government in a stimulating setting to exchange ideas on the developments of modern statistics, machine learning, and broadly defined theory, methods, and applications in data science.
Year(s) Of Engagement Activity 2023
URL https://www.icsds2023.com/
 
Description Invited talk at Conference on "Statistical Foundations of Data Science and Applications" in Princeton 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Study participants or study members
Results and Impact The conference was in honour of Professor Jianqing Fan's 60 birthday attended by over 300 academics, students and people working in industry,
Year(s) Of Engagement Activity 2023
URL https://fan60.princeton.edu/
 
Description Invited talk at Conference on 2023 Kansas Econometrics Workshop, Kansas 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Study participants or study members
Results and Impact This workshop consists of a series of yearly workshops focusing on recent developments of econometrics theories and methodologies as well as applications in economics and finance and other applied fields such as data sciences and statistics. The main purpose of the econometrics workshop series at KU is to promote methodological and theoretical research as well as applications in modern econometrics and statistics as well as data science, and to provide a forum for researchers, including Ph.D. students, to come together to interact through social discussions and presentations.
Year(s) Of Engagement Activity 2023
URL https://econometrics.ku.edu/
 
Description Invited talk at The OMI Machine Learning in Financial Econometrics, Oxford Man Institute 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Study participants or study members
Results and Impact The workshop is to to the dissemination of cutting-edge ideas in economics, financial industry using machine learning tools.
Year(s) Of Engagement Activity 2023
URL https://web.cvent.com/event/78dec7d3-ee2d-4ddb-b14d-b05e782bb209/summary