Statistical Network Analysis: Model Selection, Differential Privacy, and Dynamic Structures
Lead Research Organisation:
London School of Economics & Pol Sci
Department Name: Statistics
Abstract
In this proposal we tackle some challenging problems in the following three aspects of statistical network analysis.
1. Jittered resampling for selecting network models
The first and arguably the most important step in statistical modelling is to choose an appropriate model for a given data set. While there exist many data-driven model-selection methods in statistics in general, including those based on data reuse (i.e., bootstrap resampling, cross-validation), their application to network data is problematic. Therefore it remains common to choose a network model subjectively. The major difficulty in the reuse of network data is to mimic the underlying probability mechanisms. A few existing attempts include cross-validation under some specific settings. We propose a new `bootstrap jittering' or `jittered resampling' method for selecting an appropriate network model. The method does not impose any specific forms/conditions, therefore providing a generic tool for network model selection.
2. Edge differential privacy for network data
In network data individuals are typically represented by nodes and their inter-relationships are represented by edges. Therefore network data often contain sensitive individual/personal information. On the other hand the information of interest in the data should be perserved. Hence the primary concern for data privacy is two-folded: (a) to release only a sanitized version of the original network data to protect privacy, and (b) the sanitized data should preserves the information of interest such that the analysis based on the sanitized data is still meaningful. This is a vibrant research area now as data privacy becomes ever increasingly sensitive and important with available abundant personal information in digital format in this information age, though the contribution from statistics is still at a preliminary stage. We will adopt the so-called dyadwise randomized response approach. While such a scheme is differentially private, the inference based on the released data is largely unknown. Our initial investigation reveals some attractive features of this approach, suggesting more efficient statistical inference than those based on other data release mechanism. We will further develop this scheme to handle networks with additional node features/attributes (e.g., social networks with additional information on age, gender, hobby, occupation etc).
3. Modelling and forecasting dynamic networks
Most existing statistical inference methods for networks are confined to static network data, though a substantial proportion of real networks are dynamic in nature. Understanding and being able to forecast the changes over time are of immense importance for, e.g., monitoring anomalies in internet traffic networks, predicting demand and setting pricing in electricity supply networks, managing natural resources in environmental readings in sensor networks, and understanding how news and opinion propagates in online social networks. Unfortunately the development of the foundation for dynamic networks is still in its infancy, and the available modelling and inference tools are sparse. As for dealing with dynamic changes of networks, most available techniques are based on the evolution analysis of snapshot networks over time without really modelling the changes dynamically. Although this reflects the fact that most networks change slowly over time, it does not provides any insight on the dynamics underlying the changes and is almost powerless for future prediction for which it is essential to build appropriate stochastic models to capture dynamic dependence and dynamic changes explicitly. Combining recent developments on tensor decomposition and factor-driven dimension reduction with the efficient time series tools such as exponential smoothing and Kalman filters, we will take on this challenge to build some new dynamic models.
1. Jittered resampling for selecting network models
The first and arguably the most important step in statistical modelling is to choose an appropriate model for a given data set. While there exist many data-driven model-selection methods in statistics in general, including those based on data reuse (i.e., bootstrap resampling, cross-validation), their application to network data is problematic. Therefore it remains common to choose a network model subjectively. The major difficulty in the reuse of network data is to mimic the underlying probability mechanisms. A few existing attempts include cross-validation under some specific settings. We propose a new `bootstrap jittering' or `jittered resampling' method for selecting an appropriate network model. The method does not impose any specific forms/conditions, therefore providing a generic tool for network model selection.
2. Edge differential privacy for network data
In network data individuals are typically represented by nodes and their inter-relationships are represented by edges. Therefore network data often contain sensitive individual/personal information. On the other hand the information of interest in the data should be perserved. Hence the primary concern for data privacy is two-folded: (a) to release only a sanitized version of the original network data to protect privacy, and (b) the sanitized data should preserves the information of interest such that the analysis based on the sanitized data is still meaningful. This is a vibrant research area now as data privacy becomes ever increasingly sensitive and important with available abundant personal information in digital format in this information age, though the contribution from statistics is still at a preliminary stage. We will adopt the so-called dyadwise randomized response approach. While such a scheme is differentially private, the inference based on the released data is largely unknown. Our initial investigation reveals some attractive features of this approach, suggesting more efficient statistical inference than those based on other data release mechanism. We will further develop this scheme to handle networks with additional node features/attributes (e.g., social networks with additional information on age, gender, hobby, occupation etc).
3. Modelling and forecasting dynamic networks
Most existing statistical inference methods for networks are confined to static network data, though a substantial proportion of real networks are dynamic in nature. Understanding and being able to forecast the changes over time are of immense importance for, e.g., monitoring anomalies in internet traffic networks, predicting demand and setting pricing in electricity supply networks, managing natural resources in environmental readings in sensor networks, and understanding how news and opinion propagates in online social networks. Unfortunately the development of the foundation for dynamic networks is still in its infancy, and the available modelling and inference tools are sparse. As for dealing with dynamic changes of networks, most available techniques are based on the evolution analysis of snapshot networks over time without really modelling the changes dynamically. Although this reflects the fact that most networks change slowly over time, it does not provides any insight on the dynamics underlying the changes and is almost powerless for future prediction for which it is essential to build appropriate stochastic models to capture dynamic dependence and dynamic changes explicitly. Combining recent developments on tensor decomposition and factor-driven dimension reduction with the efficient time series tools such as exponential smoothing and Kalman filters, we will take on this challenge to build some new dynamic models.
People |
ORCID iD |
Qiwei Yao (Principal Investigator) |
Publications

Zhou Y
(2023)
Testing for the Markov property in time series via deep conditional generative learning.
in Journal of the Royal Statistical Society. Series B, Statistical methodology


Zhang B
(2021)
Factor Modelling for Clustering High-dimensional Time Series

Zhang B
(2023)
Factor Modeling for Clustering High-Dimensional Time Series
in Journal of the American Statistical Association

Xu X
(2021)
Day-ahead probabilistic forecasting for French half-hourly electricity loads and quantiles for curve-to-curve regression
in Applied Energy

Jiang, B.
(2023)
A two-way heterogeneity model for dynamic networks

Jiang Binyan
(2023)
Autoregressive Networks
in JOURNAL OF MACHINE LEARNING RESEARCH

Han Y
(2021)
Simultaneous Decorrelation of Matrix Time Series

Han Y
(2023)
Simultaneous Decorrelation of Matrix Time Series
in Journal of the American Statistical Association

Chang, J.
(2024)
Edge differentially private estimation in the ß-model via jittering and method of moments
in The Annals of Statistics
Description | 1. We provide a simple and explicit autoregressive type framework to model and forecast dynamic changes of network data. It facilitates simple and efficient statistical inference and model diagnostic checking. The framework can serve as a basic building block to accommodate various stylized features observed in real network data. 2. A standing challenge in data privacy is the trade-off between the level of privacy and the efficiency of statistical inference. We adopt the so-called dyadwise randomized response approach. While such a scheme is differentially private, the estimation for network parameters shows an interesting phase transition. We further devise a novel adaptive bootstrap procedure to construct uniform inference across different phases. |
Exploitation Route | Issues around data privacy attracts ever increasing attention in society. Network data privacy is perceived to be especially difficult due to its special structure and often binary nature. Our work in this direction will contribute to the statisticians' contribution in this important area. Dynamic modelling and forecasting for network flows directly links time series with the new development and challenges associated with big data, which is very much needed. Hence the potential beneficiaries include both theoretical and applied network data analysts in statistics and other disciplines such as computer science, social network, network communication, energy distribution and forecasting, genetic linkages, economics, finance and etc. |
Sectors | Communities and Social Services/Policy Creative Economy Energy Environment Financial Services and Management Consultancy Manufacturing including Industrial Biotechology Security and Diplomacy |
URL | https://stats.lse.ac.uk/q.yao/qyao.links/publicationsAll.html |
Title | Factor Modeling for Clustering High-Dimensional Time Series |
Description | We propose a new unsupervised learning method for clustering a large number of time series based on a latent factor structure. Each cluster is characterized by its own cluster-specific factors in addition to some common factors which impact on all the time series concerned. Our setting also offers the flexibility that some time series may not belong to any clusters. The consistency with explicit convergence rates is established for the estimation of the common factors, the cluster-specific factors, and the latent clusters. Numerical illustration with both simulated data as well as a real data example is also reported. As a spin-off, the proposed new approach also advances significantly the statistical inference for the factor model of Lam and Yao. Supplementary materials for this article are available online. |
Type Of Material | Database/Collection of data |
Year Produced | 2023 |
Provided To Others? | Yes |
URL | https://tandf.figshare.com/articles/dataset/Factor_Modelling_for_Clustering_High-dimensional_Time_Se... |
Title | Factor Modeling for Clustering High-Dimensional Time Series |
Description | We propose a new unsupervised learning method for clustering a large number of time series based on a latent factor structure. Each cluster is characterized by its own cluster-specific factors in addition to some common factors which impact on all the time series concerned. Our setting also offers the flexibility that some time series may not belong to any clusters. The consistency with explicit convergence rates is established for the estimation of the common factors, the cluster-specific factors, and the latent clusters. Numerical illustration with both simulated data as well as a real data example is also reported. As a spin-off, the proposed new approach also advances significantly the statistical inference for the factor model of Lam and Yao. Supplementary materials for this article are available online. |
Type Of Material | Database/Collection of data |
Year Produced | 2023 |
Provided To Others? | Yes |
URL | https://tandf.figshare.com/articles/dataset/Factor_Modelling_for_Clustering_High-dimensional_Time_Se... |
Title | Factor Modeling for Clustering High-Dimensional Time Series |
Description | We propose a new unsupervised learning method for clustering a large number of time series based on a latent factor structure. Each cluster is characterized by its own cluster-specific factors in addition to some common factors which impact on all the time series concerned. Our setting also offers the flexibility that some time series may not belong to any clusters. The consistency with explicit convergence rates is established for the estimation of the common factors, the cluster-specific factors, and the latent clusters. Numerical illustration with both simulated data as well as a real data example is also reported. As a spin-off, the proposed new approach also advances significantly the statistical inference for the factor model of Lam and Yao. Supplementary materials for this article are available online. |
Type Of Material | Database/Collection of data |
Year Produced | 2023 |
Provided To Others? | Yes |
URL | https://tandf.figshare.com/articles/dataset/Factor_Modelling_for_Clustering_High-dimensional_Time_Se... |
Title | Factor Modeling for Clustering High-Dimensional Time Series |
Description | We propose a new unsupervised learning method for clustering a large number of time series based on a latent factor structure. Each cluster is characterized by its own cluster-specific factors in addition to some common factors which impact on all the time series concerned. Our setting also offers the flexibility that some time series may not belong to any clusters. The consistency with explicit convergence rates is established for the estimation of the common factors, the cluster-specific factors, and the latent clusters. Numerical illustration with both simulated data as well as a real data example is also reported. As a spin-off, the proposed new approach also advances significantly the statistical inference for the factor model of Lam and Yao. Supplementary materials for this article are available online. |
Type Of Material | Database/Collection of data |
Year Produced | 2023 |
Provided To Others? | Yes |
URL | https://tandf.figshare.com/articles/dataset/Factor_Modelling_for_Clustering_High-dimensional_Time_Se... |
Title | Factor Modelling for Clustering High-dimensional Time Series |
Description | We propose a new unsupervised learning method for clustering a large number of time series based on a latent factor structure. Each cluster is characterized by its own cluster-specific factors in addition to some common factors which impact on all the time series concerned. Our setting also offers the flexibility that some time series may not belong to any clusters. The consistency with explicit convergence rates is established for the estimation of the common factors, the cluster-specific factors, and the latent clusters. Numerical illustration with both simulated data as well as a real data example is also reported. As a spin-off, the proposed new approach also advances significantly the statistical inference for the factor model of Lam and Yao (2012). |
Type Of Material | Database/Collection of data |
Year Produced | 2023 |
Provided To Others? | Yes |
URL | https://tandf.figshare.com/articles/dataset/Factor_Modelling_for_Clustering_High-dimensional_Time_Se... |
Title | Simultaneous Decorrelation of Matrix Time Series |
Description | We propose a contemporaneous bilinear transformation for a p × q matrix time series to alleviate the difficulties in modeling and forecasting matrix time series when p and/or q are large. The resulting transformed matrix assumes a block structure consisting of several small matrices, and those small matrix series are uncorrelated across all times. Hence, an overall parsimonious model is achieved by modeling each of those small matrix series separately without the loss of information on the linear dynamics. Such a parsimonious model often has better forecasting performance, even when the underlying true dynamics deviates from the assumed uncorrelated block structure after transformation. The uniform convergence rates of the estimated transformation are derived, which vindicate an important virtue of the proposed bilinear transformation, that is, it is technically equivalent to the decorrelation of a vector time series of dimension max(p, q) instead of p × q. The proposed method is illustrated numerically via both simulated and real data examples. Supplementary materials for this article are available online. |
Type Of Material | Database/Collection of data |
Year Produced | 2023 |
Provided To Others? | Yes |
URL | https://tandf.figshare.com/articles/dataset/Simultaneous_Decorrelation_of_Matrix_Time_Series_/216417... |
Title | Simultaneous Decorrelation of Matrix Time Series |
Description | We propose a contemporaneous bilinear transformation for a p × q matrix time series to alleviate the difficulties in modeling and forecasting matrix time series when p and/or q are large. The resulting transformed matrix assumes a block structure consisting of several small matrices, and those small matrix series are uncorrelated across all times. Hence, an overall parsimonious model is achieved by modeling each of those small matrix series separately without the loss of information on the linear dynamics. Such a parsimonious model often has better forecasting performance, even when the underlying true dynamics deviates from the assumed uncorrelated block structure after transformation. The uniform convergence rates of the estimated transformation are derived, which vindicate an important virtue of the proposed bilinear transformation, that is, it is technically equivalent to the decorrelation of a vector time series of dimension max(p, q) instead of p × q. The proposed method is illustrated numerically via both simulated and real data examples. Supplementary materials for this article are available online. |
Type Of Material | Database/Collection of data |
Year Produced | 2023 |
Provided To Others? | Yes |
URL | https://tandf.figshare.com/articles/dataset/Simultaneous_Decorrelation_of_Matrix_Time_Series_/216417... |
Title | Simultaneous Decorrelation of Matrix Time Series* |
Description | We propose a contemporaneous bilinear transformation for a p × q matrix time series to alleviate the difficulties in modeling and forecasting matrix time series when p and/or q are large. The resulting transformed matrix assumes a block structure consisting of several small matrices, and those small matrix series are uncorrelated across all times. Hence an overall parsimonious model is achieved by modelling each of those small matrix series separately without the loss of information on the linear dynamics. Such a parsimonious model often has better forecasting performance, even when the underlying true dynamics deviates from the assumed uncorrelated block structure after transformation. The uniform convergence rates of the estimated transformation are derived, which vindicate an important virtue of the proposed bilinear transformation, i.e. it is technically equivalent to the decorrelation of a vector time series of dimension max(p, q) instead of p × q. The proposed method is illustrated numerically via both simulated and real data examples. |
Type Of Material | Database/Collection of data |
Year Produced | 2022 |
Provided To Others? | Yes |
URL | https://tandf.figshare.com/articles/dataset/Simultaneous_Decorrelation_of_Matrix_Time_Series_/216417... |
Description | EDF in Paris |
Organisation | EDF Energy |
Department | EDF Innovation and Research |
Country | France |
Sector | Private |
PI Contribution | We and the EDF team led by Dr Y Goude have worked together to develop the curve regression methodology and time series PCA h to tackle the new challenges in forecasting the loads due to the increase of renewable energy and the development of small distributed production units, as well as the changing consumption behavour due to plug-in (hybrid)electric vehicles, heat pumps and personal storage capacities. |
Collaborator Contribution | We and the EDF team have worked together to develop the curve linear regression methodology and have tested it with EDF data. |
Impact | Four publications, and one software package in R. |
Start Year | 2010 |
Title | HDTSA |
Description | An R package available at CRAN project specialized on various statistical inference for high-dimensional time series factor modelling, principal component analysis for vector and matrix time series, cointegration, and the inference for unit roots and cointegration. |
Type Of Technology | Software |
Year Produced | 2023 |
Open Source License? | Yes |
Impact | The software is publically avaialble through CRAN project. |
URL | https://cran.r-project.org/package=HDTSA |
Description | An invited talk at Conference on "Recent Advances in Statistics and Data Science" in Rutgers |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Study participants or study members |
Results and Impact | Conference on Recent Advances in Statistics and Data Science with a Celebration of Professors Regina Liu and Cun-Hui Zhang's Special Birthdays |
Year(s) Of Engagement Activity | 2023 |
URL | https://statistics.rutgers.edu/news-events/conferences/684-conference-on-recent-advances-in-statisti... |
Description | Invited talk at 2023 IMS International Conference on Statistics and Data Science, Lisbon |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Study participants or study members |
Results and Impact | The objective of ICSDS is to bring together researchers in statistics and data science from academia, industry, and government in a stimulating setting to exchange ideas on the developments of modern statistics, machine learning, and broadly defined theory, methods, and applications in data science. |
Year(s) Of Engagement Activity | 2023 |
URL | https://www.icsds2023.com/ |
Description | Invited talk at Conference on "Statistical Foundations of Data Science and Applications" in Princeton |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Study participants or study members |
Results and Impact | The conference was in honour of Professor Jianqing Fan's 60 birthday attended by over 300 academics, students and people working in industry, |
Year(s) Of Engagement Activity | 2023 |
URL | https://fan60.princeton.edu/ |
Description | Invited talk at Conference on 2023 Kansas Econometrics Workshop, Kansas |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Study participants or study members |
Results and Impact | This workshop consists of a series of yearly workshops focusing on recent developments of econometrics theories and methodologies as well as applications in economics and finance and other applied fields such as data sciences and statistics. The main purpose of the econometrics workshop series at KU is to promote methodological and theoretical research as well as applications in modern econometrics and statistics as well as data science, and to provide a forum for researchers, including Ph.D. students, to come together to interact through social discussions and presentations. |
Year(s) Of Engagement Activity | 2023 |
URL | https://econometrics.ku.edu/ |
Description | Invited talk at The OMI Machine Learning in Financial Econometrics, Oxford Man Institute |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Study participants or study members |
Results and Impact | The workshop is to to the dissemination of cutting-edge ideas in economics, financial industry using machine learning tools. |
Year(s) Of Engagement Activity | 2023 |
URL | https://web.cvent.com/event/78dec7d3-ee2d-4ddb-b14d-b05e782bb209/summary |