Statistical Network Analysis: Model Selection, Differential Privacy, and Dynamic Structures

Lead Research Organisation: London School of Economics & Pol Sci

Department Name: Statistics

Abstract

In this proposal we tackle some challenging problems in the following three aspects of statistical network analysis.

1. Jittered resampling for selecting network models

The first and arguably the most important step in statistical modelling is to choose an appropriate model for a given data set. While there exist many data-driven model-selection methods in statistics in general, including those based on data reuse (i.e., bootstrap resampling, cross-validation), their application to network data is problematic. Therefore it remains common to choose a network model subjectively. The major difficulty in the reuse of network data is to mimic the underlying probability mechanisms. A few existing attempts include cross-validation under some specific settings. We propose a new `bootstrap jittering' or `jittered resampling' method for selecting an appropriate network model. The method does not impose any specific forms/conditions, therefore providing a generic tool for network model selection.

2. Edge differential privacy for network data

In network data individuals are typically represented by nodes and their inter-relationships are represented by edges. Therefore network data often contain sensitive individual/personal information. On the other hand the information of interest in the data should be perserved. Hence the primary concern for data privacy is two-folded: (a) to release only a sanitized version of the original network data to protect privacy, and (b) the sanitized data should preserves the information of interest such that the analysis based on the sanitized data is still meaningful. This is a vibrant research area now as data privacy becomes ever increasingly sensitive and important with available abundant personal information in digital format in this information age, though the contribution from statistics is still at a preliminary stage. We will adopt the so-called dyadwise randomized response approach. While such a scheme is differentially private, the inference based on the released data is largely unknown. Our initial investigation reveals some attractive features of this approach, suggesting more efficient statistical inference than those based on other data release mechanism. We will further develop this scheme to handle networks with additional node features/attributes (e.g., social networks with additional information on age, gender, hobby, occupation etc).

3. Modelling and forecasting dynamic networks

Most existing statistical inference methods for networks are confined to static network data, though a substantial proportion of real networks are dynamic in nature. Understanding and being able to forecast the changes over time are of immense importance for, e.g., monitoring anomalies in internet traffic networks, predicting demand and setting pricing in electricity supply networks, managing natural resources in environmental readings in sensor networks, and understanding how news and opinion propagates in online social networks. Unfortunately the development of the foundation for dynamic networks is still in its infancy, and the available modelling and inference tools are sparse. As for dealing with dynamic changes of networks, most available techniques are based on the evolution analysis of snapshot networks over time without really modelling the changes dynamically. Although this reflects the fact that most networks change slowly over time, it does not provides any insight on the dynamics underlying the changes and is almost powerless for future prediction for which it is essential to build appropriate stochastic models to capture dynamic dependence and dynamic changes explicitly. Combining recent developments on tensor decomposition and factor-driven dimension reduction with the efficient time series tools such as exponential smoothing and Kalman filters, we will take on this challenge to build some new dynamic models.

Funded Value:

£501,155

Funded Period:

Jun 21 - Aug 24

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/V007556/1

Principal Investigator:

Qiwei Yao

Research Subject:

Mathematical sciences (100%)

Research Topic:

Statistics & Appl. Probability (100%)

Organisations

People	ORCID iD
Qiwei Yao (Principal Investigator)

Publications

Author Name

Title Publication Date Published

|< < 1 2 > >|

10 25 50

Zhou Y (2023) Testing for the Markov property in time series via deep conditional generative learning. in Journal of the Royal Statistical Society. Series B, Statistical methodology

Zhou Y (2023) Testing for the Markov Property in Time Series via Deep Conditional Generative Learning

Zhang B (2021) Factor Modelling for Clustering High-dimensional Time Series

Zhang B (2023) Factor Modeling for Clustering High-Dimensional Time Series in Journal of the American Statistical Association

Xu X (2021) Day-ahead probabilistic forecasting for French half-hourly electricity loads and quantiles for curve-to-curve regression in Applied Energy

Jiang, B. (2023) A two-way heterogeneity model for dynamic networks

Jiang Binyan (2023) Autoregressive Networks in JOURNAL OF MACHINE LEARNING RESEARCH

Han Y (2021) Simultaneous Decorrelation of Matrix Time Series

Han Y (2023) Simultaneous Decorrelation of Matrix Time Series in Journal of the American Statistical Association

Chang, J. (2024) Edge differentially private estimation in the ß-model via jittering and method of moments in The Annals of Statistics

Key Findings
Research Databases and Models
Collaboration
Software and Technical Products
Engagement Activities


Description	1. We provide a simple and explicit autoregressive type framework to model and forecast dynamic changes of network data. It facilitates simple and efficient statistical inference and model diagnostic checking. The framework can serve as a basic building block to accommodate various stylized features observed in real network data. 2. A standing challenge in data privacy is the trade-off between the level of privacy and the efficiency of statistical inference. We adopt the so-called dyadwise randomized response approach. While such a scheme is differentially private, the estimation for network parameters shows an interesting phase transition. We further devise a novel adaptive bootstrap procedure to construct uniform inference across different phases.
Exploitation Route	Issues around data privacy attracts ever increasing attention in society. Network data privacy is perceived to be especially difficult due to its special structure and often binary nature. Our work in this direction will contribute to the statisticians' contribution in this important area. Dynamic modelling and forecasting for network flows directly links time series with the new development and challenges associated with big data, which is very much needed. Hence the potential beneficiaries include both theoretical and applied network data analysts in statistics and other disciplines such as computer science, social network, network communication, energy distribution and forecasting, genetic linkages, economics, finance and etc.
Sectors	Communities and Social Services/Policy Creative Economy Energy Environment Financial Services and Management Consultancy Manufacturing including Industrial Biotechology Security and Diplomacy
URL	https://stats.lse.ac.uk/q.yao/qyao.links/publicationsAll.html


Title	Factor Modeling for Clustering High-Dimensional Time Series
Description	We propose a new unsupervised learning method for clustering a large number of time series based on a latent factor structure. Each cluster is characterized by its own cluster-specific factors in addition to some common factors which impact on all the time series concerned. Our setting also offers the flexibility that some time series may not belong to any clusters. The consistency with explicit convergence rates is established for the estimation of the common factors, the cluster-specific factors, and the latent clusters. Numerical illustration with both simulated data as well as a real data example is also reported. As a spin-off, the proposed new approach also advances significantly the statistical inference for the factor model of Lam and Yao. Supplementary materials for this article are available online.
Type Of Material	Database/Collection of data
Year Produced	2023
Provided To Others?	Yes
URL	https://tandf.figshare.com/articles/dataset/Factor_Modelling_for_Clustering_High-dimensional_Time_Se...


Title	Factor Modeling for Clustering High-Dimensional Time Series
Description	We propose a new unsupervised learning method for clustering a large number of time series based on a latent factor structure. Each cluster is characterized by its own cluster-specific factors in addition to some common factors which impact on all the time series concerned. Our setting also offers the flexibility that some time series may not belong to any clusters. The consistency with explicit convergence rates is established for the estimation of the common factors, the cluster-specific factors, and the latent clusters. Numerical illustration with both simulated data as well as a real data example is also reported. As a spin-off, the proposed new approach also advances significantly the statistical inference for the factor model of Lam and Yao. Supplementary materials for this article are available online.
Type Of Material	Database/Collection of data
Year Produced	2023
Provided To Others?	Yes
URL	https://tandf.figshare.com/articles/dataset/Factor_Modelling_for_Clustering_High-dimensional_Time_Se...


Title	Factor Modeling for Clustering High-Dimensional Time Series
Description	We propose a new unsupervised learning method for clustering a large number of time series based on a latent factor structure. Each cluster is characterized by its own cluster-specific factors in addition to some common factors which impact on all the time series concerned. Our setting also offers the flexibility that some time series may not belong to any clusters. The consistency with explicit convergence rates is established for the estimation of the common factors, the cluster-specific factors, and the latent clusters. Numerical illustration with both simulated data as well as a real data example is also reported. As a spin-off, the proposed new approach also advances significantly the statistical inference for the factor model of Lam and Yao. Supplementary materials for this article are available online.
Type Of Material	Database/Collection of data
Year Produced	2023
Provided To Others?	Yes
URL	https://tandf.figshare.com/articles/dataset/Factor_Modelling_for_Clustering_High-dimensional_Time_Se...


Title	Factor Modeling for Clustering High-Dimensional Time Series
Description	We propose a new unsupervised learning method for clustering a large number of time series based on a latent factor structure. Each cluster is characterized by its own cluster-specific factors in addition to some common factors which impact on all the time series concerned. Our setting also offers the flexibility that some time series may not belong to any clusters. The consistency with explicit convergence rates is established for the estimation of the common factors, the cluster-specific factors, and the latent clusters. Numerical illustration with both simulated data as well as a real data example is also reported. As a spin-off, the proposed new approach also advances significantly the statistical inference for the factor model of Lam and Yao. Supplementary materials for this article are available online.
Type Of Material	Database/Collection of data
Year Produced	2023
Provided To Others?	Yes
URL	https://tandf.figshare.com/articles/dataset/Factor_Modelling_for_Clustering_High-dimensional_Time_Se...


Title	Factor Modelling for Clustering High-dimensional Time Series
Description	We propose a new unsupervised learning method for clustering a large number of time series based on a latent factor structure. Each cluster is characterized by its own cluster-specific factors in addition to some common factors which impact on all the time series concerned. Our setting also offers the flexibility that some time series may not belong to any clusters. The consistency with explicit convergence rates is established for the estimation of the common factors, the cluster-specific factors, and the latent clusters. Numerical illustration with both simulated data as well as a real data example is also reported. As a spin-off, the proposed new approach also advances significantly the statistical inference for the factor model of Lam and Yao (2012).
Type Of Material	Database/Collection of data
Year Produced	2023
Provided To Others?	Yes
URL	https://tandf.figshare.com/articles/dataset/Factor_Modelling_for_Clustering_High-dimensional_Time_Se...


Title	Simultaneous Decorrelation of Matrix Time Series
Description	We propose a contemporaneous bilinear transformation for a p × q matrix time series to alleviate the difficulties in modeling and forecasting matrix time series when p and/or q are large. The resulting transformed matrix assumes a block structure consisting of several small matrices, and those small matrix series are uncorrelated across all times. Hence, an overall parsimonious model is achieved by modeling each of those small matrix series separately without the loss of information on the linear dynamics. Such a parsimonious model often has better forecasting performance, even when the underlying true dynamics deviates from the assumed uncorrelated block structure after transformation. The uniform convergence rates of the estimated transformation are derived, which vindicate an important virtue of the proposed bilinear transformation, that is, it is technically equivalent to the decorrelation of a vector time series of dimension max(p, q) instead of p × q. The proposed method is illustrated numerically via both simulated and real data examples. Supplementary materials for this article are available online.
Type Of Material	Database/Collection of data
Year Produced	2023
Provided To Others?	Yes
URL	https://tandf.figshare.com/articles/dataset/Simultaneous_Decorrelation_of_Matrix_Time_Series_/216417...


Title	Simultaneous Decorrelation of Matrix Time Series
Description	We propose a contemporaneous bilinear transformation for a p × q matrix time series to alleviate the difficulties in modeling and forecasting matrix time series when p and/or q are large. The resulting transformed matrix assumes a block structure consisting of several small matrices, and those small matrix series are uncorrelated across all times. Hence, an overall parsimonious model is achieved by modeling each of those small matrix series separately without the loss of information on the linear dynamics. Such a parsimonious model often has better forecasting performance, even when the underlying true dynamics deviates from the assumed uncorrelated block structure after transformation. The uniform convergence rates of the estimated transformation are derived, which vindicate an important virtue of the proposed bilinear transformation, that is, it is technically equivalent to the decorrelation of a vector time series of dimension max(p, q) instead of p × q. The proposed method is illustrated numerically via both simulated and real data examples. Supplementary materials for this article are available online.
Type Of Material	Database/Collection of data
Year Produced	2023
Provided To Others?	Yes
URL	https://tandf.figshare.com/articles/dataset/Simultaneous_Decorrelation_of_Matrix_Time_Series_/216417...


Title	Simultaneous Decorrelation of Matrix Time Series*
Description	We propose a contemporaneous bilinear transformation for a p × q matrix time series to alleviate the difficulties in modeling and forecasting matrix time series when p and/or q are large. The resulting transformed matrix assumes a block structure consisting of several small matrices, and those small matrix series are uncorrelated across all times. Hence an overall parsimonious model is achieved by modelling each of those small matrix series separately without the loss of information on the linear dynamics. Such a parsimonious model often has better forecasting performance, even when the underlying true dynamics deviates from the assumed uncorrelated block structure after transformation. The uniform convergence rates of the estimated transformation are derived, which vindicate an important virtue of the proposed bilinear transformation, i.e. it is technically equivalent to the decorrelation of a vector time series of dimension max(p, q) instead of p × q. The proposed method is illustrated numerically via both simulated and real data examples.
Type Of Material	Database/Collection of data
Year Produced	2022
Provided To Others?	Yes
URL	https://tandf.figshare.com/articles/dataset/Simultaneous_Decorrelation_of_Matrix_Time_Series_/216417...


Description	EDF in Paris
Organisation	EDF Energy
Department	EDF Innovation and Research
Country	France
Sector	Private
PI Contribution	We and the EDF team led by Dr Y Goude have worked together to develop the curve regression methodology and time series PCA h to tackle the new challenges in forecasting the loads due to the increase of renewable energy and the development of small distributed production units, as well as the changing consumption behavour due to plug-in (hybrid)electric vehicles, heat pumps and personal storage capacities.
Collaborator Contribution	We and the EDF team have worked together to develop the curve linear regression methodology and have tested it with EDF data.
Impact	Four publications, and one software package in R.
Start Year	2010


Title	HDTSA
Description	An R package available at CRAN project specialized on various statistical inference for high-dimensional time series factor modelling, principal component analysis for vector and matrix time series, cointegration, and the inference for unit roots and cointegration.
Type Of Technology	Software
Year Produced	2023
Open Source License?	Yes
Impact	The software is publically avaialble through CRAN project.
URL	https://cran.r-project.org/package=HDTSA


Description	An invited talk at Conference on "Recent Advances in Statistics and Data Science" in Rutgers
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Study participants or study members
Results and Impact	Conference on Recent Advances in Statistics and Data Science with a Celebration of Professors Regina Liu and Cun-Hui Zhang's Special Birthdays
Year(s) Of Engagement Activity	2023
URL	https://statistics.rutgers.edu/news-events/conferences/684-conference-on-recent-advances-in-statisti...


Description	Invited talk at 2023 IMS International Conference on Statistics and Data Science, Lisbon
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Study participants or study members
Results and Impact	The objective of ICSDS is to bring together researchers in statistics and data science from academia, industry, and government in a stimulating setting to exchange ideas on the developments of modern statistics, machine learning, and broadly defined theory, methods, and applications in data science.
Year(s) Of Engagement Activity	2023
URL	https://www.icsds2023.com/


Description	Invited talk at Conference on "Statistical Foundations of Data Science and Applications" in Princeton
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Study participants or study members
Results and Impact	The conference was in honour of Professor Jianqing Fan's 60 birthday attended by over 300 academics, students and people working in industry,
Year(s) Of Engagement Activity	2023
URL	https://fan60.princeton.edu/


Description	Invited talk at Conference on 2023 Kansas Econometrics Workshop, Kansas
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Study participants or study members
Results and Impact	This workshop consists of a series of yearly workshops focusing on recent developments of econometrics theories and methodologies as well as applications in economics and finance and other applied fields such as data sciences and statistics. The main purpose of the econometrics workshop series at KU is to promote methodological and theoretical research as well as applications in modern econometrics and statistics as well as data science, and to provide a forum for researchers, including Ph.D. students, to come together to interact through social discussions and presentations.
Year(s) Of Engagement Activity	2023
URL	https://econometrics.ku.edu/


Description	Invited talk at The OMI Machine Learning in Financial Econometrics, Oxford Man Institute
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Study participants or study members
Results and Impact	The workshop is to to the dissemination of cutting-edge ideas in economics, financial industry using machine learning tools.
Year(s) Of Engagement Activity	2023
URL	https://web.cvent.com/event/78dec7d3-ee2d-4ddb-b14d-b05e782bb209/summary

Abstract

Organisations

People

ORCID iD

Publications