Statistical Network Analysis: Model Selection, Differential Privacy, and Dynamic Structures

Lead Research Organisation: London School of Economics and Political Science
Department Name: Statistics

Abstract

In this proposal we tackle some challenging problems in the following three aspects of statistical network analysis.

1. Jittered resampling for selecting network models

The first and arguably the most important step in statistical modelling is to choose an appropriate model for a given data set. While there exist many data-driven model-selection methods in statistics in general, including those based on data reuse (i.e., bootstrap resampling, cross-validation), their application to network data is problematic. Therefore it remains common to choose a network model subjectively. The major difficulty in the reuse of network data is to mimic the underlying probability mechanisms. A few existing attempts include cross-validation under some specific settings. We propose a new `bootstrap jittering' or `jittered resampling' method for selecting an appropriate network model. The method does not impose any specific forms/conditions, therefore providing a generic tool for network model selection.

2. Edge differential privacy for network data

In network data individuals are typically represented by nodes and their inter-relationships are represented by edges. Therefore network data often contain sensitive individual/personal information. On the other hand the information of interest in the data should be perserved. Hence the primary concern for data privacy is two-folded: (a) to release only a sanitized version of the original network data to protect privacy, and (b) the sanitized data should preserves the information of interest such that the analysis based on the sanitized data is still meaningful. This is a vibrant research area now as data privacy becomes ever increasingly sensitive and important with available abundant personal information in digital format in this information age, though the contribution from statistics is still at a preliminary stage. We will adopt the so-called dyadwise randomized response approach. While such a scheme is differentially private, the inference based on the released data is largely unknown. Our initial investigation reveals some attractive features of this approach, suggesting more efficient statistical inference than those based on other data release mechanism. We will further develop this scheme to handle networks with additional node features/attributes (e.g., social networks with additional information on age, gender, hobby, occupation etc).

3. Modelling and forecasting dynamic networks

Most existing statistical inference methods for networks are confined to static network data, though a substantial proportion of real networks are dynamic in nature. Understanding and being able to forecast the changes over time are of immense importance for, e.g., monitoring anomalies in internet traffic networks, predicting demand and setting pricing in electricity supply networks, managing natural resources in environmental readings in sensor networks, and understanding how news and opinion propagates in online social networks. Unfortunately the development of the foundation for dynamic networks is still in its infancy, and the available modelling and inference tools are sparse. As for dealing with dynamic changes of networks, most available techniques are based on the evolution analysis of snapshot networks over time without really modelling the changes dynamically. Although this reflects the fact that most networks change slowly over time, it does not provides any insight on the dynamics underlying the changes and is almost powerless for future prediction for which it is essential to build appropriate stochastic models to capture dynamic dependence and dynamic changes explicitly. Combining recent developments on tensor decomposition and factor-driven dimension reduction with the efficient time series tools such as exponential smoothing and Kalman filters, we will take on this challenge to build some new dynamic models.

Publications

10 25 50
publication icon
Chang J (2023) Modelling matrix time series via a tensor CP-decomposition in Journal of the Royal Statistical Society Series B: Statistical Methodology

publication icon
Han Y (2023) Simultaneous Decorrelation of Matrix Time Series in Journal of the American Statistical Association

publication icon
Zhang B (2023) Factor Modeling for Clustering High-Dimensional Time Series in Journal of the American Statistical Association

publication icon
Zhou Y (2023) Testing for the Markov property in time series via deep conditional generative learning. in Journal of the Royal Statistical Society. Series B, Statistical methodology

 
Description Dr Y Goude and his team, EDF in Paris 
Organisation EDF Energy
Department EDF Innovation and Research
Country France 
Sector Private 
PI Contribution We and the EDF team have worked together to develop the curve linear regression methodology and have tested it with EDF data. Now we are working to tackle the new challenges in forecasting the loads due to the increase of renewable energy and the development of small distributed production units, as well as the changing consumption behavour due to plug-in (hybrid)electric vehicles, heat pumps and personal storage capacities.
Collaborator Contribution We and the EDF team have worked together to develop the curve linear regression methodology and have tested it with EDF data.
Impact Three publications, one technical report, and one software package in R.
Start Year 2010