Latent Factor Representations of Dynamic Networks in Cyber-Security

Lead Research Organisation: Imperial College London

Department Name: Mathematics

Abstract

Statistical and data science techniques have an important role to play in the next generation of cyber-security defences. Inside a typical enterprise computer network, a number of high-volume data sources are available which could enable the discovery and prevention of cyber-attacks and other nefarious network activity. To improve the security of computer networks in government, industry and academia, both in the UK and worldwide, there is a requirement for developing statistical, probability model-based techniques for identifying the most subtle intrusion attempts using these data sources. The advantage of statistical approaches is their ability to learn, from historical data, complex patterns of normal computer and network behaviour, so that anomalies can be detected which would not stand out otherwise.

The aim of this project is to build strong statistical models for understanding and predicting the existence of edges indicating pairs of internet protocol addresses which connect with one another, so that unusual new connections can be identified and calibrated. This first requires the use of any open sources of information available for internet domain ranges, country-to-country connections, and protocols, ports and services typically favoured by different addresses. Second, we would like to use statistical inference, based on historical connections which have been observed, to understand some of the latent underlying structure of the internet; in contrast, these are aspects which cannot be measured but will determine the propensity for particular edges to be formed.

Complex, flexible latent factor models can be most effectively constructed through Bayesian nonparametrics, and there is a growing literature of the application of these techniques to dynamic network problems to build upon. However, Bayesian nonparametric methods are computationally burdensome, and will need adaptation to be appropriately applied to the high dimension and frequency of real cyber-security data collected within an enterprise computer network. Part of this process will be effective screening and triage of these high data volumes, identify interesting or informative sections or partitions of network traffic data.

As a research topic, statistical cyber-security lies within the EPSRC growth area of Statistics and Applied Probability, and relates to the themes of Global Uncertainties and Digital Economy. The potential impact of research to improve cyber-security spans across strengthening both national security and the position of the UK as a digital economy, through to the societal impact of safeguarding civil liberties, as having access to the internet is increasingly considered to be an emerging human right.

Student:

Francesco Sanna Passino

Period of Study:

Sep 17 - Oct 20

Funder:

EPSRC

Project Status:

Closed

Project Category:

Studentship

Project Reference:

1943891

Research Topic:

Unclassified

Organisations

Imperial College London (Lead Research Organisation)

People	ORCID iD
Nicholas Heard (Primary Supervisor)
Francesco Sanna Passino (Student)

Publications

Author Name

Title Publication Date Published

10 25 50

Passino F (2020) Bayesian estimation of the latent dimension and communities in stochastic blockmodels in Statistics and Computing

Passino F (2022) Spectral Clustering on Spherical Coordinates Under the Degree-Corrected Stochastic Blockmodel in Technometrics

Sanna Passino F (2019) Modelling dynamic network evolution as a Pitman-Yor process in Foundations of Data Science

Sanna Passino F (2022) Graph link prediction in computer networks using Poisson matrix factorisation in The Annals of Applied Statistics

Sanna Passino F (2020) Classification of periodic arrivals in event time data for filtering computer network traffic in Statistics and Computing

Sanna Passino F (2021) Link prediction in dynamic networks using random dot product graphs in Data Mining and Knowledge Discovery

Sanna Passino F (2022) Graph link prediction in computer networks using Poisson matrix factorisation in Annals of Applied Statistics

Studentship Projects

Project Reference	Relationship	Related To	Start	End	Student Name
EP/N509486/1			30/09/2016	30/03/2022
1943891	Studentship	EP/N509486/1	30/09/2017	31/10/2020	Francesco Sanna Passino

Key Findings
Software and Technical Products


Description	Novel statistical methods for separating human and automated activtity in computer network data have been proposed. Those methods are used to classify whether the observed activity has been generated by a human user, or generated automatically. This has important implications for computer security, since the two types of activity have different charateristics and must be modelled separately. In particular, statistical mixture modelling techniques are used for classification, based on a transformation of the arrival times obtained by estimated the dominating periodicity of the automated component using the Fisher's g-test. The proposed model correctly distinguishes which events might be due to the presence of a human at the machine, or are generated automatically. Also, new statistical models for discovering latent group structure within computer networks have been developed. Discovering communities within a computer network is important to better understand the normal behaviour of the network, and subsequently identify intrusions as deviations from the learned normal behaviour. In particular, the proposed models tackle the relevant problem of simultaneous dimensionality and complexity model selection in stochastic blockmodels and degree-corrected stochastic blockmodels, the most commonly used methods for identifying groups of nodes within a network. The proposed methodology represents a statistically principled technique to estimate groups and communities within a network. Furthermore, a novel model for the evolution of computer networks has been proposed, based on the Pitman-Yor process. In computer networks, new links and connections appear frequently, and it is therefore necessary to construct model that are simple enough to cope with the large volume of data available, but complex enough to capture the underlying patterns. The Pitman-Yor process appropriately models the probability of new connections, and its simplicity makes it scalable to networks of large size, common in practical applications. The anomaly scores for each event obtained using the Pitman-Yor models are then combined using statistical meta-analysis methods, obtaining scores representing the total risk associated with each machine within the network. The model shows good performance in identifying compromised machines in a real-world enterprise computer network. The results have been published on a statistics journal. Finally, part of this work has explored statistical link prediction methods within large computer networks. Link prediction in computer networks is the task of assigning anomaly scores to future observed connections, in order to evaluate whether the machine has been compromised. In particular two methods have been analysed: Poisson matrix factorisation and random dot product graphs. Statistical frameworks to include additional information regarding the nodes, for example geographic location, or type of machine, have been proposed. Also, the models have been extended to a dynamic setting, allowing for estimation and adaptation of the anomaly score over time, considering seasonal patterns.
Exploitation Route	The novel models proposed in this project represent contributions towards a unified statistical model for cyber-security applications, and could be used as generic building blocks in more complex models that include additional source of information which might be available to the user or to the enterprise. The proposed techniques could be implemented within real-world enterprises, complementing signature-based methods and improving the security of the cyber-systems of the organization. In particular, the proposed models could help making the anomaly detection process automated, and not necessarily constantly supervised by humans. This is a particularly important and relevant challenge, since real-world enterprises collect and process large amounts of data which they must protect.
Sectors	Digital/Communication/Information Technologies (including Software) Security and Diplomacy


Title	dcsbm: Spectral clustering on spherical coordinates under the degree-corrected stochastic blockmodel
Description	The software in this repository develops the the methodology and reproduces the results and simulations in Sanna Passino, F., Heard, N. A., and Rubin-Delanchy, P. (2021) "Spectral clustering on spherical coordinates under the degree-corrected stochastic blockmodel", Technometrics (to appear).
Type Of Technology	Software
Year Produced	2020
Open Source License?	Yes
Impact	The software makes the proposed statistical methodology accessible to wider audiences via an easy-to-use python script, making the proposed method widely usable.
URL	https://github.com/fraspass/dcsbm


Title	human_activity: Mixture models for separating human and automated activity on a single edge within a computer network
Description	The software develops the methodology presented in Sanna Passino, F. and Heard, N. A., "Classification of periodic arrivals in event time data for filtering computer network traffic", Statistics and Computing 30(5), 1241-1254 (2020).
Type Of Technology	Software
Year Produced	2020
Open Source License?	Yes
Impact	The software makes the proposed statistical methodology accessible to wider audiences via an easy-to-use python script, making the proposed method widely usable.
URL	https://github.com/fraspass/human_activity


Title	pitman_yor: Modelling dynamic network evolution as a Pitman-Yor process
Description	The software implements the methodology in Sanna Passino, F. and Heard, N. A., "Modelling dynamic network evolution as a Pitman-Yor process", Foundations of Data Science, 2019, 1(3):293-306.
Type Of Technology	Software
Year Produced	2019
Open Source License?	Yes
Impact	The software makes the proposed statistical methodology accessible to wider audiences via an easy-to-use python script, making the proposed method widely usable.
URL	https://github.com/fraspass/pitman_yor


Title	sbm: Bayesian estimation of the latent dimension and communities in stochastic blockmodels
Description	The software implements the methodologies presented in Sanna Passino, F. and Heard, N. A. (2020) "Bayesian estimation of the latent dimension and communities in stochastic blockmodels", Statistics and Computing, 30(5), 1291-1307.
Type Of Technology	Software
Year Produced	2020
Open Source License?	Yes
Impact	The software makes the proposed statistical methodology accessible to wider audiences via an easy-to-use python script, making the proposed method widely usable.
URL	https://github.com/fraspass/sbm

Abstract

Organisations

People

ORCID iD

Publications

Studentship Projects