Tractable inference for statistical network models with local dependence

Lead Research Organisation: UNIVERSITY OF READING

Department Name: Mathematics and Statistics

Abstract

This project concerns the analysis of data from systems consisting of large numbers of linked objects. Such data is commonly found in a wide range of applications in science. The links may arise through the existence of networks, which contain explicit connections between objects (such as links between websites), or simply through weaker associations (e.g. we expect that in most images, most pixels will be of a similar colour to nearby pixels).

Examples of such data arise in many different fields. Applications in: physics (magnetism); biology (genetics, protein design, neural models); computer science (artificial intelligence, computer vision); social science (social networks); economics (the network effect); and engineering (target tracking), all share common underlying characteristics. In many cases, the application can be reduced to understanding the process by which a network is generated and to use this knowledge to predict future events. For example, one can envisage learning from previous examples of contact between known terrorists, inferring patterns of communication that mark them out as such. Knowledge of this pattern may then be used to identify terrorist cells from communication networks. Making such an inference in the presence of uncertainty (which is always the case when analysing real data) is a problem that lies in the realm of statistics. In order to obtain accurate results, the use of a technique known as Bayesian inference is usually necessary. Bayesian inference is usually implemented by means of an iterative algorithm known as a Monte Carlo method.

However, using Monte Carlo methods to perform inference in models of large networks can be extremely computationally expensive: the algorithm can run for days, weeks or months without producing a useful result. This means that in practice these methods cannot be used, so that large networks cannot be analysed using "model based" statistical techniques. This project is concerned with the development of algorithms that are much more computationally efficient, with the aim of reducing computational time to more manageable levels. These algorithms build on some of the most recent developments in statistics and machine learning.

Planned Impact

This project will develop software (based on statistical methodology) that will be used by other researchers. Therefore its pathway to impact is indirect, in that it is reliant on working with researchers whose application areas can lead to economic and societal impact. The academic impact of the work is a crucial component in the path to other impacts.

Despite this indirect pathway, the potential for impact from this research is significant. Algorithms such as the ones developed in this project, when implemented in freely available software packages, represent the gateway to making sense of network data.

The algorithms that will be developed by the project will help describe how a network is divided into communities, and the relative effect of different local interactions within and between these communities in giving rise to the global structure of the network. The analysis of large networks is central to, for example: studying the organisation of the human brain (a topic central to understanding conditions such as Alzheimer's disease, autism or attention deficit hyperactivity disorder); understanding the behaviour of consumers; and in understanding how people form their opinions. Analysing "Big Data" such as this is seen as a major opportunity in the UK, where it has been recognised to be of national economic and scientific importance (it is one of the UK Government's "8 Great Technologies" of the future that will help drive economic growth). This project lies in the growing EPSRC Research Area of "Statistics and Applied Probability", but obvious potential impact outside of this field lies in the following themes:

- "Digital Economy", in aiding understanding of how complex interactions influence society;
- "Global Uncertainties", in security applications such as locating a terrorist cell in a communication network;
- "Healthcare Technologies", in helping understand and diagnose diseases such as dementia.

The statistical work in the project also has the potential to lead to impacts in other areas. Again, these are indirect, but are potentially significant. The methodology underlying the algorithms is quite general, and this type of approach may also find use in applications as diverse as weather and climate analysis, genomics and computer vision. The understanding of how the statistical methods used in this project perform is useful to statisticians who work in different areas of application.

The project is ideal for a young researcher to work on, since they will gain skills in two growth areas of national importance: firstly, in the analysis of network data; secondly, in the fields of statistics, machine learning and data science, the subjects at the heart of the "Big Data Revolution".

Funded Value:

£99,212

Funded Period:

Sep 16 - Sep 18

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/N023927/1

Principal Investigator:

Richard Everitt

Research Subject:

Mathematical sciences (100%)

Research Topic:

Logic & Combinatorics (20%)

Statistics & Appl. Probability (80%)

Organisations

People	ORCID iD
Richard Everitt (Principal Investigator)	http://orcid.org/0000-0002-0822-5648

Publications

Author Name

Title Publication Date Published

10 25 50

Drovandi C (2022) Ensemble MCMC: Accelerating Pseudo-Marginal MCMC for State Space Models using the Ensemble Kalman Filter in Bayesian Analysis

Drovandi C (2019) Ensemble MCMC: Accelerating Pseudo-Marginal MCMC for State Space Models using the Ensemble Kalman Filter

Everitt R (2017) Marginal sequential Monte Carlo for doubly intractable models

Everitt R (2020) Delayed Acceptance ABC-SMC in Journal of Computational and Graphical Statistics

Everitt R (2017) Bootstrapped synthetic likelihood

Everitt R (2017) Delayed acceptance ABC-SMC

Key Findings
Impact Summary
Further Funding
Research Databases and Models
Research Tools and Methods
Collaboration
Software and Technical Products


Description	This project focussed on statistical models for network data. The objectives of the award were to develop improved inference for state-of-the-art models for real data There were two main strands of work described in the proposal: - making exact inference techniques computationally feasible; - developing faster approximate methods. Both objectives were met. The models used in the project have two characteristics: they model a network as being made up of a number of different clusters (or communities); and they model what the network looks like within the clusters. The work was performed in three stages. A primary stage where new techniques for inferring how the network behaves within clusters, a secondary stage where the focus was inferring the clustering, and a final stage that applied the techniques to real data. First stage: within each cluster Three new methods for within-cluster inference were developed. Each method had a reduced computational cost compared to existing techniques - in some cases they were an order of magnitude faster. Together, these three new methods are one of the main achievements of the award. Second stage: finding the clustering This project found that the techniques outlined in the proposal were applicable to a wider range of network models than first thought. Two new approaches to inferring the clustering (also known as "community detection") were developed: one is fast and approximate, the other is slower, but more accurate. These two new approaches to inference (used on three popular statistical network models) together form the second major achievement of the award. Third stage: code and applications Work on the grant has focussed on two applications: to neuroscience, and to genetics. In both situations the data is structure as a network, and the contribution of the work on the award has been to inferring, and accounting for, the clustering structure of the network data. The algorithms developed under the project are publicly available on github.
Exploitation Route	The improved techniques developed under the grant have formed the basis of a collaboration with the Centre for Environment, Fisheries and Aquaculture Science (Cefas), on inferring fish stocks to aid policy making. This work is ongoing, and has led to two successful applications for funding from NERC under the Landscape Decisions Programme (total funding £340,000). In addition, the algorithms and expertise developed under the award are likely to also find application in a collaboration with a consumer analytics company. This work is being funded by Innovate UK (with a contribution from EPSRC).
Sectors	Aerospace Defence and Marine Communities and Social Services/Policy Digital/Communication/Information Technologies (including Software) Energy Environment Healthcare Security and Diplomacy Transport


Description	The improved techniques developed under the grant have formed the basis of a collaboration with the Centre for Environment, Fisheries and Aquaculture Science (Cefas), on inferring fish stocks to aid policy making on fishery management. This work has so far involved examining models for sea bass, which had suffered a dramatic decline in recent years. The work is ongoing, and has led to two successful applications for funding from NERC under the Landscape Decisions Programme (total funding £340,000). This collaboration with Cefas is ongoing as of 2024. In addition, the algorithms and expertise developed under the award led to a successful Innovate UK application (with a contribution from EPSRC) with a consumer analytics company.
First Year Of Impact	2018
Sector	Environment
Impact Types	Policy & public services


Description	Knowledge Transfer Partnership Emerging & Enabling
Amount	£116,755 (GBP)
Funding ID	Project number 511432
Organisation	Innovate UK
Sector	Public
Country	United Kingdom
Start	03/2019
End	03/2021


Description	Quantifying uncertainty in the predictions of complex process-based models
Amount	£42,873 (GBP)
Funding ID	NE/T004010/1
Organisation	Natural Environment Research Council
Sector	Public
Country	United Kingdom
Start	09/2019
End	09/2021


Title	Bootstrapped synthetic likelihood
Description	Synthetic likelihood (SL) is a useful alternative to approximate Bayesian computation for parameter inference in models whose likelihood is not available to evaluate pointwise. Suppose that the data that is being modelled consists of N data points. The computational cost of SL can be high, since at each parameter it requires running the simulator M times, giving a cost of O(MN) for each parameter visited. Bootstrapped SL uses a bootstrap of a single simulation to approximate the standard SL, with a cost of O(N) for each parameter visited. The use of a bag of little bootstraps can be used to reduce the cost further, where only a sub-simulation of size n
Type Of Material	Improvements to research infrastructure
Year Produced	2017
Provided To Others?	Yes
Impact	None yet.
URL	https://arxiv.org/abs/1711.05825


Title	Delayed acceptance ABC-SMC
Description	Approximate Bayesian computation (ABC) is the method of choice for inferring the parameters of statistical models that are defined by a simulator. The approach originates in genetics, but has also found application in a number of different fields, including the study of infectious diseases and the analysis of network data. ABC can be infeasible when the simulator is computationally expensive, since often the simulator needs to be run a large number of times. The method we have developed is useful in cases where an alternative computationally cheap simulator is available. In these cases, our method uses simulations from the cheap simulator to automatically rule out "bad" regions of parameter space. sometimes reducing the computational cost of ABC by orders of magnitude.
Type Of Material	Improvements to research infrastructure
Year Produced	2017
Provided To Others?	Yes
Impact	None so far.
URL	https://arxiv.org/abs/1708.02230


Title	Estimates of partition functions using a path estimator within SMC
Description	Bayesian inference for doubly intractable distributions (such as are found when modelling networks, and also other applications) is difficult due to the presence of an intractable partition function in the likelihood. Sequential Monte Carlo samplers are a standard approach to Bayesian inference, but it is difficult to use them on doubly intractable models. We propose to use an existing type of sampler, but significantly improve its efficiency through, in later stages of the algorithm, reusing simulations generated in earlier stages of the algorithm. We describe a heuristic approach to automatically deciding which simulations to reuse.
Type Of Material	Improvements to research infrastructure
Year Produced	2017
Provided To Others?	Yes
Impact	None yet.
URL	https://arxiv.org/abs/1710.04382


Title	Bootstrapped synthetic likelihood
Description	Synthetic likelihood (SL) is a useful alternative to approximate Bayesian computation for parameter inference in models whose likelihood is not available to evaluate pointwise. Suppose that the data that is being modelled consists of N data points. The computational cost of SL can be high, since at each parameter it requires running the simulator M times, giving a cost of O(MN) for each parameter visited. Bootstrapped SL uses a bootstrap of a single simulation to approximate the standard SL, with a cost of O(N) for each parameter visited. The use of a bag of little bootstraps can be used to reduce the cost further, where only a sub-simulation of size n
Type Of Material	Data analysis technique
Year Produced	2017
Provided To Others?	Yes
Impact	None yet.
URL	https://arxiv.org/abs/1711.05825


Title	Delayed Acceptance ABC-SMC
Description	Approximate Bayesian computation (ABC) is now an established technique for statistical inference used in cases where the likelihood function is computationally expensive or not available. It relies on the use of a model that is specified in the form of a simulator, and approximates the likelihood at a parameter value ? by simulating auxiliary data sets x and evaluating the distance of x from the true data y. However, ABC is not computationally feasible in cases where using the simulator for each ? is very expensive. This article investigates this situation in cases where a cheap, but approximate, simulator is available. The approach is to employ delayed acceptance Markov chain Monte Carlo within an ABC sequential Monte Carlo sampler to, in a first stage of the kernel, use the cheap simulator to rule out parts of the parameter space that are not worth exploring, so that the "true" simulator is only run (in the second stage of the kernel) where there is a reasonable chance of accepting proposed values of ?. We show that this approach can be used quite automatically, with few tuning parameters. Applications to stochastic differential equation models and latent doubly intractable distributions are presented. Supplementary materials for this article are available online.
Type Of Material	Database/Collection of data
Year Produced	2021
Provided To Others?	Yes
URL	https://tandf.figshare.com/articles/dataset/Delayed_acceptance_ABC-SMC/12413117


Title	Delayed Acceptance ABC-SMC
Description	Approximate Bayesian computation (ABC) is now an established technique for statistical inference used in cases where the likelihood function is computationally expensive or not available. It relies on the use of a model that is specified in the form of a simulator, and approximates the likelihood at a parameter value ? by simulating auxiliary data sets x and evaluating the distance of x from the true data y. However, ABC is not computationally feasible in cases where using the simulator for each ? is very expensive. This article investigates this situation in cases where a cheap, but approximate, simulator is available. The approach is to employ delayed acceptance Markov chain Monte Carlo within an ABC sequential Monte Carlo sampler to, in a first stage of the kernel, use the cheap simulator to rule out parts of the parameter space that are not worth exploring, so that the "true" simulator is only run (in the second stage of the kernel) where there is a reasonable chance of accepting proposed values of ?. We show that this approach can be used quite automatically, with few tuning parameters. Applications to stochastic differential equation models and latent doubly intractable distributions are presented. Supplementary materials for this article are available online.
Type Of Material	Database/Collection of data
Year Produced	2020
Provided To Others?	Yes
URL	https://tandf.figshare.com/articles/dataset/Delayed_acceptance_ABC-SMC/12413117/2


Title	Delayed Acceptance ABC-SMC
Description	Approximate Bayesian computation (ABC) is now an established technique for statistical inference used in cases where the likelihood function is computationally expensive or not available. It relies on the use of a model that is specified in the form of a simulator, and approximates the likelihood at a parameter value ? by simulating auxiliary data sets x and evaluating the distance of x from the true data y. However, ABC is not computationally feasible in cases where using the simulator for each ? is very expensive. This article investigates this situation in cases where a cheap, but approximate, simulator is available. The approach is to employ delayed acceptance Markov chain Monte Carlo within an ABC sequential Monte Carlo sampler to, in a first stage of the kernel, use the cheap simulator to rule out parts of the parameter space that are not worth exploring, so that the "true" simulator is only run (in the second stage of the kernel) where there is a reasonable chance of accepting proposed values of ?. We show that this approach can be used quite automatically, with few tuning parameters. Applications to stochastic differential equation models and latent doubly intractable distributions are presented. Supplementary materials for this article are available online.
Type Of Material	Database/Collection of data
Year Produced	2021
Provided To Others?	Yes
URL	https://tandf.figshare.com/articles/dataset/Delayed_acceptance_ABC-SMC/12413117/3


Title	Delayed acceptance ABC-SMC
Description	Approximate Bayesian computation (ABC) is now an established technique for statistical inference used in cases where the likelihood function is computationally expensive or not available. It relies on the use of a model that is specified in the form of a simulator, and approximates the likelihood at a parameter value ? by simulating auxiliary data sets x and evaluating the distance of x from the true data y. However, ABC is not computationally feasible in cases where using the simulator for each ? is very expensive. This paper investigates this situation in cases where a cheap, but approximate, simulator is available. The approach is to employ delayed acceptance Markov chain Monte Carlo (MCMC) within an ABC sequential Monte Carlo (SMC) sampler in order to, in a first stage of the kernel, use the cheap simulator to rule out parts of the parameter space that are not worth exploring, so that the "true" simulator is only run (in the second stage of the kernel) where there is a reasonable chance of accepting proposed values of ?. We show that this approach can be used quite automatically, with few tuning parameters. Applications to stochastic differential equation models and latent doubly intractable distributions are presented.
Type Of Material	Database/Collection of data
Year Produced	2020
Provided To Others?	Yes
URL	https://tandf.figshare.com/articles/Delayed_acceptance_ABC-SMC/12413117/1


Title	Delayed acceptance ABC-SMC
Description	Approximate Bayesian computation (ABC) is the method of choice for inferring the parameters of statistical models that are defined by a simulator. The approach originates in genetics, but has also found application in a number of different fields, including the study of infectious diseases and the analysis of network data. ABC can be infeasible when the simulator is computationally expensive, since often the simulator needs to be run a large number of times. The method we have developed is useful in cases where an alternative computationally cheap simulator is available. In these cases, our method uses simulations from the cheap simulator to automatically rule out "bad" regions of parameter space. sometimes reducing the computational cost of ABC by orders of magnitude.
Type Of Material	Data analysis technique
Year Produced	2017
Provided To Others?	Yes
Impact	None so far.
URL	https://arxiv.org/abs/1708.02230


Title	Estimates of partition functions using a path estimator within SMC
Description	Bayesian inference for doubly intractable distributions (such as are found when modelling networks, and also other applications) is difficult due to the presence of an intractable partition function in the likelihood. Sequential Monte Carlo samplers are a standard approach to Bayesian inference, but it is difficult to use them on doubly intractable models. We propose to use an existing type of sampler, but significantly improve its efficiency through, in later stages of the algorithm, reusing simulations generated in earlier stages of the algorithm. We describe a heuristic approach to automatically deciding which simulations to reuse.
Type Of Material	Data analysis technique
Year Produced	2017
Provided To Others?	Yes
Impact	None yet.
URL	https://arxiv.org/abs/1710.04382


Description	Collaboration with Centre for Environment, Fisheries and Aquaculture Science on improved inference of fish stocks
Organisation	Centre For Environment, Fisheries And Aquaculture Science
Country	United Kingdom
Sector	Public
PI Contribution	I have brought expertise in approximate Bayesian computation and particle MCMC, and new methods developed under UKRI funding, to this collaboration.
Collaborator Contribution	The partners in this collaboration have brought expertise in modelling fish stocks, in interpreting results, and in providing a route to impact for the research (through input to policy development on fishery management).
Impact	This collaboration is multi-disciplinary, involving me (Mathematics and Computational Statistics), members of the Ecology Group at the University of Reading, and experts in Ecology and Fishery Management at Cefas. Papers are currently in draft that describe the work, on fitting a bioeconomic model of fish stocks to data. The results of fitting this model will be used by Cefas to input into fishery management policy.
Start Year	2018


Title	Particle Gibbs for statistical network models
Description	Three of the most important statistical models for networks (stochastic block models, latent space models, and locally dependent exponential random graph models) are based on an underlying cluster structure in the network. Inferring this clustering is a challenge for inference in these models. The software implements an improved approach to clustering using particle Gibbs, which enables faster inference of these models.
Type Of Technology	Software
Year Produced	2018
Open Source License?	Yes
Impact	None yet.
URL	https://github.com/MarkBell1310/PGSMs

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications