Tractable inference for statistical network models with local dependence

Lead Research Organisation: University of Reading
Department Name: Mathematics and Statistics

Abstract

This project concerns the analysis of data from systems consisting of large numbers of linked objects. Such data is commonly found in a wide range of applications in science. The links may arise through the existence of networks, which contain explicit connections between objects (such as links between websites), or simply through weaker associations (e.g. we expect that in most images, most pixels will be of a similar colour to nearby pixels).

Examples of such data arise in many different fields. Applications in: physics (magnetism); biology (genetics, protein design, neural models); computer science (artificial intelligence, computer vision); social science (social networks); economics (the network effect); and engineering (target tracking), all share common underlying characteristics. In many cases, the application can be reduced to understanding the process by which a network is generated and to use this knowledge to predict future events. For example, one can envisage learning from previous examples of contact between known terrorists, inferring patterns of communication that mark them out as such. Knowledge of this pattern may then be used to identify terrorist cells from communication networks. Making such an inference in the presence of uncertainty (which is always the case when analysing real data) is a problem that lies in the realm of statistics. In order to obtain accurate results, the use of a technique known as Bayesian inference is usually necessary. Bayesian inference is usually implemented by means of an iterative algorithm known as a Monte Carlo method.

However, using Monte Carlo methods to perform inference in models of large networks can be extremely computationally expensive: the algorithm can run for days, weeks or months without producing a useful result. This means that in practice these methods cannot be used, so that large networks cannot be analysed using "model based" statistical techniques. This project is concerned with the development of algorithms that are much more computationally efficient, with the aim of reducing computational time to more manageable levels. These algorithms build on some of the most recent developments in statistics and machine learning.

Planned Impact

This project will develop software (based on statistical methodology) that will be used by other researchers. Therefore its pathway to impact is indirect, in that it is reliant on working with researchers whose application areas can lead to economic and societal impact. The academic impact of the work is a crucial component in the path to other impacts.

Despite this indirect pathway, the potential for impact from this research is significant. Algorithms such as the ones developed in this project, when implemented in freely available software packages, represent the gateway to making sense of network data.

The algorithms that will be developed by the project will help describe how a network is divided into communities, and the relative effect of different local interactions within and between these communities in giving rise to the global structure of the network. The analysis of large networks is central to, for example: studying the organisation of the human brain (a topic central to understanding conditions such as Alzheimer's disease, autism or attention deficit hyperactivity disorder); understanding the behaviour of consumers; and in understanding how people form their opinions. Analysing "Big Data" such as this is seen as a major opportunity in the UK, where it has been recognised to be of national economic and scientific importance (it is one of the UK Government's "8 Great Technologies" of the future that will help drive economic growth). This project lies in the growing EPSRC Research Area of "Statistics and Applied Probability", but obvious potential impact outside of this field lies in the following themes:

- "Digital Economy", in aiding understanding of how complex interactions influence society;
- "Global Uncertainties", in security applications such as locating a terrorist cell in a communication network;
- "Healthcare Technologies", in helping understand and diagnose diseases such as dementia.

The statistical work in the project also has the potential to lead to impacts in other areas. Again, these are indirect, but are potentially significant. The methodology underlying the algorithms is quite general, and this type of approach may also find use in applications as diverse as weather and climate analysis, genomics and computer vision. The understanding of how the statistical methods used in this project perform is useful to statisticians who work in different areas of application.

The project is ideal for a young researcher to work on, since they will gain skills in two growth areas of national importance: firstly, in the analysis of network data; secondly, in the fields of statistics, machine learning and data science, the subjects at the heart of the "Big Data Revolution".

Publications

10 25 50

publication icon
Everitt R (2020) Delayed Acceptance ABC-SMC in Journal of Computational and Graphical Statistics

publication icon
Everitt Richard G. (2017) Marginal sequential Monte Carlo for doubly intractable models in arXiv e-prints

publication icon
Everitt Richard G. (2017) Bootstrapped synthetic likelihood in arXiv e-prints

publication icon
Everitt Richard G. (2017) Delayed acceptance ABC-SMC in arXiv e-prints

 
Description This project focussed on statistical models for network data. The objectives of the award were to develop improved inference for state-of-the-art models for real data

There were two main strands of work described in the proposal:

- making exact inference techniques computationally feasible;
- developing faster approximate methods.

Both objectives were met.


The models used in the project have two characteristics: they model a network as being made up of a number of different clusters (or communities); and they model what the network looks like within the clusters.

The work was performed in three stages. A primary stage where new techniques for inferring how the network behaves within clusters, a secondary stage where the focus was inferring the clustering, and a final stage that applied the techniques to real data.


First stage: within each cluster

Three new methods for within-cluster inference were developed. Each method had a reduced computational cost compared to existing techniques - in some cases they were an order of magnitude faster.

Together, these three new methods are one of the main achievements of the award.


Second stage: finding the clustering

This project found that the techniques outlined in the proposal were applicable to a wider range of network models than first thought. Two new approaches to inferring the clustering (also known as "community detection") were developed: one is fast and approximate, the other is slower, but more accurate.

These two new approaches to inference (used on three popular statistical network models) together form the second major achievement of the award.


Third stage: code and applications

Work on the grant has focussed on two applications: to neuroscience, and to genetics. In both situations the data is structure as a network, and the contribution of the work on the award has been to inferring, and accounting for, the clustering structure of the network data.

The algorithms developed under the project are publicly available on github.
Exploitation Route The improved techniques developed under the grant have formed the basis of a collaboration with the Centre for Environment, Fisheries and Aquaculture Science (Cefas), on inferring fish stocks to aid policy making. This work is ongoing, and has led to two successful applications for funding from NERC under the Landscape Decisions Programme (total funding £340,000).

In addition, the algorithms and expertise developed under the award are likely to also find application in a collaboration with a consumer analytics company. This work is being funded by Innovate UK (with a contribution from EPSRC).
Sectors Aerospace, Defence and Marine,Communities and Social Services/Policy,Digital/Communication/Information Technologies (including Software),Energy,Environment,Healthcare,Security and Diplomacy,Transport

 
Description The improved techniques developed under the grant have formed the basis of a collaboration with the Centre for Environment, Fisheries and Aquaculture Science (Cefas), on inferring fish stocks to aid policy making on fishery management. This work has so far involved examining models for sea bass, which had suffered a dramatic decline in recent years. The work is ongoing, and has led to two successful applications for funding from NERC under the Landscape Decisions Programme (total funding £340,000). In addition, the algorithms and expertise developed under the award are likely to also find application in a collaboration with a consumer analytics company. This work is being funded by Innovate UK (with a contribution from EPSRC).
First Year Of Impact 2018
Sector Environment
Impact Types Policy & public services

 
Description Knowledge Transfer Partnership Emerging & Enabling
Amount £116,755 (GBP)
Funding ID Project number 511432 
Organisation Innovate UK 
Sector Public
Country United Kingdom
Start 04/2019 
End 03/2021
 
Description Quantifying uncertainty in the predictions of complex process-based models
Amount £42,873 (GBP)
Funding ID NE/T004010/1 
Organisation Natural Environment Research Council 
Sector Public
Country United Kingdom
Start 10/2019 
End 09/2020
 
Title Bootstrapped synthetic likelihood 
Description Synthetic likelihood (SL) is a useful alternative to approximate Bayesian computation for parameter inference in models whose likelihood is not available to evaluate pointwise. Suppose that the data that is being modelled consists of N data points. The computational cost of SL can be high, since at each parameter it requires running the simulator M times, giving a cost of O(MN) for each parameter visited. Bootstrapped SL uses a bootstrap of a single simulation to approximate the standard SL, with a cost of O(N) for each parameter visited. The use of a bag of little bootstraps can be used to reduce the cost further, where only a sub-simulation of size n 
Type Of Material Improvements to research infrastructure 
Year Produced 2017 
Provided To Others? Yes  
Impact None yet. 
URL https://arxiv.org/abs/1711.05825
 
Title Delayed acceptance ABC-SMC 
Description Approximate Bayesian computation (ABC) is the method of choice for inferring the parameters of statistical models that are defined by a simulator. The approach originates in genetics, but has also found application in a number of different fields, including the study of infectious diseases and the analysis of network data. ABC can be infeasible when the simulator is computationally expensive, since often the simulator needs to be run a large number of times. The method we have developed is useful in cases where an alternative computationally cheap simulator is available. In these cases, our method uses simulations from the cheap simulator to automatically rule out "bad" regions of parameter space. sometimes reducing the computational cost of ABC by orders of magnitude. 
Type Of Material Improvements to research infrastructure 
Year Produced 2017 
Provided To Others? Yes  
Impact None so far. 
URL https://arxiv.org/abs/1708.02230
 
Title Estimates of partition functions using a path estimator within SMC 
Description Bayesian inference for doubly intractable distributions (such as are found when modelling networks, and also other applications) is difficult due to the presence of an intractable partition function in the likelihood. Sequential Monte Carlo samplers are a standard approach to Bayesian inference, but it is difficult to use them on doubly intractable models. We propose to use an existing type of sampler, but significantly improve its efficiency through, in later stages of the algorithm, reusing simulations generated in earlier stages of the algorithm. We describe a heuristic approach to automatically deciding which simulations to reuse. 
Type Of Material Improvements to research infrastructure 
Year Produced 2017 
Provided To Others? Yes  
Impact None yet. 
URL https://arxiv.org/abs/1710.04382
 
Title Bootstrapped synthetic likelihood 
Description Synthetic likelihood (SL) is a useful alternative to approximate Bayesian computation for parameter inference in models whose likelihood is not available to evaluate pointwise. Suppose that the data that is being modelled consists of N data points. The computational cost of SL can be high, since at each parameter it requires running the simulator M times, giving a cost of O(MN) for each parameter visited. Bootstrapped SL uses a bootstrap of a single simulation to approximate the standard SL, with a cost of O(N) for each parameter visited. The use of a bag of little bootstraps can be used to reduce the cost further, where only a sub-simulation of size n 
Type Of Material Data analysis technique 
Year Produced 2017 
Provided To Others? Yes  
Impact None yet. 
URL https://arxiv.org/abs/1711.05825
 
Title Delayed acceptance ABC-SMC 
Description Approximate Bayesian computation (ABC) is the method of choice for inferring the parameters of statistical models that are defined by a simulator. The approach originates in genetics, but has also found application in a number of different fields, including the study of infectious diseases and the analysis of network data. ABC can be infeasible when the simulator is computationally expensive, since often the simulator needs to be run a large number of times. The method we have developed is useful in cases where an alternative computationally cheap simulator is available. In these cases, our method uses simulations from the cheap simulator to automatically rule out "bad" regions of parameter space. sometimes reducing the computational cost of ABC by orders of magnitude. 
Type Of Material Data analysis technique 
Year Produced 2017 
Provided To Others? Yes  
Impact None so far. 
URL https://arxiv.org/abs/1708.02230
 
Title Estimates of partition functions using a path estimator within SMC 
Description Bayesian inference for doubly intractable distributions (such as are found when modelling networks, and also other applications) is difficult due to the presence of an intractable partition function in the likelihood. Sequential Monte Carlo samplers are a standard approach to Bayesian inference, but it is difficult to use them on doubly intractable models. We propose to use an existing type of sampler, but significantly improve its efficiency through, in later stages of the algorithm, reusing simulations generated in earlier stages of the algorithm. We describe a heuristic approach to automatically deciding which simulations to reuse. 
Type Of Material Data analysis technique 
Year Produced 2017 
Provided To Others? Yes  
Impact None yet. 
URL https://arxiv.org/abs/1710.04382
 
Description Collaboration with Centre for Environment, Fisheries and Aquaculture Science on improved inference of fish stocks 
Organisation Centre For Environment, Fisheries And Aquaculture Science
Country United Kingdom 
Sector Public 
PI Contribution I have brought expertise in approximate Bayesian computation and particle MCMC, and new methods developed under UKRI funding, to this collaboration.
Collaborator Contribution The partners in this collaboration have brought expertise in modelling fish stocks, in interpreting results, and in providing a route to impact for the research (through input to policy development on fishery management).
Impact This collaboration is multi-disciplinary, involving me (Mathematics and Computational Statistics), members of the Ecology Group at the University of Reading, and experts in Ecology and Fishery Management at Cefas. Papers are currently in draft that describe the work, on fitting a bioeconomic model of fish stocks to data. The results of fitting this model will be used by Cefas to input into fishery management policy.
Start Year 2018
 
Title Particle Gibbs for statistical network models 
Description Three of the most important statistical models for networks (stochastic block models, latent space models, and locally dependent exponential random graph models) are based on an underlying cluster structure in the network. Inferring this clustering is a challenge for inference in these models. The software implements an improved approach to clustering using particle Gibbs, which enables faster inference of these models. 
Type Of Technology Software 
Year Produced 2018 
Open Source License? Yes  
Impact None yet. 
URL https://github.com/MarkBell1310/PGSMs