CoSInES (COmputational Statistical INference for Engineering and Security)
Lead Research Organisation:
University of Warwick
Department Name: Statistics
Abstract
There are tremendous demands for advanced statistical methodology to make scientific sense of the deluge of data emerging from the data revolution of the 21st Century. Huge challenges in modelling, computation, and statistical algorithms have been created by diverse and important questions in virtually every area of human activity. CoSInES will create a step change in the use of principled statistical methodology, motivated by and feeding into these challenges.
Much of our research will develop and study generic methods with applicability in a wide-range of applications. We will study high-dimensional statistical algorithms whose performance scales well to high-dimensions and to big data sets. We will develop statistical theory to understand new complex models stimulated from applications. We will produce methodology tailored to specific computational hardware. We will study the statistical and algorithmic effects of mis-match between data and models. We shall also build methodology for statistical inference where privacy constraints mean that the data cannot be directly accessed.
CoSInES willl also focus on two major application domains which will form stimulating and challenging motivation for our research: Data-centric engineering, and Defence and Security. To maximise the impact and speed of translation of our research in these areas, we will closely partner the Alan Turing Institute which is running large programmes in these areas funded respectively by the Lloyd's Register Foundation and GCHQ.
Data is providing a disruptive transformation that is revolutionising the engineering professions with previously unimagined ways of designing, manufacturing, operating and maintaining engineering assets all the way through to their decommissioning. The Data centric engineering programme (DCE) at the Alan Turing Institute is leading in the design and operation of the worlds very first pedestrian bridge to be opened and operated in a major international city that will be completely 3-D printed. Fibre-optic sensors embedded in the structure will provide continuous streams of data measuring the main structural properties of the bridge. Unique opportunities to monitor and control the bridge via "digital twins" are being developed by DCE and this is presenting enormous challenges to existing applied mathematical and statistical modelling of these complex structures where even the bulk material properties are unknown and certainly stochastic in their values. A new generation of numerical inferential methods are being demanded to support this progress.
Within the Defence and Security domain, there are many statistical challenges emerging from the need to process and communicate big and complex data sets, for example within the area of cyber-security. The virtual world has emerged as a dominant global marketplace within which the majority of organisations operate. This has motivated nefarious actors - from "bedroom hackers" to state-sponsored terrorists - to operate in this environment to further their economic or political ambitions. To counter this threat, it is necessary to produce a complete statistical representation of the environment, in the presence of missing data, significant temporal change, and an adversary willing to manipulate socio and virtual systems in order to achieve their goals.
As a second example, to counter the threat of global terrorism, it is necessary for law-enforcement agencies within the UK to share data, whilst rigorously applying data protection laws to maintain individuals' privacy. It is therefore necessary to have mathematical guarantees over such data sharing arrangements, and to formulate statistical methodologies for the "penetration testing" of anonymised data.
Much of our research will develop and study generic methods with applicability in a wide-range of applications. We will study high-dimensional statistical algorithms whose performance scales well to high-dimensions and to big data sets. We will develop statistical theory to understand new complex models stimulated from applications. We will produce methodology tailored to specific computational hardware. We will study the statistical and algorithmic effects of mis-match between data and models. We shall also build methodology for statistical inference where privacy constraints mean that the data cannot be directly accessed.
CoSInES willl also focus on two major application domains which will form stimulating and challenging motivation for our research: Data-centric engineering, and Defence and Security. To maximise the impact and speed of translation of our research in these areas, we will closely partner the Alan Turing Institute which is running large programmes in these areas funded respectively by the Lloyd's Register Foundation and GCHQ.
Data is providing a disruptive transformation that is revolutionising the engineering professions with previously unimagined ways of designing, manufacturing, operating and maintaining engineering assets all the way through to their decommissioning. The Data centric engineering programme (DCE) at the Alan Turing Institute is leading in the design and operation of the worlds very first pedestrian bridge to be opened and operated in a major international city that will be completely 3-D printed. Fibre-optic sensors embedded in the structure will provide continuous streams of data measuring the main structural properties of the bridge. Unique opportunities to monitor and control the bridge via "digital twins" are being developed by DCE and this is presenting enormous challenges to existing applied mathematical and statistical modelling of these complex structures where even the bulk material properties are unknown and certainly stochastic in their values. A new generation of numerical inferential methods are being demanded to support this progress.
Within the Defence and Security domain, there are many statistical challenges emerging from the need to process and communicate big and complex data sets, for example within the area of cyber-security. The virtual world has emerged as a dominant global marketplace within which the majority of organisations operate. This has motivated nefarious actors - from "bedroom hackers" to state-sponsored terrorists - to operate in this environment to further their economic or political ambitions. To counter this threat, it is necessary to produce a complete statistical representation of the environment, in the presence of missing data, significant temporal change, and an adversary willing to manipulate socio and virtual systems in order to achieve their goals.
As a second example, to counter the threat of global terrorism, it is necessary for law-enforcement agencies within the UK to share data, whilst rigorously applying data protection laws to maintain individuals' privacy. It is therefore necessary to have mathematical guarantees over such data sharing arrangements, and to formulate statistical methodologies for the "penetration testing" of anonymised data.
Planned Impact
Academic impact of the project will be achieved by standard mechanisms: publication, software development, conference presentations, and highlighting activities on the project website. Academic beneficiaries of this reach will include statisticians working on theory and methodology as well as a wide range of application areas. Academics outside statistics will also benefit from the methodology and software created within the project.
Engineers will benefit from the research in Objective 7 which will create a principled statistical framework for Data-centric Engineering. In turn, the government, commercial companies and the public will benefit from improved reliability of engineering structures and the economies and improved productivity created as a result of the improved scientific understanding accessed through our research. Research in this area will be rapidly disseminated to the Engineering community through the Turing Data-centric Engineering pogramme, through translational activities organised by CoSInES (such as our Impact and Innovation Showcase days), and through the bespoke software.
Through the research in Objective 8, government, commercial companies and the public will benefit from improved cyber-security and the extra security afforded through improved data-sharing efficiency of law-enforcement agencies. Through the Alan Turing Institute's Defence & Security Programme, the output of this research will directly impact the operational sectors of the UK's defence and security function, through the deployment of bespoke software, and the furthering of the statistical knowledge of the UK Government's intelligence analysts. We will also organise Impact and Innovation Showcase days focused in this area.
Engineers will benefit from the research in Objective 7 which will create a principled statistical framework for Data-centric Engineering. In turn, the government, commercial companies and the public will benefit from improved reliability of engineering structures and the economies and improved productivity created as a result of the improved scientific understanding accessed through our research. Research in this area will be rapidly disseminated to the Engineering community through the Turing Data-centric Engineering pogramme, through translational activities organised by CoSInES (such as our Impact and Innovation Showcase days), and through the bespoke software.
Through the research in Objective 8, government, commercial companies and the public will benefit from improved cyber-security and the extra security afforded through improved data-sharing efficiency of law-enforcement agencies. Through the Alan Turing Institute's Defence & Security Programme, the output of this research will directly impact the operational sectors of the UK's defence and security function, through the deployment of bespoke software, and the furthering of the statistical knowledge of the UK Government's intelligence analysts. We will also organise Impact and Innovation Showcase days focused in this area.
Publications
Virtanen S
(2021)
Spatio-Temporal Mixed Membership Models for Criminal Activity
in Journal of the Royal Statistical Society Series A: Statistics in Society
Crucinio F
(2024)
Solving a class of Fredholm integral equations of the first kind via Wasserstein gradient flows
in Stochastic Processes and their Applications
Hubbert S
(2022)
Sobolev Spaces, Kernels and Discrepancies over Hyperspheres
Roberts G
(2020)
Skew Brownian Motion and Complexity of the ALPS Algorithm
Roberts G
(2022)
Skew brownian motion and complexity of the alps algorithm
in Journal of Applied Probability
Mider Marcin
(2019)
Simulating bridges using confluent diffusions
in arXiv e-prints
Jenkins P
(2019)
Simulating bridges using confluent diffusions
Brown S
(2021)
Simple conditions for convergence of sequential Monte Carlo genealogies with applications
in Electronic Journal of Probability
South L
(2022)
Semi-exact control functionals from Sard's method
in Biometrika
South L
(2020)
Semi-Exact Control Functionals From Sard's Method
Bierkens J
(2023)
Scaling of Piecewise Deterministic Monte Carlo for Anisotropic Targets
Niederer SA
(2021)
Scaling digital twins from the artisanal to the industrial.
in Nature computational science
Papaspiliopoulos O
(2019)
Scalable inference for crossed random effects models
in Biometrika
Zanella G
(2018)
Scalable Importance Tempering and Bayesian Variable Selection
McKimm H
(2022)
Sampling using Adaptive Regenerative Processes
McKimm H
(2025)
Sampling using adaptive regenerative processes
in Bernoulli
Chevallier A
(2020)
Reversible Jump PDMP Samplers for Variable Selection
Chevallier A
(2022)
Reversible Jump PDMP Samplers for Variable Selection
in Journal of the American Statistical Association
Cornish R.
(2020)
Relaxing bijectivity constraints with continuously indexed normalising flows
in 37th International Conference on Machine Learning, ICML 2020
Wang A
(2021)
Regeneration-enriched Markov processes with application to Monte Carlo
in The Annals of Applied Probability
Robert C
(2021)
Rao-Blackwellization in the MCMC era
Robert C
(2021)
Rao-Blackwellization in the MCMC era
Robert C
(2021)
Rao-Blackwellisation in the Markov Chain Monte Carlo Era
in International Statistical Review
Shidani A
(2022)
Ranking In Generalized Linear Bandits
Deligiannidis G
(2021)
Randomized Hamiltonian Monte Carlo as scaling limit of the bouncy particle sampler and dimension-free convergence rates
in The Annals of Applied Probability
Pollock M
(2020)
Quasi-Stationary Monte Carlo and The Scale Algorithm
in Journal of the Royal Statistical Society Series B: Statistical Methodology
De Bortoli V.
(2020)
Quantitative propagation of Chaos for SGD in wide neural networks
in Advances in Neural Information Processing Systems
Phillips J
(2023)
Quantifying uncertainty in probabilistic volcanic ash hazard forecasts, with an application to weather pattern based wind field sampling
in Bulletin of Volcanology
Alenlöv J.
(2021)
Pseudo-marginal hamiltonian monte carlo
in Journal of Machine Learning Research
Crucinio F
(2023)
Properties of marginal sequential Monte Carlo methods
in Statistics & Probability Letters
Crucinio F
(2023)
Properties of Marginal Sequential Monte Carlo Methods
Crucinio F
(2023)
Properties of Marginal Sequential Monte Carlo Methods
Kuntz J
(2021)
Product-form estimators: exploiting independence to scale up Monte Carlo
in Statistics and Computing
Akyildiz Ö.D.
(2021)
Probabilistic Sequential Matrix Factorization
in Proceedings of Machine Learning Research
Bartels S
(2019)
Probabilistic linear solvers: a unifying view
in Statistics and Computing
Bartels S
(2018)
Probabilistic Linear Solvers: A Unifying View
Monterrubio-Gómez K
(2020)
Posterior inference for sparse hierarchical non-stationary models
in Computational Statistics & Data Analysis
Fearnhead P
(2018)
Piecewise Deterministic Markov Processes for Continuous-Time Monte Carlo
in Statistical Science
| Description | The focus of CoSinES was on the development of new robust and scalable methodologies for computational statistics, mostly within a Bayesian (or post-Bayesian) context. The project focused on theory and methodology as well as many application areas including data-centric engineering and security. Some specific highlights include 1. There has been a strong strand in the grant of the synergies between diffusions and algorithms. For instance, the development of methodology and associated underpinning theory for new Hamiltonian MCMC methods based on Kinetic Langevin diffusions, and the development of methodology and supporting theory for proximal MALA algorithms (which can be used when target densities lack differentiability). Moreover we developed new diffusion-inspired stochastic gradient algorithms, and diffusion algorithms for exploring stochastic finite element posterior distributions. We devised Fusion algorithms based on coalescing diffusions for combining remotely assembled sub-posterior distributions with applications to inference under privacy constraints. From a theoretical point of view, diffusion limits are used to explore the high-dimensional complexity properties of many algorithms (for instance PDMPs and simulated tempering algorithms for multi-modal distributions). 2. We also made major breakthroughs in the computational efficiency of generative modelling using diffusions in a paper which was awarded Outstanding Paper Award at Neurips 2022. This work has had a substantial and rapid impact, for example in being used to discover new synthetic proteins. This work has been reported in the New York Times https://www.nytimes.com/2023/01/09/science/artificial-intelligence-proteins.html?searchResultPosition=1 . 3. A particular success of the grant has been in the area of non-reversible dynamics for MCMC. Some of the PIs were prominent in developing the first practically useful non-reversible algorithms and CoSinES has continued this work. Early in the grant, Fearnhead and Roberts developed super-efficiency, where entire MCMC runs can be carried out with computational cost less than that required for one complete likelihood evaluation. Three separate theory papers pointed to extremely good high-dimensional scaling properties of PDMPs, and complementary work on the automatic Zig-Zag has overcome major implementational hurdles for these methods. Important methodological developments have been made in the developments of the Boomerang Sampler and PDMP methods for trans-dimensional spaces, and spaces without smooth densities. 3. We published a landmark paper on a new sampling algorithm for multi-modal distributions using non-reversible MCMC techniques in JRSSB. The methodology developed in this paper has been used to generate high-resolution images of the M87 black hole (https://iopscience.iop.org/article/10.3847/2041-8213/abe71d/pdf) and the first picture of the Milky Way monster (https://iopscience.iop.org/article/10.3847/2041-8213/ac6429). 4. A further focus has been on Bayesian inverse problems and their applications to inference for partial and ordinary differential equations. Our work on solving Fredholm equations of the first kind require a particle approximation of the solution of a McKean-Vlasov stochastic differential equation associated with the Wasserstein gradient flow of a variational formulation of the problem. We also developed generic cutting edge divide-and-conquer Sequential Monte Carlo methods which address the notoriously difficult problem of filtering in high-dimensions. We also developed powerful deep learning techniques for quantifying uncertainty in differential equation models using probabilistic numerics. This work has been successfully applied to engineering problems using digital twins in impactful work. 5. In the theoretical study of MCMC, we have made significant breakthroughs in stability theory for Markov chains with application to MCMC (both reversible and non-reversible). Their work has extended very classical probability result on Poincare and isoperimetric inequalities in surprising ways which can be applied effectively to provide new fundamental results for spectral gaps, mixing times and polynomial ergodicity for many of the currently popular MCMC methods. |
| Exploitation Route | There are existing application of our work in many areas and potential for much more. In the near future we expect our work to have most impact in applied Bayesian Statistics, data-centric engineering and possibly astronomy where our work has already had a surprising impact. |
| Sectors | Aerospace Defence and Marine Agriculture Food and Drink Other |
| URL | https://www.cosines.org |
| Description | Impacts of the work of CoSinES have been broad. Since most of the work is theoretical in nature, most of the impact is currently academic. However we highlight some areas of nn-academic interest here. The paper V. De Bortoli, E. Mathieu, M. Hutchinson, J. Thornton, Y.W. Teh & A. Doucet, "Riemmanian Score-Based Generative Modeling", NeurIPS 2022 was awarded Outstanding Paper Award (one of 13 from 8000 submissions). the paper makes a major breakthroughs in the computational efficiency of generative modelling techniques in machine leaning. This work has had a substantial and rapid impact, for example in being used to discover new synthetic proteins. This work has been reported in the New York Times https://www.nytimes.com/2023/01/09/science/artificial-intelligence-proteins.html?searchResultPosition=1 . Two major success of the grant has been in the area of non-reversible dynamics for MCMC, and in MCMC for multi-modal distributions. Many contributions have emerged from these areas. Saifuddin Syed and Arnaud Doucet published "Non-reversible Parallel Tempering: A Scalable Highly Parallel MCMC Scheme", Journal of the Royal Statistical Society Series B, vol. 84, no. 2, pp. 321--350, 2022, a landmark paper on a new sampling algorithm for multi-modal distributions using non-reversible MCMC techniques in JRSS B. The methodology developed in this paper has been used to generate high-resolution images of the M87 black hole (https://iopscience.iop.org/article/10.3847/2041-8213/abe71d/pdf) and the first picture of the Milky Way monster (https://iopscience.iop.org/article/10.3847/2041-8213/ac6429). |
| First Year Of Impact | 2022 |
| Sector | Manufacturing, including Industrial Biotechology |
| Impact Types | Societal |
| Description | Intractablelikelihood: New challenges from modern applications (i-like) |
| Amount | £2,369,503 (GBP) |
| Funding ID | EP/K014463/1 |
| Organisation | Engineering and Physical Sciences Research Council (EPSRC) |
| Sector | Public |
| Country | United Kingdom |
| Start | 01/2013 |
| End | 12/2017 |
| Title | Concave-Convex PDMP-based sampling |
| Description | Recently non-reversible samplers based on simulating piecewise deterministic Markov processes (PDMPs) have shown potential for efficient sampling in Bayesian inference problems. However, there remains a lack of guidance on how to best implement these algorithms. If implemented poorly, the computational costs of simulating event times can out-weigh the statistical efficiency of the non-reversible dynamics. Drawing on the adaptive rejection literature, we propose the concave-convex adaptive thinning approach for simulating a piecewise deterministic Markov process, which we call CC-PDMP. This approach provides a general guide for constructing bounds that may be used to facilitate PDMP-based sampling. A key advantage of this method is its additive structure - adding concave-convex decompositions yields a concave-convex decomposition. This makes the construction of bounds modular, as given a concave-convex decomposition for a class of likelihoods and a family of priors, they can be combined to construct bounds for the posterior. We show that constructing our bounds is simple and leads to computationally efficient thinning. Our approach is well suited to local PDMP simulation where conditional independence of the target can be exploited for potentially huge computational gains. We provide an R package and compare with existing approaches to simulating events in the PDMP literature. Supplementary material for this article is available online. |
| Type Of Material | Database/Collection of data |
| Year Produced | 2023 |
| Provided To Others? | Yes |
| URL | https://tandf.figshare.com/articles/dataset/Concave-Convex_PDMP-based_sampling/22773771/1 |
| Title | Global Consensus Monte Carlo |
| Description | To conduct Bayesian inference with large datasets, it is often convenient or necessary to distribute the data across multiple machines. We consider a likelihood function expressed as a product of terms, each associated with a subset of the data. Inspired by global variable consensus optimization, we introduce an instrumental hierarchical model associating auxiliary statistical parameters with each term, which are conditionally independent given the top-level parameters. One of these top-level parameters controls the unconditional strength of association between the auxiliary parameters. This model leads to a distributed MCMC algorithm on an extended state space yielding approximations of posterior expectations. A trade-off between computational tractability and fidelity to the original model can be controlled by changing the association strength in the instrumental model. We further propose the use of an SMC sampler with a sequence of association strengths, allowing both the automatic determination of appropriate strengths and for a bias correction technique to be applied. In contrast to similar distributed Monte Carlo algorithms, this approach requires few distributional assumptions. The performance of the algorithms is illustrated with a number of simulated examples. Supplementary materials for this article are available online. |
| Type Of Material | Database/Collection of data |
| Year Produced | 2020 |
| Provided To Others? | Yes |
| URL | https://tandf.figshare.com/articles/dataset/Global_Consensus_Monte_Carlo/12931061/1 |
| Title | Global Consensus Monte Carlo |
| Description | To conduct Bayesian inference with large datasets, it is often convenient or necessary to distribute the data across multiple machines. We consider a likelihood function expressed as a product of terms, each associated with a subset of the data. Inspired by global variable consensus optimization, we introduce an instrumental hierarchical model associating auxiliary statistical parameters with each term, which are conditionally independent given the top-level parameters. One of these top-level parameters controls the unconditional strength of association between the auxiliary parameters. This model leads to a distributed MCMC algorithm on an extended state space yielding approximations of posterior expectations. A trade-off between computational tractability and fidelity to the original model can be controlled by changing the association strength in the instrumental model. We further propose the use of an SMC sampler with a sequence of association strengths, allowing both the automatic determination of appropriate strengths and for a bias correction technique to be applied. In contrast to similar distributed Monte Carlo algorithms, this approach requires few distributional assumptions. The performance of the algorithms is illustrated with a number of simulated examples. Supplementary materials for this article are available online. |
| Type Of Material | Database/Collection of data |
| Year Produced | 2020 |
| Provided To Others? | Yes |
| URL | https://tandf.figshare.com/articles/dataset/Global_Consensus_Monte_Carlo/12931061 |
| Description | CoSinES with the Turing Data Centric Engineering programme |
| Organisation | Alan Turing Institute |
| Country | United Kingdom |
| Sector | Academic/University |
| PI Contribution | A postdoc is funded from the Turing DCE programme to work on the interface between CoSinES methodology and engineering problems. This part of the programme is managed by Mark Girolami. |
| Collaborator Contribution | This work has developed principled and scalable Bayesian methods for inference for differential equation models decomposed as finite element methods. It has also developed sequential Monte Carlo methods for these models. the work is being applied to a variety of problems in material science including investigating the properties of 3-d printer metal. |
| Impact | Akyildiz D, Duffin Connor, Sabanis Sotirios, Girolami Mark, (2021). Statistical Finite Elements via Langevin Dynamics. arXiv e-prints, pp. arXiv:2110.11131 Duffin Connor, Cripps Edward, Stemler Thomas, Girolami Mark, (2021). Low-rank statistical finite elements for scalable model-data synthesis. arXiv e-prints, pp. arXiv:2109.04757 Boustati Ayman, Akyildiz D, Damoulas Theodoros, Johansen Adam M., (2020). Generalized Bayesian Filtering via Sequential Monte Carlo. arXiv e-prints, pp. arXiv:2002.09998 |
| Start Year | 2020 |
| Description | CoSinES with the Turing Data Centric Engineering programme |
| Organisation | Alan Turing Institute |
| Country | United Kingdom |
| Sector | Academic/University |
| PI Contribution | A postdoc is funded from the Turing DCE programme to work on the interface between CoSinES methodology and engineering problems. This part of the programme is managed by Mark Girolami. |
| Collaborator Contribution | This work has developed principled and scalable Bayesian methods for inference for differential equation models decomposed as finite element methods. It has also developed sequential Monte Carlo methods for these models. the work is being applied to a variety of problems in material science including investigating the properties of 3-d printer metal. |
| Impact | Akyildiz D, Duffin Connor, Sabanis Sotirios, Girolami Mark, (2021). Statistical Finite Elements via Langevin Dynamics. arXiv e-prints, pp. arXiv:2110.11131 Duffin Connor, Cripps Edward, Stemler Thomas, Girolami Mark, (2021). Low-rank statistical finite elements for scalable model-data synthesis. arXiv e-prints, pp. arXiv:2109.04757 Boustati Ayman, Akyildiz D, Damoulas Theodoros, Johansen Adam M., (2020). Generalized Bayesian Filtering via Sequential Monte Carlo. arXiv e-prints, pp. arXiv:2002.09998 |
| Start Year | 2020 |
| Title | ccpdmp |
| Description | R package that implements adaptive concave-convex sampling for PDMP samplers. |
| Type Of Technology | Software |
| Year Produced | 2022 |
| Open Source License? | Yes |
| Impact | None |
| URL | https://github.com/matt-sutton/ccpdmp |
