Learn2Sim: Learning to steer computer simulators

Lead Research Organisation: University of Bristol
Department Name: Mathematics

Abstract

Computer simulators are a popular tool in physical, social and biological sciences. They are constructed using subject area knowledge of how the components of a system interact with each other. Often these interactions are modelled as being random, and involve the simulator code generating random numbers to decide the outcome. A simulator can be used to perform simulations under different scenarios in order to provide insights into the overall system behaviour.

One example application, which is the focus of this project, is to high energy physics experiments. Here the simulator calculates outcomes of the large number of random particle interactions which take place. Other important contemporary applications of large scale simulators include: models of planetary climate; epidemiological models capturing both disease spread in a population and genetic changes in the virus or bacteria causing the disease; financial and economic models made up of individual agents.

To be effective, simulators must be tuned to match observed data. This involves finding plausible values of their parameters: numerical values which control the behaviour of the simulator and whose exact values are unknown in advance. Indeed, determining these values may be the principal scientific goal. For instance in a physics application, the parameters could include the values of unknown particle masses. The problem of learning parameter values is referred to as statistical inference.

Existing approaches to statistical inference for computer simulators mostly involve running simulations under candidate parameter values until close matches are found between outputs and observed data. However for large scale simulators, the observed data is typically complex so that close matches are rare. Also each simulator run is computationally expensive. Therefore existing statistical inference methods for simulators are limited to small scale simulators. Another limitation is that they produce approximate results whose trustworthiness is hard to quantify.

The project will use a novel approach of learning how to "steer" the random components of computer simulators so that each run produces a close match to the observed data. The plausibility of particular parameter values can then be calculated using probability theory based on how close a match it produces, and how likely the required pattern of random behaviour would have been without steering. This is a challenging goal as a simulator's random components are often large, complex and hard to model. To achieve it the project will combine and extend exciting recent advances in both computational statistics and machine learning.

To implement this approach in practice we will use a state-of-the-art "probabilistic programming language" developed by a project collaborator which replaces the random number generation process of the simulator with the steered process. This allows the reuse of simulator code without modification, a considerable benefit when there is detailed pre-existing simulator code. One planned output of the project is to produce general-purpose software for our approach to statistical inference, in order to unlock the potential of fitting large scale computer simulators to data. It is also planned to apply this to a particular application in high energy particle physics: exploring properties of the Higgs boson using the "Sherpa" simulator of tau lepton decay on Large Hadron Collider data.

Publications

10 25 50
publication icon
Prangle D (2023) Distilling Importance Sampling for Likelihood Free Inference in Journal of Computational and Graphical Statistics

 
Description Key finding 1:

An initial output of this ongoing research project is to publish work on a proof-of-concept algorithm to implement the proposed approach - "Distilling importance sampling" - which makes use of recent developments in statistics and machine learning. The algorithm performs well on calibrating example simulators modelling (1) queueing and (2) epidemic spread on a network.

Distilling importance sampling requires a large number of simulations. The project's overall goal is to apply our methods to the Ginkgo simulator for high energy physics experiments. Its simulations are much more time consuming to perform than those of the example simulators we have considered so far. Therefore, our ongoing work is to produce an improved algorithm which can efficiently work with a small number of simulations.

Key finding 2:

We have uncovered difficulties in implementing our methods in the PyProb computer framework as planned: we need more flexible control of random number generation than it allows. We are now using the industry standard PyTorch framework instead, and have a working prototype of the Ginkgo simulator code reimplemented in PyTorch.
Exploitation Route Our Distilling Importance Sampling algorithm is open source (see github link) and can be used by other researchers to investigate simulators in many other sectors.
Sectors Digital/Communication/Information Technologies (including Software),Financial Services, and Management Consultancy,Healthcare

URL https://github.com/dennisprangle/DistillingImportanceSampling