# Data-Driven Coarse-Graining using Space-Time Diffusion Maps

Lead Research Organisation:
University of Edinburgh

Department Name: Sch of Mathematics

### Abstract

Dynamical systems with many degrees of freedom arise in a wide range of applications, including large scale molecular dynamics, climate and weather studies, and electrical power networks. The challenge in simulation is normally to extract statistical information, for example the average propensity of a given state of the system or the average time that elapses between certain events. Simulation data is easy to generate but often poorly utilized. The goal of this project is the development of a data-driven method for the automatic detection of a simplified description of the system based on a set of collective variables which can be used within efficient statistical extraction procedures. These slowest degrees of freedom are typically the most important ones. The dynamics are characterised as fluctuations in the vicinity of given state punctuated by relatively rare events describing transitions between the states. Efficiently identifying collective variables is the crucial first step in the design of coarse-grained models which can allow many order of magnitude increases in the accessible simulation timescale. By automatically finding collective variables, we can greatly simplify rapid study and comparison of many systems. The research builds on the technique of diffusion maps, whereby the eigenfunctions of a diffusion operator are used to characterise the metastable (slowly changing) states of the system.

The potential impact of automatic coarse-graining will be felt most profoundly in fields such as rational drug design, where it is necessary to select specific drug molecules for their properties in interaction with some target, e.g. a protein. Bio-molecular simulation depends on the use of very specialised and intensely developed simulation codes which are the products of many years of development and government investment. In order to accelerate the implementation and testing of novel algorithms in this important area, this project includes a detailed plan for software development within the EPSRC-funded MIST (Molecular Integrator Software Tools) platform. Testing of the software methodology will be conducted via collaborations with chemists and pharmaceutical chemists, including researchers at Rice University (Houston, Texas) and Memorial Sloan Kettering Cancer Research Center (New York).

The potential impact of automatic coarse-graining will be felt most profoundly in fields such as rational drug design, where it is necessary to select specific drug molecules for their properties in interaction with some target, e.g. a protein. Bio-molecular simulation depends on the use of very specialised and intensely developed simulation codes which are the products of many years of development and government investment. In order to accelerate the implementation and testing of novel algorithms in this important area, this project includes a detailed plan for software development within the EPSRC-funded MIST (Molecular Integrator Software Tools) platform. Testing of the software methodology will be conducted via collaborations with chemists and pharmaceutical chemists, including researchers at Rice University (Houston, Texas) and Memorial Sloan Kettering Cancer Research Center (New York).

### Planned Impact

This project has the potential to influence at least three important application fields: macromolecular simulation, geophysical fluid modelling, and the design of power networks. The latter two areas are increasingly important in the context of the evolving global climate, both for understanding and predicting changes in global (e.g. hurricane development) and local (e.g. tidal surge) weather behaviours and for the design of a stable power grid with the proliferation of power sources and changing power demand.

The macromolecular modelling area is the principal application developed in the project. In order to accelerate the development -> algorithm -> model -> simulation pipeline (which can, in some cases, evolve on a decadal timescale), this project will take advantage of the Molecular Integrator Software Tools (MIST) package developed within a previous EPSRC-funded research project. MIST facilitates a write-once-run-everywhere approach to algorithm design. Previously, MIST has been shown to allow the implementation of a continuous tempering strategy (the use of a defined temperature schedule) within AMBER, GROMACS and other molecular simulation codes. It is currently being enhanced to include algorithms for constraints and for adaptive thermostatic control. The current project will take MIST to the next level by establishing additional functionalities in the form of an analysis toolset with decision maker that will allow adaptive determination of collective variables as well as the selection of seeds for subsequent phase space exploration.

Benchmarking in an area such as macromolecular modelling is a nontrivial task, thus the project will include collaboration with well placed researchers from chemistry and pharma communities who can provide use cases (initial conditions and parameter values) which are relevant for large scale modelling. As an important demonstration, we will collaborate with John Chodera (Memorial Sloan Kettering Cancer Center, New York), who is engaged in using biomolecular simulation for drug design. The goal here is to apply our method to the problem of kinase inhibitor design for use in treating cancer and other diseases. The challenge is to accurately account for the complex large-scale conformational changes between active and various inactive states of the more than 500 human kinases. An efficient automated scheme for identifying the relevant collective variables for each kinase is vital to enabling the rapid computation of conformational free energies that play a critical role in inhibitor selectivity. By using well-studied kinase systems such as Abl and Src for which ~1 ms of aggregate simulation data is available and for which good collective variables are known, we can validate the approach before applying it to a large-scale survey of kinase conformational dynamics that will be performed on the Folding@home distributed computing platform.

The macromolecular modelling area is the principal application developed in the project. In order to accelerate the development -> algorithm -> model -> simulation pipeline (which can, in some cases, evolve on a decadal timescale), this project will take advantage of the Molecular Integrator Software Tools (MIST) package developed within a previous EPSRC-funded research project. MIST facilitates a write-once-run-everywhere approach to algorithm design. Previously, MIST has been shown to allow the implementation of a continuous tempering strategy (the use of a defined temperature schedule) within AMBER, GROMACS and other molecular simulation codes. It is currently being enhanced to include algorithms for constraints and for adaptive thermostatic control. The current project will take MIST to the next level by establishing additional functionalities in the form of an analysis toolset with decision maker that will allow adaptive determination of collective variables as well as the selection of seeds for subsequent phase space exploration.

Benchmarking in an area such as macromolecular modelling is a nontrivial task, thus the project will include collaboration with well placed researchers from chemistry and pharma communities who can provide use cases (initial conditions and parameter values) which are relevant for large scale modelling. As an important demonstration, we will collaborate with John Chodera (Memorial Sloan Kettering Cancer Center, New York), who is engaged in using biomolecular simulation for drug design. The goal here is to apply our method to the problem of kinase inhibitor design for use in treating cancer and other diseases. The challenge is to accurately account for the complex large-scale conformational changes between active and various inactive states of the more than 500 human kinases. An efficient automated scheme for identifying the relevant collective variables for each kinase is vital to enabling the rapid computation of conformational free energies that play a critical role in inhibitor selectivity. By using well-studied kinase systems such as Abl and Src for which ~1 ms of aggregate simulation data is available and for which good collective variables are known, we can validate the approach before applying it to a large-scale survey of kinase conformational dynamics that will be performed on the Folding@home distributed computing platform.

### Publications

Shang, Xiaocheng
(2017)

*Assessing numerical methods for molecular and particle simulation*
Shang X
(2017)

*Assessing numerical methods for molecular and particle simulation.*in Soft matter
Banisch R
(2020)

*Diffusion maps tailored to arbitrary non-degenerate Itô processes*in Applied and Computational Harmonic Analysis
Banisch R
(2017)

*Diffusion maps tailored to arbitrary non-degenerate Ito processes*
Leimkuhler B
(2022)

*Efficient Numerical Algorithms for the Generalized Langevin Equation*in SIAM Journal on Scientific Computing
Leimkuhler B
(2017)

*Ensemble preconditioning for Markov chain Monte Carlo simulation*in Statistics and Computing
Leimkuhler B
(2020)

*Hypocoercivity Properties of Adaptive Langevin Dynamics*in SIAM Journal on Applied Mathematics
Stoltz G
(2018)

*Langevin Dynamics With General Kinetic Energies*in Multiscale Modeling & SimulationDescription | Molecular models are typically described by energy landscapes which reflect the relative importance of different molecular structures. For example, a biomolecule may have two states - open and closed. In the open state the molecule may have be able to bind to other molecules (such as a drug), while in the closed state the molecule can more easily diffuse through a cell membrane. Understanding the relative prevalence of different states is thus very important. The research in this project examined the use of unsupervised machine learning techniques (especially the technique of diffusion maps) to help explore rare motions of the molecule between different structural states. The learned information can then be used to guide further exploration. Work on this project included: trajectory generation and sampling procedures for exploring MD landscapes, development of automatic procedures for determining diffusion maps, robust algorithms based on a quasi-stationary distribution which allows to limit exploration to a defined spatial domain. |

Exploitation Route | The code written by Dr Trstanova and her collaborators - pyDiffMap - is immediately useable by others and indeed is in widespread use. Algorithms such as infinite swap simulated tempering (ISST) which appear in the numerous papers written during this project are all in the public domain and are being implement in molecular simulation software and used by others. The MIST (Molecular Integration Simulation Toolkit) software has been further enhanced during this project (for example, it incorporates ISST), and this is also available in the public domain. A link between the molecular energy landscape and the loss landscape in supervised machine learning is being developed by B. Leimkuhler through his Turing Fellowship; some of the ideas of this project have been implemented already in TATi (Thermodynamic Analytics ToolkIt), a python project based at the Turing Institute. A collaboration with DNV GL was funded by a spin-off EPSRC IAA award. This led to the development of a python software package "acwind" which allowed automatic analysis of SCADA data from wind turbine farms. Algorithms were explored for training the networks based on ideas developed in this project. |

Sectors | Chemicals Pharmaceuticals and Medical Biotechnology |

Description | The research has had a direct impact on the company partner DNV Gl with which we have collaborated under an add-on impact acceleration grant in 2018. They are currently further developing the machine learning techniques we initiated and developed in the project, including our acwind software package, and they are deploying this within the company. We will present a poster on our joint work with the company in 2019 at the Wind Europe Conference in Bilbao (the largest European meeting on this topic). We have also had a paper based on our work accepted by the meeting. The work culminated in the design of a software package "acwind" which is a direct outcome of this award and secondary impact acceleration grant. The software package allows the automatic analysis of SCADA data from wind turbine farms using neural networks. An additional benefit of our work is the development of the TATi (Thermodynamic Analytics Toolkit) via collaboration with the Alan Turing Institute. This software allows the exploration of the loss landscape of neural networks (and other complicated models) using Langevin dynamics and based on TensorFlow. The software could be used for uncertainty quantification in machine learning. An associated article in JMLR has documented this software. TATi is being used for example for coreset compression by a group from NPL [Kavya Jagan (National Physical Laboratory), Stephane Chretien (National Physical Laboratory)] |

First Year Of Impact | 2019 |

Sector | Energy |

Impact Types | Economic |

Description | Alan Turing Institute Seed Funding |

Amount | £12,000 (GBP) |

Organisation | Alan Turing Institute |

Sector | Academic/University |

Country | United Kingdom |

Start | 08/2017 |

End | 03/2018 |

Description | EPSRC Impact Acceleration Grant (UOE) WIND TURBINES |

Amount | £27,861 (GBP) |

Funding ID | PIII015 |

Organisation | Engineering and Physical Sciences Research Council (EPSRC) |

Sector | Public |

Country | United Kingdom |

Start | 05/2018 |

End | 01/2019 |

Description | Rutherford Fellowship |

Amount | £80,000 (GBP) |

Organisation | Alan Turing Institute |

Sector | Academic/University |

Country | United Kingdom |

Start | 03/2018 |

End | 03/2019 |

Title | MIST: A simple and efficient molecular dynamics abstraction library for integrator development |

Description | We present MIST, the Molecular Integration Simulation Toolkit, a lightweight and efficient software library written in C++ which provides an abstract interface to common molecular dynamics codes, enabling rapid and portable development of new integration schemes for molecular dynamics. The initial release provides plug-in interfaces to NAMD-Lite, GROMACS and Amber, and includes several standard integration schemes, a constraint solver, temperature control using Langevin Dynamics, and two tempering schemes. We describe the architecture and functionality of the library and the C and Fortran APIs which can be used to interface additional MD codes to MIST. We show, for a range of test systems, that MIST introduces negligible overheads for serial, shared-memory parallel, and GPU-accelerated cases, except for Amber where the native integrators run directly on the GPU itself. As a demonstration of the capabilities of MIST, we describe a simulated tempering simulation used to study the free energy landscape of Alanine-12 in both vacuum and detailed solvent conditions. |

Type Of Material | Database/Collection of data |

Year Produced | 2018 |

Provided To Others? | Yes |

URL | https://data.mendeley.com/datasets/m2v3483r35 |

Title | MIST: A simple and efficient molecular dynamics abstraction library for integrator development |

Description | We present MIST, the Molecular Integration Simulation Toolkit, a lightweight and efficient software library written in C++ which provides an abstract interface to common molecular dynamics codes, enabling rapid and portable development of new integration schemes for molecular dynamics. The initial release provides plug-in interfaces to NAMD-Lite, GROMACS and Amber, and includes several standard integration schemes, a constraint solver, temperature control using Langevin Dynamics, and two tempering schemes. We describe the architecture and functionality of the library and the C and Fortran APIs which can be used to interface additional MD codes to MIST. We show, for a range of test systems, that MIST introduces negligible overheads for serial, shared-memory parallel, and GPU-accelerated cases, except for Amber where the native integrators run directly on the GPU itself. As a demonstration of the capabilities of MIST, we describe a simulated tempering simulation used to study the free energy landscape of Alanine-12 in both vacuum and detailed solvent conditions. |

Type Of Material | Database/Collection of data |

Year Produced | 2018 |

Provided To Others? | Yes |

URL | https://data.mendeley.com/datasets/m2v3483r35/1 |

Description | Ecole des Ponts ParisTech |

Organisation | École des ponts ParisTech |

Country | France |

Sector | Academic/University |

PI Contribution | We worked with Prof T. Lelievre at ParisTech to develop a new method blending diffusion maps with the quasistationary distribution to allow for efficient analysis of reaction coordinates in molecular and other systems. |

Collaborator Contribution | We helped to define the problem, create the theory and perform numerical experiments |

Impact | https://arxiv.org/abs/1901.06936 |

Start Year | 2018 |

Description | John Chodera/Memorial Sloan Kettering Cancer Centre |

Organisation | Memorial Sloan Kettering Cancer Center |

Country | United States |

Sector | Academic/University |

PI Contribution | Collaborated on numerical method analysis and error quantification. We provided theoretical support in the discussions. |

Collaborator Contribution | numerical implementations in software, testing |

Impact | Multidiscipinary, combining mathematics, chemistry, physic and biological sciences Publication: Quantifying Configuration-Sampling Error in Langevin Simulations of Complex Molecular Systems, Josh Fass, David A. Sivak, Gavin E. Crooks, Kyle A. Beauchamp, Benedict Leimkuhler and John D. Chodera Entropy 2018, 20(5), 318. |

Start Year | 2017 |

Title | Molecular Integration Simulation Toolkit |

Description | A library of integration routines for molecular dynamics that can be interfaced to a range of popular molecular dynamics codes for force evaluation (currently NAMD-Lite only). |

Type Of Technology | Software |

Year Produced | 2019 |

Open Source License? | Yes |

Impact | n/a |

URL | https://bitbucket.org/extasy-project/mist |

Title | Software Library For Diffusion Maps |

Description | Efficient software to implement diffusion maps, a kernel-based nonlinear clustering tool for large data sets. |

Type Of Technology | Software |

Year Produced | 2017 |

Open Source License? | Yes |

Impact | The software was used in the article Banisch R., Trstanova Z., Bittracher A., Klus S., Koltai P., Diffusion maps tailored to arbitrary non-degenerate Ito processes, under revision, arXiv:1710.03484 (not yet published). It is made available freely for use by the scientific community |

Title | Thermodynamic Analytics Toolkit (TATi) |

Description | TATi is a software suite written in Python based on tensorflow's Python API. It brings advanced sampling methods to neural network training. Its tools allow to assess the loss manifold's topology that depends on the employed neural network and the dataset. Moreover, its simulation module makes applying present sampling Python codes in the context of neural networks easy and straight-forward. The goal of the software is to enable the user to analyze and adapt the network employed for a specific classification problem to best fit her or his needs. TATi has received financial support from a seed funding grant and through a Rutherford fellowship from the Alan Turing Institute in London (R-SIS-003, R-RUT-001) and EPSRC grant no. EP/P006175/1 (Data Driven Coarse Graining using Space-Time Diffusion Maps, B. Leimkuhler PI). Moreover, the development was aided by a Microsoft Azure Sponsorship (MS-AZR-0143P). |

Type Of Technology | Software |

Year Produced | 2019 |

Open Source License? | Yes |

Impact | none yet (too early) |

URL | https://alan-turing-institute.github.io/ThermodynamicAnalyticsToolkit/ |

Title | acwind |

Description | A Python library for automatic classification of SCADA wind energy analytics |

Type Of Technology | Software |

Year Produced | 2019 |

Open Source License? | Yes |

Impact | This software was used to underpin collaborations with partner DNV GL mentioned elsewhere. |

URL | https://github.com/acwind-lib/acwind |

Description | IMA Public Lecture: (re)-learning to simulate: a look at the new science of data-driven modelling |

Form Of Engagement Activity | A talk or presentation |

Part Of Official Scheme? | No |

Geographic Reach | Regional |

Primary Audience | Public/other audiences |

Results and Impact | I gave the IMA Public Lecture at ICMS in the Bayes Centre on the topic: (re)-learning to simulate: a look at the new science of data-driven modelling. The talk was very well attended and I was asked many questions. The audience included professional mathematicians as well as laypersons and students from high school to graduate school. Abstract of the talk: What is a mathematical model? It is a representation, a condensation, a simplification. For example, the solar system is reasonably modelled by Newton's laws of motion. From these equations, knowledge of the masses of the planets, and knowledge of the initial conditions, one can predict, to high accuracy, the positions of the planets for many millions of years. In other applications, the formulation of a tractable model is much more difficult. For example, we may not know the parameters of the model very well, and the model may be very sensitive to small changes in these. We may not even know the underlying mathematical relationships that would define a good model. Think of the stock market, or political elections. Even when we have a good model, it might be not be useful if it requires excessive computation to find solutions. Thus it is often difficult to represent complex systems and to make useful quantitive predictions. In such cases it is necessary to find creative ways to explore the system of interest. Increasingly one seeks to use data (either from observations or generated by simulations) to better understand complex systems. In this talk I will discuss some of the challenges and opportunities of combining data, mathematical modelling, and scientific computing to address very challenging questions with potential importance for science, engineering and society. By embedding complex problems (and data sets) in a physical modelling framework it is sometimes possible to find new ways to understand them. I will discuss diverse examples ranging from molecular models to the analysis of wind farm performance to political gerrymandering. Slides of the talk are available here: http://kac.maths.ed.ac.uk/~bl/Data/Slides/IMA2019.pdf |

Year(s) Of Engagement Activity | 2019 |

URL | https://ima.org.uk/10790/re-learning-to-simulate-a-look-at-the-new-science-of-data-driven-computatio... |