Data-Driven Coarse-Graining using Space-Time Diffusion Maps

Lead Research Organisation: University of Edinburgh

Department Name: Sch of Mathematics

Abstract

Dynamical systems with many degrees of freedom arise in a wide range of applications, including large scale molecular dynamics, climate and weather studies, and electrical power networks. The challenge in simulation is normally to extract statistical information, for example the average propensity of a given state of the system or the average time that elapses between certain events. Simulation data is easy to generate but often poorly utilized. The goal of this project is the development of a data-driven method for the automatic detection of a simplified description of the system based on a set of collective variables which can be used within efficient statistical extraction procedures. These slowest degrees of freedom are typically the most important ones. The dynamics are characterised as fluctuations in the vicinity of given state punctuated by relatively rare events describing transitions between the states. Efficiently identifying collective variables is the crucial first step in the design of coarse-grained models which can allow many order of magnitude increases in the accessible simulation timescale. By automatically finding collective variables, we can greatly simplify rapid study and comparison of many systems. The research builds on the technique of diffusion maps, whereby the eigenfunctions of a diffusion operator are used to characterise the metastable (slowly changing) states of the system.

The potential impact of automatic coarse-graining will be felt most profoundly in fields such as rational drug design, where it is necessary to select specific drug molecules for their properties in interaction with some target, e.g. a protein. Bio-molecular simulation depends on the use of very specialised and intensely developed simulation codes which are the products of many years of development and government investment. In order to accelerate the implementation and testing of novel algorithms in this important area, this project includes a detailed plan for software development within the EPSRC-funded MIST (Molecular Integrator Software Tools) platform. Testing of the software methodology will be conducted via collaborations with chemists and pharmaceutical chemists, including researchers at Rice University (Houston, Texas) and Memorial Sloan Kettering Cancer Research Center (New York).

Planned Impact

This project has the potential to influence at least three important application fields: macromolecular simulation, geophysical fluid modelling, and the design of power networks. The latter two areas are increasingly important in the context of the evolving global climate, both for understanding and predicting changes in global (e.g. hurricane development) and local (e.g. tidal surge) weather behaviours and for the design of a stable power grid with the proliferation of power sources and changing power demand.

The macromolecular modelling area is the principal application developed in the project. In order to accelerate the development -> algorithm -> model -> simulation pipeline (which can, in some cases, evolve on a decadal timescale), this project will take advantage of the Molecular Integrator Software Tools (MIST) package developed within a previous EPSRC-funded research project. MIST facilitates a write-once-run-everywhere approach to algorithm design. Previously, MIST has been shown to allow the implementation of a continuous tempering strategy (the use of a defined temperature schedule) within AMBER, GROMACS and other molecular simulation codes. It is currently being enhanced to include algorithms for constraints and for adaptive thermostatic control. The current project will take MIST to the next level by establishing additional functionalities in the form of an analysis toolset with decision maker that will allow adaptive determination of collective variables as well as the selection of seeds for subsequent phase space exploration.

Benchmarking in an area such as macromolecular modelling is a nontrivial task, thus the project will include collaboration with well placed researchers from chemistry and pharma communities who can provide use cases (initial conditions and parameter values) which are relevant for large scale modelling. As an important demonstration, we will collaborate with John Chodera (Memorial Sloan Kettering Cancer Center, New York), who is engaged in using biomolecular simulation for drug design. The goal here is to apply our method to the problem of kinase inhibitor design for use in treating cancer and other diseases. The challenge is to accurately account for the complex large-scale conformational changes between active and various inactive states of the more than 500 human kinases. An efficient automated scheme for identifying the relevant collective variables for each kinase is vital to enabling the rapid computation of conformational free energies that play a critical role in inhibitor selectivity. By using well-studied kinase systems such as Abl and Src for which ~1 ms of aggregate simulation data is available and for which good collective variables are known, we can validate the approach before applying it to a large-scale survey of kinase conformational dynamics that will be performed on the Folding@home distributed computing platform.

Funded Value:

£304,822

Funded Period:

Jan 17 - Dec 19

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/P006175/1

Principal Investigator:

Benedict Leimkuhler

Research Subject:

Mathematical sciences (100%)

Research Topic:

Mathematical Analysis (25%)

Numerical Analysis (75%)

Organisations

People	ORCID iD
Benedict Leimkuhler (Principal Investigator)
Iain Bethune (Researcher)	http://orcid.org/0000-0003-0713-2084
Ralf Banisch (Researcher Co-Investigator)

Publications

Author Name

Title Publication Date Published

|< < 1 2 3 > >|

10 25 50

Banisch R (2020) Diffusion maps tailored to arbitrary non-degenerate Itô processes in Applied and Computational Harmonic Analysis

Banisch R (2017) Diffusion maps tailored to arbitrary non-degenerate Ito processes

Bethune I (2019) MIST: A simple and efficient molecular dynamics abstraction library for integrator development in Computer Physics Communications

Fass J (2018) Quantifying Configuration-Sampling Error in Langevin Simulations of Complex Molecular Systems. in Entropy (Basel, Switzerland)

Gkeka P (2020) Machine Learning Force Fields and Coarse-Grained Variables in Molecular Dynamics: Application to Materials and Biological Systems. in Journal of chemical theory and computation

Heber F (2019) TATi-Thermodynamic Analytics ToolkIt: TensorFlow-based software for posterior sampling in machine learning applications

Heber F. (2020) Posterior sampling strategies based on discretized stochastic differential equations for machine learning applications in Journal of Machine Learning Research

Heber Frederik (2020) Posterior sampling strategies based on discretized stochastic differential equations for machine learning applications in JOURNAL OF MACHINE LEARNING RESEARCH

Leimkuhler B (2022) Efficient Numerical Algorithms for the Generalized Langevin Equation in SIAM Journal on Scientific Computing

Leimkuhler B (2019) Partitioned integrators for thermodynamic parameterization of neural networks

Key Findings
Impact Summary
Further Funding
Collaboration
Software and Technical Products
Engagement Activities


Description	Molecular models are typically described by energy landscapes which reflect the relative importance of different molecular structures. For example, a biomolecule may have two states - open and closed. In the open state the molecule may have be able to bind to other molecules (such as a drug), while in the closed state the molecule can more easily diffuse through a cell membrane. Understanding the relative prevalence of different states is thus very important. The research in this project examined the use of unsupervised machine learning techniques (especially the technique of diffusion maps) to help explore rare motions of the molecule between different structural states. The learned information can then be used to guide further exploration. Work on this project included: trajectory generation and sampling procedures for exploring MD landscapes, development of automatic procedures for determining diffusion maps, robust algorithms based on a quasi-stationary distribution which allows to limit exploration to a defined spatial domain.
Exploitation Route	The code written by Dr Trstanova and her collaborators - pyDiffMap - is immediately useable by others and indeed is in widespread use. Algorithms such as infinite swap simulated tempering (ISST) which appear in the numerous papers written during this project are all in the public domain and are being implement in molecular simulation software and used by others. The MIST (Molecular Integration Simulation Toolkit) software has been further enhanced during this project (for example, it incorporates ISST), and this is also available in the public domain. A link between the molecular energy landscape and the loss landscape in supervised machine learning is being developed by B. Leimkuhler through his Turing Fellowship; some of the ideas of this project have been implemented already in TATi (Thermodynamic Analytics ToolkIt), a python project based at the Turing Institute. A collaboration with DNV GL was funded by a spin-off EPSRC IAA award. This led to the development of a python software package "acwind" which allowed automatic analysis of SCADA data from wind turbine farms. Algorithms were explored for training the networks based on ideas developed in this project.
Sectors	Chemicals,Pharmaceuticals and Medical Biotechnology


Description	The research has had a direct impact on the company partner DNV Gl with which we have collaborated under an add-on impact acceleration grant in 2018. They are currently further developing the machine learning techniques we initiated and developed in the project, including our acwind software package, and they are deploying this within the company. We will present a poster on our joint work with the company in 2019 at the Wind Europe Conference in Bilbao (the largest European meeting on this topic). We have also had a paper based on our work accepted by the meeting. The work culminated in the design of a software package "acwind" which is a direct outcome of this award and secondary impact acceleration grant. The software package allows the automatic analysis of SCADA data from wind turbine farms using neural networks. An additional benefit of our work is the development of the TATi (Thermodynamic Analytics Toolkit) via collaboration with the Alan Turing Institute. This software allows the exploration of the loss landscape of neural networks (and other complicated models) using Langevin dynamics and based on TensorFlow. The software could be used for uncertainty quantification in machine learning. An associated article in JMLR has documented this software. TATi is being used for example for coreset compression by a group from NPL [Kavya Jagan (National Physical Laboratory), Stephane Chretien (National Physical Laboratory)]
First Year Of Impact	2019
Sector	Energy
Impact Types	Economic


Description	Alan Turing Institute Seed Funding
Amount	£12,000 (GBP)
Organisation	Alan Turing Institute
Sector	Academic/University
Country	United Kingdom
Start	09/2017
End	03/2018


Description	EPSRC Impact Acceleration Grant (UOE) WIND TURBINES
Amount	£27,861 (GBP)
Funding ID	PIII015
Organisation	Engineering and Physical Sciences Research Council (EPSRC)
Sector	Public
Country	United Kingdom
Start	06/2018
End	01/2019


Description	Rutherford Fellowship
Amount	£80,000 (GBP)
Organisation	Alan Turing Institute
Sector	Academic/University
Country	United Kingdom
Start	03/2018
End	03/2019


Description	Ecole des Ponts ParisTech
Organisation	École des ponts ParisTech
Country	France
Sector	Academic/University
PI Contribution	We worked with Prof T. Lelievre at ParisTech to develop a new method blending diffusion maps with the quasistationary distribution to allow for efficient analysis of reaction coordinates in molecular and other systems.
Collaborator Contribution	We helped to define the problem, create the theory and perform numerical experiments
Impact	https://arxiv.org/abs/1901.06936
Start Year	2018


Description	John Chodera/Memorial Sloan Kettering Cancer Centre
Organisation	Memorial Sloan Kettering Cancer Center
Country	United States
Sector	Academic/University
PI Contribution	Collaborated on numerical method analysis and error quantification. We provided theoretical support in the discussions.
Collaborator Contribution	numerical implementations in software, testing
Impact	Multidiscipinary, combining mathematics, chemistry, physic and biological sciences Publication: Quantifying Configuration-Sampling Error in Langevin Simulations of Complex Molecular Systems, Josh Fass, David A. Sivak, Gavin E. Crooks, Kyle A. Beauchamp, Benedict Leimkuhler and John D. Chodera Entropy 2018, 20(5), 318.
Start Year	2017


Title	Molecular Integration Simulation Toolkit
Description	A library of integration routines for molecular dynamics that can be interfaced to a range of popular molecular dynamics codes for force evaluation (currently NAMD-Lite only).
Type Of Technology	Software
Year Produced	2019
Open Source License?	Yes
Impact	n/a
URL	https://bitbucket.org/extasy-project/mist


Title	Software Library For Diffusion Maps
Description	Efficient software to implement diffusion maps, a kernel-based nonlinear clustering tool for large data sets.
Type Of Technology	Software
Year Produced	2017
Open Source License?	Yes
Impact	The software was used in the article Banisch R., Trstanova Z., Bittracher A., Klus S., Koltai P., Diffusion maps tailored to arbitrary non-degenerate Ito processes, under revision, arXiv:1710.03484 (not yet published). It is made available freely for use by the scientific community


Title	Thermodynamic Analytics Toolkit (TATi)
Description	TATi is a software suite written in Python based on tensorflow's Python API. It brings advanced sampling methods to neural network training. Its tools allow to assess the loss manifold's topology that depends on the employed neural network and the dataset. Moreover, its simulation module makes applying present sampling Python codes in the context of neural networks easy and straight-forward. The goal of the software is to enable the user to analyze and adapt the network employed for a specific classification problem to best fit her or his needs. TATi has received financial support from a seed funding grant and through a Rutherford fellowship from the Alan Turing Institute in London (R-SIS-003, R-RUT-001) and EPSRC grant no. EP/P006175/1 (Data Driven Coarse Graining using Space-Time Diffusion Maps, B. Leimkuhler PI). Moreover, the development was aided by a Microsoft Azure Sponsorship (MS-AZR-0143P).
Type Of Technology	Software
Year Produced	2019
Open Source License?	Yes
Impact	none yet (too early)
URL	https://alan-turing-institute.github.io/ThermodynamicAnalyticsToolkit/


Title	acwind
Description	A Python library for automatic classification of SCADA wind energy analytics
Type Of Technology	Software
Year Produced	2019
Open Source License?	Yes
Impact	This software was used to underpin collaborations with partner DNV GL mentioned elsewhere.
URL	https://github.com/acwind-lib/acwind


Description	IMA Public Lecture: (re)-learning to simulate: a look at the new science of data-driven modelling
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Regional
Primary Audience	Public/other audiences
Results and Impact	I gave the IMA Public Lecture at ICMS in the Bayes Centre on the topic: (re)-learning to simulate: a look at the new science of data-driven modelling. The talk was very well attended and I was asked many questions. The audience included professional mathematicians as well as laypersons and students from high school to graduate school. Abstract of the talk: What is a mathematical model? It is a representation, a condensation, a simplification. For example, the solar system is reasonably modelled by Newton's laws of motion. From these equations, knowledge of the masses of the planets, and knowledge of the initial conditions, one can predict, to high accuracy, the positions of the planets for many millions of years. In other applications, the formulation of a tractable model is much more difficult. For example, we may not know the parameters of the model very well, and the model may be very sensitive to small changes in these. We may not even know the underlying mathematical relationships that would define a good model. Think of the stock market, or political elections. Even when we have a good model, it might be not be useful if it requires excessive computation to find solutions. Thus it is often difficult to represent complex systems and to make useful quantitive predictions. In such cases it is necessary to find creative ways to explore the system of interest. Increasingly one seeks to use data (either from observations or generated by simulations) to better understand complex systems. In this talk I will discuss some of the challenges and opportunities of combining data, mathematical modelling, and scientific computing to address very challenging questions with potential importance for science, engineering and society. By embedding complex problems (and data sets) in a physical modelling framework it is sometimes possible to find new ways to understand them. I will discuss diverse examples ranging from molecular models to the analysis of wind farm performance to political gerrymandering. Slides of the talk are available here: http://kac.maths.ed.ac.uk/~bl/Data/Slides/IMA2019.pdf
Year(s) Of Engagement Activity	2019
URL	https://ima.org.uk/10790/re-learning-to-simulate-a-look-at-the-new-science-of-data-driven-computatio...

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications