📣 Help Shape the Future of UKRI's Gateway to Research (GtR)

We're improving UKRI's Gateway to Research and are seeking your input! If you would be interested in being interviewed about the improvements we're making and to have your say about how we can make GtR more user-friendly, impactful, and effective for the Research and Innovation community, please email gateway@ukri.org.

Simulating catalysis: Multiscale embedding of machine learning potentials

Lead Research Organisation: University of Bristol
Department Name: Biochemistry

Abstract

In the recent decades, computer simulations have become an essential part of the molecular scientist's toolbox. However, as for any computational method, molecular simulations require a compromise between speed and precision. The most precise techniques apply principles of quantum mechanics (QM) to the molecular systems and can precisely describe processes involving changes in the electronic structure, such as the breaking and forming of chemical bonds. However, they require tremendous computer resources, being prohibitively costly even for systems containing only several hundreds of atoms. On the other extreme are highly simplified "Molecular Mechanics" (MM) methods that ignore the quantum nature of molecules and instead describe the atoms as charged "balls" of certain size connected with springs representing the chemical bonds.

The core limitation of MM is its inability to describe breaking/forming of chemical bonds, therefore making it unsuitable for simulating chemical reactions. This drawback motivated the invention of combined "multiscale" models that rely on precise but expensive QM calculations to describe the part of the simulation system where the chemical reaction takes place, while treating the rest of the system with an efficient MM method. This "Quantum Mechanics/Molecular Mechanics" approach (QM/MM), honoured by the Nobel Prize in Chemistry in 2013, is now the state-of-the-art simulation technique for reactions in complex environments, such as those happening inside living organisms. Such simulations are important to understand and design catalysts, which increase the rate of chemical reactions (and can thereby reduce the amount of energy and resources required to produce molecules). However, QM/MM calculations are still only as fast as the QM method used, limiting dramatically the precision and timescale of the simulations.

A completely different approach is to employ techniques from the rapidly evolving field of machine learning (ML) and construct a method that can learn and then predict the outcome of a QM calculation. Once properly trained, an ML model can provide results with QM quality, but several orders of magnitude faster. However, ML models are still significantly slower than MM ones. Therefore, a multiscale "ML/MM" model would still offer huge savings of computer time compared to pure ML simulations. Unfortunately, however, existing ML training schemes are only suitable for calculations in gas phase and cannot take into account the presence of an MM environment.

The goal of the proposed research project is to develop a novel multiscale embedding approach that will allow the use of ML models as part of a ML/MM scheme. This will enable molecular simulations of unprecedented precision on processes with high complexity without limiting the detailed exploration of molecular conformations. To achieve this goal, we will take advantage of recent advances in machine learning and understanding of intermolecular interactions to develop a specialised ML workflow that predicts the interaction energy between the molecule described by ML and the MM environment. The workflow will be implemented as an open, publicly available software package that allows to train ML/MM models and run ML/MM molecular dynamics simulations of complex chemical processes, such as catalysed reactions. We expect this package to be readily adopted by a wide community of computational chemists working on enzymatic reactions, homo/heterogeneous catalysis and generally on processes in condensed phases, aided by specific training materials and workshops that we will provide. This will allow, for example, the development efficient computational workflows to understand and help design catalysts for more environmentally friendly production of desired molecules.

Publications

10 25 50
 
Description Enzyme reactions can be effectively modelled using combined quantum mechanics/molecular mechanics (QM/MM) simulations. These multiscale simulations are able to capture the energy barrier of reactions, including the catalytic effect of the enzyme environment. This means that when different enzymes that can catalyze the same reaction are considered, the simulations can capture the difference in energy barrier between them. We have previously shown that QM/MM simulations can therefore be used as an assay for enzyme activity, and, for specific enzyme reactions, the computational time required can be reduced >100 fold by optimization of simulation protocols. Although this now allows for screening enzyme activity of 10s of enzymes within days (with limited computational cost), similar assays cannot be routinely applied to other enzyme reactions. A key reason for this is the longstanding challenge of the trade-off between speed and accuracy in quantum mechanical calculations (which are the bottleneck in QM/MM simulations). Density functional theory (DFT) methods that allow accurate energies and transition state descriptions are too slow for running QM/MM molecular dynamics simulations at the speed required. Semi-empirical QM methods, such as AM1, DFTB or GFN2-xTB, are significantly faster, but may not capture the mechanism and reaction barrier with sufficient accuracy. It is now possible to train machine-learned potentials (MLPs) for various different organic reactions, in principle allowing energy calculations with DFT accuracy at computational cost close to highly efficient MM potentials.
Applying such MLPs in a multiscale "ML/MM" simulation of an enzyme reaction, with the MLP only being employed for the reactive part of the system, has the benefit that an MLP needs to be trained only once for each chemical reaction (with the difference between enzymes captured in the MM region), and offers significant savings of computer time compared to pure ML simulations of whole enzymes. However, standard approaches for embedding a QM region into an MM environment to capture catalysis (electrostatic embedding: polarization of the QM atoms by the 'point charges' from the MM part) are not applicable for MLPs. As there is no such a thing as "electronic density" in an MLP, it is not clear how to properly incorporate the effect of the environment to capture enzyme catalysis, and thus how to bring the benefits of MLPs to enzyme activity assays.
In this work, have recently developed an efficient and generic computational scheme that allows for properly including electrostatic embedding in multiscale ML/MM simulations: electrostatic machine-learned embedding (EMLE). We also demonstrated our EMLE-engine software implementations for running ML/MM MD simulations stably and efficiently. Essentially, EMLE-engine decouples the QM and MM parts of a hybrid QM/MM system and provides the total energy as a sum of the MM energy, the QM energy in vacuo which can be replaced by a suitable ML potential - and the interaction energy between the two regions, obtained through the EMLE-model (Figure 1b). We have already shown that this approach can achieve essentially the same accuracy as a high-level DFT/MM description in conformational free energy landscapes. However, the real benefit of EMLE will be seen for enzyme reactions, where capturing electrostatic stabilization of high-energy species is absolutely crucial to predict enzyme catalysis and thus activity. We have already demonstrated the principle of ML/MM enzyme reaction modelling for a simple biocatalyst of interest to industry: the natural Diels-Alderase AbyU. To capture free energy barriers, established umbrella sampling protocols are used. Based on our current implementation of EMLE-engine with OpenMM, we achieve a 285x speed-up compared to DFT, with 16x less computational power (16 CPUs vs. 1). By making use of a single GPU for ML/MM (not possible for DFT), this becomes a 1000x speedup. In addition, we have demonstrated that for enzyme reactions with a highly polarized transition state, EMLE can be retrained, and then, coupled with an ML potential for the in vacuo reaction, accurately capture the catalytic effect of the enzyme (compared to solution). For enzyme reactions, the speed of ML/MM simulations combined with the accuracy of our EMLE-embedding should therefore allow a step-change in enzyme activity screening: accurate reaction barriers can be obtained at least 1000x faster than is possible currently.
Although the main focus on the work was to develop new, efficient methods for simulating reactions in (bio)catalysts, we have also started to explore different applications of the EMLE-embedding scheme, such as for a more accurate description of protein-ligand interactions that could be used for simulation-based prediction of drug-target binding affinities. This is in collaboration with other academic groups and industry.
Exploitation Route The developed EMLE methodology and code will make possible routine ML/MM simulations of enzyme reactions, so that accurate (relative) reaction barriers can be obtained for 100s of enzyme variants within days. This, in turn, can allow the use of these EMLE-based multiscale simulations for enzyme design (e.g. ranking of alternative designs). Similarly, the EMLE embedding will allow other accurate multi-scale simulations, such as for simulating the dynamics of small modules (e.g. drugs) binding to targets. As part of the project, we also have made expansions of the capabilities of popular open-source software (OpenMM), so that this is suitable for additional applications (in multiscale modelling). Our modular code further offers examples to software developers of how to efficiently integrate new methodologies within the Sire and OpenMM frameworks.
Sectors Chemicals

Digital/Communication/Information Technologies (including Software)

Manufacturing

including Industrial Biotechology

Pharmaceuticals and Medical Biotechnology

 
Description Our model and findings have been used by companies performing Computer-Aided Drug Design, which have been testing out the use of the method and software for their prediction pipelines.
First Year Of Impact 2023
Sector Pharmaceuticals and Medical Biotechnology
Impact Types Economic

 
Title emle-engine 
Description A simple, versatile interface to run multi-scale MM/ML simulations with electrostatic embedding of machine learning potentials using an ORCA-like interface. 
Type Of Material Technology assay or reagent 
Year Produced 2024 
Provided To Others? Yes  
Impact Initial interest from other research groups and industry. 
URL https://github.com/chemle/emle-engine
 
Title Electrostatic embedding scheme for Machine Learning potentials 
Description Training data was generated based on the QM7 data set, consisting of 7165 molecules with up to 7 heavy atoms (C, N, O, and S, in addition to H). For each molecule, the density and molecular dipolar polarizability were obtained at the B3LYP/cc-PVTZ level of theory without reoptimizing the structures. Training procedure and properties and parameters required by the embedding scheme - as trained/optimized for ground state neutral compounds containing H, C, N, O, and S elements. 
Type Of Material Computer model/algorithm 
Year Produced 2023 
Provided To Others? Yes  
Impact None yet. 
URL https://github.com/emedio/embedding
 
Description Collaboration with Newcastle University 
Organisation Newcastle University
Country United Kingdom 
PI Contribution Sharing of models and experience.
Collaborator Contribution Sharing knowledge and code to aid with QM training data generation.
Impact Successful application for Isambard-AI "Co-Design" project.
Start Year 2024
 
Description Collaboration with University of Edinburgh 
Organisation University of Edinburgh
Department School of Chemistry
Country United Kingdom 
Sector Academic/University 
PI Contribution Sharing of models, code, algorithms, knowledge and insights.
Collaborator Contribution Testing (and co-development) of code, sharing of knowledge and insights.
Impact Testing and use of code for new research areas (calculation of binding free energies)
Start Year 2024
 
Description University of Valencia 
Organisation University of Valencia
Country Spain 
Sector Academic/University 
PI Contribution Me and my research team have regular meetings with the team at the University of Valencia (mostly every 2 weeks). We collaborate intensively, coordinating efforts directly related to implementing methods and applying electrostatic embedding of machine learning potentials. This also involves the sharing of data and code.
Collaborator Contribution The team at the University of Valencia meets with the Bristol team regularly (see above). We share data and code.
Impact Output 1: https://doi.org/10.26434/chemrxiv-2023-6rng3-v2 (preprint, under review) Output 2: https://github.com/chemle/emle-engine
Start Year 2023
 
Title EMLE-engine 
Description A simple interface to allow electrostatic embedding of machine learning potentials using an ORCA-like interface. An example sander (AmberTools) implementation is provided. This works by reusing the existing interface between sander and ORCA, meaning that no modifications to sander are needed. emle-engine supports electrostatic, non-polarisable, and MM embedding. Here non-polarisable emedding uses the EMLE model to predict charges for the QM region, but ignores the induced component of the potential. MM embedding allows the user to specify fixed MM charges for the QM atoms, with induction once again disabled. The use of different embedding schemes provides a useful reference for determining the benefit of using electrostatic embedding for a given system. 
Type Of Technology Software 
Year Produced 2024 
Open Source License? Yes  
Impact No directly impacts yet. 
URL https://github.com/chemle/emle-engine
 
Description Training workshop simulation tools 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Postgraduate students
Results and Impact 25 participants were trained in the developed simulation tools. The main audience were postgraduate and postdoctoral researchers from around the UK, and there were also some industrial participants. The workshop raised further discussions in the use of the tools, including new collaboration opportunities.
Year(s) Of Engagement Activity 2025