Turing AI Fellowship: Machine Learning for Molecular Design
Lead Research Organisation:
University of Cambridge
Department Name: Engineering
Abstract
Many existing challenges, from personalized health care to energy production and storage, require the design and manufacture of new molecules. However, identifying new molecules with desired properties is difficult and time-consuming. We aim at accelerating this process by exploiting advances in data availability, computing power, and AI.
We will create generative models of molecules that operate by placing atoms in 3D space. These are more realistic and can produce better predictions than alternative approaches based on molecular graphs. Our models will guarantee that the generated molecules are synthetically accessible upfront. This will be achieved by mirroring realistic real-world processes for molecule generation where reactants are first selected, and then combined into more complex molecules via chemical reactions. Additionally, our methods will be reliable, by accounting for uncertainty in parameter estimation, and data-efficient, by jointly learning from different data sources.
Our contributions will have a broad impact on materials science, leading to more effective flow batteries, solar cell components, and organic light-emitting diodes. We will also contribute to accelerate the drug discovery process, leading to more economic and effective drugs that can significantly improve the health and lifestyle of millions.
We will create generative models of molecules that operate by placing atoms in 3D space. These are more realistic and can produce better predictions than alternative approaches based on molecular graphs. Our models will guarantee that the generated molecules are synthetically accessible upfront. This will be achieved by mirroring realistic real-world processes for molecule generation where reactants are first selected, and then combined into more complex molecules via chemical reactions. Additionally, our methods will be reliable, by accounting for uncertainty in parameter estimation, and data-efficient, by jointly learning from different data sources.
Our contributions will have a broad impact on materials science, leading to more effective flow batteries, solar cell components, and organic light-emitting diodes. We will also contribute to accelerate the drug discovery process, leading to more economic and effective drugs that can significantly improve the health and lifestyle of millions.
Organisations
Publications
Allingham J. U.
(2024)
A Generative Model of Symmetry Transformations
Antoran J
(2024)
Uncertainty Estimation for Computed Tomography with a Linearised Deep Image Prior
in Transactions on Machine Learning Research
Campbell A.
(2021)
A Gradient Based Strategy for Hamiltonian Monte Carlo Hyperparameter Optimization
in Proceedings of Machine Learning Research
Chen W.
(2024)
Leveraging Task Structures for Improved Identifiability in Neural Network Representations
in Transactions on Machine Learning Research
Chen W.
(2024)
Diffusive Gibbs Sampling
| Description | We have discovered new way in which we can train Gaussian process models to make predictions on massive datasets. Gaussian processes are widely used in the problem of making predictions about molecule properties from data, but these models are limited by their scalability. We have created new methods that allows us to use such techniques with massive datasets, something that was not possible before. We have also created new generative models of molecules based on normalizing flows. Our models operate in spacial coordinates and are equivariant, which means that they capture the natural invariances of molecules to rotations and translations. Our models allow us to sample molecular configurations very fast and can also eliminate bias in the generation of their samples using importance sampling. In a collaboration with my PhD student Laurence Midgley, we have developed new methods based on generative AI for the simulation of molecular configurations. Our techniques are the first ones that were able to simulate with almost perfect accuracy the 3D atomic coordinates of the molecule Alanine dipeptide. My research in this area has led to the foundation of the startup Angstrom AI. In a collaboration with my postdoc Surkiti Singh, we have developed new meta-learning methods for predicting the selectivity of specific chemical reactions. These methods are very data efficient and can make accurate predictions with small training sets. Our contributions are highly significant and show very high gains by these new methods in the low data regime, which is the most common scenario in the early stages of projects related to chemical reactions. Our work has been recently accepted for publication at the prestigious journal Nature Communications (work in press). |
| Exploitation Route | Our scalable Gaussian process methods will be able to be used by others in large-scale prediction problems. For example, pharmaceutical companies or any company that is interested in obtaining accurate estimates of uncertainty in their predictions. Our normalizing flows will allow researchers to generate more accurate models of how molecules move in space and could be used for molecular simulations by pharmaceutical companies. My startup Angstrom AI is currently working on new methods for molecular simulation based on generative AI. Our techniques will enable the calculation of free energy differences, binding conformations & hydration sites, with ab initio accuracy, orders of magnitude faster than traditional simulations. The transition-metal-catalyzed asymmetric hydrogenation of olefins is one of the key transformations with great utility in various industrial applications. The field has been dominated by the use of noble metal catalysts, such as iridium and rhodium. The reactions with the earth-abundant cobalt metal have increased only in recent years. Our data-efficient meta-learning methods for reaction selectivity prediction have shown very high gains in the prediction of selectivity in the asymmetric hydrogenation of olefins. Our methods will help in the early stages of projects, where very little data is available. Our methods will be able to speed up the search for new catalysis processes that use metals like Cobalt, which are less expensive than other alternatives like Iridium and Rhodium. |
| Sectors | Chemicals Digital/Communication/Information Technologies (including Software) Pharmaceuticals and Medical Biotechnology |
| Description | I have co-founded a startup, Angstrom AI, which is working in the development of new methods for molecular simulation based on generative AI. Angstrom AI' s technology will enable the computation of free energy differences, binding conformations & hydration sites, with ab initio accuracy, orders of magnitude faster than traditional simulations. This technology is expected to help pharmaceutical companies reduce the costs of designign new drugs. |
| First Year Of Impact | 2024 |
| Sector | Chemicals,Digital/Communication/Information Technologies (including Software),Healthcare |
| Impact Types | Societal Economic |
| Description | AI HUB IN GENERATIVE MODELS |
| Amount | £10,250,181 (GBP) |
| Funding ID | EP/Y028805/1 |
| Organisation | Engineering and Physical Sciences Research Council (EPSRC) |
| Sector | Public |
| Country | United Kingdom |
| Start | 02/2024 |
| End | 01/2029 |
| Title | Alanine dipeptide in an implicit solvent at 300K |
| Description | This dataset has been introduced in the article Midgley, et al.: Flow Annealed Importance Sampling Bootstrap, 2022. It contains samples from the Boltzmann distribution of alanine dipeptide in an implicit solvent, which have been generated with a Replica Exchange Molecular Dynamics (REMD) simulation. The ff96 with an OBC GBSA implicit solvent was used. The REMD uses 21 replicas starting at a temperature of 300K and increasing the temperature by an increment of 50K. The replicas are exchanged every 200 iterations and use the state at each multiple of 1000 time steps as samples. Many of these simulations were in parallel with different seeds. We let the system equilibrate for \(2\times10^5\) iterations and run the simulation subsequently for \(2\times10^6\) iterations. The data is split into a training set, which consists of \(10^6\) samples; a validation set consisting of \(10^6\) samples as well; and a test set with \(10^7\) samples. The data is provided as raw \((x,y,z)\)-coordinates, stored as *.h5 files, and transformed to internal coordinates, stored as *.pt files. More details about the data and how to use it are given in our GitHub repository and paper. |
| Type Of Material | Database/Collection of data |
| Year Produced | 2022 |
| Provided To Others? | Yes |
| URL | https://zenodo.org/record/6993124 |
| Title | Alanine dipeptide in an implicit solvent at 300K |
| Description | This dataset has been introduced in the article Midgley, et al.: Flow Annealed Importance Sampling Bootstrap, 2022. It contains samples from the Boltzmann distribution of alanine dipeptide in an implicit solvent, which have been generated with a Replica Exchange Molecular Dynamics (REMD) simulation. The ff96 with an OBC GBSA implicit solvent was used. The REMD uses 21 replicas starting at a temperature of 300K and increasing the temperature by an increment of 50K. The replicas are exchanged every 200 iterations and use the state at each multiple of 1000 time steps as samples. Many of these simulations were in parallel with different seeds. We let the system equilibrate for \(2\times10^5\) iterations and run the simulation subsequently for \(2\times10^6\) iterations. The data is split into a training set, which consists of \(10^6\) samples; a validation set consisting of \(10^6\) samples as well; and a test set with \(10^7\) samples. The data is provided as raw \((x,y,z)\)-coordinates, stored as *.h5 files, and transformed to internal coordinates, stored as *.pt files. More details about the data and how to use it are given in our GitHub repository and paper. |
| Type Of Material | Database/Collection of data |
| Year Produced | 2022 |
| Provided To Others? | Yes |
| URL | https://zenodo.org/record/6993123 |
| Description | Collaboration with Yarin Gal's group at Oxford |
| Organisation | University of Oxford |
| Department | Department of Computer Science |
| Country | United Kingdom |
| Sector | Academic/University |
| PI Contribution | I collaborated with Yarin Gal and the PhD student Pascal Notin. This collaboration resulted in a paper published at the Advances in Neural Information Processing Systems (NeurIPS), 2021. This is the best existing conference in the area of machine learning. We are currently collaborating on another paper. In this work, we describe a method to increase the robustness of deep generative models of molecules by taking into account the uncertainty in the decoding process. I participated in regular meetings providing regular guidance to the student Pascal Notin and helping with the writing of the paper. |
| Collaborator Contribution | Pascal Notin's supervisor, Yarin Gal, provided additional guidance to Pascal and collaborated in the writing of the paper. Pascal did experiments and wrote the paper. |
| Impact | Notin P., Hernández-Lobato J. M. and Gal Y. Improving black-box optimization in VAE latent space using decoder uncertainty, In NeurIPS 2021. |
| Start Year | 2021 |
| Company Name | Angstrom AI, Inc. |
| Description | Angstrom AI is a startup that I co-founded with two of my PhD students (it is not a spin-out). It aims at cccelerating molecular simulation using generative AI, Compute free energy differences, binding conformations & hydration sites, with ab initio accuracy, orders of magnitude faster than traditional simulations. |
| Year Established | 2024 |
| Impact | We have received 4 million USD in funding. |
| Website | https://www.angstrom-ai.com/ |
| Description | Directorship of ELLIS Research Program for Molecule Discovery |
| Form Of Engagement Activity | A formal working group, expert panel or dialogue |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Other audiences |
| Results and Impact | I was appointed director of the ELLIS Research Program on Molecule Discovery |
| Year(s) Of Engagement Activity | 2024 |
| URL | https://ellis.eu/programs/machine-learning-for-molecule-discovery |
| Description | Established an ELLIS research program on Machine Learning for Molecule Discovrey |
| Form Of Engagement Activity | A formal working group, expert panel or dialogue |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Other audiences |
| Results and Impact | The range of ELLIS research programs (https://ellis.eu/programs) is expanding further: The proposals 'Machine Learning for Molecule Discovery' has been accepted as new ELLIS program. It will push the scientific boundaries of their respective areas by fostering exchange and research collaborations among outstanding researchers in Europe. Jose Miguel Hernandez Lobato will be one of the directors of this program. ELLIS Program 'Machine Learning for Molecule Discovery' Discovering new molecules with desired functions or activities is crucial for human well-being by providing new medicines, securing the world's food supply via agrochemicals, or enabling a sustainable energy conversion and storage to counter or mitigate climate change. However, the discovery of new molecules or molecular materials that are optimized for a particular purpose can often take up to a decade and is highly cost-intensive. Machine-learning (ML) methods can accelerate molecular discovery, which is of considerable importance generally, but especially in light of the COVID-19 crisis and future pandemics. To reach this goal of speeding up the discovery of new functional molecules, the new ELLIS program 'Machine Learning for Molecule Discovery' aims to establish a dialogue between domain experts and ML researchers to ensure that ML positively impacts real world scenarios. The program's objectives are to advance computational molecular science by improving molecular representations, molecular modeling, property prediction, generative modeling for molecules and molecular optimization, and chemical synthesis through ML methods. The program intends to connect researchers from ELLIS units such as Cambridge, Linz, and Berlin, as well as from academia, pharmaceutical and technology companies. The direct exchange among experts and open discussions about research results are a crucial aspect for advancing science. "Molecules are in the center of almost all natural sciences, from chemistry and material science over physics to molecular biology. Similarly, almost all sub-fields of machine learning yielded applications for molecules, for example: Geometric Deep Learning finds molecules a rich field for equi- and invariances, Deep Learning architectures predict molecular properties or generate molecules, and deep reinforcement learning helps with planning chemical synthesis routes. By advancing machine learning, we will speed up molecule discovery, and ELLIS has the key role in this to connect machine learning researchers with molecular sciences and industry. First steps in this direction have been made through the ELLIS ML4Molecules workshops in 2021 and 2022 with together over 1300 registered participants", says Günter Klambauer, ELLIS Scholar, one of the Coordinators of this program and Associate Professor for Artificial Intelligence in Life Sciences at Johannes Kepler University Linz. The program proposal was also coordinated by Jose Miguel Hernandez Lobato (University Cambridge, ELLIS unit Cambridge) and Nadine Schneider (Novartis). Jose Miguel Hernandez Lobato will be one of the directors of the program. |
| Year(s) Of Engagement Activity | 2022 |
| URL | https://ellis.eu/news/ellis-research-programs-two-new-proposals-accepted |
| Description | Organization of ELLIS workshop on Molecule Discovery |
| Form Of Engagement Activity | A formal working group, expert panel or dialogue |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Postgraduate students |
| Results and Impact | Discovering new molecules with desired functions or activities is crucial for human well-being by providing new medicines, securing the world's food supply via agrochemicals, or enabling a sustainable energy conversion and storage to counter or mitigate climate change. However, the discovery of new molecules or molecular materials that are optimized for a particular purpose can often take up to a decade and is highly cost-intensive. Machine-learning (ML) methods can accelerate molecular discovery, which is of considerable importance generally, but especially in light of the COVID-19 crisis and future pandemics. To reach this goal of speeding up the discovery of new functional molecules, it is necessary to establish a dialogue between domain experts and ML researchers to ensure that ML positively impacts real world scenarios. The importance of this field has been acknowledged also by Stanford's 2021 Artificial Intelligence index report which states that "Drugs, Cancer, Molecular, Drug Discovery" received the greatest amount of private AI investment in 2020, with more than USD 13.8 billion, 4.5 times higher than 2019. In this workshop, we are bringing together the expertise of excellent researchers in the field of ML and its applications to molecular discovery. The workshop had more than 40 poster presentations and more than 20 invited talks. Jose Miguel Hernandez Lobato was a panelist in the workshop. |
| Year(s) Of Engagement Activity | 2021 |
| URL | https://moleculediscovery.github.io/workshop2021/ |
| Description | Organization of ELLIS workshop on molecule discovery |
| Form Of Engagement Activity | Participation in an activity, workshop or similar |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Other audiences |
| Results and Impact | Big successes of machine learning (ML) for molecules have been achieved recently, e.g. the accurate prediction of protein 3D structure (Jumper 2021; Thornton, 2021), discovery of novel antibiotics (Stokes, 2020; Das, 2021), or chemical synthesis planning (Segler, 2018). These successes make molecular machine learning one of the prime candidates to tackle the climate-, energy- and pandemic-related crisis that we are facing. Nevertheless, there are still major challenges and substantial critique is voiced on current methods that are based mostly on deep learning (Marcus, 2018). Deep learning (DL) methods are data hungry, have limited knowledge transfer capabilities, do not quickly adapt to changing tasks or distributions, insufficiently incorporate world and prior knowledge, and cannot inherently distinguish causation from correlation (Bengio, 2021; Chollet, 2019; Marcus, 2018; Schölkopf, 2019). Furthermore, the current models are usually not composable in a sense that sub-components or different modules can be combined in a new way. With these characteristics, the machine learning systems currently employed for molecules are of the type of a narrow artificial intelligence (AI) (Chollet, 2019; Hochreiter, 2022). The above-mentioned drawbacks hold in particular for molecular machine learning, such as activity and property prediction, generative modeling (Yang, 2017; Bender, 2021; Fan, 2022), chemical reactivity and synthesis (Segler, 2018; Seidl, 2022), and molecular modeling (Bereau, 2013) and representation learning. Therefore, this workshop focuses on exposing the current limitations of machine learning methods for molecules by critically assessing them, either theoretically or in applied and in industrial settings. The methods contributed to this workshop can focus on architectures that are robust against domain shifts, such as new biotechnologies or types of molecules. The proposed methods can also focus on quickly adapting to newly acquired data with potentially expensive biotechnologies, concretely few- and zero-shot learning methods. A further theme of the workshop is on methods that lead to new levels of abstractions of molecule representations, such that broader generalization capabilities are enabled. A potential step in this direction are machine learning methods for creating relevant physical abstractions, e.g. for improving molecular dynamics simulations or force fields. Advancing machine learning for molecules also means that these new systems should be able to interact with humans and transfer knowledge between them and the system, which is covered by the workshop theme on interpretability and explainability methods. The workshop also includes considerations and methodologies that allow for modularity or compositionality of architectures for molecular machine learning. |
| Year(s) Of Engagement Activity | 2022 |
| URL | https://moleculediscovery.github.io/workshop2022/ |
| Description | Organization of NeurIPS workshop on Deep Generative Models and Downstream Applications |
| Form Of Engagement Activity | Participation in an activity, workshop or similar |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Postgraduate students |
| Results and Impact | In a highly interactive format, this workshop outlined the current frontiers of practical applications and methodological contributions in deep generative models. We aimed to use this workshop as an opportunity to establish a common language across diverse communities, to actively discuss new research problems, and to collect relevant benchmark tasks by which novel data modeling methods can be benchmarked. The program was a collection of invited talks, alongside contributed posters. A panel discussion provided different perspectives and experiences of influential researchers and also engage in open participant conversation. We had 40 poster presentations and 10 invited talks. |
| Year(s) Of Engagement Activity | 2021 |
| URL | https://dgms-and-applications.github.io/2021/#:~:text=The%202021%20NeurIPS%20Workshop%20on,world%20p... |
