📣 Help Shape the Future of UKRI's Gateway to Research (GtR)

We're improving UKRI's Gateway to Research and are seeking your input! If you would be interested in being interviewed about the improvements we're making and to have your say about how we can make GtR more user-friendly, impactful, and effective for the Research and Innovation community, please email gateway@ukri.org.

Navigating Chemical Space with Natural Language Processing and Deep Learning

Lead Research Organisation: University of Greenwich
Department Name: Pharm., Chem. & Environmental Sci., FES

Abstract

Natural language processing (NLP) lies at the intersection between linguistics and computer science which aims to process and analyse human language, typically provided as written text. NLP is now strongly focused on the use of machine learning for challenging tasks with some revolutionary algorithms having been developed in the last few years. They now underpin a wide range of real-life applications, such as ChatGPT, virtual assistants and automatic text completion when we write emails.

Innovative research ideas often come from integrating techniques and concepts across disciplines. For this discipline-hopping grant, we would like to explore how Transformer models, a ground-breaking deep learning algorithm developed by Google in 2017 which fuels majority of the current cutting-edge research in NLP, can be adapted to solve research challenges in chemistry.

Chemical structures are usually three dimensional. However, they are also often converted into sequences, called SMILES. SMILES has a simple vocabulary of chemical elements and bond symbols and a few grammatical rules of how the chemical elements are positioned. Owing to this direct analogy to text sequences, through SMILES it is possible to use NLP algorithms to analyse chemical structures in a similar fashion as they are used to analyse text.

For the proposed research, Dr Pang, a chemist will work with Dr Vulic, an NLP and machine learning expert in order to get up to speed with the latest developments in the field of NLP and to examine their further applicability in her domain of expertise.

We will explore and utilise a concept which is now pervasive in machine learning and NLP, termed transfer learning, which 1) pretrains large general-purpose models, and 2) fine-tunes (i.e., specialises) those general models for specific tasks and applications, where labelled data are expensive to create (as they require expert knowledge and complex annotation protocols) and thus inherently scarce.

Specifically, we will pretrain Transformer models to learn a latent representation of the chemical space defined by tens of millions of SMILES. This learned latent representation can then be used to predict molecular properties for a given chemical structure during fine-tuning. The advantage of this type of approach is that the resulting machine learning models rely less on the so-called labelled data (molecules with experimentally determined properties), which are time-consuming or even impossible to generate in chemistry considering the associated cost and experimental challenges. We will aim to make the Transformer models more computationally efficient and accurate using two latest machine learning techniques, termed sentence encoding and contrastive learning. We hope that this new molecular representation can complement existing molecular representation methods and provide an alternative approach to evaluate molecular structures against their properties, which underpins many research and development tasks in the chemical and pharmaceutical industries.
 
Description The purpose of the award was to for the PI, Dr Pang, a chemist to work with Dr Vulic, an NLP and machine learning expert in order to get up to speed with the latest developments in the field of large language models (LLMs) and to examine their further applicability in her domain of expertise - chemistry. One year into the project, Dr Pang has upskilled significantly: she now has a broad understanding of the latest development in LLMs through developing code, models, datasets and is establishing herself as an AI in chemistry researcher. She has presented at several high-profile AI in chemistry conferences and the research from the award was featured in the Engineer magazine.

In terms of new knowledge generated from the award, the grant team has explored a concept which is pervasive in machine learning and NLP, termed transfer learning, which 1) pretrains large general-purpose models, and 2) fine-tunes (i.e., specialises) those general models for specific tasks and applications. We demonstrate that this approach can be used effectively for the prediction of solubility and organic chemistry reactions in a computationally effective way - the finetuned LLMs only require a few hours to train on a consumer-grade GPU and possess broad capabilities for a variety of reaction prediction tasks, such as predicting reagents and products. This method represents state-of-the-art in terms of both GPU-computing efficiency and model versatility and adaptability among the applications of AI for chemistry research. We have made the code and dataset open access on Github.

The method has only been published for 6 months but has attracted interests from researchers from the pharmaceutical industry. The grant team has been trying to broaden its impact and promote it to others to adapt it through conferences and industry engagement events.
Exploitation Route Our research, in its broad scope aims to transform organic synthesis using advanced AI-driven reaction prediction models. At the end of the award, we will produce open-source chemical reaction datasets and proof-of-principle AI models to support other users to adapt this approach for novel and specific reactions.

In terms of how the outcomes of this funding can be taken forward - The finetuned large language models (LLMs) can be refined through larger datasets and more advanced finetuning algorithms. LLMs, with their rich understanding of reactions can be combined with Bayesian reaction optimisation models which focus on process and are reaction agnostic. The models we developed can be adapted and taken forward by both academic and industry user that will significantly reduce the cost and time to plan and optimise organic synthesis.
Sectors Chemicals

Pharmaceuticals and Medical Biotechnology

 
Title Fine-Tuning T5-Style Language Models for Organic Reaction Prediction 
Description This method can effectively specialise Large Language Models for organic reaction prediction through a process called finetuning. The finetuned LLMs only require a few hours to train on a consumer-grade GPU and possess broad capabilities for a variety of reaction prediction tasks, such as predicting reagents and products. This method represents state-of-the-art in terms of both GPU-computing efficiency and model versatility and adaptability among the applications of AI for chemistry research. 
Type Of Material Improvements to research infrastructure 
Year Produced 2024 
Provided To Others? Yes  
Impact The method has only been published for 6 months but has attracted interests from researchers from the pharmaceutical industry. The grant team has been trying to broaden its impact and promote it to others to adapt it through conferences and industry engagement events. 
URL https://github.com/cambridgeltl/chem-encdec
 
Title Predicting Hansen Solubility Parameters Using Molecular Embeddings 
Description This research tool is a series of python code written in Jupyter Notebook. The code uses the Morgan fingerprint approach and two Natural Processing Language-based molecular embedding models to predict Hansen Solubility Parameters (HSPs), which is widely used in the chemical and pharmaceutical industry. This method is fast and flexible and achieves prediction accuracy on a par with other more computing demanding methods. 
Type Of Material Improvements to research infrastructure 
Year Produced 2024 
Provided To Others? Yes  
Impact The code and the paper that accompanies it has received 6 citations and 3 stars on GitHub over the 14 months period of time since its publication. 
URL https://github.com/jiayunpang/hsp_embedding
 
Description Meeting with researchers from the data science team and the discovery high throughput chemistry group 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Industry/Business
Results and Impact The PI and Co-I of the award team attended a discussion meeting with three researchers from GSK's data science team and discovery high throughput chemistry team. There was a wide range of discussion on AI algorithms, data representation and challenges from industry in the research area. The code and models from the EPSRC award were shared with industry researchers to encourage testing.
Year(s) Of Engagement Activity 2024
 
Description Seminar to the Data Science & Modelling team at AstraZeneca (Macclesfield UK and Gothenburg Sweden) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Industry/Business
Results and Impact A talk was given to 20-30 researchers from the Data Science and Modelling team at AstraZeneca, from both the Macclesfield UK and Gothenburg Sweden sites. The topic was focused on key AI concepts and major progress of the research from the EPSRC award. The resulting code and models from the EPSRC award were shared with industry researchers. This was followed by discussion on possible collaboration.
Year(s) Of Engagement Activity 2025