Understanding the Reliability and Transferability of Machine Learning Methods Used in High Throughput Reaction Discovery and Optimisation

Lead Research Organisation: University of Strathclyde
Department Name: Pure and Applied Chemistry


Efficient organic synthesis enables pharmaceutical products to be produced in a scalable, robust, safe, and cost-effective way; significant resource is expended to optimise the yield and purity of the products obtained. Design of Experiments (DoE) provides a structured, logical way to determine optimal reaction conditions and test their robustness. The resource requirements of DoE increase rapidly as more factors are added, and the treatment of discontinuous factors (e.g.
solvent, ligand) is difficult. Smarter, faster, and complementary ways to optimise reactions will allow the delivery of target molecules more quickly and at lower cost. This project combines reaction screening with principal components analysis (PCA) of substrate and ligand parameters, and supervised machine learning techniques (multiple linear regression/random forest classification) to derive data-driven reaction understanding from reaction screening that is routine within organisations that rely on the synthesis of fine chemicals. This will refine the chemical space to be explored during a subsequent DoE process. The focus is not on replacing the synthetic chemist, but on using modern data analysis techniques to expedite their work and reduce the amount of time required to achieve the optimum conditions.

We will focus on C-H borylation reactions because of the utility of the resulting products. We will study established iridium-catalysed Hartwig borylation reactions initially; subsequent, more ambitious work will be conducted using cheaper and more readily-available ruthenium, for which only a limited number of C-H borylation reactions of pyridines and imines have been reported. C-H borylation has a number of drawbacks that limit their use in industry: (i) we have some understanding of how some methods behave with different substrates, such as heterocycles, but known examples do not cover all substrates that might arise during synthetic campaigns such as those undertaken at GSK; (ii) metal loadings can in be very high - particularly for emerging methods that use ruthenium - which has cost implications both in terms of catalyst required and the purification of the resulting products; and (iii) reaction conditions can be harsh, requiring high temperatures for extended periods, which is environmentally unfriendly and costly, and may lead to side reactions. This project will develop innovative new ways to optimise reactions and compare them to established methods (e.g.
DoE). Our overall aim is to reduce the number of experiments required to arrive at the optimum set of conditions.


10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/V519777/1 30/09/2020 29/09/2026
2484402 Studentship EP/V519777/1 01/01/2021 31/12/2024 Hayley Russell
Description I collected a dataset of 46 borylation reactions using high-throughput chemistry techniques. This will be used in the remaining project time to train a machine learning model to predict reaction outcomes from molecular descriptors of the substrates. Machine learning work has begun, with an initial accuracy of 66% which can be improved with further adjustments of model type and hyperparameters.
Exploitation Route Significant learnings about how to collect appropriate data for machine learning using high-throughput chemistry equipment will influence the design of future studies.
Sectors Chemicals

Pharmaceuticals and Medical Biotechnology

Description Placement at University of British Columbia 
Organisation University of British Columbia
Country Canada 
Sector Academic/University 
PI Contribution tbc- currently underway
Collaborator Contribution tbc- currently underway
Impact tbc- currently underway
Start Year 2023