Machine Learning for Catalysis

Lead Research Organisation: University of Oxford


This project falls within the EPSRC Synthetic Organic Chemistry research area.
Machine Learning (ML), with its ability to predict and analyse trends that previous methods have failed to quantify, has become the preeminent modern method for the analysis of large data sets and revolutionised data science.
However, the application of these methods within chemistry has been limited, as both the complexities of the systems and the small data sets available are non-trivial dilemmas. A fundamental requirement for ML programs to work is a data set that is both large in number and a representative sample of chemical space: a model failing both these criteria will tend to over-fit the model and so not be of use. Recently, the use of ML methods within chemistry has expanded by Sigman and Doyle using algorithms to predict reactivity. Using the data sets from both the Gouverneur group's work on Hydrogen Bonding Phase Transfer Catalysts and expanding upon the ML methods used in GT-Predict to predict glycosylation, reported by the Davis group, we aim to develop novel ML methods that can be utilised alongside other analysis techniques. We envisage that ML methods could be utilised in the understanding of catalytic reactions to develop better catalytic manifolds for key substrates. A variety of catalytic manifolds, from
chiral small molecules to enzymes will be explored.
With the datasets from research present within the Gouverneur Group, we wish to develop a ML program that would both predict the enantioselectivity of a catalyst and facilitate its redesign to a more efficient structure. Fluorination of proteins and biological structures is the application of interest for the work. In this case, the limitations of representing protein structure in a vector form needs to be verified before being applied to a fluorination setting.
Previously, further advances to GT-Predict model gave rise to a new model GT-Predict-II which used the whole amino acid sequence of the protein to generate the input sequence of the program. Further study is now needed to use the model to predict which changes within the AA sequence would lead to a substrate selectivity change within the enzyme. Validating this model would solve a fundamental challenge facing molecular biology today- how changes in enzyme structure can explain changes in activity.
Using the knowledge obtained from the modelling of both families of enzymatic catalysts, along with that of small molecule chemistry we would aim to use the combined systems to build a model capable of understanding the fluorination of proteins along with the mechanisms of the Fluorinase enzyme. With a method to represent these structures in a vector format, the project will focus on the ability to implement ML techniques to a range of new synthetically relevant systems. With such wide applicability the models much be simple so as to be easy to implement and understand, this consideration will allow for ML techniques to become a toolkit for chemists to easily use alongside other traditional methods of chemical analysis.

Planned Impact

This programme is focused on a new cohort-driven approach to the training of next-generation doctoral scientists in the practice of novel and efficient chemical synthesis coupled with an in-depth appreciation of its application to biology and medicine.

This collaborative academic-industrial SBM CDT will have long-term benefit to the chemical industry, including the pharmaceutical, agrochemical and fine chemical sectors. These industries will benefit through: (i) the potential to employ individuals trained with broad and relevant scientific and transferable skills; (ii) new approaches to the investigation of complex biological and medical problems through novel chemistry; and (iii) better and more efficient synthetic methods.

We will link the work of DSTL, and our pharmaceutical and agrochemical partners (GSK, UCB, Vertex, Evotec, Eisai, AstraZeneca, Syngenta, Novartis, Takeda, Sumitomo and Pfizer) through a common theme of synthesis training. The design and synthesis of new compounds is essential for disease treatment and prevention, and for maintaining food security. Synthesis contributes significantly to UK tax revenue and results in sustained employment across a number of sectors. Employers in the finance, law, health, academic, analytical, government, and teaching professions, for example, also recognise the value of the translational skill-sets possessed by synthesis postgraduates, which this programme will provide.

The SBM CDT training programme will adopt an IP-free model to enable completely free exchange of information, know-how and specific expertise between students and supervisors on different projects and across different industrial companies. This will lead to better knowledge creation through unfettered access to information from all academics, partners and students involved in the project. By focussing on basic science, we will engender genuine collaboration leading to enabling technology that will be of use across a wide range of industries.

We will train the next generation of multidisciplinary synthetic chemists with an appreciation of the impact of synthesis in biology and medicine. Their unconstrained view of synthesis will aid in new scientific discoveries leading to new products, which (with appropriate inward investment), can lead to the formation of new companies and new UK employment.

We will, in part through an alliance with the Defence, Science and Technology Laboratory, engage with policy-makers to influence future policy issues involving chemistry such as food security and the rise of antibiotic resistance (both of which are relevant to our programme and are important for society as a whole).

Outreach and public engagement will be a key aspect of our programme; and all students in the proposed SBM CDT will take part in at least one outreach activity. Typical activities include: open days in the Chemistry Department through the 'Outreach Alchemists', engaging with the Oxfordshire Science Festival and participation in the various other activities already in place through the public engagement programme of the Department of Chemistry.

The research output of the students will be disseminated via high impact international publications and lectures; these will be of value to other academics in relevant fields and will be of value in the development of further research funding applications. Outreach activities and research output will also be advertised on a website dedicated to the proposed SBM programme.


10 25 50