Bringing AI to structure-based compound optimisation by leveraging high-throughput X-ray crystallography in a structured data framework

Lead Research Organisation: University of Oxford
Department Name: Oxford Chemistry


In the proposed research project I would build computational tools and analyses that help to improve the efficiency of drug discovery through enhanced analysis of protein-ligand interactions.
The continuing influx of genetic information has lead to an explosion in the number of putative macromolecular disease targets including proteins. Small molecules (<900 Da) can bind to and then modulate the activity of those protein targets. Small molecules can thus be used as drugs to treat diseases and as tools to reveal linkages between potential targets and disease. Tools and drugs must bind strongly to their protein of interest (potency). Most small molecule drugs and tools do so through non-covalent interactions such as hydrogen bonds, electrostatic and hydrophobic interactions (protein-ligand interactions). The quantitative understanding of such interactions remains poor and so the automated design of small molecules with optimised interactions is currently not possible. Current state of the art in small molecule optimisation involves multiple time-consuming and expensive cycles of subjective human-driven design, chemical synthesis and experimental testing. For each potent small molecule this typically takes years and costs millions of pounds, often ending in expensive failure.
A major reason for the lack of understanding of protein-ligand interactions and routes to optimising them is that high quality, systematic data has until now been the preserve of specialised industry groups (and very expensive to generate). The XChem collaboration between Diamond Light Source and the Structural Genomics Consortium (SGC) Oxford enables medium-throughput generation of such structural data for the first time. Over two years, XChem has generated thousands of high-quality 3D protein-ligand structures on more than 30 biomolecule targets. Crucially, it is now conceivable to generate systematic datasets (e.g. exploring the effect of small chemical alterations on binding and of the role of solvation) at atomistic resolution.
In this fellowship, I will build such a systematic dataset on five protein targets involving 100s of novel experimentally determined protein-ligand structures. I will do so by combining novel computational tools with the breakthrough XChem facility for high-throughput protein-ligand X-ray crystallography. Specifically, I will build on the small molecule Astex Graph Database that connects experimental XChem hits with easily synthesised molecules provided by vendors and collaborators. These connections will be used to design future experiments that explore protein-ligand binding systematically but in a feasible manner. I will then combine the experimental protein-ligand interaction data and computational energetics methods with the small molecule data in this Graph Database. Finally, I will use this comprehensive and connected Graph Database to design automated routes for compound optimisation using structural data.

Planned Impact

National Importance: I aim in the course of this project to develop two things. Firstly, I will establish a world-first systematic and invaluable dataset of protein-ligand interactions. Second, I will develop novel computational tools that will automate aspects of small molecule design improving the productivity of this sector. The fellowship will make direct use of my interdisciplinary background (Chemistry, Statistics and X-ray crystallography) and existing industrial collaborations to drive experimentally validated computational compound design.
This fellowship will apply EPSRC funded research in Physics, Chemistry and Statistics to life-science problems (Technology Touching Life). This work falls under the Industrial Strategy areas "High Productivity Services through Specialised Artificial Intelligence" and "interface with biotechnology and biological sciences" along with "Robotics and artificial intelligence systems" and "new approaches to data science". It will work within both the Productive Nation and Healthy Nation aspects of the EPSRC's delivery plan. I aim to tackle the Healthy Nation and Productive Nation elements of the EPSRC funding plan. The former by helping to develop treatments and validated disease targets in areas of unmet need (e.g. Dementia, AMR). The latter by improving the productivity of the UK pharmaceutical industry and thus protecting valuable life-sciences jobs in the UK.

Industrial route to impact: The industrial target of this work is all companies engaged in structure-based drug design (most major pharma and some SMEs). work will be carried out in continued collaboration with GSK computational chemists (Darren Green Director of Computational Chemistry). I will also use the fellowship to work with collaborators at BenevolentAI (Nathan Brown Head of Cheminformatics), UCB (Jiye Shi Global Head of CADD) and Roche (Jerome Hert Section Head CADD). By engaging with end-users I can ensure work is broadly applicable. Ultimately I would envisage such collaborators as customers for any software products generated by this work. Accordingly, I will seek the advice of Oxford University Innovations regarding licensing and IP for this work.

Academic route to impact: The groups who use my framework and data will benefit greatly from my work. This is both AI and Molecular Dynamics groups, who do not have access to such systematic data. To maximise this impact, I will generate open-source and modular code to maximise re-use by other parties and I will continue to work closely with leading open-source projects such as RDKit. I will openly release the data found in query able databases with web interfaces.
Biology groups will also benefit from the method, through improved development of chemical probes. I will work with the ~30 academic biology groups who use XChem every year as a focus for tool development. They will also help maximise my methods' exposure by publishing their own experimental results.
Finally, chemistry groups will benefit from improved understanding of which chemistry methods are needed to better explore protein-ligand interactions

Public Engagement: I will continue to train future generations of scientists and non-scientists through public engagement and training events. I will also release my data and findings openly through the Extreme Open Science Initiative, currently being developed at the SGC, aimed at scientists and non-scientists alike ( Finally, I will present the important work of SBDD at open days for the SGC and University of Oxford. I will engage with the MPLS Public Engagement Facilitator (Michaela Livingstone) to help develop my public engagement activities.


10 25 50