Exploiting Differentiable Programming Models For Protein Structure Prediction And Modelling

Lead Research Organisation: UNIVERSITY COLLEGE LONDON

Department Name: Computer Science

Abstract

Proteins are molecules present in every cell that carry out essential biological processes. These molecules are essentially strings of simpler chemicals, called amino acids and these strings are able to self-assemble into a unique 3-D structure as soon as the protein is made by the cell's protein-making machinery (called ribosomes). It's this unique structure that determines the function of the protein (i.e. what is does in the cell and how it does it). By shining X-rays on crystallised proteins, scientists can determine their structure by looking at how the rays reflect off the layers of atoms that make up the crystal. However, this process can take many months or even years of effort. With hundreds of thousands of proteins for which the native structure is unknown, it is not surprising that scientists are keen to find clever shortcuts to working out the structure of proteins. We, like many other scientists have been trying to decipher the so-called protein folding "code" i.e. trying to work out the rules which govern how the protein finds its unique structure and then trying to program a computer with these rules to allow scientists to quickly "predict" what the structure of their protein of interest might be.

Although the shape or "fold" of a single protein is an important piece of information, it is arguably even more useful to determine which proteins interact with a given protein of interest, and the geometry these so-called protein complexes i.e. groups of proteins which have evolved to stick together in a very specific way. Good examples of such complexes are found in many areas of biology and medicine. For example, a number of different protein complexes play a crucial role in controlling how blood clots. In general, protein-protein complexes underlie our whole understanding of how cells and organisms operate as "systems" - which is a field known as "systems biology". Unfortunately, experimentally studying the structure of a protein complex is even more difficult than studying the structure of a single protein, and so scientists have an urgent need for better computational tools to allow them to predict which proteins could interact and the likely overall shape of the complex that they form.

In this project, we propose to exploit some recent breakthroughs in computing and artificial intelligence to allow us to deduce which parts of proteins are likely to interact and the structures of the complexes that they form when they do. In a nutshell we start by looking for pairs of amino acids that appear to change in synchrony when we look at the different versions of the proteins found in different organisms i.e. we look for cases where a change in one amino acid always seem to occur when we see another amino acid changing. These linked changes are called "correlated mutations" and when we find them, we can be reasonably sure that the two amino acids have evolved to be close together in 3-D space in the final folded form of the protein. If we find enough correlated mutations, we can even go as far as predicting the complete structure of the protein and we hope as far as predicting the structure of a protein-protein complex in a similar way. To do this we will use a new type of computer software called "differentiable programming". This means that our computer programs are treated like mathematical formulae which can be improved by applying basic rules of calculus. In this way, the accuracy of our methods can be automatically improved as more data is obtained to optimize the algorithms.

Technical Summary

This proposal is aimed at extending some of the exciting recent developments in protein structure prediction and modelling, where deep learning is applied to sequence data, in the form of multiple sequence alignments, in order to extract structural constraints to guide protein modelling and simulation. Following on from the latest widely reported developments surrounding DeepMind's AlphaFold, the key idea here is to make use of similar concepts, primarily the highly attractive idea of combining differentiable programming and end-to-end model optimization (Differentiable Molecular Simulation), to tackle next-level problems in protein structure prediction and modelling. By seeing how far these general concepts can be extended to the harder and perhaps less well-defined problems in protein structure, such as modelling protein-protein interactions and natively disordered protein domains, for which data is not so abundant, we will not only gain a deeper understanding of the effectiveness (and limitations) of DMS across a wider range of protein modelling tasks, but also develop new practical and useful tools to allow biologists to computationally analyse more of the unknown proteome space than is currently possible.

Funded Value:

£406,419

Funded Period:

Jul 22 - Jul 25

Funder:

BBSRC

Project Status:

Active

Project Category:

Research Grant

Project Reference:

BB/W008556/1

Principal Investigator:

David Jones

Research Subject:

Tools, technologies & methods (96%)

Research Topic:

Bioinformatics (24%)

Theoretical biology (48%)

eScience (24%)

Organisations

UNIVERSITY COLLEGE LONDON (Lead Research Organisation)

People	ORCID iD
David Jones (Principal Investigator)

Publications

Author Name

Title Publication Date Published

10 25 50

Kandathil SM (2023) Machine learning methods for predicting protein structure from single sequences. in Current opinion in structural biology

Lau A (2023) Merizo: a rapid and accurate domain segmentation method using invariant point attention

Lau AM (2024) Exploring structural diversity across the protein universe with The Encyclopedia of Domains. in Science (New York, N.Y.)

Lau AM (2023) Merizo: a rapid and accurate protein domain segmentation method using invariant point attention. in Nature communications

Waman VP (2025) CATH v4.4: major expansion of CATH by experimental and predicted structural data. in Nucleic acids research

Key Findings
Research Databases and Models


Description	Our research focused on understanding how different parts of proteins interact, which is crucial for their function in the body. Initially, we used a physics-based simulation approach, but as the project progressed, we switched to using AI, specifically transformer-based language models (similar to those used in tools like ChatGPT) and diffusion. This shift allowed us to predict how protein segments interact more accurately and efficiently. By analyzing large datasets of protein structures, our models identified key patterns in how proteins fold and bind together, even in complex cases where traditional methods struggled. This breakthrough could help scientists better understand diseases, design new drugs, and develop better treatments by predicting how proteins behave at a molecular level.
Exploitation Route	The outcomes of this research provide a strong foundation for future advancements in protein structure prediction, drug discovery, and biotechnology. The transformer-based models developed in this project can be further refined and integrated into existing protein modeling tools, enabling researchers to better predict how proteins fold and interact-critical for understanding diseases and designing targeted therapies. Pharmaceutical companies and biotech firms could use these models to accelerate drug development by simulating protein interactions with potential drug compounds, reducing the need for costly and time-consuming lab experiments. Additionally, the insights gained into domain-domain interactions and disordered protein regions could aid in studying conditions like neurodegenerative diseases and cancer, where protein misfolding plays a key role. Open access to the methods and tools developed in this project would allow the broader scientific community to build on these findings, improving protein engineering techniques, enhancing synthetic biology applications, and expanding our understanding of the proteome. Although we shifted towards transformer-based language models for protein interaction predictions, the original simulator remains available for use. Researchers interested in differentiable molecular simulation (DMS) can still access and build upon the framework developed in this project. The code has been improved and is now compatible with a wider range of computing hardware, with much easier installation scripts. We still believe that ML approach may prove valuable for specific applications, such as studying protein dynamics, refining structural models, or simulating disordered regions where physics-based approaches offer unique insights. This ensures that both AI-driven and simulation-based methodologies remain accessible, providing flexibility for future research and applications.
Sectors	Pharmaceuticals and Medical Biotechnology


Title	Foldclass databases for protein structural domains in CATH and TED
Description	This repository contains databases of protein domains for use with Foldclass and Merizo-search. We provide databases for all 365 million domains in TED, as well as all classified domains in CATH 4.3. Foldclass and Merizo-search use two formats for databases. The default format uses a PyTorch tensor and a pickled list of Python tuples to store the data. This format is used for the CATH database, which is small enough to fit in memory. For larger-than-memory datasets, such as TED, we use a binary format that is searched using the Faiss library. The CATH database requires approximately 1.4 GB of disk space, whereas the TED database requires about 885 GB. Please ensure you have enough free storage space before downloading. For best search performance with the TED database, the database should be stored on the fastest storage hardware available to you. IMPORTANT:We recommend going in to each folder and downloading the files; if you attempt to download each folder in one go, it will download a zip file which will need to be decompressed. This is particularly an issue if downloading the TED database, as you will need to have roughly twice the storage space needed as compared to downloading the individual files. Our GitHub repository (see Related Materials below) contains a convenience script to download each database; we recommend using that.
Type Of Material	Database/Collection of data
Year Produced	2024
Provided To Others?	Yes
URL	https://rdr.ucl.ac.uk/articles/dataset/Foldclass_databases_for_protein_structural_domains_in_CATH_an...

Abstract

Technical Summary

Organisations

People

ORCID iD

Publications