Exploiting Differentiable Programming Models For Protein Structure Prediction And Modelling

Lead Research Organisation: University College London
Department Name: Computer Science

Abstract

Proteins are molecules present in every cell that carry out essential biological processes. These molecules are essentially strings of simpler chemicals, called amino acids and these strings are able to self-assemble into a unique 3-D structure as soon as the protein is made by the cell's protein-making machinery (called ribosomes). It's this unique structure that determines the function of the protein (i.e. what is does in the cell and how it does it). By shining X-rays on crystallised proteins, scientists can determine their structure by looking at how the rays reflect off the layers of atoms that make up the crystal. However, this process can take many months or even years of effort. With hundreds of thousands of proteins for which the native structure is unknown, it is not surprising that scientists are keen to find clever shortcuts to working out the structure of proteins. We, like many other scientists have been trying to decipher the so-called protein folding "code" i.e. trying to work out the rules which govern how the protein finds its unique structure and then trying to program a computer with these rules to allow scientists to quickly "predict" what the structure of their protein of interest might be.

Although the shape or "fold" of a single protein is an important piece of information, it is arguably even more useful to determine which proteins interact with a given protein of interest, and the geometry these so-called protein complexes i.e. groups of proteins which have evolved to stick together in a very specific way. Good examples of such complexes are found in many areas of biology and medicine. For example, a number of different protein complexes play a crucial role in controlling how blood clots. In general, protein-protein complexes underlie our whole understanding of how cells and organisms operate as "systems" - which is a field known as "systems biology". Unfortunately, experimentally studying the structure of a protein complex is even more difficult than studying the structure of a single protein, and so scientists have an urgent need for better computational tools to allow them to predict which proteins could interact and the likely overall shape of the complex that they form.

In this project, we propose to exploit some recent breakthroughs in computing and artificial intelligence to allow us to deduce which parts of proteins are likely to interact and the structures of the complexes that they form when they do. In a nutshell we start by looking for pairs of amino acids that appear to change in synchrony when we look at the different versions of the proteins found in different organisms i.e. we look for cases where a change in one amino acid always seem to occur when we see another amino acid changing. These linked changes are called "correlated mutations" and when we find them, we can be reasonably sure that the two amino acids have evolved to be close together in 3-D space in the final folded form of the protein. If we find enough correlated mutations, we can even go as far as predicting the complete structure of the protein and we hope as far as predicting the structure of a protein-protein complex in a similar way. To do this we will use a new type of computer software called "differentiable programming". This means that our computer programs are treated like mathematical formulae which can be improved by applying basic rules of calculus. In this way, the accuracy of our methods can be automatically improved as more data is obtained to optimize the algorithms.

Technical Summary

This proposal is aimed at extending some of the exciting recent developments in protein structure prediction and modelling, where deep learning is applied to sequence data, in the form of multiple sequence alignments, in order to extract structural constraints to guide protein modelling and simulation. Following on from the latest widely reported developments surrounding DeepMind's AlphaFold, the key idea here is to make use of similar concepts, primarily the highly attractive idea of combining differentiable programming and end-to-end model optimization (Differentiable Molecular Simulation), to tackle next-level problems in protein structure prediction and modelling. By seeing how far these general concepts can be extended to the harder and perhaps less well-defined problems in protein structure, such as modelling protein-protein interactions and natively disordered protein domains, for which data is not so abundant, we will not only gain a deeper understanding of the effectiveness (and limitations) of DMS across a wider range of protein modelling tasks, but also develop new practical and useful tools to allow biologists to computationally analyse more of the unknown proteome space than is currently possible.

Publications

10 25 50