Developing new methods to enable amino acid co-evolution algorithms to be applied to protein-protein interaction prediction

Lead Research Organisation: University College London
Department Name: Computer Science

Abstract

Proteins are molecules present in every cell that carry out essential biological processes. These molecules are essentially strings of simpler chemicals, called amino acids and these strings are able to self-assemble into a unique 3-D structure as soon as the protein is made by the cell's protein-making machinery (called ribosomes). It's this unique structure that determines the function of the protein (i.e. what is does in the cell and how it does it). By shining X-rays on crystallised proteins, scientists can determine their structure by looking at how the rays reflect off the layers of atoms that make up the crystal. However, this process can take many months or even years of effort. With hundreds of thousands of proteins for which the native structure is unknown, it is not surprising that scientists are keen to find clever shortcuts to working out the structure of proteins. We, like many other scientists have been trying to decipher the so-called protein folding "code" i.e. trying to work out the rules which govern how the protein finds its unique structure and then trying to program a computer with these rules to allow scientists to quickly "predict" what the structure of their protein of interest might be.

Although the shape or "fold" of a single protein is an important piece of information, it is arguably even more useful to determine which proteins interact with a given protein of interest, and the geometry these so-called protein complexes i.e. groups of proteins which have evolved to stick together in a very specific way. Good examples of such complexes are found in many areas of biology and medicine. For example, a number of different protein complexes play a crucial role in controlling how blood clots. In general, protein-protein complexes underlie our whole understanding of how cells and organisms operate as "systems" - which is a field known as "systems biology". Unfortunately, experimentally studying the structure of a protein complex is even more difficult than studying the structure of a single protein, and so scientists have an urgent need for better computational tools to allow them to predict which proteins could interact and the likely overall shape of the complex that they form.

In this project, we propose to exploit some recent breakthroughs in understanding how protein sequences evolve to allow us to deduce which pairs of proteins might interaction and the structures of the complexes that they form. In a nutshell we look for pairs of residues that appear to change in synchrony when we look at the different versions of the proteins found in different organisms i.e. we look for cases where a change in one amino acid always seem to occur when we see another amino acid changing. These linked changes are called "correlated mutations" and when we find them, we can be reasonably sure that the two amino acids have evolved to be close together in 3-D space in the final folded form of the protein. If we find enough correlated mutations, we can even go as far as predicting the complete structure of the protein and we hope as far as predicting the structure of a protein-protein complex in a similar way.

Technical Summary

We propose to develop new algorithms to extend our recent highly successful work (PSICOV) on identifying co-evolving sites in large multiple sequence alignments to the problem of predicting interacting sites in separate proteins. This will allow us to produce a Web-bsed tool which can predict novel protein-protein interactions, and also to generate 3-D models of any putative complexes where templates can be found for the individual subunits or domains.

The main challenge addressed in the proposal is the problem of home to compute amino acid covariation between two separate alignments, which is the main bottleneck in successfully applying covariance methods to predicting protein-protein interactions. Although in theory there is absolutely no obstacle to extending covariance methods to separate alignments, the practical obstacles are substantial. The only way to get sufficiently accurate covariance data is to ensure that there is accurate species and orthology equivalence between each row in the two alignments. In other words the phylogenetic origin of Sequence N in Alignment 1 must be identical to that of sequence N in Alignment 2 and so on for all sequences. Unfortunately there are always differing numbers of homologues in the alignments due to incomplete genomes, difficulty in assigning orthologs and so forth. To solve this we propose a two stage process. Firstly labelling pairs of sequences between the two alignments where equivalence can be decided from data bank annotations. Then extending this core alignment by maximising the overal mutual information score between the two alignments to maximise the likelihood of observing covarying sites between the two proteins.

Once the core algorithms have been implemented, a Web-based tool will be released to allow users to construct large accurate paired alignments, and to use these alignments to predict interaction maps between proteins and to carrying out contact-constrained rigid-body protein-protein docking.

Planned Impact

SUMMARY OF PROJECT

This proposal is to build a web-based tool to allow bioscientists to merge multiple sequence alignments for different families and from this data predict novel protein-protein interactions, and to dock proteins together.

COMMUNITY IMPACT

Predicting protein-protein interactions is a key component in understanding how biological systems work at a molecular level. Every biological network or reaction pathway involves interactions between proteins, and being able to determine which proteins in a system interact, and to be able to intervene in these interactions could have wide implications in a variety of BBSRC areas involving systems biology in the broadest sense. A few examples are as follows:

Food security - Increasingly the sequences of plants, agricultural pests and agents of disease are the focus of genome sequencing and structural studies. Interactions between plant proteins and pathogen proteins are key to many aspects of this research area. As our methodology only requires sequence data, this should allow novel leveraging of high throughput sequence data in the food security theme area.

Bio-energy and bio-industry - The manipulation of individual molecules and pathways will yield new sources of energy and materials. Synthetic pathways can be engineered to make molecules, such as fuels, more efficiently. In addition, novel molecules can be designed and synthesised. Advance knowledge of structural information relating to protein-protein complexes can be used to suggest the critical changes needed to alter function.

Health - The central role of protein structure in the design of novel and improved pharmaceuticals is well established. Almost every conceivable drug-protein interaction involves protein complexes, rather than individual protein chains. The focus of this project on building novel tools in this area will thus be especially beneficial.

POLICY MAKERS AND THE LAY PUBLIC

This project can serve as an excellent example to policy makers and the lay public about of the high impact that computational biology projects can achieve relative to the low project costs. For example, by looking at citation data it is easy to show how many different experimental projects, in a wide variety of areas, critically depend on the availability of computational tools similar to the ones outlined in this proposal. This project could also help underline the importance of the internet and "Big Data" in future government policy making.

Publications

10 25 50
 
Description 1. A new Web-based predictor for amino acid covariation-based contact prediction has been developed. This server implements a new method called MetaPSICOV, which combines different statistical models of amino acid covariation (PSICOV, Evfold, CCMPred) using a large neural network. The network also makes use of other data such as secondary structure information and amino acid sequence profiles to achieve state-of-the-art performance.

2. Later work has resulted in a new tool for automatically generating merged alignments between two protein families. This allows an adaptation of our MetaPSICOV method (MetaPSICOV-PP) to be applied to the problem of protein-protein interaction prediction. A paper describing this work is in preparation.
Exploitation Route The software to implement MetaPSICOV is freely available in open source form, and so can easily be used or adapted by other users. The Web server makes the methodology available to any scientist with just a Web browser.
Sectors Pharmaceuticals and Medical Biotechnology

 
Description The software developed in this project has recently been added to our general workbench of tools that can be used by both academics and commercial users. So far we are seeing around 40 MetaPSICOV jobs per month from users in the commercial sector (judged by IP address/domain name). The training and production of skilled research staff with appropriate transferable skills is probably the other most significant delivered item of impact from this project. Dr Tanya Singh has moved on to an industry position (IBM Research in Manchester) and thus is directly contributing to the economy thanks to training received during this grant (her first postdoctoral post).
First Year Of Impact 2016
Sector Pharmaceuticals and Medical Biotechnology
Impact Types Economic

 
Description Horizon 2020
Amount € 2,433,679 (EUR)
Funding ID 695558 
Organisation European Research Council (ERC) 
Sector Public
Country Belgium
Start 11/2016 
End 10/2021