Computational identification of protein-protein interactions

Lead Research Organisation: University of Manchester
Department Name: Life Sciences


Proteins are extremely important biological molecules. In addition to numerous vital structural roles, they are responsible for the majority of active biochemical functions and molecular processes within living cells. Nearly all proteins work as components of a biological system by binding other molecules, and most function in concert with others, as 'molecular machines' or in elegant 'production lines', such as signalling pathways, to carry out complex biological functions. These protein interactions are also important in combating foreign proteins, such as from a viral infection. Approximately 60% of proteins take part in some kind of protein assembly or 'complex'. These protein complexes play a role in the majority of cellular processes, and modern biology is now able to build the connecting parts list of cellular protein interactions via genomic and post-genomic science. However, in the majority of cases, we don't understand how the various protein specifically recognise their specific partners. What we do know is that in order to form complexes, individual proteins must make contact with ('bind') a limited number of specific partners. It is the rules that control this 'specificity' for binding that we propose to investigate. Binding in complexes is the result of specific contacts in the context of proteins' three-dimensional structures. We propose to determine the key regions for binding (termed 'interfaces'), distinguish them from non-binding regions. The strength of inferred interactions within the interface regions may help determine which amino acids are most important for binding. To achieve our goal of computationally identifying protein binding interfaces, we propose to develop sophisticated computational methods that describe how evolution at interfaces differs from that occurring at non-interacting site on proteins. These models will look for correlations in evolution at specific sites. We will examine sequence data taken from a range of interacting and non-interacting proteins to develop our a sophisticated and rigorous model to explain this evolutionary process. By iteratively improving and simplifying this substitution model we will progressively improve our ability to discriminate between interacting and non-interacting positions, enabling us to better identify both interacting proteins and the specific interfaces by which they interact. The resultant model will provide a powerful new computational tool for studying biological systems, which until now has been lacking in the field. By using phylogenetic methods that are founded on established statistical methodology, we will bring a new degree of rigour to this type of analysis and make the best possible use of information held within our sequences. We will apply the tool to investigate interaction networks in yeast, identifying new potential interactions and to identify errors in experimental methods. We will work with experimental collaborators to confirm these computational inferences and further improve our models.

Technical Summary

We propose to study the fundamental biological processes of specificity of binding in protein-protein interactions; these processes underpin systems biology and cellular interaction networks. We will deliver a new computational tool for testing for the presence of an interaction between a pairs of protein sequences by examining their sequence alignments. We will develop a sophisticated computational approach, grounded in the statistical models used in phylogenetics, to identify residues involved in protein-protein interactions. Recent research has shown that coevolution across protein interfaces is a powerful predictor of protein-protein interactions when applied at the residue level, but not over whole sequences. However, relatively little research has been performed on site-specific approaches for inferring interactions, and there are plenty of opportunities for improving their power. In common with nearly all previous phylogenetic research, a more realistic substitution model, coupled with careful development of heuristics, will substantially improve inference. We will develop an intermolecular coevolution model based on empirical observations about what occurs at protein binding interfaces. By relaxing the assumption of independence of evolution at differing sites this will allow us to account accurately for coevolution and create a predictive model specifically applicable at the residue level. We will combine our new methodology with appropriate heuristics to investigate the set of interactions occurring in yeast, and compare it to what has been inferred by experimental high-throughput approaches. Our results will be used to formulate hypotheses that will be tested in the laboratory by our collaborators.

Planned Impact

The proposed research is basic science and its outcomes would deliver a powerful new computational tool for inferring protein binding sites and protein-protein interactions. In common with all basic science, the end beneficiaries outside the academic sector cannot be easily predicted. However, the relevance of the proposal to current BBSRC strategies suggests that our proposal lies in an area that is likely to see substantial growth in the near future. Our proposed research could result in a crucial tool in systems biology, structural biology, and comparative genomics. Research from these fields is already being picked up by the private sector for use in the biotechnology and pharmaceutical sectors, and has potential for adding value to commercial items conceived or created in the UK. The main benefit for the public and third sector is the intellectual insight and that the application of our tools may help guide future developments and policy. Our Impact Plan details how we will communicate the ideas of our research to a wide and appropriate base of people. Between them, SW and SL have extensive experience of disseminating their work through standard academic channels, such as presentation at conferences and publication of research paper, and by working with the private sector (Pfizer), the popular media, and the internet.
Description Recently-developed methods to study amino acid covariation within protein sequences have led to a resurgence of interest in covariation, with the promise of wide application to problems as diverse as de novo protein structure prediction, analysis of protein complexes, and protein design.

We investigated the adequacy of the critical assumption of covariation methods: That measures of covariation capture correlated changes, which occur as a consequence of molecular coevolution.

Those methods that search for covariation between sites in a tree-independent manner tend to be less computationally demanding than the coevolutionary explicit models. However, because most covariation methodologies are tree-independent and do not include an explicit model of sequence change, it is very difficult to assess whether their results are evolutionarily sound. Specifically, it is difficult to know whether the covariation observed within sequence alignments arises from coevolution or through some other mechanism.

Our results provide several lines of evidence that undermine the assumption that the covariation signals widely detected are, in fact, due to molecular coevolution. We showed that covariation can occur both as a consequence of correlated changes resulting from molecular coevolution and as the result of rare independent changes at conserved sites. By using real data sets to examine patterns of change on evolutionary trees, we find that the signal detected by covariation methods tends to arise from small number of independent changes at highly conserved sites rather than the correlated changes expected from molecular coevolution.

Computational tools for identifying such residues would be valuable for a range of structural and functional studies. Covariation methods are widely assumed to provide approximate measures of coevolution and they have been used successfully to identify physically close residues in a range of proteins. We cast doubt on the validity of that assumption through a theoretical and empirical framework. We show that a range of different coevolutionary and independent evolutionary scenarios are indistinguishable from one another based solely on the observation of covariation. Only by examining change in the context of the phylogenetic tree structure can one discriminate between coevolutionary double changes and groups of independently occurring single changes amplified by the structure of the phylogenetic tree.
Exploitation Route Protein-protein interactions the interacting residues tend to be less conserved than residues in the core, especially for transient interactions. This suggests that unless prior knowledge is used for filtering out all the core residues, the interacting residues will be difficult to identify in some cases, and other computational approaches may be required to address these applications. In particular, we expect that covariation methods will be more successful in predicting obligate interactions in protein complexes than for identifying transient interactions, such as are found in signaling cascades. Recognizing that covariation methods tend to function by detecting slowly evolving sites in the core of proteins may also aid in their development and extension to other problems.

One possibility for improvement would be to explicitly adjust covariation measures to incorporate biophysical properties of residues, such as hydrophobicity, which may aid the identification of core residues. As covariation methods do not replicate contact maps based on hydrophobicity, both types of data might complement each other. An alternative, but complementary, approach would be to combine information from covariation and evolutionary rate to help identify buried and functional residues. Both of these approaches, however, are based on the idea of tuning covariation methods to better identify low rate sites in the core of proteins. More significant improvements may be possible through tree-based methods that directly try to measure molecular coevolution.
Sectors Manufacturing, including Industrial Biotechology

Description We are applying the results from this project to understanding the effects of sequence variations in genetic disease, specifically congenital eye diseases. Where more than one variation is seen, we are using analysis of covariation to identify whether there may be compensating effects of variants, either mitigating or exacerbating effects of variants in patients. This is being done through St Marys Hospital, Manchester.
First Year Of Impact 2015
Sector Healthcare
Impact Types Societal

Title Data from: ModelOMatic: fast and automated comparison between RY, nucleotide, amino acid, and codon substitution models 
Type Of Material Database/Collection of data 
Year Produced 2014 
Provided To Others? Yes