Prediction of protein-protein interaction hot spots using a combination of physics and machine learning

Lead Research Organisation: University College London

Department Name: Computer Science

Abstract

Over the last few years, genome sequencing projects have provided the nearly complete list of genes and proteins present in a cell. The challenge is now to understand how these molecular components interact to give rise to complex and highly interrelated biological processes and phenomena. The long term goal is to reach a quantitative and predictive description of a biological system as a whole (e.g. a cell) grounded in molecular-level knowledge. This would offer an an opportunity to study how the phenotype is generated from the genotype, for example in relation to genetic diseases. In this project we plan to investigate protein-protein interactions. Protein-protein interactions are fundamental to all biological processes, from signal transduction to gene regulation, from catalytic reactions to immune response, and more. In order to bridge the molecular to the system level, it is therefore essential a detailed knowledge of which proteins interact and how they interact. A full understanding of the functional relationship between proteins comes only from the three-dimensional (3D structure) of the complex as this reveal the underlying molecular mechanism. However, determining experimentally the 3D structure of a protein complex present considerable difficulties. There is therefore a need for accurate and reliable computational approaches that can tackle the so-called docking problem, i.e the prediction of the complex conformation starting from the structures of its component proteins. Most docking procedures consider the full 3D structure of the complex and try to orient the individual proteins so as to optimize their shape and chemical complementarity. We propose instead to develop a computational method to predict which amino-acids are in contact at a protein-protein interface. Several experiments have shown that protein interactions are critically dependent on just a few amino acids, or hot spots, at the binding interface. If potential hot-spots could be identified in isolated proteins, our ability at solving the docking problem would be significantly enhanced. We plan to combine and integrate the basic energetic determinants of hot-spot interactions (e.g. Van der Waals potentials, hydrogen bonds,etc.) using state of the art machine learning techniques (e.g. neural networks and support vector machines). Such an hybrid scheme is necessary because the problem is too complex and can not be solved purely from first principles: simplifications and approximations need to be introduced. Machine learning algorithms are extremely powerful in learning from known examples and in generating empirical rules. They can therefore be used to complement and guide physical methods and extend the limits of their applicability.

Technical Summary

Protein-protein interactions are central to most biological processes, from signal transduction to immune response. Understanding these functional associations requires knowledge of the three-dimensional structure of the complex as this reveal the underlying molecular mechanism. However, determining experimentally the 3D structure of a protein complex present considerable difficulties. There is therefore a need for accurate and reliable computational methods. Several experiments have shown that protein interactions are critically dependent on just a few residues, or hot spots, at the binding interface. Hot spots make a dominant contribution to the free energy of binding and if mutated they can disrupt the interaction. In this project we aim to develop a computational method that can identify hot spot residues (and the contacts they form across the interface) in unbound proteins (i.e. without prior knowledge of the complex). This would significantly improve our ability at predicting the overall structure of the complex (the so-called docking problem). We plan to combine and integrate the basic energetic terms that contribute to the stability of protein complexes (e.g. van der Waals potential, hydrogen bonds,etc.) using state of the art machine learning techniques. In the first part of the project, we will develop a method to predict hot-spot residues at protein protein interfaces when the structure of complex is available. In the second part, we plan to systematically dock structural fragments of the two unbound proteins and test them for the presence of potential hot spots (using the classifier developed in the first part). Eventually, we will combine different sources of information (energetic, evolutionary and structural) to predict few important contacts across the interface of two proteins.

Funded Value:

£310,930

Funded Period:

Feb 07 - Jul 10

Funder:

BBSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

BB/E017452/1

Principal Investigator:

David Jones

Research Subject:

Biomolecules & biochemistry (40%)

Tools, technologies & methods (20%)

Research Topic:

Protein expression (40%)

Theoretical biology (20%)

Organisations

University College London (Lead Research Organisation)

People	ORCID iD
David Jones (Principal Investigator)
Massimiliano Pontil (Co-Investigator)
Stefano Lise (Researcher Co-Investigator)

Publications

Author Name

Title Publication Date Published

10 25 50

Lise S (2009) Prediction of hot spot residues at protein-protein interfaces by combining machine learning and energy-based methods. in BMC bioinformatics

Lise S (2011) Predictions of hot spot residues at protein-protein interfaces using support vector machines. in PloS one

Key Findings
Impact Summary
Software and Technical Products


Description	Over the last few years, genome sequencing projects have provided the nearly complete list of genes and proteins present in a cell. The challenge is now to understand how these molecular components interact to give rise to complex and highly interrelated biological processes and phenomena. The long term goal is to reach a quantitative and predictive description of a biological system as a whole (e.g. a cell) grounded in molecular-level knowledge. This would offer an an opportunity to study how the phenotype is generated from the genotype, for example in relation to genetic diseases. In this project we plan to investigate protein-protein interactions. Interactions between proteins are fundamental to all biological processes, from signal transduction to gene regulation, from catalytic reactions to immune response, and more. In order to bridge the molecular to the system level, it is therefore essential a detailed knowledge of which proteins interact and how they interact. A full understanding of the functional relationship between proteins comes only from the three-dimensional (3D structure) of the complex as this reveals the underlying molecular mechanism. However, determining experimentally the 3D structure of a protein complex present considerable difficulties. There is therefore a need for accurate and reliable computational approaches that can tackle the so-called docking problem, i.e the prediction of the complex conformation starting from the structures of its component proteins. Most docking procedures consider the full 3D structure of the complex and try to orient the individual proteins so as to optimize their shape and chemical complementarity. We instead developed a computational method to predict which amino-acids are in contact at a protein-protein interface. Several experiments have shown that protein interactions are critically dependent on just a few amino acids, or hot spots, at the binding interface. If potential hot-spots can be identified in isolated proteins, our ability at solving the docking problem would be significantly enhanced. We combined and integrated the basic energetic determinants of hot-spot interactions (e.g. Van der Waals potentials, hydrogen bonds,etc.) using state of the art machine learning techniques (e.g. neural networks and support vector machines). Such a hybrid scheme is necessary because the problem is too complex and can not be solved purely from first principles: simplifications and approximations need to be introduced. Machine learning algorithms are extremely powerful in learning from known examples and in generating empirical rules. They can therefore be used to complement and guide physical methods and extend the limits of their applicability.
Exploitation Route	After publication, the methods developed in this project were implemented in the form of a publicly available web service so that other researchers can use them easily.
Sectors	Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology
URL	http://bioinf.cs.ucl.ac.uk/structure/


Description	The main non-academic impact from this grant has been the training of skilled scientific researchers. In this case the researcher, Dr Stefano Lise has gone on to become Head of High-Throughput Bioinformatics at the Wellcome Trust Centre for Human Genetics, which will ultimately benefit the core mission of the Centre, which is advancing the understanding of genetically-related medical conditions through multidisciplinary research.
First Year Of Impact	2010
Sector	Healthcare,Pharmaceuticals and Medical Biotechnology
Impact Types	Societal,Economic


Title	HSPred
Description	A support vector machine (SVM)-based method to predict hot spot residues, given the structure of a protein complex.
Type Of Technology	Webtool/Application
Year Produced	2011
Impact	None to date.
URL	http://bioinf.cs.ucl.ac.uk/structure

Abstract

Technical Summary

Organisations

People

ORCID iD

Publications