Random Forest Prediction of Protein-Ligand Binding Affinities

Lead Research Organisation: UNIVERSITY OF CAMBRIDGE

Department Name: Chemistry

Abstract

The binding affinity between a small molecule ligand and the protein with which it interacts is not easy to calculate. Indeed, its computational prediction remains one of the most important and difficult unsolved problems in computational biochemical science. Most medicines, and many other molecules in uses from agrochemicals to deodorants, are ligands that bind to proteins. The proteins may be from the human, or from a pathogenic or undesirable organism such as a bacterium. It would be very beneficial to be able to predict binding affinities using a computer, because the alternative experimental approach of making very many molecules and assaying them against the relevant protein or proteins is difficult, expensive and time-consuming. The computer calculates an estimated binding affinity using a mathematical formula known as a scoring function. The development of suitable scoring functions for ranking possible three dimensional protein-ligand interaction geometries, and especially for accurate prediction of protein-ligand binding affinities, remains a considerable challenge. The scoring function must capture all the important aspects of the interaction in order to give an accurate and reliable prediction of the binding affinity. In order to develop better scoring functions, we are looking to the fields of machine learning and informatics, and will require the known binding affinities and structures of numerous well-characterised protein-ligand complexes. Fortunately, many hundreds of protein-ligand complexes have both structures and binding affinities available. The method we will use is called Random Forest. The forest is a set of several hundred 'decision trees', each of which is basically a flow diagram. We will train them to learn patterns in the known properties of existing protein-ligand complexes, their binding affinities and their patterns of atom-atom interaction distances. However, the way in which we will generate the trees involves computer-simulated dice-rolling. This will ensure that they are all different, though based on the same underlying information. The decision trees then each made a prediction of the unknown binding affinity. These predictions are averaged to give the final computed value. This averaging over many decision trees maximises the use of the information contained in the underlying data and produces results which are much more accurate than those of any one decision tree. Our models will be validated by using them to predict binding affinities of protein-ligand complexes that the algorithm has not seen before. This ensures that the computer is not simply learning the idiosyncrasies of the data on which it is being trained.

Technical Summary

Unlike knowledge-based methods, Random Forest affinity prediction will use binding affinities as well as 3D structures. We will take hundreds of protein-ligand complexes with binding affinities from our own PLD, and from the PDBbind, AffinDB, LPDB, Binding MOAD, BindingDB and KiBank databases. Most will form the training data, but we will withhold an external validation set. Random Forest is an ensemble of decision trees generated stochastically so that all are different, though based on the same underlying data. Random Forest can handle large numbers of descriptors even when some are uninformative, can measure the importance of each descriptor, and is immune from overfitting. For regression, the prediction is averaged over all the trees. Processing PDB structures, defining atom types and preparing histograms of atom type pairwise distance distributions are all handled by our existing BLEEP software. Predictive Random Forest models will be built using the randomForest package from the statistical suite R. Our descriptors will be counts of atom type pairs interacting in distance ranges, say hydroxyl oxygen interacting with amide nitrogen between 3.0-3.5Å. We will use fewer than 40 atom types; their definitions can be revised during the project. The more data we have, the more specific we can make our descriptors, by adjusting atom type definitions and histogram bin sizes. We will build Random Forests with 500 trees using the training set. The performance in predicting out-of-bag data, those data not selected to build that tree, reflects a model's quality. We will measure the importance of individual descriptors by replacing them with random noise and recording the resultant drop in accuracy. We will also test our models on the independent external validation sets. We will build models for the overall diverse dataset of protein-ligand complexes and for specific families, like serine proteinases, aspartic proteinases and sugar binding proteins.

Funded Value:

£80,714

Funded Period:

Jan 09 - Dec 09

Funder:

BBSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

BB/G000247/1

Principal Investigator:

John Mitchell

Research Subject:

Biomolecules & biochemistry (24%)

Cell biology (13%)

Tools, technologies & methods (24%)

Research Topic:

Bioinformatics (12%)

Catalysis & enzymology (12%)

Protein expression (12%)

Receptors (13%)

Tools for the biosciences (12%)

Organisations

UNIVERSITY OF CAMBRIDGE (Lead Research Organisation)

People	ORCID iD
John Mitchell (Principal Investigator)

Publications

Author Name

Title Publication Date Published

10 25 50

Ballester PJ (2012) Hierarchical virtual screening for the discovery of new molecular scaffolds in antibacterial hit identification. in Journal of the Royal Society, Interface

Ballester PJ (2010) A machine learning approach to predicting protein-ligand binding affinity with applications to molecular docking. in Bioinformatics (Oxford, England)

Ballester PJ (2011) Comments on "leave-cluster-out cross-validation is appropriate for scoring functions derived from diverse protein data sets": significance for the validation of scoring functions. in Journal of chemical information and modeling

Mitchell J (2011) Informatics, Machine Learning and Computational Medicinal Chemistry in Future Medicinal Chemistry

Key Findings
Impact Summary
Software and Technical Products


Description	We have created a highly successful scoring function for predicting protein-ligand binding affinity, RF-Score. Our new scoring function both outperforms its leading rivals on the demanding PDBbind benchmark, and also incorporates new design principles. RF-score embodies a novel approach to scoring functions that circumvents the need for problematic modelling assumptions, as the use of the Random Forest machine learning method avoids any requirement to assume a particular mathematical form. Our use of Random Forest allowed us to implicitly capture binding effects that are hard to model by any theory-based method. We showed RF-Score to be particularly effective as a re-scoring function and thus suitable for virtual screening and lead optimization purposes. It is very encouraging that the first version obtained a high correlation with measured binding affinities on a highly diverse test set. RF-Score's performance was shown to improve dramatically with training set size. Hence the anticipated future availability of ever more data is expected to lead to further improvements, which can be realised in forthcoming versions of RF-Score. Our success with RF-Score demonstrates that machine learning based scoring functions constitute an effective way to assimilate the fast growing volume of high quality structural and interaction data into affinity prediction, and functions of this kind are confidently expected to lead to increasingly accurate and general predictions of binding affinity. The code for RF-Score is freely available from us under a Creative Commons license. It is also available as supporting information to the Bioinformatics paper describing this work.
Exploitation Route	As part of docking suites. In application to DNA and RNA docking.
Sectors	Chemicals Education Manufacturing including Industrial Biotechology Pharmaceuticals and Medical Biotechnology
URL	http://chemistry.st-andrews.ac.uk/staff/jbom/group/RF-Score.html


Description	RF-Score is being made publically available without charge under a Creative Commons license. It is available both directly from the authors and also as Supporting Information to the Bioinformatics paper describing RF-Score. We hope that it will prove especially useful to SMEs in areas like pharmaceuticals, biotechnology and food science, who lack the in-house resources of larger corporations. However, RF-Score will also be available without charge to large companies, and of course to academics. RF-Score is being made available under a Creative Commons license. John Mitchell has given a number of presentations to the pharmaceutical industry, including invitations to speak at Improving Solubility (2009), ADMET (2009) & ADMET Europe (2010). He has one PhD student half funded by Unilever, plc. He also has a collaboration with GlaxoSmithKline. Pedro Ballester has strong links with Pfizer, collaborating with them in testing his Ultrafast Shape Recognition (USR) software. Pedro also works closely with the drug discovery company InhibOx Ltd, who are primarily known for their Screensaver Lifesaver project using spare home PC capacity to conduct virtual screening against cancer-relevant targets. Dr Ballester has worked specifically with InhibOx on using USR in a virtual screening context.
First Year Of Impact	2010
Sector	Agriculture, Food and Drink,Chemicals,Education,Pharmaceuticals and Medical Biotechnology
Impact Types	Economic


Title	RF-Score
Description	RF-Score is a machine learning approach to predicting protein-ligand binding affinity with applications to molecular docking.
Type Of Technology	Software
Year Produced	2010
Open Source License?	Yes
Impact	This virtual screening methodology was tested prospectively on two versions of an antibacterial target (type II dehydroquinase from Mycobacterium tuberculosis and Streptomyces coelicolor), for which HTS has not provided satisfactory results and consequently practically all known inhibitors are derivatives of the same core scaffold. Overall, our protocols identified 100 new inhibitors, with calculated Ki ranging from 4 to 250 µM (confirmed hit rates are 60% and 62% against each version of the target). Most importantly, over 50 new active molecular scaffolds were discovered that underscore the benefits that a wide application of prospectively validated in silico screening tools is likely to bring to antibacterial hit identification.
URL	http://chemistry.st-andrews.ac.uk/staff/jbom/group/RF-Score.html

Abstract

Technical Summary

Organisations

People

ORCID iD

Publications