Protein Function Prediction using Machine Learning by an Enhanced Novel Support Vector Logic-based Approach

Lead Research Organisation: Imperial College London
Department Name: Life Sciences


Proteins are biological molecules that are the machinery of life involved in numerous biological processes such as the breakdown of food to provide energy and the defence of a cell against disease. Proteins adopt complex three-dimensional (3D) structures and the location of the atoms can be revealed experimentally. Knowledge of the 3D structure of a protein and its function often provides major insight into biological processes. In addition, this knowledge is of substantial benefit to the design of novel drugs. As a result of advances in biological research, particularly the sequencing of the genomes of humans, other animals and many bacteria, the scientific community is now determining or predicting the 3D structures for many proteins whose functions are not yet known. In addition computational methods can predict the possible structure of a protein from its chemical formula (its sequence). This project is to develop a computer-based approach to take a protein of experimentally-determined or predicted structure and suggest its function. Protein function is determined by the spatial position of critical residues and the environment of these residues. We will use a computer algorithm to learn the rules from known examples of protein structures and their functions. In particular the machine learning approach will be a combination of logic reasoning and quantitative predictions from a support vector machine using a novel method known as Support Vector Inductive Logic Programming (SVILP). SVILP has the benefits that logic rules are powerful in describing spatial relationships and can be readily understood. However logic rules are yes or no and for quantitative prediction (e.g. confidence or rank) we then feed the logic rules into a support vector machine. In this grant we will enhance this novel SVILP methodology. There will be two major results from the grant. First we will have developed an enhanced method to assign function to protein structure and develop a web server for use by the community. Second we will have developed an enhanced robust version of SVILP with its power benchmarked on a challenging application and in a form suitable for uptake by the community to apply our method to a wide range of problems.

Technical Summary

This proposal has two inter-related aims: 1) to develop a method to predict the function of a protein from its experimental or predicted structure using a novel machine learning method - support vector inductive logic programming (SVILP); 2) to enhance the prototype version of SVILP into a robust tool for use in protein function prediction and in a broad range of other application areas. To develop function prediction, will use the Catalytic Site Atlas and eFsite (electrostatic-surface of Functional site). The first step is to predict which residues are functional. A pool of method will be used: our in-house program PHUNCTIONER that identifies residues specific for function, graph theoretic measures, electrostatics, evolutionary and statistical propensities, spatial clustering and cleft geometry. We will use SVILP to learn rules to predict functional residues using the above as background knowledge. The second step is to learn 3D motifs to specific function using the SVILP to yield rules. We will develop a web server for dissemination of the methodology. We will interact with structural genomics projects to employ and test our method. To improve SVILP, we will consider 4 topics. (1) Feature Selection will select a small number of rules that are highly effective and will be implemented using both filter and embedded methods. (2) Estimation of probabilistic parameters on ILP rules will use maximum a posteriori estimations to give different weights to rules. (3) Novel Kernel Functions will be designed that are efficient and effective for protein function modelling. We will prove the properties of symmetry and positive semi definiteness that will establish the validity of the developed functions as kernel functions. (4) A Multi-class prediction method will implemented that allows SVILP-based techniques to perform robust and accurate multi-class predictors based on schemes which weight the predictive contributions of individual rules and class predictors.


10 25 50
Description Logic-based methodologies can be used to assign protein function.
Exploitation Route Application to synthetic biology for modify protein function.
Sectors Agriculture, Food and Drink,Manufacturing, including Industrial Biotechology

Description Developed a novel approach to predict protein function using enhanced machine learning
First Year Of Impact 2006
Sector Pharmaceuticals and Medical Biotechnology
Impact Types Economic

Description Syngenta Ltd, "University Innovation Centre
Amount £1,500,000 (GBP)
Organisation Syngenta International AG 
Sector Public
Country Global
Start 10/2008 
End 09/2013
Description Nanjing-Imperial Machine Learning Centre 
Organisation Nanjing University (NJU)
Country China 
Sector Academic/University 
PI Contribution This is the outcome of two years collaboration with the University of Nanjing, and has involved multiple bilateral visits. Joint research on the development of techniques for integrating Statistical and Logical Machine Learning has led to an early report on a new technology called Logical Vision. Major conference and Journal submissions are in progress. Our contribution has been in providing expertise and research in Logic-Based Machine Learning.
Collaborator Contribution Nanjing University is China's top centre for research in Statistical Machine Learning. They have developed the code base for the LogVis system which was recently released on GitHub under a BSD open source license. Nanjing University has just agreed to fund the new centre at a level of £60K per year. This will provide funding for travel and RA time to support the ongoing collaboration.
Impact W-Z Dai, S.H. Muggleton, and Z-H Zhou. Logical Vision: Meta-interpretive learning for simple geometrical concepts. In Late Breaking Paper Proceedings of the 25th International Conference on Inductive Logic Programming, pages 1-16. CEUR, 2015.
Start Year 2015
Title Confunc 
Description A web server to predict protein function from sequence 
Type Of Technology Webtool/Application 
Year Produced 2008 
Impact Used by bioscience workers. Now incorporated into CombFunc 
Description Lecture - Art and Science 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Public/other audiences
Results and Impact Talk highlighted link of structural biology and art.

Follow up invitation to talk at a human/computer iteraction conference
Year(s) Of Engagement Activity 2013
Description School lecture (London) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Schools
Results and Impact Talk to school children to spark interest in science

Requests for work experience
Year(s) Of Engagement Activity 2012