Protein Function Prediction using Machine Learning by an Enhanced Novel Support Vector Logic-based Approach
Lead Research Organisation:
Imperial College London
Department Name: Life Sciences
Abstract
Proteins are biological molecules that are the machinery of life involved in numerous biological processes such as the breakdown of food to provide energy and the defence of a cell against disease. Proteins adopt complex three-dimensional (3D) structures and the location of the atoms can be revealed experimentally. Knowledge of the 3D structure of a protein and its function often provides major insight into biological processes. In addition, this knowledge is of substantial benefit to the design of novel drugs. As a result of advances in biological research, particularly the sequencing of the genomes of humans, other animals and many bacteria, the scientific community is now determining or predicting the 3D structures for many proteins whose functions are not yet known. In addition computational methods can predict the possible structure of a protein from its chemical formula (its sequence). This project is to develop a computer-based approach to take a protein of experimentally-determined or predicted structure and suggest its function. Protein function is determined by the spatial position of critical residues and the environment of these residues. We will use a computer algorithm to learn the rules from known examples of protein structures and their functions. In particular the machine learning approach will be a combination of logic reasoning and quantitative predictions from a support vector machine using a novel method known as Support Vector Inductive Logic Programming (SVILP). SVILP has the benefits that logic rules are powerful in describing spatial relationships and can be readily understood. However logic rules are yes or no and for quantitative prediction (e.g. confidence or rank) we then feed the logic rules into a support vector machine. In this grant we will enhance this novel SVILP methodology. There will be two major results from the grant. First we will have developed an enhanced method to assign function to protein structure and develop a web server for use by the community. Second we will have developed an enhanced robust version of SVILP with its power benchmarked on a challenging application and in a form suitable for uptake by the community to apply our method to a wide range of problems.
Technical Summary
This proposal has two inter-related aims: 1) to develop a method to predict the function of a protein from its experimental or predicted structure using a novel machine learning method - support vector inductive logic programming (SVILP); 2) to enhance the prototype version of SVILP into a robust tool for use in protein function prediction and in a broad range of other application areas. To develop function prediction, will use the Catalytic Site Atlas and eFsite (electrostatic-surface of Functional site). The first step is to predict which residues are functional. A pool of method will be used: our in-house program PHUNCTIONER that identifies residues specific for function, graph theoretic measures, electrostatics, evolutionary and statistical propensities, spatial clustering and cleft geometry. We will use SVILP to learn rules to predict functional residues using the above as background knowledge. The second step is to learn 3D motifs to specific function using the SVILP to yield rules. We will develop a web server for dissemination of the methodology. We will interact with structural genomics projects to employ and test our method. To improve SVILP, we will consider 4 topics. (1) Feature Selection will select a small number of rules that are highly effective and will be implemented using both filter and embedded methods. (2) Estimation of probabilistic parameters on ILP rules will use maximum a posteriori estimations to give different weights to rules. (3) Novel Kernel Functions will be designed that are efficient and effective for protein function modelling. We will prove the properties of symmetry and positive semi definiteness that will establish the validity of the developed functions as kernel functions. (4) A Multi-class prediction method will implemented that allows SVILP-based techniques to perform robust and accurate multi-class predictors based on schemes which weight the predictive contributions of individual rules and class predictors.
Publications
Reynolds CR
(2018)
EzMol: A Web Server Wizard for the Rapid Visualization and Image Production of Protein and Nucleic Acid Structures.
in Journal of molecular biology
Lodhi H
(2010)
Multi-class Mode of Action Classification of Toxic Compounds Using Logic Based Kernel Methods.
in Molecular informatics
Wass MN
(2010)
3DLigandSite: predicting ligand-binding sites using similar structures.
in Nucleic acids research
Kelley LA
(2009)
Discovering rules for protein-ligand specificity using support vector inductive logic programming.
in Protein engineering, design & selection : PEDS
Description | Logic-based methodologies can be used to assign protein function. |
Exploitation Route | Application to synthetic biology for modify protein function. |
Sectors | Agriculture Food and Drink Manufacturing including Industrial Biotechology |
Description | Developed a novel approach to predict protein function using enhanced machine learning |
First Year Of Impact | 2006 |
Sector | Pharmaceuticals and Medical Biotechnology |
Impact Types | Economic |
Description | Syngenta Ltd, "University Innovation Centre |
Amount | £1,500,000 (GBP) |
Organisation | Syngenta International AG |
Sector | Private |
Country | Switzerland |
Start | 09/2008 |
End | 09/2013 |
Description | Nanjing-Imperial Machine Learning Centre |
Organisation | Nanjing University (NJU) |
Country | China |
Sector | Academic/University |
PI Contribution | This is the outcome of two years collaboration with the University of Nanjing, and has involved multiple bilateral visits. Joint research on the development of techniques for integrating Statistical and Logical Machine Learning has led to an early report on a new technology called Logical Vision. Major conference and Journal submissions are in progress. Our contribution has been in providing expertise and research in Logic-Based Machine Learning. |
Collaborator Contribution | Nanjing University is China's top centre for research in Statistical Machine Learning. They have developed the code base for the LogVis system which was recently released on GitHub under a BSD open source license. Nanjing University has just agreed to fund the new centre at a level of £60K per year. This will provide funding for travel and RA time to support the ongoing collaboration. |
Impact | W-Z Dai, S.H. Muggleton, and Z-H Zhou. Logical Vision: Meta-interpretive learning for simple geometrical concepts. In Late Breaking Paper Proceedings of the 25th International Conference on Inductive Logic Programming, pages 1-16. CEUR, 2015. |
Start Year | 2015 |
Title | Confunc |
Description | A web server to predict protein function from sequence |
Type Of Technology | Webtool/Application |
Year Produced | 2008 |
Impact | Used by bioscience workers. Now incorporated into CombFunc |
URL | http://www.sbg.bio.ic.ac.uk/confunc/about.html |
Description | Lecture - Art and Science |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Public/other audiences |
Results and Impact | Talk highlighted link of structural biology and art. Follow up invitation to talk at a human/computer iteraction conference |
Year(s) Of Engagement Activity | 2013 |
Description | School lecture (London) |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Local |
Primary Audience | Schools |
Results and Impact | Talk to school children to spark interest in science Requests for work experience |
Year(s) Of Engagement Activity | 2012 |