New Developments of Large-scale Automatic Protein Function Prediction using Graphical Learning Techniques

Lead Research Organisation: University College London
Department Name: Computer Science

Abstract

A large fraction of the cellular activities required for life are carried out by proteins, some of which have been extensively studied over the years. Knowing exactly what these molecules do, when, where and how has been instrumental for medical and biotechnological use. Unfortunately the required level of details for such advanced applications is only available for a tiny fraction of the proteins in a typical cell; for many of them we have some reasonable clues about their biological. Moreover, there is also a substantial portion that we can barely link to our understanding of biology, even though we are confident that they exist. In human cells, for instance, these represent approximately 40% of the proteins.

It is clearly very challenging to experimentally test all the proteins in order to describe their function at the finest level of details. Computer programs can help narrow down the number of assays to run by leveraging on known experimental data and on the fact that some protein features can be used to recognise some well-studied functional units. The underpinning algorithms have become more and more advanced over time, but a number of independent studies have shown that there is still a lot of room for improvement in this field. One clear bottleneck that hampers progress is that all current methods address separately the questions of what proteins do and in which context. However, there is clear evidence that proteins carry out molecular activities in specific cellular compartments and in concert with other biological partners.

The proposed project builds on successful previous work on protein function prediction to expand the scope and accuracy of our tools. These already make use of a wide array of heterogeneous experimental data stored in public databases, which can give information about the protein of interest in terms of its evolutionary relationships to other characterized proteins, as well as of the other proteins it physically interacts with or it is co-regulated with, for instance. These diverse sources of information are then combined through some of the most popular machine-learning methods, which were successfully applied in the past in many other areas such as game-playing, speech recognition and e-mail spam filtering.

Here we seek to make better use of the information already included in our system, to introduce additional biological data types, as well as to explore new and smarter ways of combining them. We will exploit our expertise in providing reliable and user-friendly online tools for protein structure and function prediction so that the new programs and predictions can be easily used and analyzed by experimentalists for their own research with just a PC and a standard web browser.

Technical Summary

Surveys of public resources show that functional information is still completely missing for a considerable fraction of known proteins and is clearly incomplete for an even larger portion. Moreover, these estimates do not include metagenomics sequences, which pose even tougher challenges to existing functional annotation tools. Bioinformatics methods have long been made use of very diverse data sources alone or in combination to predict protein function, with the understanding that different data types help elucidate complementary biological roles.

Recently community-wide initiatives have been launched to critically test existing approaches, to identify successful strategies and highlight bottlenecks that hamper progress. The first CAFA (Critical Assessment of Functional Annotations) experiment found that: (i) the most reliable predictions are based on extensive use of sequence similarities that are often combined with high-throughput data sources; and (ii) there is room for improvement in prediction accuracy and in the deployment of fast, fully automated predictors.

Here we propose to research ways to improve the integrative function prediction system that we tested at CAFA and that was ranked at either the top or near the top across a range of benchmarks and evaluation metrics. In particular, we aim at: (i) making better use of existing sources of information, by studying how informative each data source can be relative to a functional category; (ii) adding gene expression profiles and protein-protein interaction network data to increase the make more confident biological process assignments; (iii) exploring new ways of combining component predictions into a single unified probabilistic framework, employing graphical machine learning approaches; and (iv) delivering biologist-friendly Web tools to allow our work to be exploited by scientists working across the whole BBSRC remit.

Planned Impact

The immediate beneficiaries of this research are the broad community of bench biologists needing additional functional clues for proteins of interest. Both academic and industry scientists will benefit in a similar way as the results of this research will be available freely to all users. Commercial scientists with sensitive data will be able to license the software through UCL Business so that they can exploit the resource without revealing their research interests to other users. Being able to determine even some clue as to the function of the 40% of functionally uncharacterised proteins in model organism genomes can have significant impact in a broad variety of areas e.g. drug, antibody and vaccine design, biochemical engineering, protein design and even nanotechnology.

Beyond industrial applications of this research, filling in the major gaps in our knowledge of what the full complement of genes and the products of these genes do and how the proteins interact can have wider implications in understanding the working of healthy cells and how they age. Ultimately this work can make a contribution to our overall understanding of how life processes arise from interactions between a relatively small number of genes in our genomes and the genomes of other organisms.

Publications

10 25 50
 
Description We first updated FFPred, a machine learning-based method that tackles challenging protein function annotation cases, when conventional approaches relying on sequence similarity can provide little help. The tool exploits a library of Support Vector Machines to examine the complex relationships between individual protein functions and biophysical attributes describing secondary structure, transmembrane helices, intrinsically disordered regions, signal peptides and other motifs. We showed that this tool achieves state-of-the-art performance on test cases with no close homologs of known function. We also showed how useful this predictor can be in analysing the potential functional consequences of alternative splicing and in relating them to changes in the biochemical features of the corresponding gene products.
We then investigated the usefulness of deep neural networks (that include multiple hidden layers with hundreds of nodes each) to predict protein function in a more holistic way, that is by considering many different functional classes together rather than separately like FFPred does. In particular, we made use of multi-task deep neural networks, which try to leverage explicitly the commonalities and differences among individual function prediction tasks. Through stringent benchmarking experiments we showed that this approach can make more accurate predictions than FFPred does, and highlighted aspects for further improvement.
We also researched novel ways of using high-throughput data sources (such as gene expression profiles and protein-protein interactions) in making novel functional annotations. Recent advances in data mining and machine learning have led to effective methods to generate graph embeddings, i.e. to represent the nodes in the graph as points in a multi-dimensional space such that neighbouring nodes correspond to points close to each other. We trained a deep neural network with maxout units to predict many protein functions at once from such data representation and showed that this approach produces more accurate predictions than using a nearest-neighbour approach or a divide-and-conquer strategy (similar to FFPred) as other researchers did. These findings appear to be independent of the graph embedding technique used.
We also explored better ways of combining the results from different predictors, which mine heterogeneous sources of biological information and tend to produce large numbers of partially overlapping functional assignments. We generated a compact representation of the Gene Ontology using graph embedding techniques, and merged these data with the output scores from a number of function prediction methods and learnt a classifier that distinguishes predictions that are compatible with known annotations from those that are not. Under stringent benchmarking conditions, we found that this approach produces a more accurate consensus than other heuristic methods do.
Exploitation Route FFPred is available to the scientific community for download and online use through our group's user-friendly web-server. The tool is expected to help experimentalists narrow down the number of assays required to characterise their proteins of interest, especially when no close homologues of known function. The graphical visualisation of the biophysical attributes encoded in the input sequences allows to generate testable hypotheses about how changes in sequence affect function, which is particularly relevant to understanding the functional consequences of alternative mRNA splicing. Furthermore, the underlying methodology has been successfully adapted to making strides in studying the functions of the fruit fly interactome across different developmental stages.
The computational biology community will benefit from these studies in several different ways. The implementation of some function prediction methods developed in the course of the project are publicly available for download, so they can be integrated in third-party annotation pipelines. The results from our thorough benchmarking experiments are reported in the corresponding publications and form a useful basis for future efforts aimed at making further improvements. Our group is certainly keen to build on these findings and develop more reliable tools that are able to rank more accurately the lists of predicted functions for each individual protein.
Sectors Agriculture, Food and Drink,Pharmaceuticals and Medical Biotechnology

URL http://bioinf.cs.ucl.ac.uk/web_servers/ffpred/ffpred_help
 
Title FFPRED3 
Description This server is designed to predict Gene Ontology Biological Process and Molecular Function terms for orphan and unannotated protein sequences. The prediction method has been optimised for performance on these 'difficult to annotate' targets using a protein feature based method that does not require prior identification of protein sequence homologues. The method is best suited to annotating proteins with general function classes rather than deciphering specific annotations of proteins with many homologues. 
Type Of Technology Webtool/Application 
Year Produced 2016 
Impact The web tool has been used 489 times in the last month - which is typical. We don't track user affiliation, but based on previous questionnaires we estimate that approximately 15-20% of the usage comes from non-academic internet domains. 
URL http://bioinf.cs.ucl.ac.uk/psipred/?ffpred=1