Developing a novel web-based tool for functional annotation of proteins

Lead Research Organisation: University College London
Department Name: Computer Science

Abstract

Every living cell within an organism contains thousands of different protein molecules. Although we know the biological function or role for most of these proteins, for 40% or so of the proteins in a typical human cell, for example, have no known function - although we are fairly certain that they do indeed have a function. Some people refer to this set of unknown genes as the 'dark matter' within the genome i.e. we know the genes are present but we simply do not know why there are there. Function can be described in many different ways, and given that the interactions of groups of proteins are perhaps the most important types of processes occurring within cells, describing the function of a protein by the interactions it makes with other proteins (i.e. in the form of networks) is clearly a good approach. Knowing how these protein molecules correctly bind together, or interact, can even help us to understand when something goes wrong within the machinery of a cell, for example during aging processes. This in turn will allow us to better understand disease or aging and will also perhaps help us to develop medicines to correct the faulty machinery. By taking data from many different experimental sources, all from publicly available databases, computers can help us to successfully predict the function of a protein based on, for example, determining what other proteins it interacts with and in what genes are switched on in synchrony with the protein's own encoding gene. We can also look at component features of a protein, such as which parts might be embedded in the membranes surrounding the cell or what kind of overall shape the protein might have. This project seeks to develop new computer programs to analyse these sorts of data and thus help biologists deduce the functions of the many genes and the proteins they encode whose functions are currently unknown. These programs and predictions will be made available via the World Wide Web, so that biologists can easily make use of our results for their own research work with just a PC and a standard web browser. This should greatly help with research into how cells work and how they might go wrong during disease or aging. Ultimately, such discoveries might even lead to the development of new drugs and treatments or even new industrial processes for synthesising useful chemicals.

Technical Summary

The impact of high throughput sequencing technologies since the 1980's has produced over 100 billion base pairs of DNA sequence, cataloguing the genetic material of more than 1000 organisms. Genome sequences provide information not only for a complete set of genes and their precise locations in the chromosome, but also help to define the core proteome i.e. the set of functional proteins that are the work horse components of living cells. In this post-sequencing era, a detailed characterisation of a protein, its structural form, functional role and interactions with other molecules is the next key step in driving our understanding of cellular processes, along with responses to external stimuli or changes to the organism's environment. Ultimately this could also lead to new understanding of biological systems and related disease mechanisms. We propose here to build a web-based tool around our existing collection of publicly available data relevant to predicting protein function and to apply state of the art machine learning techniques to the integration of these data. Thus we will extract novel functional annotations for a number of model organism genomes (including human, mouse and yeast). The range of data sources we use includes protein sequence features, genome-wide domain-based evolutionary information (e.g. domain fusions), publicly available transcriptomic data, microRNA and other regulatory binding sites and both experimental and predicted protein-protein interaction maps. This analysis will rely on supercomputing facilities available at UCL in the form of the recently deployed Legion supercomputer which will be further upgraded in 2010. Finally, we propose to build a novel user-driven protein classification tool, which will allow any biologist to compile his or her own protein classifier with no expertise needed in machine learning.

Planned Impact

The immediate beneficiaries of this research are the broad community of bench biologists needing additional functional clues for proteins of interest. Both academic and industry scientists will benefit in a similar way as the results of this research will be available freely to all users. Commercial scientists with sensitive data will be able to license the software through UCL Business so that they can exploit the resource without revealing their research interests to other users. Being able to determine even some clue as to the function of the 40% of functionally uncharacterised proteins in model organism genomes can have significant impact in a broad variety of areas e.g. drug, antibody and vaccine design, biochemical engineering, protein design and even nanotechnology. Beyond industrial applications of this research, filling in the major gaps in our knowledge of what the full complement of genes and the products of these genes do and how the proteins interact can have wider implications in understanding the working of healthy cells and how they age. Ultimately this work can make a contribution to our overall understanding of how life processes arise from interactions between a relatively small number of genes in our genomes and the genomes of other organisms.
 
Description Through the work funded by this grant, a new updated and improved version of the FFPred tool was developed. FFPred is a computational tool that allows prediction of the functional characterisation of human and other eukaryotic proteins employing a technique called Feature-based Function Prediction. In order to do this, FFPred uses machine learning techniques and tools (called Support Vector Machines), which exploit the biologically relevant "features" that can be extracted from the sequence of a query protein, making no use of structural data and limited use of comparisons between multiple sequences, or homology information.
As a consequence, a critical step for the proper functioning of FFPred is the training of its machine learning algorithms and tools. Much of the work funded by this grant was devoted to developing a series of procedures that allow FFPred's prediction tools to be trained in a mostly automated, reproducible way, starting from the basic input of public databases available at the time of training.
While doing this, a new version of FFPred itself was implemented, which utilises state-of-the-art third party software (this is needed to extract some of the sequence "features" for the query protein) and biological databases, runs in a more streamlined way and gives an output that is simpler to interpret.

Additionally, work was performed to make FFPred available to the community not only via the web interface (that was improved with a new layout for the results section), but also as a standalone program that can be downloaded as a package for Unix/Linux based platforms, as documented in the corresponding publication.
Exploitation Route The improved version of FFPred can now be easily and freely accessed by researchers from all over the world, via both download of a standalone software package and a web interface. Also, as a result of work on this grant, FFPred's machine learning algorithms are now in the condition of being easily trained again by other researchers in the group, as more data on functional characterisation of human proteins is expected to become available in the near future.
Even more interestingly, it can be thought to expand and adapt FFPred to other organisms, in order to contribute to discoveries and development in the bioinformatics analysis of eukaryotic proteomes in general. This idea is currently being actively tested in the group, and it seems to yield useful results at least for some other widely researched model organisms.

Furthermore, the implementation and ideas used by FFPred can suggest new ways of thinking about function prediction. For instance, the idea of feature-based sequence analysis is being implemented in a new tool for search of functionally related proteins, that has been prototyped in our group for use by collaborators in experimental biology projects.
In particular, new components in this version of FFPred may be easily taken forward and influence other sectors of protein bioinformatics analysis. This is evident for example in the case of its new Cellular Component section, which allows to predict sub-cellular localisation of query proteins, and that shows great potential after recent benchmark testing in an international experiment (the second Critical Assessment of Function Annotation).
Sectors Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Healthcare,Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology

URL http://bioinf.cs.ucl.ac.uk/index.php?id=3329
 
Description Our publicly available computational tools are used by both academics and commercial users, and we estimate that around 15% of our 800 user jobs per day are from the commercial sector. Some aspects of this work has also resulted in a new 3 year collaborative project between UCL and Elsevier (contract signed in November 2014) to look at how information can be usefully mined from literature sources and used to help identify the functions of uncharacterised genes and proteins.
First Year Of Impact 2014
Sector Digital/Communication/Information Technologies (including Software),Pharmaceuticals and Medical Biotechnology
Impact Types Economic

 
Title FFPRED 
Description This server is designed to predict Gene Ontology Biological Process and Molecular Function terms for orphan and unannotated protein sequences. The prediction method has been optimised for performance on these 'difficult to annotate' targets using a protein feature based method that does not require prior identification of protein sequence homologues. The method is best suited to annotating proteins with general function classes rather than deciphering specific annotations of proteins with many homologues. 
Type Of Technology Webtool/Application 
Year Produced 2013 
Impact Currently used by an estimated 10-20 commercial sector users per day and 80-100 academic users. 
URL http://bioinf.cs.ucl.ac.uk/web_servers