Tools for Ontology Annotation: dcGO

Lead Research Organisation: University of Bristol
Department Name: Computer Science

Abstract

The massive amount of biological data, especially sequenced genomes and their end-product protein sequences, poses a prohibitive challenge for humans to manually cross-reference in literature. The solution is to use ontologies - controlled vocabularies - so that computer systems can process and connect rational relationships that are also readable by humans. The most used ontology is Gene Ontology (GO) that intends to describe protein functions.

The most used resource for GO is UniProt. It provides protein sequence and functional annotation. GO annotations for proteins have two main types of evidence: one from experimental or curated annotations, the other from IEA (Inferred Electronic Annotations). IEA comes mostly from an InterPro mapping. InterPro is an integrated database that combines protein domains (or precisely signatures) from diverse sources with different definitions. In InterPro, annotations for domains are hand-curated via homology. Recently, we have pioneered a methodology in pursuit of a fully-automated tool: domain-centric Gene Ontology (dcGO). dcGO is the only freely available fully-automated general methodology for domain-based GO annotation. Thanks to its automated nature, dcGO is much more extensive than the InterPro hand curation. By comparison, we have shown that the quality of dcGO is at least as good as InterPro in terms of function predictions of proteins in UniProt. Therefore, extending dcGO to InterPro will have a considerable downstream impact on the UniProt IEA coverage and quality which is what scientists from around the world are routinely using as their primary source of annotation.

In addition to GO, we will also provide domain annotations using other biomedical ontologies such as those describing diseases and phenotypes. For these kinds of annotations, currently only the dcGO approach can be extended in this way in a straightforward manner. In CAFA, a community-based critical assessment of protein function prediction, dcGO has been demonstrated for use in automated function annotation. In the next CAFA, we will prove that dcGO is also the suitable baseline for other ontologies and thus for inclusion in InterPro and UniProt. To meet the user requests for analysing sequences as a whole, we will also provide ontology enrichment analysis that will allow the users to understand which functions (and other relevant knowledge) are overrepresented in sequences submitted. This user-driven tool will be implemented in a computationally efficient way; the required will be on the order of seconds or minutes, rather than hours or days. In summary, the proposed research will be undertaken with a tight link to InterProt, UniProt, CAFA and end-users, and these collaborative connections will help translate our domain-centric solution into the industry standard for annotating and analyzing genome sequences.

Technical Summary

Proteins are of modular design, and domains or signatures are often the operational units responsible for protein functions. A domain-based approach for Inferred Electronic Annotations (IEA) makes more sense than homology based on whole sequences. IEA from InterPro mappings is the largest contributor to Uniprot IEA, but InterPro2GO is currently entirely hand-curated. We propose to solve this by working with InterPro. We can increase their coverage by offering a baseline and speed up the curators job and extend to unannotated proteins, plus this frees their time to work on the quality that can be added by hand rather than chasing associations which can be automated via dcGO. dcGO is one of the top predictors in CAFA, among which it is the only one that neither relies on complex algorithms (e.g. machine learning) nor data integration, but directly transfers annotations from domains to the protein they reside in. This simplicity permits dcGO to be amenable to automation and to large-scale applications such as UniProt and InterPro. From InterPro's side, they will send us monthly InterPro UniProt annotation and we will develop a pipeline that automatically returns the dcGO annotation. There will be a simple track or interface where they accept/reject using checkboxes and it records their annotation responses, so that each month when they send us their updated UniProt InterProscan results they also send the previous month's hand curation decisions. Upon receiving their feedback to dcGO, we can adapt the method to produce closer annotation to what they want, in a loop making the valuable curator's time spent more and more effectively. Our annotation will go far beyond what InterPro and UniProt are currently able to achieve in two principle ways: (1) 15 other ontologies, e.g. phenotype and disease will be included; (2) supra-domains (pairs of domains or longer combinations) will be annotated which are even more difficult than individual domains to manually curate.

Planned Impact

The Uniprot database is the world's primary resource for protein annotation. This proposal will have a substantial impact on extending and improving the annotation provided by Uniprot, and thus have an unmeasurable impact on the scientific community worldwide. Any user of Uniprot protein annotations will potentially benefit from this work. The reason for this is that the largest amount of Uniprot annotation is in the form of electronically inferred annotation (EIA), and the bulk of that, in turn is supplied via InterPro using it's mapping of GO terms to InterPro signatures. This project will improve the quality and coverage of the InterPro GO associations with signatures, and thus impact the downstream EIA in Uniprot.

The impact is therefore not restricted to Uniprot users but will also directly impact users of Interpro which as a portal resource to the data provided by the member databases, receives an enormous amount of web traffic and citations, which is indicative of the impact on the research community that improving the GO associations will have. Furthermore the individual member databases (such as Pfam, which is another database with a phenomenally large worldwide audience) will be impacted by the fact that they will each be supplied individually with their GO associations. Pfam already rely on the InterPro GO mappings, so this work will also translate into an improvement for Pfam.

There is a blinkered focus in the research community on two things that the work in this proposal will have an impact on. Firstly on GO as an ontology; this work will open up the annotation to 15 other biomedical ontologies such as phenotype and disease, thus promoting these ontologies and their use amongst scientists and importantly their inclusion in resources such as Uniprot and InterPro, etc.. Secondly GO is specifically aimed at annotating whole proteins, whereas it is often the domains which are the functional units, or pairs or combinations of domains which are responsible for a given activity or function. By basing these annotations on a domain-centric view of proteins, this work will also have a profound impact on the way in which people consider the functional annotation of proteins. In fact dcGO has already had some success in that the domain-centric view is discussed in the CAFA Nature Methods paper, and the next CAFA competition will include phenotype ontology prediction.

Publications

10 25 50
 
Description We have developed the dcGO resource extending it as described, and connecting it to all 80 million sequences in the SUPERFAMILY database. and we took part in the international CAFA competition; the result was that, on average across multiple assessment categories, we were in the top 3-5 most successful in the competition out of 126 methods from 56 research groups worldwide. Crucially however, of all of these top methods dcGO was the only one that is high throughput and publicly available.
dcGO also covers 16 biomedical ontologies, far more than any other method. We also added web services and enrichment analysis tools.
Exploitation Route dcGO is a community resource and may be used by others for functional annotation of proteins, domains, architectures, genomes, etc.
Sectors Aerospace, Defence and Marine,Agriculture, Food and Drink,Chemicals,Healthcare,Pharmaceuticals and Medical Biotechnology

 
Description dcGO 
Organisation EMBL European Bioinformatics Institute (EMBL - EBI)
Country United Kingdom 
Sector Academic/University 
PI Contribution functional annotation
Collaborator Contribution integration with Uniprot and Interpro
Impact Not yet
Start Year 2014