IDA2GO - Improving Domain Annotation and Representation within InterPro
Lead Research Organisation:
European Bioinformatics Institute
Department Name: Sequence Database Group
Abstract
Protein domains are discrete, stable structures within proteins. They typically form distinct operational units with responsibility for specific functions, such as binding a given molecule or catalyzing a specific step in an enzymatic reaction. To fully understand a protein's biological role, it is necessary to understand domain distribution, evolution and function.
The core concept of InterPro is that if two proteins look similar (either structurally and/or at the sequence-level), there is a strong possibility that they will have a similar or identical function. The similarities and differences between proteins that have the same function or structure can be modelled; InterPro calls the resultant predictive models "signatures". InterPro uses signatures from several different databases (each of which has a particular niche or biological focus) to predict information about proteins. InterPro integrates together signatures if they appear to represent the same protein family, domain or site. In addition, concise information about the signatures and the types of proteins they match is added, including terms from the Gene Ontology (GO), a controlled vocabulary that is used to describe biological functions, processes and the subcellular localisation of genes in a standardised way.
InterPro regularly calculates the presence of domains in sequences from the UniProtKB protein knowledgebase. It makes this information available through websites and software tools. However, the manner in which these data are displayed and calculated is sub-optimal and can lead to confusion for the biologists attempting to use them. Similarly, because domains can be found in proteins which have quite different overall functions, it is difficult to accurately annotate individual domains with GO terms.
The IDA2GO project intends to improve the way that domains are represented and annotated within the InterPro database so that scientists are able to utilise these data for the functional annotation of genomes, the discovery of novel domains and to better understand how proteins evolve.
The core concept of InterPro is that if two proteins look similar (either structurally and/or at the sequence-level), there is a strong possibility that they will have a similar or identical function. The similarities and differences between proteins that have the same function or structure can be modelled; InterPro calls the resultant predictive models "signatures". InterPro uses signatures from several different databases (each of which has a particular niche or biological focus) to predict information about proteins. InterPro integrates together signatures if they appear to represent the same protein family, domain or site. In addition, concise information about the signatures and the types of proteins they match is added, including terms from the Gene Ontology (GO), a controlled vocabulary that is used to describe biological functions, processes and the subcellular localisation of genes in a standardised way.
InterPro regularly calculates the presence of domains in sequences from the UniProtKB protein knowledgebase. It makes this information available through websites and software tools. However, the manner in which these data are displayed and calculated is sub-optimal and can lead to confusion for the biologists attempting to use them. Similarly, because domains can be found in proteins which have quite different overall functions, it is difficult to accurately annotate individual domains with GO terms.
The IDA2GO project intends to improve the way that domains are represented and annotated within the InterPro database so that scientists are able to utilise these data for the functional annotation of genomes, the discovery of novel domains and to better understand how proteins evolve.
Technical Summary
IDA2GO will improve the annotation and representation of domain information within InterPro.
The member databases which make up the InterPro resource each have their own biological focus and signature methodology. InterPro aims to provide a consensus view of their data but achieving this for protein domain information is complex due to the different ways each database defines domains. Whilst the definitions often overlap, there are many cases where they differ substantially. An imperfect, compromise solution (where some databases' definitions are favoured over others) is currently used to generate domain architectures on the InterPro web site. This makes the data difficult to interpret and it is currently not possible to perform sophisticated analysis on it (e.g. searching for proteins that contain a particular set of domains). We intend to use graph-theory to accurately represent InterPro's domain architectures. This would allow an in-depth analysis of the domain information contained within InterPro for arguably the first time. A user-friendly query interface that is tightly integrated into the existing InterPro website will also be produced.
In addition, we will collaborate with the Gene Ontology (GO) consortium to improve the annotation of domains and domain architectures using the Gene Ontology. At present, annotation of InterPro's domains is relatively sparse (both in coverage and depth of annotation) compared with the annotation of protein families. This is due to the inherent difficulties in annotating domains, as they are frequently found in different functional contexts. We will mitigate this by manually mapping GO terms to domains with additional qualifiers describing how the domain contributes to the protein's function. We will also perform an automatic mapping of GO terms to the domain architectures produced in the first part of the project. Together, these approaches will greatly improve the coverage and utility of InterPro2GO.
The member databases which make up the InterPro resource each have their own biological focus and signature methodology. InterPro aims to provide a consensus view of their data but achieving this for protein domain information is complex due to the different ways each database defines domains. Whilst the definitions often overlap, there are many cases where they differ substantially. An imperfect, compromise solution (where some databases' definitions are favoured over others) is currently used to generate domain architectures on the InterPro web site. This makes the data difficult to interpret and it is currently not possible to perform sophisticated analysis on it (e.g. searching for proteins that contain a particular set of domains). We intend to use graph-theory to accurately represent InterPro's domain architectures. This would allow an in-depth analysis of the domain information contained within InterPro for arguably the first time. A user-friendly query interface that is tightly integrated into the existing InterPro website will also be produced.
In addition, we will collaborate with the Gene Ontology (GO) consortium to improve the annotation of domains and domain architectures using the Gene Ontology. At present, annotation of InterPro's domains is relatively sparse (both in coverage and depth of annotation) compared with the annotation of protein families. This is due to the inherent difficulties in annotating domains, as they are frequently found in different functional contexts. We will mitigate this by manually mapping GO terms to domains with additional qualifiers describing how the domain contributes to the protein's function. We will also perform an automatic mapping of GO terms to the domain architectures produced in the first part of the project. Together, these approaches will greatly improve the coverage and utility of InterPro2GO.
Planned Impact
The InterPro database has a large number of users of both its website (~50,000 unique IPs served per month) and the InterProScan search software (21.3 million searches performed at EBI in 2011 alone). This userbase comprises both academic and commercial scientists with a range of research questions.
The biggest "traditional" usage of InterPro has been the high-throughput functional annotation of genome sequencing projects. InterPro has the benefit of a comprehensive set of protein signatures for predicting protein function and sequence features, as well as trusted annotations via the association of Gene Ontology terms. The IDA2GO project promises to benefit these users considerably: The domain architecture data that will be generated could be used in quality control for gene coding predictions and for transfer of annotation between orthologs. The new GO term associations for domains and domain architectures should increase coverage of functional annotation of gene products.
The data within InterPro covers many areas of taxonomy and, as such, can be utilised by a wide range of biologists, including crop researchers, drug developers and microbiologists. For those researchers covering novel areas of biology (e.g. metagenomics), where proteins are not as thoroughly functionally characterised as in more established areas, domain information is of particular utility because even though the protein's particular function may be unknown, the presence of a well-understood set of individual domains can give insights into its potential role.
It is hoped that the project will also have benefits for evolutionary biologists, particularly those studying how domains shuffle and adapt over time and across species. Presenting domain architectures in a quick and easy to use graphical interface should allow better exploitation of the wealth of information held within InterPro.
The biggest "traditional" usage of InterPro has been the high-throughput functional annotation of genome sequencing projects. InterPro has the benefit of a comprehensive set of protein signatures for predicting protein function and sequence features, as well as trusted annotations via the association of Gene Ontology terms. The IDA2GO project promises to benefit these users considerably: The domain architecture data that will be generated could be used in quality control for gene coding predictions and for transfer of annotation between orthologs. The new GO term associations for domains and domain architectures should increase coverage of functional annotation of gene products.
The data within InterPro covers many areas of taxonomy and, as such, can be utilised by a wide range of biologists, including crop researchers, drug developers and microbiologists. For those researchers covering novel areas of biology (e.g. metagenomics), where proteins are not as thoroughly functionally characterised as in more established areas, domain information is of particular utility because even though the protein's particular function may be unknown, the presence of a well-understood set of individual domains can give insights into its potential role.
It is hoped that the project will also have benefits for evolutionary biologists, particularly those studying how domains shuffle and adapt over time and across species. Presenting domain architectures in a quick and easy to use graphical interface should allow better exploitation of the wealth of information held within InterPro.
People |
ORCID iD |
Sarah Hunter (Principal Investigator) |
Publications
Mitchell A
(2015)
The InterPro protein families database: the classification resource after 15 years.
in Nucleic acids research
Description | Protein domains are discrete, stable structures within proteins. They represent fundamental elements of evolution. Domains typically form distinct operational units with responsibility for specific functions, such as binding a given molecule or catalyzing a specific step in an enzymatic reaction. As a consequence, the organisation of the domains within a protein (the "domain architecture") has a significant implication for protein function. To fully understand a protein's role, and to make sense of sequenced genomes and their evolution, it is essential that we understand domain distribution and function. This grant funded work to improve domain representation within InterPro. The major impacts have been: i) the development of a new algorithm, based on graph theory, that accurately represents the arrangement of protein domains in the InterPro database; and ii) the development of a new search tool to allow interactive querying of these domain architectures. The domain architecture tool allows users to search the InterPro database with a particular set of domains, and returns all of the domain organisations and associated proteins that match the query. This makes it easy to rapidly identify all of the different domain combinations, where one type of domain co-occurs with another, or a particular domain is followed by another (e.g., an SH3 domain is found C-terminal to a protein kinase domain, or vice versa), and to list the proteins that match each domain organisation. The tool has been made available via the InterPro website (http://www.ebi.ac.uk/interpro/search/domain-organisation). |
Exploitation Route | The tool may be used to gain insights into domain function and evolution. It allows users to search for proteins based on domain absence/presence and/or order. It could be used to investigate the occurrence of predicted domain architectures in selected proteomes. Furthermore, it could also be used to assess the likely function of proteins with particular domain architectures, through comparison to characterised proteins with the same domain arrangement. |
Sectors | Agriculture Food and Drink Digital/Communication/Information Technologies (including Software) Healthcare Manufacturing including Industrial Biotechology Pharmaceuticals and Medical Biotechnology |
Description | The InterPro domain architecture tool was launched in beta relatively recently (July 2014) and so is still growing it's user base. The tool is part of the InterPro database, which is widely used by the research community (and beyond), receiving millions of web hits per month. The associated InterProScan resource is the most widely-used web service at the EBI, receiving almost 3 million sequence searches per day. A large user base, which includes the commercial pharmaceutical and biotechnology sectors, therefore exists for tool, with its potential use in understanding the domain architectures of proteins involved in agriculture, food manufacture and spoilage, epidemiology, elucidation of antibiotic resistance mechanisms and bioenergy production. We expect the tool's use to increase significantly following our publication describing its deployment (due out in January 2015). |
First Year Of Impact | 2014 |
Sector | Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Healthcare,Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology |
Title | InterPro |
Description | InterPro provides functional analysis of proteins by classifying them into families and predicting domains and important sites. We combine protein signatures from a number of member databases into a single searchable resource, capitalising on their individual strengths to produce a powerful integrated database and diagnostic tool |
Type Of Material | Database/Collection of data |
Provided To Others? | Yes |
Impact | All of the annotations provided by InterPro underpin the automatic annotation pipeline within the UniProt database. InterPro provides tens of millions of sequences to UniProt through the InterPro2Go pipeline. InterPro is the most widely used web service at EMBL-EBI, performing ~15,000,000 searches per month, from around the world. |
URL | http://www.ebi.ac.uk/interpro/ |
Title | InterPro Domain Architecture Search Tool |
Description | The InterPro Domain Architecture (IDA) tool allows users to search the InterPro database with a particular set of domains, and returns all of the domain architectures and associated proteins that match the query. This makes it easy to rapidly identify all of the different domain combinations where one type of domain co-occurs with another, or a particular domain is followed by another (e.g., an SH3 domain is found C-terminal to a protein kinase domain, or vice versa), and to list the proteins that match each domain architecture. |
Type Of Technology | Webtool/Application |
Year Produced | 2014 |
Impact | Since the tool was released in beta 6 months ago, it has received over 1,500 unique page views. We anticipate that the number of visitors will further increase, once the publication is released and awareness of the tool grows. |
URL | http://www.ebi.ac.uk/interpro/search/domain-organisation |