Embracing new technologies to streamline improve and sustain InterPro and its contributing databases

Lead Research Organisation: The Wellcome Trust Sanger Institute
Department Name: Research Directorate

Abstract

New DNA sequencing technologies have led to a flood of new data in sequence databases being submitted by individual scientists, genome sequencing projects and metagenomics projects. These sequences enter the databases with little or no annotation, limiting their usefulness to the scientific community. This has inspired the development of new tools for automatic annotation of the encoded protein sequences. One of the most successful developments in this area has been in the production of so-called protein 'signatures', diagnostic methods that are able to characterise newly-determined sequences in terms of the protein families to which they belong and/or the structural or functional domains they contain. Protein signature approaches have been adopted by a number of databases, and ten of the top such resources are integrated into the InterPro database. InterPro, and its accompanying protein analysis software tool, InterProScan, is now one of the leading protein functional classification resources in the world. However, despite its success, InterPro and its partners are currently suffering from a lack of financial support. The level of funding required to maintain and improve a database of this size is often underestimated. The amount of incoming data is increasing exponentially, and databases now struggle to provide their data to the public in a timely way, while at the same time maintaining the necessary high standards of data quality. Moreover, as they become more popular, and user demands increase, these core databases endure mounting pressure not only to keep up with the expanding volume of data and growing community requirements, but also to be early adopters of newly emerging technologies. This proposal aims to resolve these issues by embracing new technologies to enhance and further develop InterPro and its source databases. It aims to streamline production processes both to provide more regular data releases and to better cope with increased volumes of data. With more formalised Consortium activities and coordination thereof, we will make more efficient use of resources and share tasks to ensure long-term sustainability of the databases. Specifically we aim to: - Streamline data production procedures to enable a faster turn-around time for releasing the data; - Develop and integrate new annotation tools and standards to make the rate-limiting annotation step quicker and easier, and share tasks, such as annotation, to remove redundancy in effort; - Work closely together to improve quality-assurance procedures for protein matches; - Coordinate the upgrade of InterProScan and other HMM-based databases to the latest HMMer version; - Improve the InterProScan protein domain-finding software; - Exploit new technologies for database linking and data exchange; and - Extend the functionality of the Web interface to better meet the needs of the user community. The planned improvements to InterProScan and the protein match procedures will improve the quality, as well as the speed of protein functional classification; streamlining the production processes will enable the databases to get new protein domains and families out to the public as soon as they become available. New technologies will facilitate easier linking between different databases, and will provide the public with access to data from different sources. They will also open the door to more complex analyses, by providing improved programmatic access to the data. In addition, these new processes and technologies will allow InterPro and its member databases to cope with the ever-increasing flood of new data and make it accessible to the public in more regular releases. Ultimately, these improvements will make InterPro and its partners easier and more efficient to maintain, paving the way to a more sustainable future and increasing their benefit and usefulness to the scientific community.

Technical Summary

InterPro is an integrated documentation resource for protein families, domains and functional sites, which unifies results from 10 major signature databases into a single resource. The integration process and domain/family annotation is done manually by biologists, ensuring high standards of data quality and consistency. The accompanying software, InterProScan, integrates the individual searching and post- processing algorithms into a single package. InterProScan data is supplemented with GO annotations using InterPro2GO mappings, making it a powerful protein functional classification tool. The data and tools are currently accessible for searching via a Web interface and downloading from the FTP site. Although already used extensively by the scientific community, InterPro and its contributing databases, have a number of internal and external limitations. Internally, they suffer from a lack of funding, which stunts the growth and further development of the databases. Externally, the core databases need to keep up with new technologies, provide links to new databases, and continually improve the interface and data accessibility for their users / currently, this is not being done. This project aims to streamline the current data-production procedures for InterPro and its member databases, improve coordination of activities to make better use of resources, and ensure that new technologies are embraced to drive the project into the future. These activities will enable the databases to provide new data to the public more rapidly, improve and speed up protein match production with InterProScan, and enhance data access through improved Web interfaces and Web services. The latter will provide much needed programmatic access to the data, which will facilitate more complex data analyses, and thus more efficient use of the wealth of scientific content held within the databases.

Publications

10 25 50
publication icon
Finn RD (2010) The Pfam protein families database. in Nucleic acids research

publication icon
Finn RD (2008) The Pfam protein families database. in Nucleic acids research

publication icon
Hunter S (2009) InterPro: the integrative protein signature database. in Nucleic acids research

publication icon
Mistry J (2013) The challenge of increasing Pfam coverage of the human proteome. in Database : the journal of biological databases and curation

publication icon
Punta M (2012) The Pfam protein families database. in Nucleic acids research

 
Description One of the major impacts of this work has been the adoption of a new protein homology search algorithm, HMMER3. By adopting this algorithm, protein family databases such as Pfam and InterPro have importantly been able to scale. At the time of writing the proposal, there were only a few millions of sequences in the underlying database, and by the end of the grant, this was at tens of millions with no signs of the growth abating. Furthermore, the adoption of this algorithm has allowed the discovery of relationships between protein families not previously known, for example a bacterial homologue of the Pleckstrin homology (PH) domain, which had previously only been identified in eukaryotic proteins.

Another major improvement has been the incorporation of the Jackhmmer program within the Pfam curation pipeline. This new tool iteratively searches sequences and has greatly enhanced our ability to curate large families of proteins and has been greatly enhanced sequence coverage. During the past year using Jackhmmer approaches we have increased sequence coverage by 2.8% which represents the largest yearly increase since 2002.

We have also exposed the data in both databases in new ways. The first, the adoption of Wikipedia has allowed, a richer media to be used for the annotation of protein families that are editable by everyone. As these are no longer tied to releases, we have found many cases where the information is more up-to-date and with more relevant citations. The second, the use of DAS and BioMarts, allows access to the Pfam/InterPro data programmatically or provides a way for power users to query across the entire database, something not previously available.
Exploitation Route The HMMER3 implementation by the Pfam database smoothed the way for other databases to follow (UK based: SUPERFAMILY and Gene3D. International: TIGRfams and PIRSF). This was a major step forwards in creating scalable and sustainable pipelines for delivery of protein domain and family information. Along with the adoption of Wikipedia for protein family annotation which created a sustainable resource of great utility to others. The creation of the InterPro BioMart and multiple new DAS sources enabled improved access to protein family data.

The widespread use of this information is demonstrated by the 1.7 million searches per month achieved in 2010 and the exceptional level of citations to the publications describing Pfam and InterPro across a wide variety of biological, medical and biotechnological sectors that used the data created and made accessible via this work.
Sectors Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Healthcare,Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology

 
Description InterPro and Pfam are widely used resources in the research community and beyond, each receiving millions of web hits per month. InterPro's BioMart has increased usage year-on-year since the first release of the resource, for example in 2009 it received 6,656 visitors and in 2010 is received 10,276 visitors. InterProScan continues to be the most widely used tool/service at EBI, with an average of 1.4 million searches per month during the period of the grant. Since then, growth as increases to tens of millions of searches per month. Similarly, the Pfam search tool received typically 11,000 searches per month (during the funding period). These services offered by Pfam and InterPro means that there is no need for local replication of search tools, offering economic savings to the broad user base. InterPro continues to collaborate with the automatic annotation project within the UniProt consortium to improve data quality so that the automatic annotation rules generated are as sensitive and accurate as possible. Additionally, InterPro has been working closely with the Human Gene Nomenclature Committee (HGNC) to assist naming of less well characterised genes using the information that InterPro predicts. Such naming ensures that consistent and appropriate names are used from informaticians to molecular biologists to clinicians. Both Pfam and InterPro are widely used in genome annotation. Many of the citations to this work are from genome projects or by tools that reuse this data in tools for genomic analysis. Pfam was used in the annotation of the genome of the blood fluke Schistosoma mansoni, an organism that is responsible for the neglected tropical disease schistosomiasis that affects 210 million people in 76 countries. The use of genome data has impacts in a wide variety of fields. For example genomes of food crops and fruits can help increase yields. Sequencing of human genome variation is now having impacts in genomic medicine and data from Pfam is being used to identify the function. Such work has direct longer term benefits to understanding the molecular function, disease mechanisms and host interactions. Such information will naturally improve human well being, as well as ensuring important factors such as food security, animal health and systems biology. The 2010 paper describing the Pfam database has now received 4109 citations, while the 2012 update paper have received 1652 citations. Similarly, the article describing the InterPro database has received 983 citations (citations according to Google Scholar). While many of these citations represent academic uses (informatics tools and research outcomes), protein families annotations are disseminated to a very broad audience, which will include academia, the commercial pharmaceutical and biotechnology sectors. Indeed, there are publications that use Pfam for drug discovery e.g. mining the human gut microbiome for drug targets (1) performed by GlaxoSmithKline, or searching for novel antibacterial therapeutics (2). InterPro and Pfam integrated are commercial pipelines, and while some of this comes from personal communication, there is evidence of this (3). References: 1 Collison M, Hirt RP, Wipat A, Nakjang S, Sanseau P, Brown JR. Data mining the human gut microbiota for therapeutic targets. Brief Bioinform. 2012 Nov;13(6):751-68. doi: 10.1093/bib/bbs002. Epub 2012 Mar 24. Review. PubMed PMID: 22445903. 2 Fahnoe KC, Flanagan ME, Gibson G, Shanmugasundaram V, Che Y, Tomaras AP. Non-traditional antibacterial screening approaches for the identification of novel inhibitors of the glyoxylate shunt in gram-negative pathogens. PLoS One. 2012;7(12):e51732. doi: 10.1371/journal.pone.0051732. Epub 2012 Dec 11. PubMed PMID: 23240059; PubMed Central PMCID: PMC3519852. 3. Brothwood J, MSc Thesis, Cranfield University, GlaxoSmithKine. 'DRUGGABLE AND BIOPHARMABLE GENOME ANNOTATION PIPELINE DEVELOPMENT' http://www.openphacts.org/documents/publications/Brothwood_Jessica_Druggable%20and%20biopharmable%20genome%20annotation%20pipeline%20development_MSc%20Thesis_Cranfield%20University_GSK_September%202012.pdf
First Year Of Impact 2009
Sector Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Healthcare,Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology
Impact Types Societal,Economic

 
Description Biomedical Resources
Amount £1,154,000 (GBP)
Funding ID 108433/Z/15/Z 
Organisation Wellcome Trust 
Sector Charity/Non Profit
Country United Kingdom
Start 09/2015 
End 08/2020
 
Title Pfam 
Description Protein Family database 
Type Of Material Database/Collection of data 
Provided To Others? Yes  
Impact The annotation of the millions of sequences that are generated by modern DNA sequencing technologies. 
URL http://pfam.xfam.org