Embracing new technologies to streamline improve and sustain InterPro and its contributing databases

Lead Research Organisation: Wellcome Sanger Institute

Department Name: Research Directorate

Abstract

New DNA sequencing technologies have led to a flood of new data in sequence databases being submitted by individual scientists, genome sequencing projects and metagenomics projects. These sequences enter the databases with little or no annotation, limiting their usefulness to the scientific community. This has inspired the development of new tools for automatic annotation of the encoded protein sequences. One of the most successful developments in this area has been in the production of so-called protein 'signatures', diagnostic methods that are able to characterise newly-determined sequences in terms of the protein families to which they belong and/or the structural or functional domains they contain. Protein signature approaches have been adopted by a number of databases, and ten of the top such resources are integrated into the InterPro database. InterPro, and its accompanying protein analysis software tool, InterProScan, is now one of the leading protein functional classification resources in the world. However, despite its success, InterPro and its partners are currently suffering from a lack of financial support. The level of funding required to maintain and improve a database of this size is often underestimated. The amount of incoming data is increasing exponentially, and databases now struggle to provide their data to the public in a timely way, while at the same time maintaining the necessary high standards of data quality. Moreover, as they become more popular, and user demands increase, these core databases endure mounting pressure not only to keep up with the expanding volume of data and growing community requirements, but also to be early adopters of newly emerging technologies. This proposal aims to resolve these issues by embracing new technologies to enhance and further develop InterPro and its source databases. It aims to streamline production processes both to provide more regular data releases and to better cope with increased volumes of data. With more formalised Consortium activities and coordination thereof, we will make more efficient use of resources and share tasks to ensure long-term sustainability of the databases. Specifically we aim to: - Streamline data production procedures to enable a faster turn-around time for releasing the data; - Develop and integrate new annotation tools and standards to make the rate-limiting annotation step quicker and easier, and share tasks, such as annotation, to remove redundancy in effort; - Work closely together to improve quality-assurance procedures for protein matches; - Coordinate the upgrade of InterProScan and other HMM-based databases to the latest HMMer version; - Improve the InterProScan protein domain-finding software; - Exploit new technologies for database linking and data exchange; and - Extend the functionality of the Web interface to better meet the needs of the user community. The planned improvements to InterProScan and the protein match procedures will improve the quality, as well as the speed of protein functional classification; streamlining the production processes will enable the databases to get new protein domains and families out to the public as soon as they become available. New technologies will facilitate easier linking between different databases, and will provide the public with access to data from different sources. They will also open the door to more complex analyses, by providing improved programmatic access to the data. In addition, these new processes and technologies will allow InterPro and its member databases to cope with the ever-increasing flood of new data and make it accessible to the public in more regular releases. Ultimately, these improvements will make InterPro and its partners easier and more efficient to maintain, paving the way to a more sustainable future and increasing their benefit and usefulness to the scientific community.

Technical Summary

InterPro is an integrated documentation resource for protein families, domains and functional sites, which unifies results from 10 major signature databases into a single resource. The integration process and domain/family annotation is done manually by biologists, ensuring high standards of data quality and consistency. The accompanying software, InterProScan, integrates the individual searching and post- processing algorithms into a single package. InterProScan data is supplemented with GO annotations using InterPro2GO mappings, making it a powerful protein functional classification tool. The data and tools are currently accessible for searching via a Web interface and downloading from the FTP site. Although already used extensively by the scientific community, InterPro and its contributing databases, have a number of internal and external limitations. Internally, they suffer from a lack of funding, which stunts the growth and further development of the databases. Externally, the core databases need to keep up with new technologies, provide links to new databases, and continually improve the interface and data accessibility for their users / currently, this is not being done. This project aims to streamline the current data-production procedures for InterPro and its member databases, improve coordination of activities to make better use of resources, and ensure that new technologies are embraced to drive the project into the future. These activities will enable the databases to provide new data to the public more rapidly, improve and speed up protein match production with InterProScan, and enhance data access through improved Web interfaces and Web services. The latter will provide much needed programmatic access to the data, which will facilitate more complex data analyses, and thus more efficient use of the wealth of scientific content held within the databases.

Funded Value:

£307,341

Funded Period:

Jul 08 - Dec 12

Funder:

BBSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

BB/F010435/1

Principal Investigator:

Alex Bateman

Research Subject:

Omic sciences & technologies (20%)

Tools, technologies & methods (20%)

Research Topic:

Bioinformatics (20%)

Proteomics (20%)

Organisations

Wellcome Sanger Institute (Lead Research Organisation)

People	ORCID iD
Alex Bateman (Principal Investigator)

Publications

Author Name

Title Publication Date Published

10 25 50

Finn RD (2008) The Pfam protein families database. in Nucleic acids research

Finn RD (2010) The Pfam protein families database. in Nucleic acids research

Hunter S (2012) InterPro in 2011: new developments in the family and domain prediction database. in Nucleic acids research

Hunter S (2009) InterPro: the integrative protein signature database. in Nucleic acids research

Mistry J (2013) The challenge of increasing Pfam coverage of the human proteome. in Database : the journal of biological databases and curation

Mistry J (2013) Challenges in homology search: HMMER3 and convergent evolution of coiled-coil regions. in Nucleic acids research

Punta M (2012) The Pfam protein families database. in Nucleic acids research

Key Findings
Impact Summary
Further Funding
Research Databases and Models


Description	One of the major impacts of this work has been the adoption of a new protein homology search algorithm, HMMER3. By adopting this algorithm, protein family databases such as Pfam and InterPro have importantly been able to scale. At the time of writing the proposal, there were only a few millions of sequences in the underlying database, and by the end of the grant, this was at tens of millions with no signs of the growth abating. Furthermore, the adoption of this algorithm has allowed the discovery of relationships between protein families not previously known, for example a bacterial homologue of the Pleckstrin homology (PH) domain, which had previously only been identified in eukaryotic proteins. Another major improvement has been the incorporation of the Jackhmmer program within the Pfam curation pipeline. This new tool iteratively searches sequences and has greatly enhanced our ability to curate large families of proteins and has been greatly enhanced sequence coverage. During the past year using Jackhmmer approaches we have increased sequence coverage by 2.8% which represents the largest yearly increase since 2002. We have also exposed the data in both databases in new ways. The first, the adoption of Wikipedia has allowed, a richer media to be used for the annotation of protein families that are editable by everyone. As these are no longer tied to releases, we have found many cases where the information is more up-to-date and with more relevant citations. The second, the use of DAS and BioMarts, allows access to the Pfam/InterPro data programmatically or provides a way for power users to query across the entire database, something not previously available.
Exploitation Route	The HMMER3 implementation by the Pfam database smoothed the way for other databases to follow (UK based: SUPERFAMILY and Gene3D. International: TIGRfams and PIRSF). This was a major step forwards in creating scalable and sustainable pipelines for delivery of protein domain and family information. Along with the adoption of Wikipedia for protein family annotation which created a sustainable resource of great utility to others. The creation of the InterPro BioMart and multiple new DAS sources enabled improved access to protein family data. The widespread use of this information is demonstrated by the 1.7 million searches per month achieved in 2010 and the exceptional level of citations to the publications describing Pfam and InterPro across a wide variety of biological, medical and biotechnological sectors that used the data created and made accessible via this work.
Sectors	Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Healthcare,Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology


Description	InterPro and Pfam are widely used resources in the research community and beyond, each receiving millions of web hits per month. InterPro's BioMart has increased usage year-on-year since the first release of the resource, for example in 2009 it received 6,656 visitors and in 2010 is received 10,276 visitors. InterProScan continues to be the most widely used tool/service at EBI, with an average of 1.4 million searches per month during the period of the grant. Since then, growth as increases to tens of millions of searches per month. Similarly, the Pfam search tool received typically 11,000 searches per month (during the funding period). These services offered by Pfam and InterPro means that there is no need for local replication of search tools, offering economic savings to the broad user base. InterPro continues to collaborate with the automatic annotation project within the UniProt consortium to improve data quality so that the automatic annotation rules generated are as sensitive and accurate as possible. Additionally, InterPro has been working closely with the Human Gene Nomenclature Committee (HGNC) to assist naming of less well characterised genes using the information that InterPro predicts. Such naming ensures that consistent and appropriate names are used from informaticians to molecular biologists to clinicians. Both Pfam and InterPro are widely used in genome annotation. Many of the citations to this work are from genome projects or by tools that reuse this data in tools for genomic analysis. Pfam was used in the annotation of the genome of the blood fluke Schistosoma mansoni, an organism that is responsible for the neglected tropical disease schistosomiasis that affects 210 million people in 76 countries. The use of genome data has impacts in a wide variety of fields. For example genomes of food crops and fruits can help increase yields. Sequencing of human genome variation is now having impacts in genomic medicine and data from Pfam is being used to identify the function. Such work has direct longer term benefits to understanding the molecular function, disease mechanisms and host interactions. Such information will naturally improve human well being, as well as ensuring important factors such as food security, animal health and systems biology. The 2010 paper describing the Pfam database has now received 4109 citations, while the 2012 update paper have received 1652 citations. Similarly, the article describing the InterPro database has received 983 citations (citations according to Google Scholar). While many of these citations represent academic uses (informatics tools and research outcomes), protein families annotations are disseminated to a very broad audience, which will include academia, the commercial pharmaceutical and biotechnology sectors. Indeed, there are publications that use Pfam for drug discovery e.g. mining the human gut microbiome for drug targets (1) performed by GlaxoSmithKline, or searching for novel antibacterial therapeutics (2). InterPro and Pfam integrated are commercial pipelines, and while some of this comes from personal communication, there is evidence of this (3). References: 1 Collison M, Hirt RP, Wipat A, Nakjang S, Sanseau P, Brown JR. Data mining the human gut microbiota for therapeutic targets. Brief Bioinform. 2012 Nov;13(6):751-68. doi: 10.1093/bib/bbs002. Epub 2012 Mar 24. Review. PubMed PMID: 22445903. 2 Fahnoe KC, Flanagan ME, Gibson G, Shanmugasundaram V, Che Y, Tomaras AP. Non-traditional antibacterial screening approaches for the identification of novel inhibitors of the glyoxylate shunt in gram-negative pathogens. PLoS One. 2012;7(12):e51732. doi: 10.1371/journal.pone.0051732. Epub 2012 Dec 11. PubMed PMID: 23240059; PubMed Central PMCID: PMC3519852. 3. Brothwood J, MSc Thesis, Cranfield University, GlaxoSmithKine. 'DRUGGABLE AND BIOPHARMABLE GENOME ANNOTATION PIPELINE DEVELOPMENT' http://www.openphacts.org/documents/publications/Brothwood_Jessica_Druggable%20and%20biopharmable%20genome%20annotation%20pipeline%20development_MSc%20Thesis_Cranfield%20University_GSK_September%202012.pdf
First Year Of Impact	2009
Sector	Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Healthcare,Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology
Impact Types	Societal,Economic


Description	Biomedical Resources
Amount	£1,154,000 (GBP)
Funding ID	108433/Z/15/Z
Organisation	Wellcome Trust
Sector	Charity/Non Profit
Country	United Kingdom
Start	09/2015
End	08/2020


Title	Pfam
Description	Protein Family database
Type Of Material	Database/Collection of data
Provided To Others?	Yes
Impact	The annotation of the millions of sequences that are generated by modern DNA sequencing technologies.
URL	http://pfam.xfam.org

Abstract

Technical Summary

Organisations

People

ORCID iD

Publications