14 NSFBIO:Towards detailed and consistent function prediction from protein family databases

Lead Research Organisation: European Bioinformatics Institute
Department Name: Sequence Database Group

Abstract

Significance.

Thanks to continuing developments in DNA sequencing technology, we now know the exact genetic makeup ("genome") of thousands of different organisms, encoding millions of different proteins. But simply knowing the chemical specification (the "sequence") of these proteins is only a first step-the ultimate goal is to discover how genes and proteins function to support the diversity of life, and also how some of them can be used for commercial and biotechnology applications. This research project will expand the capability of scientists and their students, to advance their analyses from sequences to functions, by bringing together multiple different state-of-the-art approaches. Each of these approaches uses both computational (necessary to address a problem of this magnitude) and broad biological expertise.

Approach.

The general approach in this project is to classify proteins into families of related proteins, and, wherever possible, describe how each family relates to function. The ultimate goal is to assign the same specific function to all of the proteins in a family or to subsets of the family if more than one function is represented within the family. These relations may be very complex, and scientific accuracy will require application of multiple, diverse methods. In order to accomplish this aim, the project will expand InterPro, a widely used resource that already contains (though with limited data integration mechanisms) eleven different databases, three of which are involved in this project: PANTHER, Pfam and TIGRFAM. A fourth classification resource, the Structure-Function Linkage Database (SFLD), will also be incorporated into InterPro. These four databases use complementary methodologies to represent and describe protein relationships, which will be integrated to address the problem of protein function classification with unprecedented accuracy, precision and ease-of-use. As proteins do not generally work in isolation, additional structured annotations relating to pathways and complexes will be added to sets of families, to defined functional characteristics present in a genome. The products of this work will be used to enhance sequence analysis tools used by the scientific community, as well as to provide enhanced educational materials, and will be broadly accessible over the web at http://ebi.ac.uk/interpro.

Technical Summary

Scientists desperately need effective methods to better decode, organize, and more fully exploit still rapidly increasing sequencing data. Classification of proteins into hierarchical families that deliver meaningful functional assignments offers one primary solution. InterPro, one of the most widely used resources for protein family annotation represents 11 different databases, including Pfam, TIGRFAM and PANTHER, combined into a single resource providing value added annotations. A third classification resource, the Structure-Function Linkage Database (SFLD), will be incorporated into InterPro as part of the work proposed here. These databases approach the problem of functional annotation using complementary methodologies, which we propose to combine to address this problem with unprecedented accuracy, precision and ease-of-use.

Planned Impact

N/A for this submission
 
Title Bacterial Genomes: From DNA to Protein Function Using Bioinformatics - Interview 
Description This video is an interview as part of the Wellcome Genome Campus Advanced Courses and Scientific Conference course titled "Bacterial Genomes: From DNA to Protein Function Using Bioinformatics". It discusses how protein signatures and profile HMMs are used to functionally classify proteins. 
Type Of Art Film/Video/Animation 
Year Produced 2018 
Impact This video is provided as part of an available online course, and additionally provides a high level overview of the project, useful for a general audience. 
URL https://www.futurelearn.com/courses/bacterial-genomes-bioinformatics/0/steps/47029
 
Title HMMER webinar 
Description This video is a recording of a webinar entitled HMMER: Fast and sensitive sequence similarity searches 
Type Of Art Film/Video/Animation 
Year Produced 2018 
Impact 382 views on YouTube 
URL https://www.ebi.ac.uk/training/online/course/hmmer-fast-and-sensitive-sequence-similarity-searches
 
Title PANTHER case study webinar 
Description This video is a webinar-style recording which gives and introduction to the PANTHER resource and the way they generate their protein signatures. 
Type Of Art Film/Video/Animation 
Year Produced 2017 
Impact It is currently avaialble on the EBI training YouTube channel. It is planned to be used within an online training course detailing different member database approaches to protein family model building. 
URL https://www.youtube.com/watch?v=LB1SjvUt2fc
 
Title SFLD case study webinar 
Description This video is a webinar-style recording which gives and introduction to the SFLD resource and the way they generate their protein signatures. 
Type Of Art Film/Video/Animation 
Year Produced 2017 
Impact It is currently avaialble on the EBI training YouTube channel. It is planned to be used within an online training course detailing different member database approaches to protein family model building. 
URL https://www.youtube.com/watch?v=Kb7q3X1-BRE&feature=youtu.be
 
Title TIGRFAM case study webinar 
Description This video is a webinar-style recording which gives and introduction to the TIGRFAM resource and the way they generate their protein signatures. 
Type Of Art Film/Video/Animation 
Year Produced 2017 
Impact It is currently avaialble on the EBI training YouTube channel. It is planned to be used within an online training course detailing different member database approaches to protein family model building. 
URL https://www.youtube.com/watch?v=642lC_9tBLo
 
Description We have now established new approaches for performing large-scale comparison between protein family sets. These are being employed to accelerate the process of manual curation, flagging where integration of new signatures appears to be relatively simple, or more importantly, where there appears to be important differences that need inspection. This has accelerated the level of integrations into InterPro and is leading to more accurate classification. Through this work, we have added the SFLD database to InterPro, which includes more fine grained per residue annotations. Understanding the key residues that are important for catalysis allows scientists to use the information for studying both the evolution of these enzymes, as well as how best to modify the enzymes to either optimise and/or modify the enzymes to work on different substrates. We have integrated 100s of PANTHER entries, focusing on the family level of their classification. Collectively, the aforementioned collaborative work has allowed both InterPro and the member databases involved in this project (PANTHER, SFLD and TIGRFAM), to develop deeper insight into the principles behind the resources and how best to work together. We have also extended Genome Properties to work with any InterPro entry, allowing modelling of a broader ranges of pathways and systems. In particular, we have now expanded the resource to include complexes and to also cover eukaryotic genomes, which opens an important new direction for the resource. As part of this work, we have developed a new website for Genome Properties that allows the phylogenetic profiling of functions. This in turn has allowed a more coarse grained analysis of known, as well as novel proteomes.
Exploitation Route Phylogenetic profiling as performed by the new Genome Properties resource permits users to upload new proteomes and allows them to explore the functional pathways and systems that they encode. This has clear applicability to the analysis of metagenomic assembled genomes. The broader annotations in InterPro will expand the annotations of protein sequences, allowing molecular biologists to understand the evolution of protein families better, and will additionally allow functional screening to be focussed on those that have an as yet undetermined function. We have published online training materials, which can be accessed by a broad audience. This should facilitate users of InterPro, Pfam and Genome Properties to havie a greater understanding of the individual resources and member databases and thus, a deeper understanding of the annotations provided.
Sectors Aerospace, Defence and Marine,Agriculture, Food and Drink,Chemicals,Energy,Environment,Healthcare,Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology

 
Description Biomedical Resources
Amount £1,154,000 (GBP)
Funding ID 108433/Z/15/Z 
Organisation Wellcome Trust 
Sector Charity/Non Profit
Country United Kingdom
Start 09/2015 
End 08/2020
 
Title Jaccard analysis for comparison of protein signatures 
Description We have developed a method wherby is it possible to compare 2 protein signatures, by comparing the set of proteins each signature matches. By caclculating the Jaccard index (intersect/union) and defining a cut-off value for that index, it is possible to automatically estimate if 2 protein signatures should be included in one InterPro entry (i.e. they are describing the same protein family or domain). Further, by calculating the Jaccard containment index it is possible to automatically define signatures which should exist in hierarchies together (i.e. parent and child relationships). 
Type Of Material Improvements to research infrastructure 
Year Produced 2018 
Provided To Others? Yes  
Impact The development of this method has provided InterPro curators with an automatic prediction of relationships between InterPro member database signatures. This greatly assists in the manual curation of signaures for integration into InterPro. 
URL https://github.com/rdfinn/Jaccard
 
Title Genome Properties 
Description Genome properties (GP) is an annotation system whereby functional attributes can be assigned to a genome, based on the presence of a defined set of protein family markers within that genome. For example, a species can be proposed to synthesise proline if it can be shown that the genome for that species encodes all the necessary proteins required to carry out the various biochemical steps in the proline biosynthesis pathway. The resource was developed by and previously hosted by the TIGRFAM group. We have re-engineered the database, integrated the resource into InterPro by assigning InterPro identifiers to the steps, provided a reference set of genomes for comparatve analysis, and developed a user-formatabale viewer for the GP results of this reference set and a users own proteome. 
Type Of Material Database/Collection of data 
Year Produced 2018 
Provided To Others? Yes  
Impact Integrating the GP resource into InterPro greatly increases its visibility and usability. Not only is it now possible to compare GP results across a set of species, it is also possible to upload a novel genome/proteome and compare against a reference set of results, allowing phylogenetic analysis on the novel genome. It is possible to use the GP data in functional analysis of metagenomics datasets, as well as in quality control of proposed marker genomes. 
URL https://www.ebi.ac.uk/interpro/genomeproperties/#home
 
Title InterPro 
Description InterPro provides functional analysis of proteins by classifying them into families and predicting domains and important sites. We combine protein signatures from a number of member databases into a single searchable resource, capitalising on their individual strengths to produce a powerful integrated database and diagnostic tool 
Type Of Material Database/Collection of data 
Provided To Others? Yes  
Impact All of the annotations provided by InterPro underpin the automatic annotation pipeline within the UniProt database. InterPro provides tens of millions of sequences to UniProt through the InterPro2Go pipeline. InterPro is the most widely used web service at EMBL-EBI, performing ~15,000,000 searches per month, from around the world. 
URL http://www.ebi.ac.uk/interpro/
 
Title Pfam 
Description The Pfam database is a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs). Proteins are generally composed of one or more functional regions, commonly termed domains. Different combinations of domains give rise to the diverse range of proteins found in nature. The identification of domains that occur within proteins can therefore provide insights into their function. Pfam also generates higher-level groupings of related entries, known as clans. A clan is a collection of Pfam entries which are related by similarity of sequence, structure or profile-HMM. 
Type Of Material Database/Collection of data 
Provided To Others? Yes  
Impact Pfam is widely used within the research community. 
URL http://pfam.xfam.org
 
Description PANTHER Database 
Organisation University of Southern California
Department Keck School of Medicine
Country United States 
Sector Academic/University 
PI Contribution Integration of PANTHER HMMs into InterPro resource. Inclusion of PANTHER software within InterProScan to permit monthly calculation of protein matches to UniProt. This is turn allows the automatic annotation of protein sequences which is an integral component of UniProt.
Collaborator Contribution Supply of protein family HMMs and post-processing software for InterPro integration. Provision of reference trees for use in comparison of protein classifications in InterPro between PANTHER, SFLD and TIGRFAM. Supplier of
Impact Harmonization of protein family definitions. Use of PANTHER reference trees as a scaffold for comparing classifications from disparate databases.
 
Description SFLD added to InterPro Consortium 
Organisation University of California, San Francisco
Department Department of Bioengineering and Therapeutic Sciences
Country United States 
Sector Academic/University 
PI Contribution We have helped SFLD move to a more formal database design, and provided them with software tools and advice to enable the systematic transfer of annotations. We have begun to integrate SFLD into InterPro. This process has functioned as a QC on the SFLD data, and we have fed back any issues we identified. We have focused on the integration of the SFLD subset of gold-standard entries for comparison with TIGRFAM and PANTHER.
Collaborator Contribution SFLD provide the underlying knowledge and data to InterPro, which take the form of multiple sequence alignments, functional annotations and structured ontologies. SFLD provided a subset list of gold-standard families for use in the comparison between PANTHER and TIGRFAM, towards a harmonization of protein family names and functional annotations in InterPro.
Impact The SFLD resource is in the process of being added to InterPro, where it will provide fine grained protein annotations associated with enzymes with chemical reactions.
Start Year 2015
 
Description TIGRFAM database 
Organisation J Craig Venter Institute
Country United States 
Sector Charity/Non Profit 
PI Contribution Integration of TIGRFAM HMMs to InterPro. Generation of editable DESCfile format files for Genome Properties, and subsequent curation of the DESCfiles. Production of visualisation system for Genome Properties.
Collaborator Contribution Provision of TIGRFAM HMMs to InterPro. Provision of Genome Properties flat file data for inclusion in InterPro.
Impact TIGRFAM HMMs have been integrated into InterPro and are included in the effort to harmonize protein names and functions between disparate resources. A database of Genome Properties has been established at InterPro. The properties are stored as an editable DESCfile format (generated form the flat file data provided) and are currently being curated for presentation within InterPro.
 
Title Genome Properties code 
Description Allows the user to utilise the Genome Properties data and assertions in their own analysis. 
Type Of Technology Software 
Year Produced 2017 
Open Source License? Yes  
Impact It is possible to carry out stand alone analysis of (for example) metagenomics data using Genome Properties data. 
URL https://github.com/ebi-pf-team/genome-properties/tree/master/code
 
Title InterProScan5 
Description Allow the user to compare either a DNA or protein sequence and compare it against the collection of InterPro member databases, assign InterPro annotations and associated GO terms. 
Type Of Technology Software 
Open Source License? Yes  
Impact This software is widely downloaded and users (assessed through citations, distributed annotations and helpdesk interactions). This tools is widely used in other analysis pipelines, such as genomics and metagenomics analysis. This tool is updated with every release (bi-monthly) of InterPro to include both data updates and software updates. These software updates take the form of both scientific developments imposed by changes in member databases post-processing. The others are general software maintenance. 
URL https://www.ebi.ac.uk/interpro/interproscan.html
 
Description "A case study of 3 protein family building methodologies" e-learning course 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Created a course within EMBL-EBI train-online resource to highlight similarities and differences in protein family building methodologies within InterPro.
Year(s) Of Engagement Activity 2018
URL https://www.ebi.ac.uk/training/online/course/interpro-case-study-3-protein-family-building-methodolo...
 
Description CABANA Workshop talk and training module on Genome Properties 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Presentation and hands on practical session focused on Genome Properties were delivered during the 5 day "CABANA workshop: Introduction to Metagenomics" held at the Faculty of Natural Sciences - University of Buenos Aires (FCEN-UBA), Argentina.
Year(s) Of Engagement Activity 2019
URL https://www.ebi.ac.uk/training/events/2019/cabana-workshop-introduction-metagenomics
 
Description EMBL-EBI workshop at National Veterinary Research Institute, Poland 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact National Veterinary Research Institute, Poland requested a training workshop covering a selected set of EBI resources including InterPro. The delagates were interested in large-scale protein and metageomics analysis and so the topics covered (in the form of presentations and hands-on training exercises) included InterPro as well as Genome Properties. Delegates reported an enthusiasm to utilise the resources covered.
Year(s) Of Engagement Activity 2017
URL https://www.ebi.ac.uk/training/events/2017/embl-ebi-resources-and-tools-genomics-and-proteomics
 
Description Genome Properties Quick Tour - e-learning 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Created a user-guide within EMBL-EBI train-online resource to briefly describe the content and function of the Genome Properties resource.
Year(s) Of Engagement Activity 2018
URL https://www.ebi.ac.uk/training/online/course/genome-properties-quick-tour
 
Description Genome Properties Tutorial - e-learning 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Created a user-guide within EMBL-EBI train-online resource to describe the content and function of the Genome Properties resource, and to describe how users could analyse their own data using Genome Properties.
Year(s) Of Engagement Activity 2018
URL https://www.ebi.ac.uk/training/online/course/genome-properties-tutorial
 
Description Genome Properties talk at Biocuration 2017 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Selected to present a talk at Biocuration 2017 conference on our work to integrate Genome Properties into InterPro.
Year(s) Of Engagement Activity 2017
 
Description HESI-PATB Workshop 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Presented a seminar on using InterPro/Pfam/HMMER to study potentially allergenic proteins. Talk title: Protein sequence analysis tools and resources to detect potential allergens.
Year(s) Of Engagement Activity 2019
 
Description InterPro and HMMER sessions at Bioinformatics Resources for Protein Biology course at EBI 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Presentation and hands-on training sessions covering both InterPro and HMMER resources, as past of a 3-day EBI course on Bioinformatics Resources for Protein Biology.
Year(s) Of Engagement Activity 2019
URL https://www.ebi.ac.uk/training/events/2019/bioinformatics-resources-protein-biology-3
 
Description InterPro and HMMER sessions within Structural Bioinformatics course at EMBL-EBI. 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Presented InterPro and HMMER in the context of talks and hands-on training sessions within the week-long EBI course Structural Bioinformatics.
Year(s) Of Engagement Activity 2018
URL https://www.ebi.ac.uk/training/events/2018/structural-bioinformatics-2
 
Description InterPro session at Bioinformatics Resources for Protein Biology course at EBI 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Presentation and hands-on training session on InterPro and Genome Properties within an EBI organised course covering Bioinformatics Resources for Protein Biology.
Year(s) Of Engagement Activity 2018
URL https://www.ebi.ac.uk/training/events/2018/bioinformatics-resources-protein-biology-2
 
Description InterPro session in Exploring Biological Sequences course at EBI 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Presentation and hands-on training covering InterPro as part of the EBI Exploring Biological Sequences course.
Year(s) Of Engagement Activity 2017
URL https://www.ebi.ac.uk/training/events/2017/exploring-biological-sequences
 
Description InterPro session within University of Malta Workshop 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Participated in an invited workshop hosted by University of Malta. InterPro session consisted of a presentation and hands-on practical session.
Year(s) Of Engagement Activity 2018
URL https://www.ebi.ac.uk/training/events/2018/embl-ebi-workshop-data-and-tools-transcriptomics-and-prot...
 
Description InterPro, Pfam and HMMER sessions within Protein Structure Analysis course at University of Cambridge 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Postgraduate students
Results and Impact Presented InterPro, Pfam and HMMER resources as talks and hands-on training sessions, within a University of Cambridge course titled Protein Structure Analysis.
Year(s) Of Engagement Activity 2018
URL https://www.training.cam.ac.uk/event/2419336
 
Description Introduction to InterPro at University of Cambridge 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Postgraduate students
Results and Impact Presentation and hands-on training of a half-day module covering InterPro and Genome Properties as part of the University of Cambridge training provision.
Year(s) Of Engagement Activity 2018
URL https://www.training.cam.ac.uk/event/2239008
 
Description Peruvian Bioinformatics Symposium on Repeat Proteins talk titled "Genome Properties: Using InterPro to predict coordinated functions in genomes" 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact A talk on Genome Properties and their application to the analysis of metagenome assembled genomes (MAGs) was presented during the Peruvian Bioinformatics Symposium on Repeat Proteins held at the Pontificia Universidad Católica del Perú, Peru.
Year(s) Of Engagement Activity 2019
URL http://simposio.pucp.edu.pe/refract-latam/
 
Description Pfam: Creating protein families - e-learning 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Created a course within EMBL-EBI train-online resource to describe the process of family building within Pfam.
Year(s) Of Engagement Activity 2018
URL https://www.ebi.ac.uk/training/online/course/pfam-database-creating-protein-families
 
Description Presentation at Institute of Organic Chemistry and Biochemistry of the CAS 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Postgraduate students
Results and Impact Presented a seminar on the challenges of classifying protein space.
Year(s) Of Engagement Activity 2018
 
Description Presentation to EBI Industry Programme quarterly meeting 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Industry/Business
Results and Impact Presentation of work on Genome Properties resource.
Year(s) Of Engagement Activity 2018