Exploiting data driven computational approaches for understanding protein structure and function in InterPro and Pfam

Lead Research Organisation: EMBL - European Bioinformatics Institute
Department Name: MSCB Macromolec, structural and chem bio

Abstract

Proteins are biological macromolecules that perform a diverse array of crucial functions, from enzymes (e.g. the entities responsible for fermentation) to transporters (e.g. hemoglobin in the blood) to mechanical structures (e.g. actin and myosin in muscle). Proteins are synthesized as linear polymers of building blocks called amino acids. They usually fold into complex three-dimensional (3D) structures, and typically interact with other proteins and molecules to perform their function. Knowledge of protein sequences can facilitate insights into hitherto undiscovered enzymes with potential applications in the biotechnology sector, or novel drugs of interest to the pharmaceutical industry. Detailed understanding of the functional architecture of proteins, including the arrangement of amino acids in a 3D structure, enables scientists to diagnose diseases as well as design more effective enzymes.

These days, our ability to generate new protein sequences based on modern high-throughput DNA sequencing (HTS) techniques far outstrips our ability to functionally characterise them. Thus, most sequences are computationally annotated, by identifying similarities between new sequences and the few experimentally characterised examples, using these to infer function (i.e. annotate). More recently, HTS has been applied directly to environmental samples to discover previously uncultured bacteria and single cell eukaryotes, and to enable the reconstruction of large and complex genomes, like plants. Such approaches are correcting many of the historical biases in the protein sequence databases. However, for humankind to understand and utilise these data, sequences need to be functionally annotated, which is best accomplished using the information gleaned from sets of related sequences (known as protein families).

InterPro is a world leading protein family resource that merges information from 13 different specialist databases to present the user with comprehensive functional analysis of sequences. One of its member databases, Pfam, is a collection of protein domain families containing functional annotations. Both InterPro and Pfam are well-established primary resources in the field of protein research. In this application, we propose crucial developments to both of these resources in order to augment their utility, functionality and scalability, as well as uniquely position them to tackle imminent advances in the field. We will leverage pre-established links with other protein databases and concurrently build additional pipelines to develop and exchange the latest information between these existing and new resources.

We will improve coverage of protein sequences originating from environmental sources by building families for novel sets (or clusters) of related proteins. Considering the fundamental association between protein structure and function, we will develop a pipeline that will not only import structural models for Pfam entries and present them via the website, but will also ensure that the models remain up to date. To increase coverage and functional annotations in both resources, we will integrate new resources to provide sub-domain classifications, and improve annotations through combined literature searches and enhanced curation tools. To refine annotations, we will adopt a new algorithm called TreeGrafter to InterProScan (our software package that performs automatic annotations of protein sequences), and integrate controlled vocabularies for protein attributes from databases like PANTHER with those already in InterPro. We will evaluate the performance of an upgraded version of the HMMER software that is widely used to build protein families, including Pfam, to improve future scalability. Finally, we will focus on eight genomes of agricultural importance, including chicken, salmon, and wheat, by systematically annotating 2000 associated entries in Pfam and by extension, InterPro.

Technical Summary

InterPro and Pfam are preeminent complementary resources in the field of protein research. InterPro draws its information from a compendium of 13 expert member databases, including Pfam, enabling classification of protein sequences into families and prediction of functional domains and sites. Pfam generates protein families, with each curated entry represented by an alignment and profile hidden Markov model (HMM).
In light of the sheer volume of novel protein sequences being constantly discovered, especially through metagenomics, this proposal devises key developments to further improve functionality and scalability of these resources. We will enhance coverage of environmentally derived sequences (MGnify database, Tara Oceans and MMETSP projects) by generating families for the largest novel sequence clusters. We will incorporate de novo structural models and produce deep sequence alignments (using metagenomics sequences) necessary for the detection of co-evolutionary residues, which in turn will be used for structural modelling. The websites will facilitate visualization of these structural models and display co-variance contact sites. We will use a combination of known structures and models to classify additional Pfam entries into clans, as well as review domain boundaries. To increase InterPro coverage and functional annotations, we will integrate new resources (CATH FunFams) to provide sub-domain classifications, improve annotations (especially domains of unknown function) and maximise member database integrations. To enable scaling and refine annotations, we will adopt a new algorithm (TreeGrafter) in InterProScan, harmonise PANTHER and FunFams-based Gene Ontology terms within InterPro, and evaluate performance of an upgraded version of the HMMER software. Finally, we will focus annotation efforts on eight genomes of agricultural importance, including chicken, salmon, and wheat, generating 1000s of Pfam and InterPro entries.

Planned Impact

The field of protein research has witnessed an explosion in novel protein sequences due to advances in sequencing technologies. However, these sequences are meaningless without functional annotation. This proposal focuses on the world leading protein databases, InterPro and Pfam, which are routinely used for protein annotation. Due to their extensive use by researchers worldwide, this application will impact most BBSRC strategic priorities - especially agriculture and food security, industrial biotechnology, and bioscience for health. To maximise the impact of these resources, we propose to exploit multiple computational approaches to (i) improve annotation of metagenomics datasets and eukaryotic marine microbes; (ii) provide co-evolutionary structural models for Pfam entries using deep alignments to build additional models and permit their visualization; (iii) integrate and improve annotations from current and new InterPro databases, such as PANTHER, CDD, and CATH FunFams; (iv) improve scaling and refine annotations by adopting new algorithms and software, like TreeGrafter and HMMER4, and reconcile Gene Ontology terms across databases; (v) systematically annotate eight genomes of agricultural importance. These developments will ensure users in the UK and world over can derive the maximum benefit from these resources while further cementing their position as exceptional databases of immense importance to the scientific community at large.

Developing new pipelines to build new entries for proteins derived from metagenomics provides a unique exploitable opportunity for InterPro and Pfam. The fact that these resources will extend coverage of marine eukaryotic microbes will have significant, far reaching impacts on other fields and analytical disciplines. This is especially true for the UK Darwin Tree of Life project, which forms part of a global initiative to sequence all eukaryotic species, aiming to revolutionize our understanding of biology, evolution and biodiversity. However, this will only be realised through detailed and accurate functional annotation, such as that provided by InterPro and Pfam.

The agricultural sector represents another area of considerable impact. Providing comprehensive functional annotations for proteins from widely farmed animal and plant species in the UK and worldwide will facilitate insights into the molecular basis of biological features including yield characteristics, capacity to resist disease and tolerance to the vagaries of nature. This will lead to socioeconomic benefits, through maximising land utilisation for growing crops such as wheat and sugar beet (the latter providing nearly 30% of the world's annual sugar production and forming an important source for bioethanol and animal feed), or enhancing the global aquaculture market, projected to reach $20 billion by 2022, where salmon is a substantial component.

Furthermore, the project outputs will be of exceptional value to the commercial sector, eventually benefiting the public. Improved annotations of proteins originating from microbes will lead to new discoveries, such as novel antibiotics for humans and livestock, higher agricultural yields from the understanding of ecological interplay (e.g. food chain microbes), expanded discovery of novel enzymes (e.g. psychrophilic enzymes for detergents) or those with novel catalytic functionality.

We will ensure impact on all academic and industrial audiences by the publication of software, data, and peer reviewed articles. To ensure that resource developments are disseminated as widely as possible, we will deliver onsite training, webinars, participate in community workshops and produce online training materials. We will leverage our professional networks and collaborations, conference platforms and social media channels to further publicise key developments. The public sector will also be engaged, via specific events and the publication of non-specialist articles and interviews.

Publications

10 25 50

publication icon
Cantelli G (2022) The European Bioinformatics Institute (EMBL-EBI) in 2021. in Nucleic acids research

publication icon
Mistry J (2021) Pfam: The protein families database in 2021. in Nucleic acids research

 
Title Automatic pipeline to generate potential Pfam profile-HMMs for clusters from MGnify protein sequence set and UniProtKB 
Description The pipeline performs a co-clustering of the MGnify protein sequence set and UniProtKB and generates candidate profile-HMMs for potential inclusion in Pfam. It uses mmseqs to carry out the clustering of MGnify and UniProt which generated a set of 434,651,340 clusters. We kept clusters with at least 1 UniProt and 1 MGnify sequence and generated 10,000 clusters of automatically generated candidate Pfam families that were put forward for curation. 
Type Of Material Improvements to research infrastructure 
Year Produced 2020 
Provided To Others? Yes  
Impact 382 new families where included in Pfam following the first iteration. 
URL https://github.com/ProteinsWebTeam/mgnify-clustering
 
Title Automatic pipeline to generate potential Pfam profile-HMMs for clusters from Marine Eukaryotic microbiomes protein sequence sets found in UniProtKB 
Description The pipeline clusters UniProt sequences from the Marine eukaryotic microbiome from the MMETS project. This pipeline uses mmseqs to carry out the clustering; it generated a set of 620,056 clusters. We kept clusters with at least 2 UniProt sequences and generated 10,000 clusters of automatically generated candidate Pfam families that were put forward for curation. 
Type Of Material Improvements to research infrastructure 
Year Produced 2021 
Provided To Others? Yes  
Impact 16 new families where included in Pfam following the first iteration and will be made available publicly in Pfam release 36.0. 
 
Title DeDuF Pfam entries 
Description Improving Pfam annotations and coverage through the identification of functions for Domains of Unknown Function. 
Type Of Material Improvements to research infrastructure 
Year Produced 2020 
Provided To Others? No  
Impact 44 Pfam families with previously unknown function have been re-annotated and assigned a function. The updated annotation will soon be made available to the public in Pfam release 34.0. 
 
Title DeDuF Pfam entries 
Description Improving Pfam annotations and coverage through the identification of functions for Domains of Unknown Function. 
Type Of Material Improvements to research infrastructure 
Year Produced 2020 
Provided To Others? No  
Impact 1405 Pfam families with previously unknown function have been re-annotated and assigned a function. The updated annotation are available in Pfam release 35.0. 
 
Title Import of co-evolutionary models for Pfam entries with no known structure in Pfam and InterPro 
Description We have provided the Baker group with Deep alignments based on UniProtKB sequence alignments for Pfam families with no PDB structure (approx 6,500 families). They have calculated models for all of these and calculated IDDT scores for reliability of the model. The vast majority of models give the correct fold, with the vast majority having an lDDT score higher than 0.6 (considered as reasonable models) and some of the models have an lDDT higher than 0.8 (considered as great models). We have made the models and their contact map available through the InterPro website for the Pfam families with no structure under the "Structure models" tab. In this tab the contact map between the residues is available for the Pfam SEED alignment. We also display the 3D structure of the model, where the contacts between residues can be highlighted by hovering over the contacts in the alignment. The method used to predict the models and a description of the information available in InterPro pages has been included in the InterPro documentation. 
Type Of Material Improvements to research infrastructure 
Year Produced 2021 
Provided To Others? Yes  
Impact Providing structural models for Pfam families with no PDB structure allows a better understanding of the three-dimensional (3D) arrangement of amino acids, which can provide key insights into protein function, and allow very distant homologues to be identified. 
URL https://www.ebi.ac.uk/interpro/entry/pfam/PF01050/model/
 
Title Increased annotation of eight key agricultural genomes 
Description We have developed a pipeline to generate lists of proteins that are not covered by integrated entries in InterPro in each of the targeted agriculturally relevant genomes that have associated member database signatures. 
Type Of Material Improvements to research infrastructure 
Year Produced 2020 
Provided To Others? Yes  
Impact Newly integrated proteins in InterPro for each key organism as between September 2019 and September 2020: Wheat: 154, Maize: 385, Chicken: 104, Cow: 498, Salmon: 83, Pig: 18239, Sugar beet: 8, Miscanthus: 20 Newly integrated proteins in Pfam entries for each key organism as between September 2019 and September 2020: Wheat: 237, Maize: 302, Chicken: 203, Cow: 247, Salmon: 519, Pig: 253 
URL https://www.ebi.ac.uk/interpro/
 
Title Increased annotation of eight key agricultural genomes 
Description We have developed a pipeline to generate lists of proteins that are not covered by integrated entries in InterPro in each of the targeted agriculturally relevant genomes that have associated member database signatures. 
Type Of Material Improvements to research infrastructure 
Year Produced 2021 
Provided To Others? Yes  
Impact Newly integrated proteins in InterPro for each key organism as between September 2019 and March 2022: Wheat: 4500, Maize: 31878, Chicken: 147, Cow: 340, Salmon: 154, Pig: 216144, Sugar beet: 9, Miscanthus: 19. Newly integrated proteins in Pfam entries for each key organism as between September 2019 and March 2022: Wheat: 3223, Maize: 2060, Chicken: 749, Cow: 920, Salmon: 1850, Pig: 980 
URL https://www.ebi.ac.uk/interpro/
 
Title Preliminary work to investigate using CATH-Gene3D matches to speed up the searches of CATH-FunFams 
Description In order to speed-up the calculation process of the CATH-Funfams models for their integration in InterPro a preliminary work has been done to investigate whether CATH-Gene3D matches could be used as a pre-filter to speed up the searches of CATH-Funfams. 
Type Of Material Improvements to research infrastructure 
Year Produced 2020 
Provided To Others? No  
Impact It has been found that the number of matches for the CATH-Funfams and CATH-Gene3D profiles against the UniProtKB database aren't significantly different. Following those good results, the InterPro team has decided to go ahead on the integration of the CATH-Funfams profiles in its database. Those will be provided as an automatic annotation of protein sequences which will be displayed in a similar way than the MOBI-DB database. 
 
Title InterPro 
Description InterPro provides functional analysis of proteins by classifying them into families and predicting domains and important sites. We combine protein signatures from a number of member databases into a single searchable resource, capitalising on their individual strengths to produce a powerful integrated database and diagnostic tool. 
Type Of Material Database/Collection of data 
Provided To Others? Yes  
Impact All of the annotations provided by InterPro underpin the automatic annotation pipeline within the UniProt database. InterPro provides tens of millions of sequences to UniProt through the InterPro2Go pipeline. InterPro is the most widely used web service at EMBL-EBI, performing ~15,000,000 searches per month, from around the world. Since November 2019, we have released 9 updates of the InterPro data, in total 1637 new InterPro entries have been created, representing a coverage of 97% of the proteins found in UniProtKB. The InterPro website is continually updated and a number of new features have been added, including the structural models for 6370 families from Pfam 33.1 without PDB structures. This data was generated following a collaboration with the Baker group from the University of Washington. 
URL https://www.ebi.ac.uk/interpro/
 
Title Pfam 
Description The Pfam database is a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs). Proteins are generally composed of one or more functional regions, commonly termed domains. Different combinations of domains give rise to the diverse range of proteins found in nature. The identification of domains that occur within proteins can therefore provide insights into their function. Pfam also generates higher-level groupings of related entries, known as clans. A clan is a collection of Pfam entries which are related by similarity of sequence, structure or profile-HMM. 
Type Of Material Database/Collection of data 
Provided To Others? Yes  
Impact Pfam is widely used within the research community. In the past year we have been working actively on migrating the data available in the pfam website (pfam.xfam.org) into the InterPro website. Two data releases were made available since November 2019, the total number of Pfam entries is 18259, included in 635 clans. 
URL http://pfam.xfam.org/
 
Description Providing AlphaFold structural models through Pfam and InterPro 
Organisation Alphabet
Department Deepmind
Country United Kingdom 
Sector Private 
PI Contribution We updated the Pfam and InterPro websites to display the AlphaFold structural models.
Collaborator Contribution DeepMind provided a large collection of structural models to EMBL-EBI which became part of the AlphaFold Database. They also discussed the design and implemnentation of their structural models in the Pfam and InterPro websites.
Impact https://proteinswebteam.github.io/interpro-blog/2021/07/22/AlphaFold-structure-predictions-available-in-InterPro/ Dr Bateman took part in a joint websinar with DeepMind to train scientists in interpreting AlphaFold models.
Start Year 2021
 
Description Structural Models for Pfam 
Organisation University of Washington
Country United States 
Sector Academic/University 
PI Contribution We have made trRosetta initially structural models from the Baker group available via the InterPro and Pfam website. More recently these have been replaced by RoseTTAfold structural models.
Collaborator Contribution The group of David Baker produced a collection of 6,370 trRosetta models of Pfam families with no known structure for Pfam release 33.1 which increases the fraction of Pfam families with structural data to 88%. For more recent Pfam releases the group have provided more accurate RoseTTAfold structural models.
Impact The collection of structural models have been made available via the InterPro and Pfam websites. This work has been described in a blog post and press release.
Start Year 2020
 
Description Development of a public engagement activity: Protein families card game 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact Development of a card game "Protein families" thanks to the support of the Welcome genome campus public engagement fund.
The game is available as a printed version for face to face events, as well as an online version.
It is currently under testing.
Year(s) Of Engagement Activity 2021
 
Description InterPro and Pfam resources in the context of EBI structural bioinformatics course 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact 30 professionals received an introduction to the InterPro and Pfam resources, including lecture and practical, in the context of the EBI structural bioinformatics course.
Year(s) Of Engagement Activity 2021
URL https://www.ebi.ac.uk/training/events/structural-bioinformatics2021/
 
Description InterPro and Pfam resources in the context of the EBI Protein course 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact An introduction to the InterPro and Pfam resources, including lecture and practicals, was given to professional scientists in the context of the EBI Protein course.
Year(s) Of Engagement Activity 2022
URL https://www.ebi.ac.uk/training/events/bioinformatics-resources-protein-biology-2022/
 
Description InterPro resource in the context of EBI structural bioinformatics course 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact 30 professionals received an introduction to the InterPro resource, including lecture and practical, in the context of the EBI structural bioinformatics course.
A user testing session of the InterPro website was organised with 3 of the participants.
Year(s) Of Engagement Activity 2020
 
Description InterPro resource in the context of the EBI Protein course 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact An introduction to the InterPro resource, including lecture and practicals, was given to professional scientists in the context of the EBI Protein course.
Year(s) Of Engagement Activity 2020
 
Description UCL postgraduates training about InterPro and HMMER 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Postgraduate students
Results and Impact Postgraduate and undergraduate students from UCL attended a lecture and practical session on how to use InterPro and HMMER resources.
Year(s) Of Engagement Activity 2020,2021
 
Description UCL postgraduates training about InterPro, Pfam and HMMER 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Postgraduate students
Results and Impact Postgraduate and undergraduate students from UCL attended a lecture and practical session on how to use InterPro, Pfam and HMMER resources.
Year(s) Of Engagement Activity 2022
 
Description UniProt/InterPro joint webinar 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact This webinar gave a brief introduction to the UniProt and InterPro websites and highlight resources available that proteomic scientists or other users with protein datasets may find useful to analyse their data. This encompasses searching by protein sequence, identifying protein peptides, and retrieving sequence-specific features and functional information both curated and predicted.
Year(s) Of Engagement Activity 2021
URL https://www.ebi.ac.uk/training/events/guide-proteomics-data-analysis-using-uniprot-and-interpro/
 
Description Webinar series on InterPro resources 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact A series of 4 webinars about InterPro resources was organised:
- Understanding InterPro families, domains and functions
- Using the InterPro website in your research
- Accessing InterPro programmatically
- InterProScan
Year(s) Of Engagement Activity 2020
URL https://www.ebi.ac.uk/interpro/help/tutorial/