📣 Help Shape the Future of UKRI's Gateway to Research (GtR)

We're improving UKRI's Gateway to Research and are seeking your input! If you would be interested in being interviewed about the improvements we're making and to have your say about how we can make GtR more user-friendly, impactful, and effective for the Research and Innovation community, please email gateway@ukri.org.

Improving accuracy, coverage, and sustainability of functional protein annotation in InterPro, Pfam and FunFam using Deep Learning methods

Lead Research Organisation: European Bioinformatics Institute
Department Name: MSCB Macromolec, structural and chem bio

Abstract

Proteins are macromolecules responsible for biological processes in the cell. At their most basic level, they consist of a sequence of amino acids, determined by the sequence of nucleotides (the ATGC building blocks of life) in a gene. Proteins usually fold into three-dimensional structures, allowing them to interact with other molecules and perform their functions. Recent advances in sequencing technologies have led to a substantial accumulation of protein data, and our capacity of generating new protein sequences has surpassed our ability to fully understand their functions. Therefore, it is crucial to develop computational methods that identify sequence or structural similarities between characterised and uncharacterised proteins to transfer functional information from the former to the latter.
InterPro, Pfam and FunFam are world-leading, UK-based resources that group similar protein sequences together, forming protein families. Pfam is a collection of protein domain families containing functional annotations. FunFam focuses on protein structural domains that share a common function. InterPro merges information from 13 expert protein databases, including Pfam and FunFam, into a single searchable resource, and further annotates protein families.
In the past few years, Artificial Intelligence methods have been successfully applied to several biological applications. For instance, DeepMind's AlphaFold has revolutionised the prediction of how protein sequences fold into three-dimensional structures. Several promising tools are being developed by our collaborators to better identify protein families using Deep Learning (DL). These methods outperform current state-of-the-art approaches in terms of accuracy, coverage and computing efficiency, thus making them more environmentally sustainable.
In this ambitious project, we will improve the efficiency, accuracy, and sustainability of InterPro, Pfam and FunFam. This will be accomplished by reducing the technical debt of Pfam, established almost three decades ago, adopting DL approaches to enhance the classification of protein sequences into families, and significantly reducing the carbon footprint of sequence annotation. Finally, we will improve the annotation of agriculturally important plant pathogens, resulting in the creation of hundreds of additional InterPro and Pfam entries.

Technical Summary

InterPro, Pfam, and FunFam are three well-known biological databases in the field of protein research. Pfam and FunFam are collections of protein domain families and protein structural domains, respectively, that share a common function. They are widely used by the scientific community to predict the location of domains and provide functional annotations of novel protein sequences. Pfam is a collection of approximately 20,000 entries that cover three-quarters of known proteins. Matches to Pfam are calculated using a set of profile Hidden Markov models (HMMs) calculated from multiple sequence alignments. FunFam also relies on profile-HMMs to classify CATH-Gene3D annotations into functional families, but its collection is approximately 10 times bigger. This classification is based on an automatic agglomerative clustering of CATH-Gene3D, but has been shown to produce very high-quality annotations. InterPro is a protein data resource that integrates 13 major protein family databases, including Pfam and FunFam, to present a unified and comprehensive description of protein families, domains, and functionally important sites.
The recent development of modern sequencing technologies has contributed to an unprecedented growth of the number of available protein sequences, especially through metagenomics. In this context, this project aims to expand and enhance the coverage of protein families through the adoption of Deep Learning methods of protein classification. These emerging technologies already outperform current methods (profile-HMMs, position-specific scoring matrices, and motifs) in terms of speed and accuracy, and their adoption will constitute a paradigm shift. As proteins will be annotated faster, we expect the carbon footprint of InterPro, Pfam, and FunFam to decrease significantly. Finally, we will improve the annotation of plant pathogens of agricultural importance, generating hundreds of new InterPro and Pfam entries.

Publications

10 25 50
 
Title InterPro 
Description InterPro provides functional analysis of proteins by classifying them into families and predicting domains and important sites. We combine protein signatures from a number of member databases into a single searchable resource, capitalising on their individual strengths to produce a powerful integrated database and diagnostic tool. 
Type Of Material Database/Collection of data 
Provided To Others? Yes  
Impact All of the annotations provided by InterPro underpin the automatic annotation pipeline within the UniProt database. InterPro provides tens of millions of sequences to UniProt through the InterPro2Go pipeline. InterPro is the most widely used web service at EMBL-EBI, performing ~15,000,000 searches per month, from around the world. Since November 2019, we have released 9 updates of the InterPro data, in total 1637 new InterPro entries have been created, representing a coverage of 97% of the proteins found in UniProtKB. The InterPro website is continually updated and a number of new features have been added, including the structural models for 6370 families from Pfam 33.1 without PDB structures. This data was generated following a collaboration with the Baker group from the University of Washington. 
URL https://www.ebi.ac.uk/interpro/
 
Title Pfam 
Description Protein Family database 
Type Of Material Database/Collection of data 
Provided To Others? Yes  
Impact The annotation of the millions of sequences that are generated by modern DNA sequencing technologies. 
URL http://pfam.xfam.org
 
Title Pfam-N 
Description Deep learning method to predict protein families and domains trained on the Pfam dataset 
Type Of Material Database/Collection of data 
Year Produced 2023 
Provided To Others? Yes  
Impact Pfam-N significantly enhances the annotation of UniProtKB, covering 85.7% of sequences-an 8.9% increase over Pfam 36.0. Improved deep learning models boost annotation accuracy, recovering 97% of known annotations with 93% precision. The model identifies 22.8 million previously unannotated proteins, including 10 million lacking any prior functional insights. These improvements expand Pfam's utility, supporting more accurate function predictions and enabling new discoveries in structural and functional genomics. This research strengthens bioinformatics resources, benefiting computational biology and biomedical research communities. 
 
Description Lucy Colwell - Google Deepmind 
Organisation Google
Department Deepmind
Country United Kingdom 
Sector Private 
PI Contribution Assessment of deep learning models. Integration of Pfam-N data in the InterPro infrastructure
Collaborator Contribution Developed Deep Learning models for Pfam and InterPro member databases
Impact Pfam-N annotations are available on the InterPro website and as part of the InterPro release files
Start Year 2022
 
Title InterProScan 
Description InterProScan combines different protein signature recognition methods from the InterPro. Sequences are submitted in FASTA format. Matches are then calculated against all of the required member database's signatures and the results are then output in a variety of formats. 
Type Of Technology Software 
Year Produced 2024 
Open Source License? Yes  
Impact InterProScan has enhanced its functionality by incorporating Gene Ontology (GO) annotations from PANTHER, complementing the existing annotations from InterPro. Additionally, the latest update introduces the reporting of representative domains. These representative domains are automatically chosen from those that match a given sequence, aiming to optimise sequence coverage while minimising overlap between domains. In the XML and JSON outputs of InterProScan, domains are now accompanied by a "representative" attribute, indicating whether a domain has been selected as representative or not. This refinement enhances the precision and comprehensiveness of domain information provided by InterProScan. 
URL https://www.ebi.ac.uk/interpro/download/InterProScan/
 
Description Bury St Edmonds - Suffolk Family Carers Science Spooktacular 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Public/other audiences
Results and Impact Presentation of the protein families game to children and their parents/carers
Year(s) Of Engagement Activity 2024
 
Description Deep learning advisory group meeting 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact bi-annual deep learning advisory group meeting to present the latest development of DL methods in InterPro and discuss next steps
Year(s) Of Engagement Activity 2024,2025
 
Description EBI Protein & complex day - InterPro & Pfam 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Other audiences
Results and Impact Presentation about Predicting molecular complexes with AlphaFold and beyond to other teams working at EBI
Year(s) Of Engagement Activity 2024
 
Description EMBL structural biology retreat 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact Presentation about Using structure prediction to computationally identify protein function to EMBL employees
Year(s) Of Engagement Activity 2025
 
Description InterPro and Pfam resources in the context of EBI structural bioinformatics course 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact 30 professionals received an introduction to the InterPro and Pfam resources, including lecture and practical, in the context of the EBI structural bioinformatics course.
Year(s) Of Engagement Activity 2022,2023,2024
URL https://www.ebi.ac.uk/training/events/structural-bioinformatics2021/
 
Description InterPro release blog posts 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact For each InterPro release (every 2 months), we write a blog post presenting the latest features and updates made to the InterPro website.
Year(s) Of Engagement Activity 2020,2021,2022,2023,2024,2025
URL https://proteinswebteam.github.io/interpro-blog/
 
Description InterPro sequence search and analysis in the context of EBI Job dispacher workshop 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact 250+ professionals received an introduction to the InterPro and the InterProScan sequence analysis tool
Year(s) Of Engagement Activity 2024,2025
 
Description InterPro social media posts 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact Since 2020, we have been posting weekly on the social media platform X about new releases and features available on the InterPro website.
Year(s) Of Engagement Activity 2020,2021,2022,2023,2024,2025
URL https://twitter.com/InterProDB
 
Description Pfam release blog posts 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact For each Pfam release, we write a blog post article presenting the release data content, source of new entries, and interesting cases.
Year(s) Of Engagement Activity 2020,2021,2022,2023,2024
URL https://xfam.wordpress.com
 
Description Seminar at APBJC 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Presentation on how AI is used in Pfam and InterPro at International Conference
Year(s) Of Engagement Activity 2024
 
Description UCL postgraduates training about InterPro, Pfam and HMMER 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Postgraduate students
Results and Impact Postgraduate and undergraduate students from UCL attended a lecture and practical session on how to use InterPro, Pfam and HMMER resources.
Year(s) Of Engagement Activity 2022,2023,2024