Improving accuracy, coverage, and sustainability of functional protein annotation in InterPro, Pfam and FunFam using Deep Learning methods PID 7012435
Lead Research Organisation:
UNIVERSITY COLLEGE LONDON
Department Name: Structural Molecular Biology
Abstract
Abstracts are not currently available in GtR for all funded research. This is normally because the abstract was not required at the time of proposal submission, but may be because it included sensitive information such as personal details.
Technical Summary
InterPro, Pfam, and FunFam are three well-known biological databases in the field of protein research. Pfam and FunFam are collections of protein domain families and protein structural domains, respectively, that share a common function. They are widely used by the scientific community to predict the location of domains and provide functional annotations of novel protein sequences. Pfam is a collection of approximately 20,000 entries that cover three-quarters of known proteins. Matches to Pfam are calculated using a set of profile Hidden Markov models (HMMs) calculated from multiple sequence alignments. FunFam also relies on profile-HMMs to classify CATH-Gene3D annotations into functional families, but its collection is approximately 10 times bigger. This classification is based on an automatic agglomerative clustering of CATH-Gene3D, but has been shown to produce very high-quality annotations. InterPro is a protein data resource that integrates 13 major protein family databases, including Pfam and FunFam, to present a unified and comprehensive description of protein families, domains, and functionally important sites.
The recent development of modern sequencing technologies has contributed to an unprecedented growth of the number of available protein sequences, especially through metagenomics. In this context, this project aims to expand and enhance the coverage of protein families through the adoption of Deep Learning methods of protein classification. These emerging technologies already outperform current methods (profile-HMMs, position-specific scoring matrices, and motifs) in terms of speed and accuracy, and their adoption will constitute a paradigm shift. As proteins will be annotated faster, we expect the carbon footprint of InterPro, Pfam, and FunFam to decrease significantly. Finally, we will improve the annotation of plant pathogens of agricultural importance, generating hundreds of new InterPro and Pfam entries.
The recent development of modern sequencing technologies has contributed to an unprecedented growth of the number of available protein sequences, especially through metagenomics. In this context, this project aims to expand and enhance the coverage of protein families through the adoption of Deep Learning methods of protein classification. These emerging technologies already outperform current methods (profile-HMMs, position-specific scoring matrices, and motifs) in terms of speed and accuracy, and their adoption will constitute a paradigm shift. As proteins will be annotated faster, we expect the carbon footprint of InterPro, Pfam, and FunFam to decrease significantly. Finally, we will improve the annotation of plant pathogens of agricultural importance, generating hundreds of new InterPro and Pfam entries.
Organisations
People |
ORCID iD |
Christine Orengo (Principal Investigator) |
Publications

Blum M
(2025)
InterPro: the protein sequence classification resource in 2025
in Nucleic Acids Research
Title | Evaluation of Interpro-N predictions |
Description | The InterPro team are developing AI-based approaches (InterPro-N) for automatically assigning domain boundaries for protein chains in UniProt. We evaluated predictions from three InterPro-N models (model A, model B and Model C) using domain annotations from CATH. We analyzed two datasets 'ref' dataset and 'new' dataset (total entries analyzed: ~200). The "ref" dataset contains a random sample of UniProtKB accessions for which there is at least one existing InterPro96 annotation while entries in "new" dataset has no InterPro96 annotation. We used the following evaluation and scoring schemes: Score of 6 [Correct] is given when: 1. Where both (i) CATH Superfamily annotation and (ii) predicted domain boundaries matches with that of CATH assignments (by our latest CATH-TED protocol) 2. whether all the domains are predicted by models (in case of multi-domain proteins) 3. Agreement with other member databases (if this assignment is available) Score of 5 is given when: 1. CATH Superfamily annotation is correctly predicted by the model, but the domain boundaries are not Score of 4 [May be correct] 1. The model has a partial annotation match (e.g. T (topology level matches) 2. Only one of the domain is correctly predicted (in case of multi-domain proteins) Score of 3[May be wrong] 1. The structure is non-globular (Small helix) and has no assignment in CATH (e.g. A0A507EA72) Score 2 [Wrong] is given when: 1. Model predictions do not match with superfamily assignment in CATH 2. The domain is non-globular and no other supporting evidence on domain assignment is provided in any other member database 3. The structural comparison scores are below thresholds (i.e. TM-score < 0.50) Score 1 [Uncertain] is given when: There is no structure available in the AlphaFold Database and hence unable to verify using structural/CATH information We observed that all three models performs equally for the "ref' dataset annotations while model A performed best for the "new" dataset. |
Type Of Material | Improvements to research infrastructure |
Year Produced | 2024 |
Provided To Others? | No |
Impact | The evaluation used in our approach is helpful for selection of the appropriate model for predicting annotations by Interpro-N. The work and writing of the manuscript is continuing. |