Improving accuracy, coverage, and sustainability of functional protein annotation in InterPro, Pfam and FunFam using Deep Learning methods PID 7012435

Lead Research Organisation: UNIVERSITY COLLEGE LONDON

Department Name: Structural Molecular Biology

Abstract

Abstracts are not currently available in GtR for all funded research. This is normally because the abstract was not required at the time of proposal submission, but may be because it included sensitive information such as personal details.

Technical Summary

InterPro, Pfam, and FunFam are three well-known biological databases in the field of protein research. Pfam and FunFam are collections of protein domain families and protein structural domains, respectively, that share a common function. They are widely used by the scientific community to predict the location of domains and provide functional annotations of novel protein sequences. Pfam is a collection of approximately 20,000 entries that cover three-quarters of known proteins. Matches to Pfam are calculated using a set of profile Hidden Markov models (HMMs) calculated from multiple sequence alignments. FunFam also relies on profile-HMMs to classify CATH-Gene3D annotations into functional families, but its collection is approximately 10 times bigger. This classification is based on an automatic agglomerative clustering of CATH-Gene3D, but has been shown to produce very high-quality annotations. InterPro is a protein data resource that integrates 13 major protein family databases, including Pfam and FunFam, to present a unified and comprehensive description of protein families, domains, and functionally important sites.
The recent development of modern sequencing technologies has contributed to an unprecedented growth of the number of available protein sequences, especially through metagenomics. In this context, this project aims to expand and enhance the coverage of protein families through the adoption of Deep Learning methods of protein classification. These emerging technologies already outperform current methods (profile-HMMs, position-specific scoring matrices, and motifs) in terms of speed and accuracy, and their adoption will constitute a paradigm shift. As proteins will be annotated faster, we expect the carbon footprint of InterPro, Pfam, and FunFam to decrease significantly. Finally, we will improve the annotation of plant pathogens of agricultural importance, generating hundreds of new InterPro and Pfam entries.

Funded Value:

£130,886

Funded Period:

Feb 24 - Jan 27

Funder:

BBSRC

Project Status:

Active

Project Category:

Research Grant

Project Reference:

BB/X018563/1

Principal Investigator:

Christine Orengo

Research Subject:

Info. & commun. Technol. (16%)

Omic sciences & technologies (24%)

Plant & crop science (16%)

Tools, technologies & methods (40%)

Research Topic:

Artificial Intelligence (16%)

Bioinformatics (40%)

Interaction with organisms (16%)

Proteomics (24%)

Organisations

UNIVERSITY COLLEGE LONDON (Lead Research Organisation)

People	ORCID iD
Christine Orengo (Principal Investigator)

Publications

Author Name

Title Publication Date Published

10 25 50

Blum M (2025) InterPro: the protein sequence classification resource in 2025 in Nucleic Acids Research

Research Tools and Methods


Title	Evaluation of Interpro-N predictions
Description	The InterPro team are developing AI-based approaches (InterPro-N) for automatically assigning domain boundaries for protein chains in UniProt. We evaluated predictions from three InterPro-N models (model A, model B and Model C) using domain annotations from CATH. We analyzed two datasets 'ref' dataset and 'new' dataset (total entries analyzed: ~200). The "ref" dataset contains a random sample of UniProtKB accessions for which there is at least one existing InterPro96 annotation while entries in "new" dataset has no InterPro96 annotation. We used the following evaluation and scoring schemes: Score of 6 [Correct] is given when: 1. Where both (i) CATH Superfamily annotation and (ii) predicted domain boundaries matches with that of CATH assignments (by our latest CATH-TED protocol) 2. whether all the domains are predicted by models (in case of multi-domain proteins) 3. Agreement with other member databases (if this assignment is available) Score of 5 is given when: 1. CATH Superfamily annotation is correctly predicted by the model, but the domain boundaries are not Score of 4 [May be correct] 1. The model has a partial annotation match (e.g. T (topology level matches) 2. Only one of the domain is correctly predicted (in case of multi-domain proteins) Score of 3[May be wrong] 1. The structure is non-globular (Small helix) and has no assignment in CATH (e.g. A0A507EA72) Score 2 [Wrong] is given when: 1. Model predictions do not match with superfamily assignment in CATH 2. The domain is non-globular and no other supporting evidence on domain assignment is provided in any other member database 3. The structural comparison scores are below thresholds (i.e. TM-score < 0.50) Score 1 [Uncertain] is given when: There is no structure available in the AlphaFold Database and hence unable to verify using structural/CATH information We observed that all three models performs equally for the "ref' dataset annotations while model A performed best for the "new" dataset.
Type Of Material	Improvements to research infrastructure
Year Produced	2024
Provided To Others?	No
Impact	The evaluation used in our approach is helpful for selection of the appropriate model for predicting annotations by Interpro-N. The work and writing of the manuscript is continuing.

Abstract

Technical Summary

Organisations

People

ORCID iD

Publications