Improving accuracy, coverage, and sustainability of functional protein annotation in InterPro, Pfam and FunFam using Deep Learning methods

Lead Research Organisation: European Bioinformatics Institute
Department Name: MSCB Macromolec, structural and chem bio

Abstract

Proteins are macromolecules responsible for biological processes in the cell. At their most basic level, they consist of a sequence of amino acids, determined by the sequence of nucleotides (the ATGC building blocks of life) in a gene. Proteins usually fold into three-dimensional structures, allowing them to interact with other molecules and perform their functions. Recent advances in sequencing technologies have led to a substantial accumulation of protein data, and our capacity of generating new protein sequences has surpassed our ability to fully understand their functions. Therefore, it is crucial to develop computational methods that identify sequence or structural similarities between characterised and uncharacterised proteins to transfer functional information from the former to the latter.
InterPro, Pfam and FunFam are world-leading, UK-based resources that group similar protein sequences together, forming protein families. Pfam is a collection of protein domain families containing functional annotations. FunFam focuses on protein structural domains that share a common function. InterPro merges information from 13 expert protein databases, including Pfam and FunFam, into a single searchable resource, and further annotates protein families.
In the past few years, Artificial Intelligence methods have been successfully applied to several biological applications. For instance, DeepMind's AlphaFold has revolutionised the prediction of how protein sequences fold into three-dimensional structures. Several promising tools are being developed by our collaborators to better identify protein families using Deep Learning (DL). These methods outperform current state-of-the-art approaches in terms of accuracy, coverage and computing efficiency, thus making them more environmentally sustainable.
In this ambitious project, we will improve the efficiency, accuracy, and sustainability of InterPro, Pfam and FunFam. This will be accomplished by reducing the technical debt of Pfam, established almost three decades ago, adopting DL approaches to enhance the classification of protein sequences into families, and significantly reducing the carbon footprint of sequence annotation. Finally, we will improve the annotation of agriculturally important plant pathogens, resulting in the creation of hundreds of additional InterPro and Pfam entries.

Technical Summary

InterPro, Pfam, and FunFam are three well-known biological databases in the field of protein research. Pfam and FunFam are collections of protein domain families and protein structural domains, respectively, that share a common function. They are widely used by the scientific community to predict the location of domains and provide functional annotations of novel protein sequences. Pfam is a collection of approximately 20,000 entries that cover three-quarters of known proteins. Matches to Pfam are calculated using a set of profile Hidden Markov models (HMMs) calculated from multiple sequence alignments. FunFam also relies on profile-HMMs to classify CATH-Gene3D annotations into functional families, but its collection is approximately 10 times bigger. This classification is based on an automatic agglomerative clustering of CATH-Gene3D, but has been shown to produce very high-quality annotations. InterPro is a protein data resource that integrates 13 major protein family databases, including Pfam and FunFam, to present a unified and comprehensive description of protein families, domains, and functionally important sites.
The recent development of modern sequencing technologies has contributed to an unprecedented growth of the number of available protein sequences, especially through metagenomics. In this context, this project aims to expand and enhance the coverage of protein families through the adoption of Deep Learning methods of protein classification. These emerging technologies already outperform current methods (profile-HMMs, position-specific scoring matrices, and motifs) in terms of speed and accuracy, and their adoption will constitute a paradigm shift. As proteins will be annotated faster, we expect the carbon footprint of InterPro, Pfam, and FunFam to decrease significantly. Finally, we will improve the annotation of plant pathogens of agricultural importance, generating hundreds of new InterPro and Pfam entries.

Publications

10 25 50