Improving accuracy, coverage, and sustainability of functional protein annotation in InterPro, Pfam and FunFam using Deep Learning methods PID 7012435

Lead Research Organisation: University College London
Department Name: Structural Molecular Biology

Abstract

Abstracts are not currently available in GtR for all funded research. This is normally because the abstract was not required at the time of proposal submission, but may be because it included sensitive information such as personal details.

Technical Summary

InterPro, Pfam, and FunFam are three well-known biological databases in the field of protein research. Pfam and FunFam are collections of protein domain families and protein structural domains, respectively, that share a common function. They are widely used by the scientific community to predict the location of domains and provide functional annotations of novel protein sequences. Pfam is a collection of approximately 20,000 entries that cover three-quarters of known proteins. Matches to Pfam are calculated using a set of profile Hidden Markov models (HMMs) calculated from multiple sequence alignments. FunFam also relies on profile-HMMs to classify CATH-Gene3D annotations into functional families, but its collection is approximately 10 times bigger. This classification is based on an automatic agglomerative clustering of CATH-Gene3D, but has been shown to produce very high-quality annotations. InterPro is a protein data resource that integrates 13 major protein family databases, including Pfam and FunFam, to present a unified and comprehensive description of protein families, domains, and functionally important sites.
The recent development of modern sequencing technologies has contributed to an unprecedented growth of the number of available protein sequences, especially through metagenomics. In this context, this project aims to expand and enhance the coverage of protein families through the adoption of Deep Learning methods of protein classification. These emerging technologies already outperform current methods (profile-HMMs, position-specific scoring matrices, and motifs) in terms of speed and accuracy, and their adoption will constitute a paradigm shift. As proteins will be annotated faster, we expect the carbon footprint of InterPro, Pfam, and FunFam to decrease significantly. Finally, we will improve the annotation of plant pathogens of agricultural importance, generating hundreds of new InterPro and Pfam entries.

Publications

10 25 50