Exploiting data driven computational approaches for understanding protein structure and function in InterPro and Pfam

Lead Research Organisation: University College London
Department Name: Structural Molecular Biology

Abstract

Abstracts are not currently available in GtR for all funded research. This is normally because the abstract was not required at the time of proposal submission, but may be because it included sensitive information such as personal details.

Technical Summary

InterPro and Pfam are preeminent complementary resources in the field of protein research. InterPro draws its information from a compendium of 13 expert member databases, including Pfam, enabling classification of protein sequences into families and prediction of functional domains and sites. Pfam generates protein families, with each curated entry represented by an alignment and profile hidden Markov model (HMM).
In light of the sheer volume of novel protein sequences being constantly discovered, especially through metagenomics, this proposal devises key developments to further improve functionality and scalability of these resources. We will enhance coverage of environmentally derived sequences (MGnify database, Tara Oceans and MMETSP projects) by generating families for the largest novel sequence clusters. We will incorporate de novo structural models and produce deep sequence alignments (using metagenomics sequences) necessary for the detection of co-evolutionary residues, which in turn will be used for structural modelling. The websites will facilitate visualization of these structural models and display co-variance contact sites. We will use a combination of known structures and models to classify additional Pfam entries into clans, as well as review domain boundaries. To increase InterPro coverage and functional annotations, we will integrate new resources (CATH FunFams) to provide sub-domain classifications, improve annotations (especially domains of unknown function) and maximise member database integrations. To enable scaling and refine annotations, we will adopt a new algorithm (TreeGrafter) in InterProScan, harmonise PANTHER and FunFams-based Gene Ontology terms within InterPro, and evaluate performance of an upgraded version of the HMMER software. Finally, we will focus annotation efforts on eight genomes of agricultural importance, including chicken, salmon, and wheat, generating 1000s of Pfam and InterPro entries.

Planned Impact

The field of protein research has witnessed an explosion in novel protein sequences due to advances in sequencing technologies. However, these sequences are meaningless without functional annotation. This proposal focuses on the world leading protein databases, InterPro and Pfam, which are routinely used for protein annotation. Due to their extensive use by researchers worldwide, this application will impact most BBSRC strategic priorities - especially agriculture and food security, industrial biotechnology, and bioscience for health. To maximise the impact of these resources, we propose to exploit multiple computational approaches to (i) improve annotation of metagenomics datasets and eukaryotic marine microbes; (ii) provide co-evolutionary structural models for Pfam entries using deep alignments to build additional models and permit their visualization; (iii) integrate and improve annotations from current and new InterPro databases, such as PANTHER, CDD, and CATH FunFams; (iv) improve scaling and refine annotations by adopting new algorithms and software, like TreeGrafter and HMMER4, and reconcile Gene Ontology terms across databases; (v) systematically annotate eight genomes of agricultural importance. These developments will ensure users in the UK and world over can derive the maximum benefit from these resources while further cementing their position as exceptional databases of immense importance to the scientific community at large.

Developing new pipelines to build new entries for proteins derived from metagenomics provides a unique exploitable opportunity for InterPro and Pfam. The fact that these resources will extend coverage of marine eukaryotic microbes will have significant, far reaching impacts on other fields and analytical disciplines. This is especially true for the UK Darwin Tree of Life project, which forms part of a global initiative to sequence all eukaryotic species, aiming to revolutionize our understanding of biology, evolution and biodiversity. However, this will only be realised through detailed and accurate functional annotation, such as that provided by InterPro and Pfam.

The agricultural sector represents another area of considerable impact. Providing comprehensive functional annotations for proteins from widely farmed animal and plant species in the UK and worldwide will facilitate insights into the molecular basis of biological features including yield characteristics, capacity to resist disease and tolerance to the vagaries of nature. This will lead to socioeconomic benefits, through maximising land utilisation for growing crops such as wheat and sugar beet (the latter providing nearly 30% of the world's annual sugar production and forming an important source for bioethanol and animal feed), or enhancing the global aquaculture market, projected to reach $20 billion by 2022, where salmon is a substantial component.

Furthermore, the project outputs will be of exceptional value to the commercial sector, eventually benefiting the public. Improved annotations of proteins originating from microbes will lead to new discoveries, such as novel antibiotics for humans and livestock, higher agricultural yields from the understanding of ecological interplay (e.g. food chain microbes), expanded discovery of novel enzymes (e.g. psychrophilic enzymes for detergents) or those with novel catalytic functionality.

We will ensure impact on all academic and industrial audiences by the publication of software, data, and peer reviewed articles. To ensure that resource developments are disseminated as widely as possible, we will deliver onsite training, webinars, participate in community workshops and produce online training materials. We will leverage our professional networks and collaborations, conference platforms and social media channels to further publicise key developments. The public sector will also be engaged, via specific events and the publication of non-specialist articles and interviews.

Publications

10 25 50
 
Description This project is ongoing. We have provided information on CATH functional families to InterPro. We will continue
to provide updated CATH data as the project progresses.
Exploitation Route InterPro is widely used by biologists for obtaining predicted functions for query proteins
Sectors Education,Healthcare,Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology

URL https://www.ebi.ac.uk/interpro/
 
Description CATH structural an functional annotations are now being disseminated by InterPro which is widely used by industry. ). Outside academia, InterPro is one of the most widely used web portals by biologists in industry, with over 716,000 unique visitors per year. It combines protein family data from multiple resources to assign greater confidence.
First Year Of Impact 2019
Sector Education,Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology