18-BBSRC-NSF/BIO : CIBR:Implementing an explicit phylogenetic framework for large-scale protein sequence annotation

Lead Research Organisation: EMBL - European Bioinformatics Institute
Department Name: MSCB Macromolec, structural and chem bio

Abstract

Proteins are the primary molecular machines that perform the instructions encoded in our genomes. Proteins ultimately shape the response of our cells, tissues, organs, and bodies to the surrounding environment, either directly (e.g. muscle contraction) or through their functional outputs (e.g. the electrical signals along the dendrites to produce a nerve impulse or action potential). Therefore, understanding the functional role(s) performed by each protein is critical to research and development in many areas of science, particularly biology, medicine and applied biotechnology.

The rapid increase in throughput of next-generation sequencing technologies has important ramifications, in that our ability to sequence an organism's genome and determine the proteins it encodes far out paces our ability to experimentally characterise the function of a protein. Thus, for every functionally characterised protein, there are now many thousands of proteins that will never be experimentally characterised. Molecular biology increasingly relies on our ability to computationally group related sequences and to transfer functional annotations from the few experimentally characterised proteins, to those related, yet uncharacterised, proteins.

Knowledge on proteins has been collected and stored in public databases like UniProt, a world-leading resource on protein sequences and function. Currently, there are over 150 million sequences in UniProt, with the number doubling every two years. Therefore, it is crucial to develop new and reliable computational methods for inferring protein function that can be scaled to billions of sequences.

We aim to implement an annotation system that incorporates evolutionary information, permitting the level of annotation transfer to be tuned accordingly, while also ensuring scalability and speed of annotation that meets current and future demands. This new annotation system will integrate the most innovative features present in two pre-existing methods that are currently used in producing world-class resources. The Gene Ontology (GO) Consortium has developed software for explicit evolutionary modelling of GO annotation gain and loss along specific branches of phylogenetic trees, and has applied it to inferring GO annotations for experimentally uncharacterised proteins. UniProt has developed the UniRule system that applies annotation "rules" that combines information on protein families and domains (from the InterPro resource), with a range of other types of information like taxonomy, to make more precise and informative annotations.

Our goal is to create a next-generation, large-scale annotation system that merges the two approaches, and to implement this annotation system in the UniProt resource, thereby increasing the quality of functional annotations in the database for the benefit of the scientific community. We propose three specific aims to achieve this goal: (1) convert existing UniRule rules into explicit evolutionary models, (2) integrate software to apply the evolutionary models (TreeGrafter) into the UniProt annotation pipeline, and (3) develop software for ongoing curation of new evolutionary models of additional annotation types and protein families. The result will be an annotation pipeline based on explicit evolutionary principles, which will enable seamless sharing of information between the UniProt and GO curation processes, and substantially improve the accuracy, comprehensiveness and informativeness of inferred protein annotations in public databases.

Technical Summary

Advances in DNA sequencing technologies are rapidly expanding our knowledge of protein sequences, but only a small fraction of these proteins has been experimentally characterised. The UniProt protein knowledgebase aims to maximise the utility of protein sequence data to the scientific community: it not only presents the sequences but also provides "annotations", i.e. data that help to infer functional information about those sequences, like predicted active sites. The current approach to large-scale annotation of proteins in UniProt, called UniRule, relies on ad hoc rules to define sets of proteins that should be annotated similarly. While these rules implicitly utilise information about evolutionary relationships (e.g. membership in a protein family), they do not model functional evolution explicitly and are thus limited in the specificity of annotations they can express, e.g. a protein family may have two or more distinct subtypes, which perform related but distinguishably different functions. Here, we propose to implement an explicit evolutionary approach to large-scale sequence annotation. We will build upon previous work (1) on evolutionary modelling of gain and loss of protein functions (represented as Gene Ontology terms, GO) in gene families; (2) on software to reconstruct the evolutionary history of any arbitrary protein sequence by placing it in the context of a phylogenetic tree. This will enable decisions regarding the specificity of annotations to be transferred based on the evolutionary difference between the characterised protein and the protein to be annotated. Broadening the range of annotations, coupled with increasing numbers of sequences will present key technical challenges that will be addressed during the course of this work. Overall, implementing this approach within UniProt will integrate the large-scale annotation systems already used in the UniProt and GO projects and result in increased specificity and coverage of annotations in UniProt.

Planned Impact

The field of protein research has witnessed an explosion in novel protein sequences due to advances in sequencing technologies. However, our ability to understand their role(s) within a cell largely relies on our ability to functionally annotate them. In modern molecular biology, the vast majority of functional annotations is performed computationally by identifying similarities between the few experimentally characterised sequences and uncharacterised ones, followed by the transfer of annotations. This proposal aims to build on two recent developments in the field of computational functional inference to make annotations: (i) utilise evolutionary modelling of Gene Ontology (GO) annotations using phylogenetic trees; (ii) annotation "rules" applied by the UniRule system to combine information on protein families with other data types like taxonomy. Merging these two approaches within the UniProt production framework will increase the number and quality of functional annotations, benefiting the entire scientific community.

The impact of this project will be facilitated by the high profile of both the UniProt and GO resources that provide the distinct annotation functionalities. This impact is measurable in terms of the numbers of users, citations, and educational materials. The UniProt website receives 700,000 visitors and 6 million hits per month. Similarly, the GO resource has been cited over 70,000 times. One of the major impacts expected from this project will be the increased accuracy, comprehensiveness and informativeness of inferred UniProt annotations, including GO annotations. This will benefit the broad user community of UniProt, encompassing academia and biotechnological industries.

The improved functional annotations will help academics ranging from evolutionary biologists to those studying pathogenesis. To ensure dissemination of the proposed new developments to users, we will create a new online training module focused on the use of evolutionary models in bioinformatics to explore the basic principles of molecular evolution that enable adaptation of gene functions through a gene family. We will also investigate the provisioning of online "learning pathways", which will group different combinations of training modules according to a user's research background. People at over 400,000 unique IP addresses have accessed the EMBL-EBI online training resource, thus amply demonstrating the demand for bioinformatics training.

The project outputs will be of exceptional value to the commercial sector as well, eventually benefiting the public. For example, improved annotations of proteins will lead to the discovery of novel antibiotics for humans and livestock, and higher agricultural yields by helping to understand the mechanisms underlying abiotic resistance. More precise sub-family annotations provided by this work will facilitate discovery of novel enzymes with distinct functionalities, i.e. ability to break down alternative substrates or substitute processes that are currently performed via organic synthesis (chemical).

We will also disseminate project outputs to academic and industrial audiences via the publication of software, data, and peer-reviewed articles. We will leverage our professional networks and collaborations, conference platforms and social media channels to further publicise key developments. The public sector will also be engaged, via specific events and the publication of non-specialist articles and interviews.

This transatlantic project involves four staff members directly, as well as other staff members within the three participating groups. It will expose all teams to different approaches of working, as well as strengthen international collaborations, especially in the field of bioinformatics. All staff members will receive relevant scientific training and opportunities to develop their interpersonal skills.

Publications

10 25 50

publication icon
MacDougall A (2020) UniRule: a unified rule resource for automatic annotation in the UniProt Knowledgebase. in Bioinformatics (Oxford, England)

publication icon
UniProt Consortium (2021) UniProt: the universal protein knowledgebase in 2021. in Nucleic acids research

 
Description The overwhelming majority of sequences in public databases remain experimentally uncharacterized, a trend which is increasing rapidly with the continued development of modern sequencing technologies. There are currently over 140 million sequences in UniProt, with the number currently doubling every two years. The numbers of proteins sequences arising from metagenomics assembly are now running in the billions of sequences. It is therefore indispensable to develop powerful and reliable computational methods for inferring, or "annotating", protein function. We are working in implementing an annotation system based on explicit modeling of evolutionary processes, rather than on the status quo of arbitrary, ad hoc rules. This system will integrate the most innovative elements present in two existing systems that are already used in production for world-class resources, and have been extensively tested. PI Thomas's group at USC (and collaborators) has developed software for explicit evolutionary modeling of Gene Ontology (GO) annotation gain and loss along specific branches of phylogenetic trees, and has applied it to inferring GO annotations for experimentally uncharacterized proteins [13]. PI Martin's group at EMBL-EBI has developed the UniRule system (currently used for annotation of the UniProt resource), that applies annotation "rules" that combine "signatures" of protein families and domains (from the InterPro resource) with additional information, such as the species origin of a sequence, to make more precise and informative annotations not only of GO terms, but many other annotation types. Co-I Bateman's group, also at EMBL-EBI, runs the InterPro database and is responsible for maintaining the InterProScan software that supplies the UniRule system with protein family annotations. They are also responsible for Pfam database and the HMMER web server, and as such are world-leading experts in applying profile hidden Markov models (HMMs) for protein annotation. The goal of this proposal is to create a next-generation, large-scale annotation system that merges the two approaches, and to implement this annotation system in the UniProt production environment. We propose three specific aims to achieve this goal: 1) convert existing UniRule rules into explicit evolutionary models, 2) integrate software for applying the evolutionary models (TreeGrafter) into the UniProt annotation pipeline, and 3) develop software for ongoing curation of new evolutionary models of additional annotation types and protein families. The final result will be an annotation pipeline based on explicit evolutionary principles, which will enable seamless information sharing between the UniProt and GO curation processes, and substantially improve the accuracy, comprehensiveness and informativeness of inferred UniProt annotations.
The major achievements during the past year have been (1) conversion of the existing UniRules into the evolutionary framework of gain and loss of protein characteristics (annotations) along specific branches of the phylogenetic tree. Building on our work from the previous year, we have modified our algorithm for automatically converting UniRules into evolutionary gain and loss of characteristics, and run the program on all UniRules. We have reviewed the resulting tree-based rules, and approximately 65% of all existing rules can now be straightforwardly mapped to our evolutionary framework. These rules are now accurate enough for testing at large scale once the TreeGrafter implementation is complete (2) development of the proof-of-concept procedure of how the TreeGrafter process can be decomposed into the relevant steps, and the necessary data that needs to be captured and propagated to InterPro and/or UniProtKB. During the past year, we have implemented TreeGrafter in Python to facilitate its maintenance and its integration into InterProScan, and we have updated the underlying grafting software to support EPA-ng (Barbera P., et al. Systematic Biology, 2019; https://doi.org/10.1093/sysbio/syy054) as an alternative to RaXML, which dramatically reduces both runtime and memory footprint. Input/output (I/O) operations required have also been greatly reduced by reading the trees only once instead of once for each grafting operation. Furthermore, by replacing the current PANTHER subfamily HMM scoring with the more efficient and more accurate TreeGrafter, we are able to decrease the carbon footprint of InterProScan, as PANTHER 15.0 subfamily HMM scoring was responsible for 35% of the overall compute time of InterPro production (3) Evaluate and study the database back-end for storing annotation blocks, and their gains and losses along branches in the evolutionary trees. We have implemented a new condition type into UniRule database backend and UniRule web interface that specifies a "gain" or "loss" of an annotation event associated to a certain node in a phylogenetic tree of a PANTHER family. TreeGrafter will graft a protein into the phylogenetic tree and the protein will be annotated with the annotations from the ancestral node with an annotation "gain" condition. A protein grafted to a node specified in a "loss" condition will not be annotated even if there are ancestral nodes with a "gain" condition. This new feature enables UniRule to import evolutionary models (both from conversion of existing rules into the database backend and to display and modify them in the UniRule web interface.
Exploitation Route Knowing how proteins function is key in understanding biological systems in health and disease. Upon completion, the project is expected to have a large impact on biological research by dramatically increasing the number and specificity of functional annotations in the UniProt resource. The work will also result in improvements to the InterProScan software, which is also widely used outside of the UniProt resource for protein and genome annotation. This will increase our knowledge of protein function in many newly sequenced species and the foundation for biomedical and biotechnological research
Sectors Agriculture, Food and Drink,Environment,Healthcare,Pharmaceuticals and Medical Biotechnology

 
Description Implementing an explicit phylogenetic framework for large-scale protein sequence annotation 
Organisation University of Southern California
Department Department of Preventive Medicine
Country United States 
Sector Academic/University 
PI Contribution Our team are experts in the development of protein resources (UniProt) and computational methods for protein function prediction. We have developed computational systems for annotation of proteins at large scale based on protein family classification. We contribute with knowledge of production systems and method development.
Collaborator Contribution The USC team are experts in evolutionary modelling and phylogenetic approaches to family classification and orthology inference. This is important in protein function prediction and the team provides expertise in this area.
Impact - Convert existing UniRule annotation rules into explicit evolutionary models - Develop software for ongoing curation of new evolutionary models of additional annotation types and protein families.
Start Year 2019
 
Description Automated Annotation in UniProt 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact Webinar presenting the automatic annotation systems in UniProt
Year(s) Of Engagement Activity 2022
URL https://www.ebi.ac.uk/training/events/automated-annotation-uniprot/
 
Description Automatic Annotation Systems in UniProt 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact Webinar presenting automatic annotation systems
Year(s) Of Engagement Activity 2021
URL https://www.ebi.ac.uk/training-beta/events/automated-annotation-uniprot/
 
Description EMBL-EBI webinar "Automatic annotation systems in UniProt: UniRule and ARBA" 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact This webinar was for scientists and bioinformaticians with an interest in functional annotation of protein sequences. We described the two main automated annotation systems currently in use. First, UniRule, which is an established UniProt system in which curators manually develop rules for annotation. Second ARBA (Association-Rule-Based Annotator), which has recently been introduced as a significant improvement in fully automated functional annotation. ARBA is a multiclass learning system which uses rule mining techniques to generate concise annotation models. ARBA employs a data exclusion set that censors data not suitable for computational annotation, and generates human-readable rules for each UniProt release. We also briefly touched on the mechanism UniProt has set up to enable researchers to run these automated annotation systems on their own protein datasets.
Year(s) Of Engagement Activity 2021
URL https://www.ebi.ac.uk/training/events/automated-annotation-uniprot/
 
Description Large scale, classification and rule-based functional annotation of proteins 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact Presentation in a workshop for the Future of Protein Function which is key for understanding health and disease mechanisms.
Year(s) Of Engagement Activity 2022