14 NSFBIO:Bilateral BBSRC-NSF/BIO Collaborative Research: ABI Development: A Critical Assessment of Protein Function Annotation

Lead Research Organisation: European Bioinformatics Institute

Department Name: Sequence Database Group

Abstract

Abstracts are not currently available in GtR for all funded research. This is normally because the abstract was not required at the time of proposal submission, but may be because it included sensitive information such as personal details.

Technical Summary

The accurate annotation of protein function is key to understanding life. However, with its inherent difficulty and expense, experimental characterization of function cannot scale up to accommodate the vast amount of sequence data already available. Therefore, the computational annotation of protein function is of primary importance. It is now possible to collect data that comprehensively profile many different states of complex biological systems. Using these data it should be possible to understand and explain the underlying systems, but significant challenges remain. One of the primary challenges is that, as researchers collect more data from many different organisms in many different systems, they discover more and different genes. Assigning functions to these newly discovered genes represents a key step towards interpretation of high-throughput data. The mission of the Automated Function Prediction Special Interest Group (AFP-SIG), founded in 2005, is to bring together bioinformaticians and biologists who are addressing this key challenge of gene function prediction. AFP-SIG has created CAFA: the Critical Assessment of (protein) Function Annotation. CAFA is a community-driven challenge to assess the performance of protein function prediction software, and it has been carried out twice since 2010. The investigators will provide the following outcomes: (1) robust open-source software to be used in function prediction and assessment of function prediction methods, incorporated into the high-profile annotation pipelines of UniProt-GOA; (2) expansion of the AFP community by engaging bioinformaticians, biocurators and experimentalists, thereby improving the quality aquality and relevance of function prediction methods; (3) large-scale experimental screens in Drosophila, Candida and Pseudomonas for novel associations of targeted functional terms with genes; (4) a CAFA event, incorporating both the curated annotations from the literature and our experimental screens.

Planned Impact

N/A

Funded Value:

£349,848

Funded Period:

Sep 15 - Aug 18

Funder:

BBSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

BB/N004876/1

Principal Investigator:

Maria J. Martin

Claire O'Donovan

Research Subject:

Tools, technologies & methods (100%)

Research Topic:

Bioinformatics (100%)

Organisations

People	ORCID iD
Maria J. Martin (Principal Investigator)	http://orcid.org/0000-0001-5454-2815
Claire O'Donovan (Principal Investigator)
Michele Magrane (Co-Investigator)

Publications

Author Name

Title Publication Date Published

10 25 50

Boudellioua I (2016) Prediction of Metabolic Pathway Involvement in Prokaryotic UniProtKB Data by Association Rule Mining. in PloS one

Jiang Y (2016) An expanded evaluation of protein function prediction methods shows an improvement in accuracy. in Genome biology

Rifaioglu AS (2018) Large-scale automated function prediction of protein sequences and an experimental case study validation on PTEN transcript variants. in Proteins

Saidi R (2017) Rule Mining Techniques to Predict Prokaryotic Metabolic Pathways. in Methods in molecular biology (Clifton, N.J.)

Zhou N (2019) The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. in Genome biology

Key Findings
Impact Summary
Research Databases and Models
Collaboration
Software and Technical Products
Engagement Activities


Description	Sequence and structure genomics have generated a wealth of data but extracting meaningful information from genomic data is becoming increasingly difficult. Proteins are the primary effectors of the instructions encoded in genomes, and they ultimately shape cells, tissues, organs and bodies in response to the environment. However, understanding proteins and how they function remains one of the most challenging problems in biology. UniProt is the world's most important catalogue of protein information, facilitating scientific discovery by helping scientists to understand protein function. As the number of proteins in UniProt continues to grow rapidly as a result of genome sequencing, providing functional annotation for these proteins presents a significant challenge that requires computational support. This award facilitated the work of The Critical Assessment of Functional Annotation (CAFA) community by bringing together researchers who focus on understanding and accurately predicting function of gene products. The CAFA challenge has been held three times, demonstrating the performance improvement of the participating methods over time. In this project, UniProt provided the reference data set necessary to improve and assess predictions by the CAFA community. UniProt biocurators targeted curation of classes of proteins that have been underrepresented in biocuration efforts, in particular, moonlighting and intrinsically disordered proteins. Biocurators at the EMBL-EBI created a reference set of 3400 Gene Ontology annotations (GOA) to 557 distinct proteins that had been identified as either disordered or having moonlighting behaviour. These annotations were created in the GOA database using Protein2GO, the EMBL-EBI's GO annotation tool, which was enhanced to allow the easy tagging and identification of disordered and moonlighting proteins. The annotations that were created as part of the CAFA project are included in the files that are published on the GOA and GO Consortium FTP sites, and are also visible in QuickGO, the EMBL-EBI's GO annotation browser. This work allowed UniProt to identify computational methods that performed well in the CAFA experiment and could be integrated into the UniProt and GOA annotation pipeline. We used CATH and FunFams computational predictors from Christine Orengo's group and identified new annotations as well as potentially incorrect annotations in UniProt records. The evaluation was facilitated by the PredComp (Prediction Comparator - a tool that compares annotation currently in the UniProt database with those from external prediction programs) developed in this project and verified manually by biocurators. There is ongoing work together with UniProt to assess the utility of community predictors and the potential development of interfaces allowing their integration in this resource. This award provided the platform for the EMBL-EBI's team participation in the CAFA challenge. Two computational methods for protein function prediction developed by the EMBL-EBI team - Domain Architecture Alignment and Classification (DAAC) and ARBA multiclass predictor- participated in the CAFA3 Challenge for benchmarking computational methods in this field. ARBA (Association-Rule-Based Annotator) is a multiclass predictor that exhaustively finds most representative models which define significant relationships between protein attributes and protein functions. The system was evaluated on UniProtKB data where it achieved very promising results. Preliminary benchmarking suggests that ARBA could significantly surpasses current UniProt's annotation systems in terms of coverage and learning ability. DAAC (Domain Architecture Alignment and Classification) is a protein function predictor using Gene Ontology (GO) and Enzyme Commission (EC) number systems. DAAC detects the remote homologies between protein sequences by aligning their domain architectures (i.e. the specific arrangement of structural domains on the sequence) and calculating a pairwise similarity. DAAC especially works well in detecting remote homologies between multi-domain proteins, that are usually missed by conventional sequence-based approaches. This award allowed the evaluation of ARBA and DAAC within the CAFA challenge. They both performed well in 'mode 2' where methods are evaluated only based on the proteins CAFA3 provided for the benchmarking. ARBA was amongst the ten best methods in ten occasions and was ranked first in five of them. DAAC was the 'top performer' in 6 different categories and it ranked second and third in the detailed evaluation (composed of 60-65 categories). According to the official CAFA3 paper, which was published in the Genome Biology journal just a few months ago, DAAC ranked number 1 in the official GO molecular function prediction category (mode 2).
Exploitation Route	Gene Ontology annotations are publicly available to users of the UniProt and Gene Ontology resources. These proteins are now annotated with functional information contributing to our knowledge of biology. Benchmarking computational methods for protein function predictions is critical in the credibility of these approaches. As these two methods (DAAC and ARBA) were proved to perform well in the CAFA3 Challenge, we will be using them in the computational annotation of proteins in the UniProt resource. UniProt is a world leading resource on protein function delivering services to the scientific community.
Sectors	Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Education,Healthcare
URL	https://www.ebi.ac.uk/QuickGO/annotations?assignedBy=CAFA


Description	The Critical Assessment of Functional Annotation (CAFA) is an ongoing, global, community-driven effort to evaluate and improve the computational annotation of protein function. The assessment of functional annotation challenge requieres a well anntotated data set which is used to evaluate computational methods. This award was very important dor making this challenge possible providing (i) a data set for the evaluation (ii) a platform for researchers to efficiently benchmark their methods. The CAFA community now involves a broad range of participants with expertise in bioinformatics, biological experimentation, biocuration, and bio-ontologies, working together to improve functional annotation, computational function prediction, and our ability to manage big data in the era of large experimental screens. The evaluated methodologies provide the means for providing functional knowledge to large experimental datasets. The benefits go beyond academics where industry and pharmatheuticals benefits for a deep knowledge in biology and advances in research and industry.
First Year Of Impact	2018
Sector	Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Healthcare,Pharmaceuticals and Medical Biotechnology
Impact Types	Societal,Policy & public services


Title	CAFA Function Special Interest group platform
Description	The CAFA challenge website provides a platform for researchers to benchmark their methods on protein function prediction
Type Of Material	Database/Collection of data
Year Produced	2019
Provided To Others?	Yes
Impact	The CAFA platform allows researchers to benchmark their methods in protein function prediction. The CAFA community now involves a broad range of participants with expertise in bioinformatics, biological experimentation, biocuration, and bio-ontologies, working together to improve functional annotation, computational function prediction, and our ability to manage big data in the era of large experimental screens.
URL	https://www.biofunctionprediction.org/cafa/


Title	Experimental validation dataset
Description	We curated the experimental data set of GO annotation which will be used for the validation/benchmarking of the computational approaches being evaluated as part of the CAFA challenge. This was hidden data until the predictions were submitted to the competition. This data is now made available to the wider scientific community through the Gene Ontology Annotation Project, the UniProt Knowledgebase and the Gene Ontology Consortium to the wider community.
Type Of Material	Database/Collection of data
Year Produced	2017
Provided To Others?	Yes
Impact	this data set was essential for the success of this project as it is a requirement to have the benchmarking manually curated set to validate the computational models. In addition, this set of annotations about moonlighting proteins and disordered proteins is of value to the wider bioscience community as this work was new in these very exciting new protein domains.
URL	http://www.ebi.ac.uk/QuickGO/GAnnotation?source=CAFA


Description	CAFA: Critical Assessment of Protein Function Annotation
Organisation	Indiana University
Department	Computer Science
Country	United States
Sector	Academic/University
PI Contribution	We provided the expertise in Gene Ontology annotation and a data set for the benchmarking of methods from participating groups. We also contributed to the CAFA evaluation and participated in meetings and conferences. The data set is publicly available for the benefit of the scientific community.
Collaborator Contribution	Partners contributed with expertise in Protein Function assessment and provided a platform for benchmarking and evaluation of our newly developed methods. They also provided the framework for leveraging methodologies for use in databases like UniProt.
Impact	Data sets for CAFA benchmarking - We focused on annotations of classes of proteins that have been underrepresented in biocuration efforts. To address underrepresented proteins, we will focus on coverage of moonlighting proteins and intrinsically disordered proteins. We developed a tool to evaluate and compare the resulting annotations with those in UniProt. This tool has been investigated to flag potentially incorrect annotations, provide confidence scores of existing electronic annotations, help biocurators identify novel experimental annotations from the literature, and provide another layer of computational function annotation for UniProt's users. For the CAFA3 project, we continued targeted curation of classes of proteins that have been underrepresented in biocuration efforts, in particular, moonlighting and intrinsically disordered proteins. GOA curators at the EMBL-EBI created a reference set of 3400 GO annotations to 557 distinct proteins that had been identified as either disordered or having moonlighting behavior. These annotations were created in the GOA database using Protein2GO, the EMBL-EBI's GO annotation tool, which was enhanced to allow the easy tagging and identification of disordered and moonlighting proteins. The annotations that were created as part of the CAFA project are included in the files that are published on the GOA and GO Consortium FTP sites, and are also visible in QuickGO, the EMBL-EBI's GO and GO annotation browser. We identified computational methods that performed well in the CAFA3 experiment and could be integrated into the UniProt-GOA's annotation pipeline. We used CATH and FunFams computational predictors from Christine Orengo's group and identified new annotations as well as potentially incorrect annotations in UniProt records. The evaluation was facilitated by the PredComp (Prediction Comparator - a tool that compares annotation currently in the UniProt database with those from external prediction programs) developed in this project and verified manually by GOA curators. There is ongoing work together with UniProt to assess the utility of community predictors and the potential development of interfaces allowing their integration in this resource.
Start Year	2015


Description	CAFA: Critical Assessment of Protein Function Annotation
Organisation	Iowa State University
Country	United States
Sector	Academic/University
PI Contribution	We provided the expertise in Gene Ontology annotation and a data set for the benchmarking of methods from participating groups. We also contributed to the CAFA evaluation and participated in meetings and conferences. The data set is publicly available for the benefit of the scientific community.
Collaborator Contribution	Partners contributed with expertise in Protein Function assessment and provided a platform for benchmarking and evaluation of our newly developed methods. They also provided the framework for leveraging methodologies for use in databases like UniProt.
Impact	Data sets for CAFA benchmarking - We focused on annotations of classes of proteins that have been underrepresented in biocuration efforts. To address underrepresented proteins, we will focus on coverage of moonlighting proteins and intrinsically disordered proteins. We developed a tool to evaluate and compare the resulting annotations with those in UniProt. This tool has been investigated to flag potentially incorrect annotations, provide confidence scores of existing electronic annotations, help biocurators identify novel experimental annotations from the literature, and provide another layer of computational function annotation for UniProt's users. For the CAFA3 project, we continued targeted curation of classes of proteins that have been underrepresented in biocuration efforts, in particular, moonlighting and intrinsically disordered proteins. GOA curators at the EMBL-EBI created a reference set of 3400 GO annotations to 557 distinct proteins that had been identified as either disordered or having moonlighting behavior. These annotations were created in the GOA database using Protein2GO, the EMBL-EBI's GO annotation tool, which was enhanced to allow the easy tagging and identification of disordered and moonlighting proteins. The annotations that were created as part of the CAFA project are included in the files that are published on the GOA and GO Consortium FTP sites, and are also visible in QuickGO, the EMBL-EBI's GO and GO annotation browser. We identified computational methods that performed well in the CAFA3 experiment and could be integrated into the UniProt-GOA's annotation pipeline. We used CATH and FunFams computational predictors from Christine Orengo's group and identified new annotations as well as potentially incorrect annotations in UniProt records. The evaluation was facilitated by the PredComp (Prediction Comparator - a tool that compares annotation currently in the UniProt database with those from external prediction programs) developed in this project and verified manually by GOA curators. There is ongoing work together with UniProt to assess the utility of community predictors and the potential development of interfaces allowing their integration in this resource.
Start Year	2015


Description	CAFA: Critical Assessment of Protein Function Annotation
Organisation	University of Pennsylvania
Department	Department of Systems Pharmacology and Translational Therapeutics
Country	United States
Sector	Academic/University
PI Contribution	We provided the expertise in Gene Ontology annotation and a data set for the benchmarking of methods from participating groups. We also contributed to the CAFA evaluation and participated in meetings and conferences. The data set is publicly available for the benefit of the scientific community.
Collaborator Contribution	Partners contributed with expertise in Protein Function assessment and provided a platform for benchmarking and evaluation of our newly developed methods. They also provided the framework for leveraging methodologies for use in databases like UniProt.
Impact	Data sets for CAFA benchmarking - We focused on annotations of classes of proteins that have been underrepresented in biocuration efforts. To address underrepresented proteins, we will focus on coverage of moonlighting proteins and intrinsically disordered proteins. We developed a tool to evaluate and compare the resulting annotations with those in UniProt. This tool has been investigated to flag potentially incorrect annotations, provide confidence scores of existing electronic annotations, help biocurators identify novel experimental annotations from the literature, and provide another layer of computational function annotation for UniProt's users. For the CAFA3 project, we continued targeted curation of classes of proteins that have been underrepresented in biocuration efforts, in particular, moonlighting and intrinsically disordered proteins. GOA curators at the EMBL-EBI created a reference set of 3400 GO annotations to 557 distinct proteins that had been identified as either disordered or having moonlighting behavior. These annotations were created in the GOA database using Protein2GO, the EMBL-EBI's GO annotation tool, which was enhanced to allow the easy tagging and identification of disordered and moonlighting proteins. The annotations that were created as part of the CAFA project are included in the files that are published on the GOA and GO Consortium FTP sites, and are also visible in QuickGO, the EMBL-EBI's GO and GO annotation browser. We identified computational methods that performed well in the CAFA3 experiment and could be integrated into the UniProt-GOA's annotation pipeline. We used CATH and FunFams computational predictors from Christine Orengo's group and identified new annotations as well as potentially incorrect annotations in UniProt records. The evaluation was facilitated by the PredComp (Prediction Comparator - a tool that compares annotation currently in the UniProt database with those from external prediction programs) developed in this project and verified manually by GOA curators. There is ongoing work together with UniProt to assess the utility of community predictors and the potential development of interfaces allowing their integration in this resource.
Start Year	2015


Description	CAFA: Critical Assessment of Protein Function Annotation
Organisation	University of Washington
Department	Department of Biomedical Informatics and Medical Education
Country	United States
Sector	Academic/University
PI Contribution	We provided the expertise in Gene Ontology annotation and a data set for the benchmarking of methods from participating groups. We also contributed to the CAFA evaluation and participated in meetings and conferences. The data set is publicly available for the benefit of the scientific community.
Collaborator Contribution	Partners contributed with expertise in Protein Function assessment and provided a platform for benchmarking and evaluation of our newly developed methods. They also provided the framework for leveraging methodologies for use in databases like UniProt.
Impact	Data sets for CAFA benchmarking - We focused on annotations of classes of proteins that have been underrepresented in biocuration efforts. To address underrepresented proteins, we will focus on coverage of moonlighting proteins and intrinsically disordered proteins. We developed a tool to evaluate and compare the resulting annotations with those in UniProt. This tool has been investigated to flag potentially incorrect annotations, provide confidence scores of existing electronic annotations, help biocurators identify novel experimental annotations from the literature, and provide another layer of computational function annotation for UniProt's users. For the CAFA3 project, we continued targeted curation of classes of proteins that have been underrepresented in biocuration efforts, in particular, moonlighting and intrinsically disordered proteins. GOA curators at the EMBL-EBI created a reference set of 3400 GO annotations to 557 distinct proteins that had been identified as either disordered or having moonlighting behavior. These annotations were created in the GOA database using Protein2GO, the EMBL-EBI's GO annotation tool, which was enhanced to allow the easy tagging and identification of disordered and moonlighting proteins. The annotations that were created as part of the CAFA project are included in the files that are published on the GOA and GO Consortium FTP sites, and are also visible in QuickGO, the EMBL-EBI's GO and GO annotation browser. We identified computational methods that performed well in the CAFA3 experiment and could be integrated into the UniProt-GOA's annotation pipeline. We used CATH and FunFams computational predictors from Christine Orengo's group and identified new annotations as well as potentially incorrect annotations in UniProt records. The evaluation was facilitated by the PredComp (Prediction Comparator - a tool that compares annotation currently in the UniProt database with those from external prediction programs) developed in this project and verified manually by GOA curators. There is ongoing work together with UniProt to assess the utility of community predictors and the potential development of interfaces allowing their integration in this resource.
Start Year	2015


Title	The CAFA website
Description	The CAFA challenge needed a website for accessing data sets and methods for protein prediction evaluation.
Type Of Technology	Webtool/Application
Year Produced	2017
Open Source License?	Yes
Impact	The website allowed researchers to access the platform for benchmarking their protein function prediction methods
URL	https://www.biofunctionprediction.org/cafa/


Description	A New Entropy for Measuring Annotation Consistency with Regards to Protein Signature
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	This talk was a COSI track presentation in the session"Function: Gene and Protein Function Annotation" during the ISCB 2018 conference in Chicago, US.
Year(s) Of Engagement Activity	2018
URL	https://www.iscb.org/ismb2018


Description	CAFA town hall workshop
Form Of Engagement Activity	A formal working group, expert panel or dialogue
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other audiences
Results and Impact	The CAFA protein prediction participants discussed methods and quality control for a fair assessment. We discussed with the CAFA board on defining a new branch in the competition for database production purposes (in addition to research). We will define metrics to quantify production requirements when dealing with the protein functional annotation. Best systems would be shortlisted for accommodating their predictions by UniProt resource(on FTP as first step)
Year(s) Of Engagement Activity	2020


Description	Flask talk and Poster titled ARBA: Association-Rule-Based Annotator
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other audiences
Results and Impact	We carried out an evaluation study by performing a two-round five-fold cross-validation process on UniProtKB/Swiss-Prot prokaryotic entries. The cross-validation technique could then reveal how the system performs globally and for each single pathway in order to set a list of functional annotation which the system can efficiently predict. Results suggested that specific combinations of protein domains (recorded in our rules) strongly determine pathways in which proteins are involved and thus provide information that let us very accurately assign pathway membership (with a global F1-measure of 98.2% and a global AUC of 98.7%) to prokaryotic proteins. By applying ARBA on all UniProtKB/Swiss-Prot prokaryotic entries, a set of 568,006 rules, ARBA then, selected only 1,347 (more than 99% of reduction rate). The latter were aggregated by pathway annotation to form 356 models, each for a pathway annotation.
Year(s) Of Engagement Activity	2016


Description	Function SIG meeting ISMB 2019
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	ISMB is one of the largest international conference on Computational Biology gathering over 1000 attendents. One of the tracks is the Function SIG gathering experts in function prediction methods and organised and with talks from participants in the CAFA challenge. There were 18 researchers presenting their methodologies and outputs with one of the organizers presenting an update on the Critical Assessment of Function Annotation challenge. The meeting led to interaction and collaboration between the participants. Our group presented three posters "How does my drug target function in health and disease?", "UniRule: a semi-automatic pipeline for functional annotation", "Large-Scale Benchmarking of Protein Descriptors for Protein Ligand Prediction in Target-Based Modelling and Proteochemometrics" which generated discussions between researchers
Year(s) Of Engagement Activity	2019
URL	https://www.iscb.org/cms_addon/conferences/ismbeccb2019/posters.php?track=Function%20COSI&session=A


Description	Hackathon on Pathway effect prediction for protein targets
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	This was a hackathon project that was organised as part of the Biohackathon 2018 in Paris. ELIXIR France with the support of the ELIXIR Hub and the ELIXIR interoperability platform and in collaboration with COST CHARME, the National Bioscience Database Center (NBDC) and the Database Center for Life Science (DBCLS), organised this with the plan that it would complement and work in collaboration with the BioHackathon planned in Japan at the end of 2018. The topics were aligned to challenges proposed by ELIXIR platforms (data, tools, compute, interoperability and training), ELIXIR communities (Human Data, Rare Diseases, Marine Metagenomics, Plant Science, Metabolomics and Proteomics), a selection of new tools and communities (e.g., Cytoscape and reproducible networks) proposed by ELIXIR France, and a set of common challenges proposed by the sister BioHackathon organised in Japan. This was one of 29 hacking projects that were preselected; it was part of the interoperability platform of ELIXIR.
Year(s) Of Engagement Activity	2018
URL	https://bh2018paris.info/index.html


Description	Measuring Enzymatic Activity Consistency with Regards to Protein Signatures
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	This talk was presented at the HIBIT 2018 conference in Turkey.
Year(s) Of Engagement Activity	2018


Description	Poster titled A Self-training Approach for Functional Annotation of UniProtKB Proteins
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other audiences
Results and Impact	In order to deal with small amount of labelled data, ARBA was self-trained on UniProtKB/Swiss-Prot data. Proteins that contain pathway annotations constitute the labelled dataset while those that do not contain any pathway annotation constitute the unlabelled dataset. The system performs self-training in two main steps. In the first step, annotations are propagated from the labelled to the unlabelled set based on their similarity. The similarity criterion is defined as a Boolean that is true if the two proteins have the same attributes; e.g., signatures. In the second step, ARBA iteratively learns from these data, retrains itself and adds to the labelled instances until a desired performance level is reached. The output of self-training is the final learning dataset using which the annotation model was built. This model was validated in two 2-fold cross-validation runs. The results, averaged over the two runs, are shown in Table 1. Finally, we applied this model to predict metabolic pathways in UniProtKB/TrEMBL bacterial data which are poorly covered, currently 3.5%.
Year(s) Of Engagement Activity	2017


Description	Poster titled Manual Annotation of a subset of the CAFA3 target set
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other audiences
Results and Impact	This was a poster that presented The Critical Assessment of protein Function Annotation algorithms (CAFA) to a diverse audience at the GO consortium meeting in USA. The CAFA challenge is a large-scale assessment whose purpose is to evaluate new computational methods that are capable of predicting Human Phenotype Ontology and Gene Ontology (GO) terms for proteins, based on their sequence or structure. For the latest CAFA challenge (CAFA3), the UniProt Gene Ontology Annotation (GOA) team has contributed and curated two target sets: the first consists of intrinsically disordered proteins, and the second consists of moonlighting proteins. Participants in CAFA3 predicted the biological processes, molecular functions and cellular component of the target sets using GO. To create our target sets, we used data from the DisProt database for intrinsically disordered proteins (http://www.disprot.org/) and the MoonProt database of moonlighting proteins (http://www.moonlightingproteins.org/) to generate a potential list of proteins for CAFA. We identified 627 proteins as potentially being intrinsically disordered and 306 proteins as potential moonlighting proteins. For each candidate protein we found appropriate literature to provide experimental evidence of whether it was intrinsically disordered or had a secondary function. In total, we found more than 1100 papers associated with both target sets that needed to be read, evaluated, and, if suitable, curated using the GO. After evaluation, the intrinsically disordered dataset comprised 472 proteins and the moonlighting dataset comprised 156 proteins. In total, 766 papers were curated, which resulted in the creation of 6981 new GO annotations to be used as a benchmark for evaluating the predictions submitted by CAFA3 contestants. These annotations for the target sets are publicly accessible through the QuickGO website (http://www.ebi.ac.uk/QuickGO/).
Year(s) Of Engagement Activity	2017


Description	Poster titled Manual Annotation of a subset of the CAFA3 target set
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other audiences
Results and Impact	This was a poster that presented The Critical Assessment of protein Function Annotation algorithms (CAFA) to a diverse audience at the ICMB/ECCB meeting in Prague. The CAFA challenge is a large-scale assessment whose purpose is to evaluate new computational methods that are capable of predicting Human Phenotype Ontology and Gene Ontology (GO) terms for proteins, based on their sequence or structure. For the latest CAFA challenge (CAFA3), the UniProt Gene Ontology Annotation (GOA) team has contributed and curated two target sets: the first consists of intrinsically disordered proteins, and the second consists of moonlighting proteins. Participants in CAFA3 predicted the biological processes, molecular functions and cellular component of the target sets using GO. To create our target sets, we used data from the DisProt database for intrinsically disordered proteins (http://www.disprot.org/) and the MoonProt database of moonlighting proteins (http://www.moonlightingproteins.org/) to generate a potential list of proteins for CAFA. We identified 627 proteins as potentially being intrinsically disordered and 306 proteins as potential moonlighting proteins. For each candidate protein we found appropriate literature to provide experimental evidence of whether it was intrinsically disordered or had a secondary function. In total, we found more than 1100 papers associated with both target sets that needed to be read, evaluated, and, if suitable, curated using the GO. After evaluation, the intrinsically disordered dataset comprised 472 proteins and the moonlighting dataset comprised 156 proteins. In total, 766 papers were curated, which resulted in the creation of 6981 new GO annotations to be used as a benchmark for evaluating the predictions submitted by CAFA3 contestants. These annotations for the target sets are publicly accessible through the QuickGO website (http://www.ebi.ac.uk/QuickGO/).
Year(s) Of Engagement Activity	2017


Description	Predicting Protein Function with Deep Learning and multi-source data
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other audiences
Results and Impact	A talk in the ISMB 2021 Conference within the Function program. Presenting research work.
Year(s) Of Engagement Activity	2021
URL	https://www.youtube.com/watch?v=nEV5qJMmaJA


Description	Presentation titled "Update on Protein Functional Annotation in UniProt in 2020" during ISMB 2020
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Presentation during the 28th ISMB conference held virtually during 2020.
Year(s) Of Engagement Activity	2020
URL	https://www.youtube.com/watch?v=-MqwczUHMG8&ab_channel=ISCB


Description	Talk and Poster titled A Self-training Approach for Functional Annotation of UniProtKB Proteins
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other audiences
Results and Impact	Automatic annotation systems are essential to reduce the gap between the amount of protein sequence data and functional information in public databases such as UniProtKB. These systems rely on manually annotated (also called labelled) data to learn rules for predicting annotations. Manually labelled data are, however, often scarce or time consuming to obtain as they have to be reviewed by expert human curators. On the other hand, unlabelled data are abundant and comparatively easy to gather. In this work, we present a self-training automatic annotation approach that utilises unlabelled data in order to improve the accuracy of predictions. We evaluated our system on a set of entries in UniProtKB/Swiss-Prot. The results show improvement in different performance metrics when self-training is used. The generated model was then used to predict metabolic pathway involvement of UniProtKB/TrEMBL proteins. As a result, it covered 86% of the proteins currently annotated by UniProt pipelines, but also could annotate 6.7 million proteins that lacked any previous pathway annotations.
Year(s) Of Engagement Activity	2017


Description	Talk and Poster titled Rule Mining and Selection for Protein Functional Annotation
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other audiences
Results and Impact	In the scope of this work, we suggest that association rule mining and selection techniques can be used effectively as computational methods for functional prediction. We introduce our automatic annotation system, ARBA (Association-Rule-Based Annotator) that can be used to enhance the quality of automatically generated annotations as well as annotating proteins with unknown functions. ARBA learns on data from UniProtKB/Swiss-Prot and uses InterPro signatures and organism taxonomy as attributes to predict most of the protein functional annotations including Gene Ontology terms, metabolic pathways, EC numbers, etc. With respect to certain quality measures, we find all rules which would define significant relationships between attributes and pathway annotations in UniProtKB/Swiss-Prot entries. The set of extracted rules represent the comprehensive knowledge which could explain protein pathway involvement. However, these rules comprise redundant information and their high number makes it infeasible to apply them on large sets of data such as UniProtKB/TrEMBL. To address this issue, ARBA puts these rules into a fast competition process called SkyRule based on two concepts, namely dominance and comparability. Rules are then elegantly and considerably reduced in number and aggregated to form concise prediction models that assign functional annotations to UniProtKB entries. To give a picture of the efficiency of ARBA in this paper, we briefly report its performance in the case of prediction of metabolic pathway involvement for prokaryotes.
Year(s) Of Engagement Activity	2016


Description	Talk and training course entitled "Machine learning for protein function prediction"
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Postgraduate students
Results and Impact	This talk and training course were presented as part of the EMBL predicts bioinformatics course to students enrolled in the EMBL International Ph.D. programme.
Year(s) Of Engagement Activity	2018
URL	https://www.ebi.ac.uk/research/eipp


Description	Talk entitled "Predicting protein functions with deep learning and multi-source data" during ISMB 2020
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Talk presented during the 29th ISMB 2020 conference held virtually.
Year(s) Of Engagement Activity	2020
URL	https://www.youtube.com/watch?v=nEV5qJMmaJA&ab_channel=ISCB


Description	Training talk and Hands-on "Protein Function Prediction with Machine Learning and Interactive Analytics" at the ECCB 2018
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Industry/Business
Results and Impact	This was a training talk followed by a hands on session on the Protein Function Prediction with Machine Learning and Interactive Analytics at the ECCB 2018.
Year(s) Of Engagement Activity	2018
URL	http://eccb18.org/tutorial-10/


Description	UniProt and Gene Ontology: the need for functional annotation across the span of taxonomic biodiversity
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Postgraduate students
Results and Impact	A talk in the Function program within an international Conference ISMB which gathers researchers in computational biology with over 1000 participants. The talk presented data resources relevant for this audience
Year(s) Of Engagement Activity	2021