Expanding Genome3D and disseminating the structural annotations via InterPro and PDBe

Lead Research Organisation: European Bioinformatics Institute
Department Name: Sequence Database Group

Abstract

Abstracts are not currently available in GtR for all funded research. This is normally because the abstract was not required at the time of proposal submission, but may be because it included sensitive information such as personal details.

Technical Summary

1. Improve SCOP/CATH mapping to increase structural data integrated in InterPro

We have already developed a protocol which identifies domain residue ranges for a given PDB classified in CATH or SCOP. The overlap between ranges is calculated to determine whether domains are equivalent. Two superfamilies are judged equivalent depending on the percentage of equivalent domains. Recent work by the PDBe in a joint project with CATH - ending November 2015 - has explored more sophisticated approaches. These examine the multi-domain contexts of the domains being compared and identify blocks of equivalent multi-domain architectures between two superfamilies. This will be further developed to increase the number of equivalent superfamilies. The SCOP/CATH mapping will be exploited in new protocols for integrating predicted structural data into InterPro.

2. Develop a 3D viewer to view sequence variations in a structural context

Displaying structural data in a way that works reliably on different web browsers is a challenge - especially if they have additional components (e.g. features to show conserved positions). Whilst 3D models can be viewed on the Genome3D website using the JSmol viewer, structures are not integrated with sequence data. There are excellent Java-based tools for analysing protein sequence and structure (e.g. JalView, Jmol), however working with Java in modern web browsers is no longer viable due to security concerns. Additionally, since JSmol is a direct port from the large java codebase of Jmol, this presents significant limitations for future development: large web footprint etc. However, alternatives are being developed that address these limitations. We will evaluate the available 3D molecular viewers, identifying robust candidates that conform to web standards (HTML5/WebGL). This viewer will be integrated with other JavaScript components to provide an intuitive, interactive and reusable structural feature viewer.

Planned Impact

The data provided by the project is essential for a wide range of biologists and this proposal addresses key strategic areas for the BBSRC in Data Driven Bioscience: (1) Improved accuracy of structural data used by structural and computational biologists to analyse protein evolution and predict protein structures and functions; (2) Generation of consensus data that will aid the provision of structural annotations for millions of protein sequences in InterPro, and hence UniProt. Such annotations will be critical for understanding the impacts of genetic variations in these proteins i.e. that could be causing disease in humans or animals or modifying the efficiencies of the proteins in different crop and animal strains.

Currently, InterPro contains less than two thirds of the structural annotations in Gene3D and SUPERFAMILY and none of the predictions from PHYRE, FUGUE, pDomTHREADER. By integrating data from all 5 of these Genome3D resources this project will significantly increase the amount of structural data available to biologists. Collaborations between PDBe, SCOP and CATH to map between SCOP and CATH and to develop a platform for assigning domain boundaries to new structures will be incredibly valuable for increasing the numbers of PDB structures classified. Currently <75% of structures in the PDB are classified in either SCOP or CATH and these collaborations will share the task of manual curation - the most time consuming aspect of the classification.

Dissemination through websites and workshops
As evidenced by the web statistics (CATH and SCOP > 10,0000, InterPro 135,000 and PDBe 45,000 unique users/month), data generated by all resources is widely used by biologists both in academia and in industry. Companies frequently use the resources to determine the structures and functions of query proteins. Recent analyses of web statistics by Genome3D groups showed that ~20% of accesses came from industry. Furthermore the algorithms and data provided by FTP downloads are used by a number of pharmaceutical companies including Pfizer India, Cubist, Lilley Pharmaceuticals.

In addition to providing information on equivalent superfamilies the project will provide a range of other consensus data valuable for both academia and industry. For example consensus data on domain boundary assignments will be highly valuable for structural biologists in pharmaceutical companies to guide the generation of domain constructs for structure determination

We will publicise the SCOP/CATH mapping, consensus data and integration of Genome3D in InterPro by presenting at a Technology track of the annual ISMB conference which typically has participants from industry. We will hold a Genome3D workshop at UCL in Dec 2018 to present the integration in InterPro and PDBe. Results will also be reported at an EBI workshop at which Orengo regularly presents and at a Bioinformatics course at UCL which is open both to academics and researchers from industry. We will aim to publish in NAR database 2017, 2019.

Interaction with the Public
UCL, hosts visits by 6th form science students at which the Orengo group give presentations on domain structure classifications and the benefits of using protein structure to understand protein functions and the impacts of genetic variations. UCL is one of 6 Beacons for Public Engagement in the UK and has a dedicated Public Engagement Unit that will provide training. All the PIs and have expertise in communicating scientific strategies and discoveries to the public.

Training received by the research project staff
Researchers in all the groups will be working closely. Researchers will receive hands-on training from other PDRAs in the Orengo, Finn and Velankar groups. All the institutes have excellent training schemes and career development courses and the PDRAs will be working in world class laboratories of internationally renowned scientists. They will have opportunities to present their work within the groups.

Publications

10 25 50
 
Description InterPro is an amalgamation of 15 different member databases, with each database contributing signatures that reflect their concept of a protein family. InterPro curators then group alike signatures into a single InterPro entry, and define relationships between entries where appropriate. Each InterPro entry is labelled with a 'type' (family, domain, repeat or site), reflecting the biological entity that the constituent signatures represent. As part of release 65.0, a new entry type was added called Homologous Superfamily, to complement the existing types. A Homologous Superfamily is a group of proteins that share a common evolutionary origin, indicated by their structural similarities. These are the most divergent relationships in InterPro that we previously merged into Domain and Family.

The Homologous Superfamily type was created to accommodate signatures from CATH-Gene3D and SUPERFAMILY. Both of these databases use a slightly different methodology to InterPro's other member databases, in that they often utilise a collection of underlying profile hidden Markov models (HMMs) to represent diverse structural families, rather than one single model. Creating the new entry type better encapsulates what CATH-Gene3D and SUPERFAMILY are trying to describe and allows us to better represent their data in InterPro.

Following the creation of this new entry type, we have increased the CATH-Gene3D and SUPERFAMILY coverage level from 20 to 35% and from 73 to 79%, respectively. Through this work, we have used mathematical approaches (Jaccard and Jaccard containment indexes) to allow these new types to be linked to other InterPro entries, ensuring that the links between the structurally defined, yet diverse entries are maintained to InterPro entries of other types.

In parallel to the creation of PDBe protein pages (PDBe-KB), and part of this project, we have designed a pipeline to generate residue conservation profiles for protein sequences. The development has been based on a previous work on the 3Dpatch tool, developed with the Finn team. The residue conservation profiles are generated every week for knew structures as part of the PDBe release process for cross-linking UniProt accessions to PDB identifiers. A similar mechanism has also be included in InterPro, where a conservation profile can be retrieved based on score against Pfam profiles.

The Genome3D predictions have been incorporated at multiple levels in the InterPro website. Genome3D is composed of two protein domain classification resources: CATH and SCOP. These are supplemented by structure prediction and annotation methods based on one or both of the two classifications. The prediction methods provide access to models where a 3D structure for the protein may not be known. Genome3D predictions are displayed, where appropriate, at the levels of protein, a structure and an InterPro entry. The InterPro entry gathers all the predictions for each sequence that Genome3D has a prediction, and presents where the sequence has been modelled. The protein and structure views are similar, provide models in the absence of the structure, or a decomposition of the structure into globular domains. This data is dynamically retrieved from Genome3D and cached during the lifetime of an InterPro release (two months).
Exploitation Route The creation of the Homologous Superfamily entry type has allowed us to increase the sequence and amino acid residue coverage provided by InterPro, and allows a better representation of structure-based signatures. In the future we plan to plan to continue the integration of signatures from CATH-Gene3D and SUPERFAMILY to reach up to 70-90% of integration.

It is known that residues showing lower substitution rates are critical for protein folding, hydrophobic core stabilization, intermolecular recognition, and enzymatic activity. The residue conservation profiles for protein sequences allow to easily find hot-spots of conservation that may be binding sites. These conserved sites could then be used as possible drug targets or allow the interpretation of deleterious mutations.

The inclusion of Genome3D models into InterPro is a major new datatype and allows researches to connect proteins sequences, domain annotations and structural models allow an integrated overview across different dimensions for understanding the linkage between sequence conservations and spatial organisation. This information is essential for bioengineering of enzymes and/or the design of drugs.
Sectors Agriculture, Food and Drink,Chemicals,Education,Healthcare,Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology

 
Title Cath-b integration in PDBe 
Description CATH-B is a daily updated version of CATH, it contains more accurate and up-to-date data than the CATH yearly release. Every week the data from CATH-B are integrating in the PDBe SIFTS database. 
Type Of Material Improvements to research infrastructure 
Year Produced 2018 
Provided To Others? Yes  
Impact CATH-B provides up-to-date protein classification. 
 
Title Jaccard index and containment for overlapping entries 
Description Following the creation of the Homologous superfamily entry type in InterPro, we have developed an automated approach for linking between homologous superfamilies and other entry types (such as families or domains) to allow the connection of structural data to other entries within InterPro that may lack a sequence representative with a known protein structure. This method uses the Jaccard index and containment to determine whether two entries are overlapping or not with a threshold of 0.75. 
Type Of Material Improvements to research infrastructure 
Year Produced 2018 
Provided To Others? Yes  
Impact The overlapping entries are displayed on each InterPro entry under the "Overlapping entries" section for Homologous superfamily and under the "Overlapping Homologous superfamilies" for other entry types. They allow the connection of structural data to other entries within InterPro. 
 
Title LiteMol 3D Viewer 
Description The LiteMol 3D Viewer, developed in collaboration with CEITEC, is a new WebGL-based viewer with a low memory foot-print and compatible with all major browsers without any additional plugins, and therefore compatible with tablets and mobile devices. Based on the requested visualization, LiteMol automatically queries the Coordinate and Density servers to fetch relevant atomic coordinates or portions of electron density or electric potential maps, respectively. It accepts as input PDBx/mmCIF as well as the BinaryCIF format. 
Type Of Material Improvements to research infrastructure 
Year Produced 2016 
Provided To Others? Yes  
Impact The viewer has the ability to generate interactive visualisations of 3D coordinate data with standard representations, as well as overlaid experimental data and annotations such as sequence or structure annotations and quality assessment information from wwPDB validation reports. 
URL https://webchemdev.ncbr.muni.cz/LiteMol/
 
Title InterPro 
Description InterPro provides functional analysis of proteins by classifying them into families and predicting domains and important sites. We combine protein signatures from a number of member databases into a single searchable resource, capitalising on their individual strengths to produce a powerful integrated database and diagnostic tool 
Type Of Material Database/Collection of data 
Provided To Others? Yes  
Impact All of the annotations provided by InterPro underpin the automatic annotation pipeline within the UniProt database. InterPro provides tens of millions of sequences to UniProt through the InterPro2Go pipeline. InterPro is the most widely used web service at EMBL-EBI, performing ~15,000,000 searches per month, from around the world. 
URL http://www.ebi.ac.uk/interpro/
 
Title PDBe 
Description PDBe is the European resource for the collection, organisation and dissemination of data on biological macromolecular structures. 
Type Of Material Database/Collection of data 
Provided To Others? Yes  
Impact This resource provides researchers with access to known protein structures. Notably, PDBe focuses on annotating these with additional biological features, such as the annotations that come from InterPro and CATH-b, as well as calculating residue conservation profiles. 
URL https://www.ebi.ac.uk/pdbe/
 
Title CATH/SCOP mapping code 
Description This code has been developed to perform the mapping of two structural databases: CATH and SCOP. 
Type Of Technology Software 
Year Produced 2017 
Open Source License? Yes  
Impact The mapping between CATH and SCOP has allowed to prioritized the integration of GENE3D and SUPERFAMILY entries in InterPro. 
URL https://github.com/typhainepl/genome3D
 
Title InterProScan5 
Description Allow the user to compare either a DNA or protein sequence and compare it against the collection of InterPro member databases, assign InterPro annotations and associated GO terms. 
Type Of Technology Software 
Open Source License? Yes  
Impact This software is widely downloaded and users (assessed through citations, distributed annotations and helpdesk interactions). This tools is widely used in other analysis pipelines, such as genomics and metagenomics analysis. This tool is updated with every release (bi-monthly) of InterPro to include both data updates and software updates. These software updates take the form of both scientific developments imposed by changes in member databases post-processing. The others are general software maintenance. 
URL https://www.ebi.ac.uk/interpro/interproscan.html
 
Title Pipeline to generate residue conservation data for PDBe protein sequences 
Description This process has been design based on a previous project (3Dpatch), it generates residue conservation annotations for PDBe and UniProt protein sequences and saves them in csv files. 
Type Of Technology Software 
Year Produced 2019 
Open Source License? Yes  
Impact The degree of evolutionary conservation of amino acid residues often indicates functional significance. The residues showing lower substitution rates are critical for protein folding, hydrophobic core stabilization, intermolecular recognition, enzymatic activity. The data generated by the pipeline could be of use to find new drug targets. 
 
Description 5th Advanced in silico Drug Design workshop/challenge 2020 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Workshop and presentation of PDBe ligand tools at the Palacky University Olomouc (UPOL) in Czech Republic.
Year(s) Of Engagement Activity 2020
URL https://fch.upol.cz/en/5add/
 
Description EMBL training course "Mining PDBe and PDBe-KB using a graph database" 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact EMBL training course titled "Advanced workshop on the PDBe graph database".

This workshop covered the use of the PDBe graph database to extract data for solving complex structural biology queries. It introduced the PDBe graph database and how to write Cypher queries to retrieve data of interest. Workshop participants were then able to use the graph database to explore data relevant to their own research with support and guidance from the development team at PDBe.
Year(s) Of Engagement Activity 2020
URL https://www.ebi.ac.uk/training/events/mining-pdbe-and-pdbe-kb-using-graph-database/
 
Description EMBL training course "Structural bioinformatics (Virtual)" 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact This course explored bioinformatics data resources and tools for the investigation, analysis, and interpretation of biomacromolecular structures. It focused on how best to analyse and interpret available structural data to gain useful information given specific research contexts. The course content also covered predicting protein structure and function, and exploring interactions with other macromolecules as well as with low-MW compounds. Workshops were presented on PDBe search, pages and tools, as well as PDBe-KB pages.This course was a virtual event delivered via a mixture of live-streamed sessions, pre-recorded lectures, and tutorials with live support.
Year(s) Of Engagement Activity 2020
URL https://www.ebi.ac.uk/training/events/structural-bioinformatics-virtual/
 
Description EMBL-EBI online tutorial "Why include Genome3D annotations in InterPro?" 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact This course is for anyone who would like a brief introduction to Genome3D annotations in InterPro. Undergraduate-level knowledge of biology and knowledge of InterPro (InterPro: Quick tour) would be an advantage. By the end of the course participants will be able to explain the role of Genome3D annotations in InterPro.
Year(s) Of Engagement Activity 2020
URL https://www.ebi.ac.uk/training/online/courses/genome3d-annotations-in-interpro/relationship-between-...
 
Description EMBL-EBI training course "Summer school in bioinformatics (Virtual)" 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact This course provided an introduction to the use of bioinformatics in biological research, giving participants guidance for using bioinformatics in their work whilst also providing hands-on training in tools and resources appropriate to their research. Participants were initially introduced to bioinformatics theory and practice, including best practices for undertaking bioinformatics analysis, data management and reproducibility. To enable specific exploration of resources in their particular field of interest, participants were divided into focused groups to work on a small project set by EMBL-EBI resource and research staff, ending in a presentation from each group on the final day of the course to bring together learnings from all participants. The course included training and mentoring by experts from EMBL-EBI and external institutes. PDBe supervised the group project for independent exploration and analysis of PDBe-KB data.
Year(s) Of Engagement Activity 2020
URL https://www.ebi.ac.uk/training/events/summer-school-bioinformatics-virtual/
 
Description EMBL-EBI webinar "Genome3D annotations in InterPro" 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Genome3D provides consensus structural annotations and 3D models for sequences from ten model organisms, including human. These data are generated by several UK-based resources that together form the Genome3D consortium: SCOP, CATH, SUPERFAMILY, Gene3D, FUGUE, pDomTHREADER and PHYRE. InterPro, meanwhile, provides functional analysis of proteins by classifying them into homologous superfamilies and families, and by predicting domains, repeats and important sites, based on data from 14 member databases. This webinar presented the new InterPro entry type, Homologous superfamily, as well as describing domain and structure predictions from Genome3D annotations, and how they are integrated in InterPro.
Year(s) Of Engagement Activity 2019
URL https://www.youtube.com/watch?v=ZuFJu4iwTsg
 
Description Homologous superfamily: a new InterPro entry type 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact Blog post presentation the new InterPro entry type: Homologous superfamily
Year(s) Of Engagement Activity 2017
URL https://proteinswebteam.github.io/interpro-blog/2017/10/03/Homologous-superfamily/
 
Description PDBe API webinar series "Using the PDBe graph API" 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact This webinar was part of a 6-part PDBe API webinar series, introducing different levels of programmatic access at PDBe.The series ranged from basic data retrieval and search using the PDBe API to more advanced features, including access and reuse of PDBe data visualisation components.

This webinar introduced the PDBe graph API, which is generated from the PDBe graph database and contains an even richer level of data than our standard API. We highlighted how this API supports our PDBe-KB aggregated views, with specific case studies that demonstrate the possibilities through this API.
Year(s) Of Engagement Activity 2020
URL https://www.ebi.ac.uk/training/events/using-pdbe-graph-api/
 
Description Poster presentation at ECCB conference 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Presenting the new InterPro entry type and the pipeline to generate protein sequence residue conservation.
Year(s) Of Engagement Activity 2018