CRESTANO - Common REst api for Structural ANnotation

Lead Research Organisation: European Bioinformatics Institute
Department Name: Protein Data Bank in Europe

Abstract

The Protein Data Bank in Europe (PDBe; pdbe.org) is one of the core resources at the European Bioinformatics Institute (EMBL-EBI). PDBe is a founding member of the Worldwide Protein Data Bank (wwPDB), which manages the PDB, the single global archive of biomacromolecular structure data. The other wwPDB partners are RCSB, PDBj and BMRB. PDBe has operated a deposition and annotation facility for PDB data since 1998. Over the years, PDBe has developed advanced tools and services for analysis of biomacromolecules (including unique tools such as PDBeFold, PDBePISA and PDBeMotif) and for delivery of PDB data to the user community. In addition, PDBe develops and maintains critical resources such as SIFTS (Structure Integration with Function, Taxonomy and Sequences), a vital source of up-to-date cross-reference information to other biological data resources.

The Cambridge Crystallographic Data Centre (CCDC; www.ccdc.cam.ac.uk) manages the Cambridge Structural Database (CSD), the main archive for small-molecule crystal structure data. CSD contains structural data for organic and organometallic compounds obtained using single crystal X-ray and neutron diffraction methods or based on powder diffraction data. The archive was established in 1965 and now contains more than 600,000 structures of small molecules. In addition to archiving the small molecule structural data, CCDC has developed many tools for the analysis of these data.

The information available in the PDB archive is used by structural biologists and the wider biomedical community to understand the structures archived in the PDB, while CSD data amongst many other applications can be used by chemists and biochemists for automatic screening of natural molecules suitable as drug candidates. Recently, to improve the annotation and validation of small molecule information in the PDB, wwPDB has entered into a collaboration with CCDC. As part of this, CCDC has made a number of tools available to the wwPDB partners, including Mogul, which will be used for validation of small molecule geometry during deposition and annotation of PDB data. This will constitute a major improvement as analysis of ligand structures in the PDB has shown that the majority of ligand models can be improved. The structure validation pipeline, which includes Mogul and which will become a critical part of the new wwPDB Deposition and Annotation system (D&A), is being developed at PDBe.

The goal of the present project is to implement a web-services API that will provide access to biomacromolecular structure data and advanced analyses and annotations of those structures available from PDBe. Additionally, CCDC will develop infrastructure to allow access to small-molecule data in the Cambridge Structural Database (CSD) for those compounds that are present in the both the CSD and the PDB. This will facilitate real-time programmatic access to up-to-date information from PDBe databases and advanced tools and services, which will become available to any bioinformatics and structural-biology-workflow systems as well as individual programs. In addition, access to experimentally determined structures from the CSD will provide better quality starting models for ligands during the macromolecular structure determination process. This, in turn, will improve the quality of deposited ligand data in the PDB, benefitting chemoinformatics research and informing the structure-based design of new drugs.

In this project, we propose to develop a method to provide access to types of PDBe and CCDC data and information in an integrated framework:
1. PDB data from the PDBe database infrastructure
2. Advanced analysis and annotations on biomacromolecular assemblies
3. Ligand environment and 3D structural motifs data from PDBeMotif
4. Up-to-date cross-references for all PDB entries, taken from the SIFTS resource
5. Data-quality indicators for all PDB entries
6. Access to CSD data for molecules that are also in the PDB

Technical Summary

PDBe (Protein Data Bank in Europe) has developed many unique and advanced tools and services such as PDBePISA, for prediction of biomacromolecular assemblies and analysis of interfaces, PDBeMotif, for access to structural ligand-binding information and 3D structural motifs, and SIFTS, for up-to-date cross-references to UniProt, CATH, SCOP, Pfam, InterPro, PubMed, NCBI taxonomy, GO, and IntEnz for all PDB entries. PDBe has also established resources specific for X-ray crystallography, Nuclear Magnetic Resonance (NMR) spectroscopy and cryo-Electron Microscopy (EM). PDBe is currently implementing the wwPDB validation pipelines for X-ray, NMR and EM data deposited to the PDB and EMDB archives. The proposed PDBe e-infrastructure will create a unique and unified web service API for accessing data and annotations from PDBe databases and its advanced services and tools.
In this project, we propose to develop a REpresentational State Transfer (REST) web-service API to provide access to the following classes of PDBe and CCDC data and information in an integrated framework:
1. PDB data from the PDBe database infrastructure
2. Advanced analysis and annotations on biomacromolecular assemblies from PDBePISA
3. Ligand-environment data from PDBeMotif
4. 3D structural motifs from PDBeMotif
5. Up-to-date cross-references to UniProt, CATH, SCOP, Pfam, InterPro, PubMed, NCBI taxonomy, GO and IntEnz for all PDB entries, taken from the SIFTS resource developed by PDBe and UniProt
6. Data-quality indicators for all PDB entries and representative structures using the wwPDB validation-pipeline data available at PDBe
7. Access to CSD data for molecules that are also in the PDB

Planned Impact

PDBe (Protein Data Bank in Europe) has developed many unique and advanced tools and services such as PDBePISA, for prediction of biomacromolecular assemblies and analysis of interfaces, PDBeMotif, for access to structural ligand-binding information and 3D structural motifs, and SIFTS, for up-to-date cross-references to UniProt, CATH, SCOP, Pfam, InterPro, PubMed, NCBI taxonomy, GO, and IntEnz for all PDB entries. PDBe has also established resources specific for X-ray crystallography, Nuclear Magnetic Resonance (NMR) spectroscopy and cryo-Electron Microscopy (EM). PDBe is currently implementing the wwPDB validation pipelines for X-ray, NMR and EM data deposited to the PDB and EMDB archives. The proposed PDBe e-infrastructure will create a unique and unified web service API for accessing data and annotations from PDBe databases and its advanced services and tools.

The new infrastructure will allow integration into the PDB of small molecule data from the Cambridge Structural Database (CSD) for all small molecules that are found in the PDB in complex with biomacromolecules. Comparing the structures of small molecules in isolation (CSD data) and bound to their biomacromolecular targets (PDB data) will improve our understanding of whether binding results in geometric strain, which in turn may help elucidate the mode of substrates and signaling molecules. Alternatively, it may aid design of improved inhibitors or antagonists with reduced strain and possibly tighter binding. Finally, such comparisons can help users to assess if any unusual geometry or conformation is likely to be of biological significance or more likely to be an artifact of the structure-determination protocol.

Structures of small molecules in the CSD are almost always of high quality and represent strain-free conformations due to the higher resolution and better observation-to-parameter ratios obtained with small-molecule crystals. Thus, the CSD is potentially a very valuable source of high-quality starting models for structural biologists, and more frequent use of these structures would result in better quality ligand data in the PDB. Currently, it is possible to freely request individual CSD structures if the user knows the CSD identification code, but there is no mechanism that allows external structure-based queries of CSD. The proposed e-infrastructure at CSD will allow wwPDB annotators to query the CSD for structures that are identical or very similar to newly deposited ligands in the PDB. In this way, representative coordinates from CSD can be incorporated into the wwPDB chemical component dictionary and distributed publicly


The web services will enable identification of high-quality starting models for use in structure building and refinement. The availability of structure-quality information will benefit developers of model-building and refinement software by identifying the most suitable starting models for ligands and biomacromolecules to use in the structure-determination process, be it by X-ray, NMR or EM. Programmatic access to annotations could also inform, for example, the interpretation of unexplained electron-density features in the active site of a protein, by providing information about all possible ligands found in a given protein environment.

The e-infrastructure will allow PDBe to bring together relevant data (e.g. data related to a particular ligand or protein molecule) from distinct PDBe advanced tools and resources and integrate it to provide users with a single user interface showing all the available information.

Publications

10 25 50
 
Description The Protein Data Bank in Europe (PDBe) is a founding member of the Worldwide Protein Data Bank (wwPDB), the organisation that manages the single global repository of bio-macromolecular structure data. PDBe also manages the global repository of 3D electron microscopy data, the Electron Microscopy Data Bank (EMDB).

PDBe has developed many advanced tools for analysis of macromolecular structure data. In collaboration with the UniProt team, PDBe has also developed the SIFTS (Structure Integration with Function, Taxonomy and Sequence) resource that provides up-to-date data on the biological context of the macromolecular structure information in the PDB by integrating information from other biomedical databases.

The PDBe REST API is designed to provide programmatic access to all information in the PDBe database. This includes information available in the PDB and EMDB archives, but also improved and value-added information such as assembly data from PISA, validation information from the wwPDB validation resource, and cross-reference information to other biomedical databases based on the SIFTS resource.

The API contains separate modules for PDB, EMDB, SIFTS, PISA and validation information with relevant calls aggregated in each module. This allows for easy integration of all macromolecular structure information into an application or workflow for analysis of biomedical information. The API also makes it possible to access structure-related information in manageable data blocks without having to read a large file or parse data that is not relevant.

A similar REST API developed by the Cambridge Crystallography Data Center (CCDC), provides coordinate data from the small molecule archive, the Cambridge Crystallographic Database (CSD), for molecules that are in common between PDB and CSD. Their coordinates will be integrated in the wwPDB Chemical Component Dictionary (CCD) and made publicly available via the PDB ftp archive.
Exploitation Route The PDBe REST API is integrated in the PDBe production workflow and supports the redesigned PDB and EMDB entry pages. The REST API has recently been integrated in Jmol/JSmol, a 3D macromolecular visualisation applet, where it is used to show value-added annotations on structure and sequence domain annotation and validation-related information, making it available to all data resources that use Jmol/JSmol to provide interactive 3-D representation of the macromolecules.

The CCDC team has developed a webservice to provide access to CSD molecule data for small molecules found in the PDB. To keep this information up-to-date the webservice is made available to the wwPDB partners so it can be integrated in the wwPDB deposition and annotation system. The coordinate data for existing compounds is also made available to the wwPDB partners so it can be made available via wwPDB ftp site.
Sectors Digital/Communication/Information Technologies (including Software),Education,Healthcare,Pharmaceuticals and Medical Biotechnology

URL http://wwwdev.ebi.ac.uk/pdbe/api/doc/api-index.html
 
Description The macromolecular structure information available in the Protein Data Bank (PDB) and the Electron Microscopy Data Bank (EMDB) is used by structural biologists and the wider biomedical community, while CSD data amongst many other applications is used by chemists and biochemists for automatic screening of natural molecules suitable as drug candidates. The PDBe API makes it possible to integrate the information available in the PDB and EMDB as well as value-added information from other PDBe resources in workflows and provides up-to-date information to other biomedical data resources. The API has been integrated in visualisation applications such as Jmol/JsMol to efficiently deliver value-added annotation such as the information on the quality of a structure or the structure and sequence domain annotation to general users. Similarly, the API can also support data analysis tools by making the up-to-date macromolecular structure data available via an easy to use programmatic interface.
First Year Of Impact 2014
Sector Agriculture, Food and Drink,Chemicals,Digital/Communication/Information Technologies (including Software),Education,Healthcare,Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology
Impact Types Societal,Economic

 
Title CCDC API 
Description CCDC API for obtaining CSD coordinate data for molecules in the PDB 
Type Of Technology Webtool/Application 
Year Produced 2014 
Impact A server which will be used by the wwPDB annotation system to update CSD coordinates for existing entries as well as adding coordinates for new molecules. 
URL https://api.ccdc.cam.ac.uk/crestano
 
Title PDBe RESt API 
Description REST API for all PDBe information 
Type Of Technology Webtool/Application 
Year Produced 2014 
Impact The API makes all PDB and EMDB data as well as value added information from PISA, SIFTS and validation data available. 
URL http://wwwdev.ebi.ac.uk/pdbe/api/doc/api-index.html
 
Description ACS meeting (Aug 2014) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? Yes
Geographic Reach International
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact There were queries about methods used in our project and the API

There were further queries about the CCDC API and CSD coordinate data for molecules in the PDB
Year(s) Of Engagement Activity 2014
URL http://www.acs.org/content/acs/en/meetings/fall-2014.html
 
Description CSHL course: X-Ray Methods in Structural Biology 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Invited lecture as well as two practicals and discussions with students in the course and fellow course instructors.
Year(s) Of Engagement Activity 2016
URL http://meetings.cshl.edu/courses.aspx?course=c-crys&year=16
 
Description IUCr - 2014 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? Yes
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact The talk followed by questions and discussions

There were further inquiries on the CSD coordinate data and the REST API
Year(s) Of Engagement Activity 2014
URL http://www.iucr.org/iucr/cong/iucr-xxiii
 
Description Protein Structure Determination in Industry Conference 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Industry/Business
Results and Impact Keynote speaker at the Protein Structure Determination in Industry Conference in Malmö, Sweden. Interacted with the other participants, most of whom are in pharma and biotech companies.
Year(s) Of Engagement Activity 2016
URL http://psdi2016.org/index.html
 
Description Seminar in Uppsala 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Professional Practitioners
Results and Impact Invited lecture and follow-up discussion with local scientists and students.
Year(s) Of Engagement Activity 2016
 
Description Seminar in Vienna 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Professional Practitioners
Results and Impact Invited lecture on "The wonderful world of structure archiving - what's happening and what's next?" as well as discussions with various local scientists and students.
Year(s) Of Engagement Activity 2016
 
Description Structural bioinformatics training course 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Co-organiser of and lecturer in this practical course, interacting with students and fellow instructors.
Year(s) Of Engagement Activity 2016,2017,2018
URL http://www.ebi.ac.uk/training/events/2016/structural-bioinformatics-2016
 
Description Talk: CCP4 Developers Meeting (2014) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact CCP4 developer meeting

Some of the developers have shown interest in the PDBe API and the CSD coordinate data
Year(s) Of Engagement Activity 2014
 
Description Training provided to Grenoble Trainee 
Form Of Engagement Activity Participation in an open day or visit at my research institution
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact Aine Barry from EMBL Grenoble visited to get training on the API so she can develop application to help her team back in EMBL- Grenoble. Aine visited for two days and was provided detailed information about the PDBe search and entry API.

Aine has gone back and developed a prototype application.
Year(s) Of Engagement Activity 2015
 
Description Univ of Copenhagen PhD Day (keynote speaker) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Postgraduate students
Results and Impact Keynote speaker at the PhD Day of the Dept of Biology of the Univ of Copenhagen. Also interacted with students who presented posters as well as the PhD students who organised the PhD Day.
Year(s) Of Engagement Activity 2016
URL http://phdday.wixsite.com/2016
 
Description Visit by Roberto Mosca, IRB Barcelona, Spain 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact Roberto visited PDBe to learn more about the entry and search API and new developments at PDBe which are based on the API

Roberto is planning to use the API in his research and development work
Year(s) Of Engagement Activity 2015
 
Description Visit by Ville Uski, STFC, UK to discuss API development 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact Ville is responsible for developing the CCP4 API and wanted to discuss the technologies used in development of PDBe API. The information shared will help him make decision in technology selection process for CCP4 API.

The sharing of information has resulted in Ville saving a lot of time and will result in CCP4 benefiting from PDBe experience.
Year(s) Of Engagement Activity 2015
 
Description York Roadshow (2014) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Professional Practitioners
Results and Impact The participants carried out the exercises based on the REST API tutorials. The discussions that followed helped one of the participants integrate the API in the CCP4 programme suite which will make it available to the wider structural biology community

There is increasing interest in the REST API for integration in research workflows as well as established programs such as CCP4
Year(s) Of Engagement Activity 2014