CIBR 19-BBSRC-NSF/BIO: Next generation PDB - FACT infrastructure with value added FAIR data supporting diverse research and education user communities

Lead Research Organisation: European Bioinformatics Institute
Department Name: MSCB Macromolec, structural and chem bio

Abstract

The vision of this US RCSB Protein Data Bank/Protein Data Bank in Europe collaborative project is to improve data deposition, delivery, and management of three-dimensional (3D) macromolecular structure information stored in the Protein Data Bank. This work will benefit researchers, educators, and their students across the natural, physical, and engineering sciences.

"Form (meaning shape/3D structure) dictates function in biology" - was first revealed in the Watson and Crick publication of the DNA double helix structure. Since their landmark discovery, interdisciplinary collaborative teams of biologists, physicists, chemists, and engineers have generated ~160,000 experimentally determined 3D structures of proteins and nucleic acids, which are centrally stored in a public data resource known as the Protein Data Bank (PDB). Founded in 1971 as the first open-access digital data resource in biology, the PDB has grown more than 20,000-fold to become the single global archive housing richly annotated 3D structures of proteins and DNA and RNA. This public-domain 3D structure data resource has had an enormous impact on fundamental biology, biomedicine, biotechnology, and bioenergy by enabling atomic-level understanding of naturally-occurring and engineered biomolecule, and by facilitating discovery of nearly 90% of the new drugs approved by the United States (US) Food and Drug Administration between 2010-2016. Today, new PDB structures are coming from macromolecular crystallography (MX), nuclear magnetic resonance spectroscopy (NMR), single-particle cryo-electron microscopy (3DEM), and micro-crystal electron diffraction (microED). X-ray free electron lasers and new integrative methods for structure determination are accelerating biomedical research with insights into ever more complex biological systems at the atomic level. Cryo-electron tomography even allows studies of macromolecular machines "caught in the act" inside frozen cells.

Since 2003, the Worldwide Protein Data Bank (wwPDB, wwpdb.org) partnership has managed the PDB Core Archive (hereafter PDB archive) as a global Public Good according to the FACT principles of Fairness-Accuracy-Confidentiality-Transparency and the FAIR principles of Findability-Accessibility-Interoperability-Reusability. The wwPDB includes locally-funded partners in the US (Research Collaboratory for Structural Bioinformatics Protein Data Bank, RCSB PDB), Europe (Protein Data Bank in Europe, PDBe) and Asia (Protein Data Bank Japan, PDBj), plus a specialist NMR resource (BioMagResBank, BMRB). The wwPDB also enables equitable sharing of PDB data archiving and management costs between US, Europe, and Asia. In 2019, RCSB PDB, PDBe, and PDBj jointly processed 13,377 new structures coming into the PDB archive using the web-based, global wwPDB OneDep software system for deposition, validation, and biocuration. Also, in 2019, RCSB PDB, PDBe, and PDBj jointly enabled download of ~800 million PDB structure data files by millions of users from around the world.

Today, wwPDB partners are confronting significant software engineering challenges, resulting from
(i) the relentless growth in the number and size/complexity of newly deposited MX and 3DEM structures, and (ii) the need to manage incoming data as groups of related structures (or investigations) coming from serial femtosecond X-ray crystallography (SFX) using X-ray Free Electron Lasers (XFEL) and 3DEM.

Technical Summary

This project aims to improve data deposition, delivery, and management of three-dimensional (3D) macromolecular structure information stored in the single global public data resource known as the Protein Data Bank (PDB). The PDB currently houses ~160,000 experimentally determined 3D structures of proteins and nucleic acids. It is managed according to the FAIR Principles on an open access basis by the Worldwide Protein Data Bank (wwPDB; wwpdb.org) partnership. The project addresses significant software engineering challenges, resulting from (i) the relentless growth in the number and size/complexity of newly deposited structures, and (ii) the need to manage incoming data as groups of related structures (or investigations). The project will improve the fidelity and completeness of 3D structure data deposited into the PDB by harvesting data automatically from structure determination software packages, and streamlining the wwPDB data deposition, validation, and biocuration system known as OneDep. The project will improve the "FAIR"ness of PDB data for researchers, educators, and students by extending chemical metadata for small-molecule ligands (e.g. bound cofactors and inhibitors), incorporating enhanced descriptions of macromolecular assemblies, grouping related PDB structures into investigations for more efficient, parallel data delivery; and creating a "Next Generation" PDB data repository with up-to-date metadata. Finally, the project will modernise wwPDB information technology infrastructure to future-proof PDB data management and weekly PDB archive release to the public domain by developing new application programming interfaces (APIs) and microservices infrastructure, and updating existing mechanisms for synchronisation of data and software across wwPDB data centres in the US< Europe, and Asia. This work will directly benefit researchers, educators, and their students across the natural, physical, and engineering sciences.
 
Title SIFTS mappings in updated mmCIF 
Description PDBe has made's 'updated' PDBx/mmCIF files, containing SIFTS mapping information and additional, standardised metadata, are now available directly from the EMBL-EBI FTP area. These files will also be made available through the NextGen FTP archive. 
Type Of Material Database/Collection of data 
Year Produced 2022 
Provided To Others? Yes  
Impact PDBe's release process creates 'updated' PDBx/mmCIF files from PDB archive files, containing remapped enumerations and additional information, while yielding more consistent, standardized metadata, without altering core PDB information, such as atomic coordinates and experimental data. These updated PDBx/mmCIF files have now been further enriched by the addition of three new 'SIFTS- specific' categories, providing an improved mapping between structure and sequence data. 
URL https://ftp.ebi.ac.uk/pub/databases/msd/updated_mmcif/
 
Title wwPDB NextGen Archive 
Description Since February 01 2023, the wwPDB enriches PDB entries with additional annotation and distributes the latest versions of each entry via next generation archive (NextGen). 
Type Of Material Database/Collection of data 
Year Produced 2023 
Provided To Others? Yes  
Impact This enriched PDB archive provides annotation from external database resources in the metadata that is in addition to what is in the structure model files in the PDB main archive. 
URL https://files-nextgen.wwpdb.org/
 
Description CCP4 WG2 Meeting 2023 Feb 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The planned improvements and changes to the PDB infrastructure were presented and discussed.
Year(s) Of Engagement Activity 2022
 
Description Protein Data Bank Workshop 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The Protein Data Bank (https://www.wwpdb.org) archives information about the 3D shapes of proteins, nucleic acids, and complex assemblies that help students and researchers understand all aspects of biomedicine and agriculture, from protein synthesis to health and disease. This workshop coincides with the PDB's 50th anniversary.
Year(s) Of Engagement Activity 2021
URL https://www.rsc.org/events/detail/47412/workshop-on-open-source-tools-for-chemistry