PDBHarvest - Harvesting more and better metadata from CCP4 projects to enrich structure depositions to the PDB

Lead Research Organisation: EMBL - European Bioinformatics Institute
Department Name: Protein Data Bank in Europe


In the era of data-driven biology, the research community is increasingly dependent on the availability of accurate and complete metadata information for different experiments archived in the biological data resources. The Protein Data Bank (PDB) is the single global repository of high-resolution three-dimensional (3D) macromolecular structure data. The PDB is managed by the Worldwide Protein Data Bank (wwPDB) consortium of which PDBe is a founding member. The high-resolution data archived in the PDB can help in the design and discovery of new therapeutics relevant to the pharmaceutical, animal health, food safety and biotechnology industries. Over 80% of the structures available in the PDB are determined using X-ray crystallography. In recent years there have been rapid advances in structure-determination methodology, instrumentation and software. This has resulted in a rapid growth of the number of structures determined each year. The improvements have also enabled crystallographers to address ever more challenging biological systems including integral membrane proteins and large macromolecular machines such as ribosomes and chaperones. The emergence of hybrid methods, where data from a wide range of experiments can be used in structure determination allows researchers to study even larger and more complex systems. These scientific advances make it necessary to accurately capture additional and more complex metadata. This can be achieved by implementing automated data-capturing ("data-harvesting") functionality in popular software packages used by crystallographers. CCP4 is one the most popular program suites that supports all steps in the structure-determination process. It is widely used by researchers in academia and industry. We plan to implement data-harvesting infrastructure in CCP4 to facilitate automatic data capture and easy deposition to the PDB. The new harvesting functionality will export the metadata in mmCIF file format; this is a flexible format that allows for future extensions to capture even more metadata. The additional metadata will enrich the information available in the PDB archive and will allow for better use of the archive information by the biomedical research community. The additional information can be "mined" by crystallographic methods developers to detect hitherto unknown correlations and possibly gain new insights that could lead to better methods. To support the data-harvesting efforts, the project will also modify the wwPDB deposition and annotation system so that it will accept the upload of the new harvest file. The modified wwPDB deposition software will allow for automatic extraction of the additional metadata and simplify the deposition process for CCP4 users, while providing the entire user community with more and more accurate structural data.

Technical Summary

With the increasing importance of data-driven research in biomedicine, which is critically dependent on the diverse and large volumes of data available from public data resources, it is imperative that the data archives collaborate closely with the research community to support automated, accurate and complete capture of experimental metadata ("data-harvesting"). The structural biology community and the Worldwide Protein Data Bank (wwPDB; the organisation that manages the single global repository of high-resolution macromolecular structure data, the PDB) are at the forefront of such data-harvesting efforts. Over 80% of the structures available in the PDB were determined using macromolecular crystallography (MX). The MX structure-determination process is complex and usually non-linear, with steps including data scaling, data integration, solving the phase problem, model building and finally refinement and validation of the model. Many software suites such as CCP4, Phenix, ShelX and Global Phasing support this complex process or substantial parts thereof. Amongst these, CCP4 is the most popular. It provides an intuitive user interface that makes the complex steps in the structure-determination pipeline easily accessible.

We plan to implement data-harvesting functionality to take advantage of recent CCP4 developments that include an improved user interface (GUI-2) and data-storage infrastructure. The new infrastructure will capture additional metadata including information on the sample sequence and the chemical description of the non-standard amino acids and nucleotides and small molecules bound to the macromolecules. The data will be exported in mmCIF format, which is an extensible, dictionary based data format. To provide easy access to the new functionality, it will be made available via the GUI-2 user interface. The project will also implement the necessary updates to the wwPDB deposition system to upload and process data from the new harvest files.

Planned Impact

Increasingly, life-science research is driven by the enormous amounts and variety of biological data available in public archives and resources. Biological data resources play a major role in supporting researchers by accurately capturing, curating and archiving the relevant data and the corresponding metadata. The Protein Data Bank (PDB), managed by the Worldwide Protein Data Bank consortium (wwPDB), is the single global repository of high-resolution 3D macromolecular structures. In this project, PDBe will work closely with Collaborative Computational Project No. 4 in Protein Crystallography UK (CCP4), the project responsible for the most popular structure-determination program suite used by researchers in academia and industry. PDBe and CCP4 will implement infrastructure to automatically capture relevant metadata and provide mechanisms to upload it during PDB deposition. The immediate beneficiaries of the new infrastructure will be the users of CCP4 program suite at over 8000 academic and 140 industrial sites. The CCP4 project will also benefit from having an up-to-date harvesting infrastructure designed to be extensible in the future. The direct integration of the harvesting functionality into the CCP4 user interface will make it immediately available to its users via the CCP4 update mechanism, thus reducing the time of adoption of the new infrastructure by the user community to a minimum.

The new developments will make the data-deposition process easier for CCP4 users while increasing the accuracy and amount of data transferred, thus enriching the information available to all users of the PDB. Automated harvesting of relevant metadata and its deposition to the PDB also ensures that the outcomes of research funded by the public and private sector are captured in a way that these can be used by the wider research community. The enriched archive will allow for data mining to uncover hitherto unknown knowledge, trends and correlations and gain new insights that could lead to improved methods.
The planned inclusion of automatic harvesting of sample information will allow for accurate representation of the macromolecules studied by making their sequences available in the harvest file. This will allow for better integration of macromolecular structure information with other resources and the possibility to transfer value-added annotations from other life-science domains onto macromolecular structures and vice versa. Such transfer of knowledge can have direct applications, e.g. structure information can be valuable in understanding the effect of genetic variation on the function of a macromolecule or in understanding the mechanism of action of small molecule effectors, potentially leading to design of better therapeutic molecules.

The new harvesting functionality will also transfer valuable information about the chemical and 3D structure of non-standard residues and small molecules not yet represented in the PDB archive. This will make it easier to assess the quality of the models of these molecules. At present, the sometimes poor quality of ligand coordinates means that the data is not usable by pharmaceutical companies without additional in-house checks. Since these checks will involve calculation of electron density maps and their inspection by at least one expert, such checks are time-consuming and expensive.

In summary, archiving of additional and better quality metadata will enrich the global archive of macromolecular structure data, the PDB, with a direct impact on all the communities that use this information.

The proposed work will directly contribute to the professional development of the staff involved. Apart form contributing to the proposed work, the experienced software developer named on the proposal will benefit from experience in handling of macromolecular structure data from different stages of the structure determination process.


10 25 50
Description Archiving research information is a critical part of the scientific process. The software developed in this project helps better archiving of the structural biology data.
Exploitation Route The software is part of CCP4 distribution and will be used by all users of CCP4 package. CCP4 has further updated the software to include additional data in the harvest file. This make it easier to deposit structure data to the PDB.
Sectors Other

Title Software provided in CCP4 distribution 
Description The software allows for generation of file to be transferred to wwPDB 
Type Of Technology Software 
Year Produced 2016 
Open Source License? Yes  
Impact Better data provision and metadat during wwPDB deposition 
Description "What does PDB do to improve data quality? 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact One talk and one institutional seminar were presented at the University of Strasbourg, France as part of the Proteopedia training workshop.
Year(s) Of Engagement Activity 2019
Description CCP4 developers meeting 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Provided a overview of the software developed in the project to allow a better information transfer
Year(s) Of Engagement Activity 2016
Description PDBe lunchtime byte 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact This event was part of the CCP4 Study Weekend 2019 at the University of Nottingham, where a talk was presented and calendars distributed to attendees.
Year(s) Of Engagement Activity 2019