BioChemGRAPH - an integrated knowledge graph to facilitate basic and translational research

Lead Research Organisation: Cambridge Crystallographic Data Centre
Department Name: The Cambridge Crystallographic Data Cent


Endogenous small molecules like drugs, sugars, amino acids, and lipids play important roles in regulating complex biological processes. They do so by interacting with macromolecules such as proteins or nucleic acids to facilitate, disrupt or change their function or interaction patterns. The three-dimensional structures of small molecules in complex with macromolecules and the associated biochemical experiments can provide key insights into their function and mode of action. This understanding can support the design of new enzymes or new drug candidates.

The Protein Data Bank (PDB) is the single global archive of macromolecular structures and also contains more than 30,000 unique small molecules observed in nearly 120,000 complexes. The ChEMBL database archives biochemical assay data that provides complementary information to understand the biological role of the small molecules. The Cambridge Crystallographic Data Centre (CCDC) manages an archive of over 1 million small molecule crystal structures that can provide insights into small molecules interactions, which can be especially important in the design of new drug molecules. Exponential growth in the scale and diversity of data due to recent technological and scientific advances have transformed life sciences into a data-driven activity, with multidisciplinary teams using data from many sources to drive new innovations in the fields of biotechnology, synthetic biology, agriculture, and human health. A major effort is currently expended by the users to standardise, curate and integrate data from multiple data resources to gain a comprehensive understanding of biological systems.

The BioChemGRAPH project aims to establish a collaboration between PDBe, ChEMBL and CCDC to create an easily accessible resource that integrates structural, functional and biochemical annotations of small molecule data into one place. These data will be added into an existing community-driven integration platform, namely PDBe-KB, which already provides an aggregated view of structural and functional annotations for macromolecules in the PDB. Building on PDBe-KB's efforts in the macromolecular community, this project will promote interoperability between small molecule resources by implementing common data standards. The project also aims to improve the findability and accessibility of small molecule annotations via uniform data access mechanisms and develop intuitive web components to visualise these valuable data through web interfaces. It will significantly increase the synergies between structural and biochemical data and will lead to increased understanding of the role of small molecules in biological systems (e.g. enzymatic mechanisms) and translational research in a number of areas, including synthetic biology, target validation, and drug development. Interconnections between targets and small molecules, which currently require manual collation, would be automatically established as part of the proposed project and could help with target validation and potential drug repurposing, highlighting potential cross-reactivity, and side effects. In order to achieve this automated linking, we will build upon and expand UniChem, a service that offers mappings between small molecule entries from numerous data resources. With the help of a readily available integrated resource, we will develop user-friendly web pages aggregating all the data for a given small molecule and relevant macromolecules. Advanced users will also benefit from the expanded programmatic access mechanisms to the integrated data. The new infrastructure will thus help researchers by providing a robust, time-saving mechanism to obtain all relevant small molecule-related data and associated biochemical, structural and functional information on macromolecules.

Technical Summary

BioChemGRAPH will develop and sustain an automated process to integrate information for the common set of small molecules from three well maintained and curated core data archives: structures of >30,000 unique small molecules observed in complex with macromolecules from the Protein Data Bank (PDB), small molecule crystal structures from the Cambridge Structural Database (CSD) and associated biochemical assay and target information from ChEMBL. These data will be integrated into the PDBe-KB data resource, a community-driven effort that integrates structural and functional annotations for macromolecules from >20 international data resources.

To achieve this objective, we will build upon and expand UniChem, a service that offers mappings between entries describing small molecules present in numerous data resources. We will map the target information between ChEMBL and the PDB by using UniProtKB accessions cross-referenced in each resource. The project will implement common data standards for the exchange of annotations and will improve findability and accessibility via uniform data access mechanisms (RESTful API and FTP), and via intuitive web components to visualise the data. With the help of this readily available integrated resource, we will develop user-friendly web interfaces aggregating all the data for a given small molecule and its macromolecular binding partners. BioChemGRAPH will thus address a key data integration challenge by developing a robust, time-saving mechanism to obtain all small molecule-related data and associated biochemical, structural and functional information on macromolecules, facilitating increased understanding of the role of small molecules in biological systems (e.g. enzymatic mechanisms) and translational research in a broad range of fields, including synthetic biology, target validation and drug development.

Planned Impact

Integrating data from a large number of resources is a challenging but essential task in the new era of data-driven biology. However, keeping track of the increasing number of individual data resources, their curation practices, policies and data formats is nearly an impossible task. Core data resources are uniquely positioned to bring together the growing community of data providers to agree common data exchange standards and thus facilitate the creation of integrated resources with robust data access mechanisms, offering potential time-saving to users. PDBe-KB is one such resource that brings together >20 resources from the Structural Bioinformatics community, including MetalPDB, MCSA and 3DLigandSite, which provide functional annotations for structures in the PDB.

The BioChemGRAPH project will deliver an integrative data resource that will serve as a central repository for structural, functional and biochemical annotations for the common set of small molecular compounds (e.g. drugs, cofactors, inhibitors) and associated target information contained in the PDB, ChEMBL and CSD archives. The integration efforts will not only enrich the information already available in these resources, but also make it more accessible to the broader scientific user community and achieve further impact by combining it with annotations contributed to PDBe-KB. The focus of this work was identified as a clear priority based on feedback from our users through surveys and interviews. The survey received 113 responses from academia (63%) and industry (37%) and guided the definition of needs that will be addressed by this proposal. In particular, the resource will enable users to answer scientific questions that were highlighted in the survey, such as (1) are there annotations explaining the biological role of a particular ligand; (2) are there other small molecules that have the same scaffold or fragments, or bind in similar binding pockets; (3) where are the functional groups of a small molecule based on the statistics of observed atomic-level interactions of the ligand; and (4) which fragment hotspots overlay with the small molecule of interest.

As evident from the support letters, BioChemGRAPH will not only benefit the academic research community but will also provide valuable support for small and medium-sized enterprises and large pharmaceutical companies. The integrated knowledge graph will support a broad range of research areas, e.g. groups studying microbial resistance, rare diseases, drug design or target validation could more easily conduct comparative studies and transfer of structural, functional and interaction-specific annotations between similar small molecules. The resource will also serve as a valuable research tool to investigate the effects of genetic variation, drug resistance, drug repurposing efforts, as well as understanding enzymatic mechanisms at the molecular level, designing synthetic enzymes and other translational research goals. It will enable explorative analyses based on structural similarity, the commonality of scaffolds, fragments or binding pockets. It will also significantly help initiatives like Open Targets and the Illuminating the Druggable Genome project by providing rich data sets in a standardised and interoperable format.

BioChemGRAPH will also deliver a common library of data visualisation tools that will expose the collated data in a uniform manner and lower the barrier to accessing bioactivity and structural data for small molecules. This open source library will potentially standardise visualisation tools for displaying small molecule-related annotations for all relevant data resources. These visualisation tools will also be used to enhance the aggregated views of proteins, effectively linking the small molecule and macromolecule-specific web interfaces.


10 25 50