BioChemGRAPH - an integrated knowledge graph to facilitate basic and translational research

Lead Research Organisation: European Bioinformatics Institute
Department Name: MSCB Macromolec, structural and chem bio

Abstract

Endogenous small molecules like drugs, sugars, amino acids, and lipids play important roles in regulating complex biological processes. They do so by interacting with macromolecules such as proteins or nucleic acids to facilitate, disrupt or change their function or interaction patterns. The three-dimensional structures of small molecules in complex with macromolecules and the associated biochemical experiments can provide key insights into their function and mode of action. This understanding can support the design of new enzymes or new drug candidates.

The Protein Data Bank (PDB) is the single global archive of macromolecular structures and also contains more than 30,000 unique small molecules observed in nearly 120,000 complexes. The ChEMBL database archives biochemical assay data that provides complementary information to understand the biological role of the small molecules. The Cambridge Crystallographic Data Centre (CCDC) manages an archive of over 1 million small molecule crystal structures that can provide insights into small molecules interactions, which can be especially important in the design of new drug molecules. Exponential growth in the scale and diversity of data due to recent technological and scientific advances have transformed life sciences into a data-driven activity, with multidisciplinary teams using data from many sources to drive new innovations in the fields of biotechnology, synthetic biology, agriculture, and human health. A major effort is currently expended by the users to standardise, curate and integrate data from multiple data resources to gain a comprehensive understanding of biological systems.

The BioChemGRAPH project aims to establish a collaboration between PDBe, ChEMBL and CCDC to create an easily accessible resource that integrates structural, functional and biochemical annotations of small molecule data into one place. These data will be added into an existing community-driven integration platform, namely PDBe-KB, which already provides an aggregated view of structural and functional annotations for macromolecules in the PDB. Building on PDBe-KB's efforts in the macromolecular community, this project will promote interoperability between small molecule resources by implementing common data standards. The project also aims to improve the findability and accessibility of small molecule annotations via uniform data access mechanisms and develop intuitive web components to visualise these valuable data through web interfaces. It will significantly increase the synergies between structural and biochemical data and will lead to increased understanding of the role of small molecules in biological systems (e.g. enzymatic mechanisms) and translational research in a number of areas, including synthetic biology, target validation, and drug development. Interconnections between targets and small molecules, which currently require manual collation, would be automatically established as part of the proposed project and could help with target validation and potential drug repurposing, highlighting potential cross-reactivity, and side effects. In order to achieve this automated linking, we will build upon and expand UniChem, a service that offers mappings between small molecule entries from numerous data resources. With the help of a readily available integrated resource, we will develop user-friendly web pages aggregating all the data for a given small molecule and relevant macromolecules. Advanced users will also benefit from the expanded programmatic access mechanisms to the integrated data. The new infrastructure will thus help researchers by providing a robust, time-saving mechanism to obtain all relevant small molecule-related data and associated biochemical, structural and functional information on macromolecules.

Technical Summary

BioChemGRAPH will develop and sustain an automated process to integrate information for the common set of small molecules from three well maintained and curated core data archives: structures of >30,000 unique small molecules observed in complex with macromolecules from the Protein Data Bank (PDB), small molecule crystal structures from the Cambridge Structural Database (CSD) and associated biochemical assay and target information from ChEMBL. These data will be integrated into the PDBe-KB data resource, a community-driven effort that integrates structural and functional annotations for macromolecules from >20 international data resources.

To achieve this objective, we will build upon and expand UniChem, a service that offers mappings between entries describing small molecules present in numerous data resources. We will map the target information between ChEMBL and the PDB by using UniProtKB accessions cross-referenced in each resource. The project will implement common data standards for the exchange of annotations and will improve findability and accessibility via uniform data access mechanisms (RESTful API and FTP), and via intuitive web components to visualise the data. With the help of this readily available integrated resource, we will develop user-friendly web interfaces aggregating all the data for a given small molecule and its macromolecular binding partners. BioChemGRAPH will thus address a key data integration challenge by developing a robust, time-saving mechanism to obtain all small molecule-related data and associated biochemical, structural and functional information on macromolecules, facilitating increased understanding of the role of small molecules in biological systems (e.g. enzymatic mechanisms) and translational research in a broad range of fields, including synthetic biology, target validation and drug development.

Planned Impact

Integrating data from a large number of resources is a challenging but essential task in the new era of data-driven biology. However, keeping track of the increasing number of individual data resources, their curation practices, policies and data formats is nearly an impossible task. Core data resources are uniquely positioned to bring together the growing community of data providers to agree common data exchange standards and thus facilitate the creation of integrated resources with robust data access mechanisms, offering potential time-saving to users. PDBe-KB is one such resource that brings together >20 resources from the Structural Bioinformatics community, including MetalPDB, MCSA and 3DLigandSite, which provide functional annotations for structures in the PDB.

The BioChemGRAPH project will deliver an integrative data resource that will serve as a central repository for structural, functional and biochemical annotations for the common set of small molecular compounds (e.g. drugs, cofactors, inhibitors) and associated target information contained in the PDB, ChEMBL and CSD archives. The integration efforts will not only enrich the information already available in these resources, but also make it more accessible to the broader scientific user community and achieve further impact by combining it with annotations contributed to PDBe-KB. The focus of this work was identified as a clear priority based on feedback from our users through surveys and interviews. The survey received 113 responses from academia (63%) and industry (37%) and guided the definition of needs that will be addressed by this proposal. In particular, the resource will enable users to answer scientific questions that were highlighted in the survey, such as (1) are there annotations explaining the biological role of a particular ligand; (2) are there other small molecules that have the same scaffold or fragments, or bind in similar binding pockets; (3) where are the functional groups of a small molecule based on the statistics of observed atomic-level interactions of the ligand; and (4) which fragment hotspots overlay with the small molecule of interest.

As evident from the support letters, BioChemGRAPH will not only benefit the academic research community but will also provide valuable support for small and medium-sized enterprises and large pharmaceutical companies. The integrated knowledge graph will support a broad range of research areas, e.g. groups studying microbial resistance, rare diseases, drug design or target validation could more easily conduct comparative studies and transfer of structural, functional and interaction-specific annotations between similar small molecules. The resource will also serve as a valuable research tool to investigate the effects of genetic variation, drug resistance, drug repurposing efforts, as well as understanding enzymatic mechanisms at the molecular level, designing synthetic enzymes and other translational research goals. It will enable explorative analyses based on structural similarity, the commonality of scaffolds, fragments or binding pockets. It will also significantly help initiatives like Open Targets and the Illuminating the Druggable Genome project by providing rich data sets in a standardised and interoperable format.

BioChemGRAPH will also deliver a common library of data visualisation tools that will expose the collated data in a uniform manner and lower the barrier to accessing bioactivity and structural data for small molecules. This open source library will potentially standardise visualisation tools for displaying small molecule-related annotations for all relevant data resources. These visualisation tools will also be used to enhance the aggregated views of proteins, effectively linking the small molecule and macromolecule-specific web interfaces.
 
Title PDBe graph database (ligands) 
Description We integrated ligand-related information into the PDBe graph database as part of the BioChemGraph project. This included atomic-level residue-residue interactions and cofactor-like, reactant-like and drug-like annotations. 
Type Of Material Database/Collection of data 
Year Produced 2022 
Provided To Others? Yes  
Impact The data added to the PDBe graph database as part of the BioChemGraph project makes the database even more useful in terms of data discovery, especially in the context of drug design. 
URL https://www.ebi.ac.uk/pdbe/pdbe-kb/graph
 
Title LigEnv web component 
Description LigEnv is a data visualisation web component to display the atomic level environment of a small molecule in relation to its binding pocket. 
Type Of Technology Software 
Year Produced 2022 
Open Source License? Yes  
Impact This is a reusable web component that we use on the PDBe entry pages. 
URL https://github.com/PDBeurope/ligand-env
 
Title PDBe Arpeggio 
Description An open-source wrapper for the Arpeggio tool that we use to generate atomic-level residue-residue interaction data. 
Type Of Technology Software 
Year Produced 2023 
Open Source License? Yes  
Impact We use this software to generate atom-level interactions data for ligands in the PDB archive. 
URL https://github.com/PDBeurope/arpeggio
 
Title PDBe CCDUtils 
Description An open-source Python package which is a wrapper for RDKit and provides various methods for manipulating small-molecule data. 
Type Of Technology Software 
Year Produced 2023 
Open Source License? Yes  
Impact We use this software for all the ligand-related work in PDBe and PDBe-KB. 
URL https://github.com/PDBeurope/ccdutils
 
Description 3D-BioInfo Annual Meeting 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact PDBe-KB (FunPDBe, BioChemGraph, covariation-related works), 3D-Beacons and AlphaFold DB was presented at the 3D-BioInfo Annual Meeting 2021.
Year(s) Of Engagement Activity 2021
 
Description 3D-BioInfo community webinar series 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Webinar giving an overview of the 3D Beacons project
Year(s) Of Engagement Activity 2021
 
Description 6th European Crystallographic School 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Presentation, introducing PDBe and PDBe-KB resources
Year(s) Of Engagement Activity 2021
URL https://akcongress.com/ecs6/
 
Description A guide to analysing binding sites in protein structures 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Introduction to interpreting ligand binding sites in protein structures using tools at PDBe and PDBe-KB
Year(s) Of Engagement Activity 2021
URL https://www.ebi.ac.uk/training/events/guide-analysing-binding-sites-protein-structures/
 
Description Bringing molecular structure to life: 50 years of the PDB 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Presentation of PDBe and PDBe-KB resources to make PDB data more accessible
Year(s) Of Engagement Activity 2021
URL https://www.embl.org/about/info/course-and-conference-office/events/pdb21-01/
 
Description British Crystallographic Association (BCA) 2021 meeting 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Presentation of recent improvements at PDBe-KB pages
Year(s) Of Engagement Activity 2021
 
Description CCP4 Study Weekend 2021 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Presentation of recent improvements at PDBe-KB pages and Q&A about PDB data and deposition
Year(s) Of Engagement Activity 2021
URL https://ccp4sw2021.meeting-mojo.com/page/agenda
 
Description EBI/Sanger Seminar Series 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Talk describing projects for enriching structural data, explaining how it has enabled the development of PDBe-KB aggregated views
Year(s) Of Engagement Activity 2021
URL https://www.ebi.ac.uk/about/events/events/internal-seminar/2021/ebisanger-seminar-series-sameer-vela...
 
Description EIPP Bioinformatics Predocs Course 2021 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Training session with PhD students to introduce them to accessing PDB data at PDBe and PDBe-KB resources
Year(s) Of Engagement Activity 2021
 
Description EMBL-EBI virtual Pavia workshop 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Postgraduate students
Results and Impact Training session with PhD students to introduce them to accessing PDB data at PDBe and PDBe-KB resources
Year(s) Of Engagement Activity 2021
URL https://www.ebi.ac.uk/training/events/embl-ebi-workshop-university-pavia-2021/
 
Description EMBL-EBI workshop: The Open University, 2021 (Virtual) 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Undergraduate students
Results and Impact Presentation, introducing PDBe and PDBe-KB resources
Year(s) Of Engagement Activity 2021
 
Description Infection Biology Retreat 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact PDBe-KB, 3D-Beacons and the AlphaFold DB was presented at the EMBL Infection Biology Retreat in the context of infectious diseases.
Year(s) Of Engagement Activity 2021
 
Description King's College Structural Seminar Series 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Postgraduate students
Results and Impact Presentation, introducing PDBe-KB resources
Year(s) Of Engagement Activity 2021
 
Description PDBe-KB at ECCB 2022 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact A presentation that gave an update on the latest developments in PDBe-KB, including work related to the BioChemGraph project.
Year(s) Of Engagement Activity 2022
 
Description PSDI Virtual Conference 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact PDBe-KB, with a focus on ligand-related work (i.e. BioChemGraph project), was presented to an audience of mainly drug discovery and development experts.
Year(s) Of Engagement Activity 2021