BioChemGRAPH - an integrated knowledge graph to facilitate basic and translational research

Lead Research Organisation: Cambridge Crystallographic Data Centre
Department Name: The Cambridge Crystallographic Data Cent

Abstract

Abstracts are not currently available in GtR for all funded research. This is normally because the abstract was not required at the time of proposal submission, but may be because it included sensitive information such as personal details.

Technical Summary

BioChemGRAPH will develop and sustain an automated process to integrate information for the common set of small molecules from three well maintained and curated core data archives: structures of >30,000 unique small molecules observed in complex with macromolecules from the Protein Data Bank (PDB), small molecule crystal structures from the Cambridge Structural Database (CSD) and associated biochemical assay and target information from ChEMBL. These data will be integrated into the PDBe-KB data resource, a community-driven effort that integrates structural and functional annotations for macromolecules from >20 international data resources.

To achieve this objective, we will build upon and expand UniChem, a service that offers mappings between entries describing small molecules present in numerous data resources. We will map the target information between ChEMBL and the PDB by using UniProtKB accessions cross-referenced in each resource. The project will implement common data standards for the exchange of annotations and will improve findability and accessibility via uniform data access mechanisms (RESTful API and FTP), and via intuitive web components to visualise the data. With the help of this readily available integrated resource, we will develop user-friendly web interfaces aggregating all the data for a given small molecule and its macromolecular binding partners. BioChemGRAPH will thus address a key data integration challenge by developing a robust, time-saving mechanism to obtain all small molecule-related data and associated biochemical, structural and functional information on macromolecules, facilitating increased understanding of the role of small molecules in biological systems (e.g. enzymatic mechanisms) and translational research in a broad range of fields, including synthetic biology, target validation and drug development.

Planned Impact

Integrating data from a large number of resources is a challenging but essential task in the new era of data-driven biology. However, keeping track of the increasing number of individual data resources, their curation practices, policies and data formats is nearly an impossible task. Core data resources are uniquely positioned to bring together the growing community of data providers to agree common data exchange standards and thus facilitate the creation of integrated resources with robust data access mechanisms, offering potential time-saving to users. PDBe-KB is one such resource that brings together >20 resources from the Structural Bioinformatics community, including MetalPDB, MCSA and 3DLigandSite, which provide functional annotations for structures in the PDB.

The BioChemGRAPH project will deliver an integrative data resource that will serve as a central repository for structural, functional and biochemical annotations for the common set of small molecular compounds (e.g. drugs, cofactors, inhibitors) and associated target information contained in the PDB, ChEMBL and CSD archives. The integration efforts will not only enrich the information already available in these resources, but also make it more accessible to the broader scientific user community and achieve further impact by combining it with annotations contributed to PDBe-KB. The focus of this work was identified as a clear priority based on feedback from our users through surveys and interviews. The survey received 113 responses from academia (63%) and industry (37%) and guided the definition of needs that will be addressed by this proposal. In particular, the resource will enable users to answer scientific questions that were highlighted in the survey, such as (1) are there annotations explaining the biological role of a particular ligand; (2) are there other small molecules that have the same scaffold or fragments, or bind in similar binding pockets; (3) where are the functional groups of a small molecule based on the statistics of observed atomic-level interactions of the ligand; and (4) which fragment hotspots overlay with the small molecule of interest.

As evident from the support letters, BioChemGRAPH will not only benefit the academic research community but will also provide valuable support for small and medium-sized enterprises and large pharmaceutical companies. The integrated knowledge graph will support a broad range of research areas, e.g. groups studying microbial resistance, rare diseases, drug design or target validation could more easily conduct comparative studies and transfer of structural, functional and interaction-specific annotations between similar small molecules. The resource will also serve as a valuable research tool to investigate the effects of genetic variation, drug resistance, drug repurposing efforts, as well as understanding enzymatic mechanisms at the molecular level, designing synthetic enzymes and other translational research goals. It will enable explorative analyses based on structural similarity, the commonality of scaffolds, fragments or binding pockets. It will also significantly help initiatives like Open Targets and the Illuminating the Druggable Genome project by providing rich data sets in a standardised and interoperable format.

BioChemGRAPH will also deliver a common library of data visualisation tools that will expose the collated data in a uniform manner and lower the barrier to accessing bioactivity and structural data for small molecules. This open source library will potentially standardise visualisation tools for displaying small molecule-related annotations for all relevant data resources. These visualisation tools will also be used to enhance the aggregated views of proteins, effectively linking the small molecule and macromolecule-specific web interfaces.

Publications

10 25 50
 
Title A method for automatically intersecting structural chemistry and biology data resources 
Description Building on previous work for reliably generating standard identifiers for small-molecule crystal structures undertaken as part of this award, we established automated workflows to identify the intersection between the Cambridge Structural Database (CSD) and EBI's UniChem resource. The intersection is identified using InChIs for chemical components in UniChem and the CSD and the method is implemented using GitHub Workflows. The intersection is regenerated on a regular basis to ensure it is kept up to date as resources grow. 
Type Of Material Improvements to research infrastructure 
Year Produced 2022 
Provided To Others? No  
Impact This process has enabled us to identify the intersection between the CSD, PDB and ChEMBL and thus links between chemical and biological data resources that can be incorporated into a knowledge graph connecting biological and chemical data. Standard identifiers and links to the CSD for the intersecting structures will be submitted for inclusion in UniChem so they become available to other researchers wishing to connect across chemical and biological data resources. We also intend building on the automated workflow described here to help maintain links between the CSD and other chemistry data resources. We anticipate being able to report on these impacts more fully in a future submission. 
 
Title A methodology for reliably generating standard identifiers for small-molecule crystal structures 
Description The Cambridge Structural Database (CSD) contains over 1 million experimentally determined crystal structures many of which are relevant to biological processes. Our goal is to be able to incorporate links to these crystal structures into data resources that bridge chemistry and biology. We aim to do this using the IUPAC International Chemical Identifier (InChI) which provides a standard unique identifier for a chemical compound. In order to generate reliable InChIs from crystal structure data, we needed to develop a methodology that takes into account experimental artefacts such as crystallographic disorder that can lead to an imperfect 3D model. The methodology developed exploits the 2D chemical representation of a structure and the 3D crystallographic data stored in a CSD entry to generate an InChI that encapsulates chemical connectivity and stereochemistry. 
Type Of Material Improvements to research infrastructure 
Year Produced 2022 
Provided To Others? No  
Impact Applying this methodology has enabled InChIs to be generated for almost 95% of organic crystal structures in the CSD. Being able to reliably generate InChIs for crystal structures allows us to establish sustainable workflows to intersect the CSD with EBI and other biological data resources in order to integrate knowledge from chemical and biological domains. This will be described as a separate Research Tool and Method. The methodology developed has also been incorporated into a development version of the CSD Python API which is scheduled to be released outside of the current reporting period. This will enable structural chemists in academia and industry to take advantage of InChIs to link across their own chemistry and biology resources. We will report on this in a future submission. 
 
Description One slide presented in a talk at the 25th IUCr Congress 14-22 Aug 2021 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact One slide on the intersection between CSD/PDB/ChEMBL presented at the 25th IUCr Congress 14-22 Aug 2021: Presentation on the CSD in a session on "Exemplary practice in chemical, biological and materials database archiving". 40-50 people attended.
Year(s) Of Engagement Activity 2021
 
Description Presentation at the CCDC Journal Club 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Professional Practitioners
Results and Impact Members from different teams (software development, science, commercial) across the organisation attended the talk and provided feedback and input.
Year(s) Of Engagement Activity 2021
 
Description Presentation at the CCDC Science Showcase meeting 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Professional Practitioners
Results and Impact Talk to report progress made on project to the wider organisation. Attendees from other departments could provide feedback and suggestions.
Year(s) Of Engagement Activity 2021
 
Description Presentation at the Chem-Bio Informatics Society (CBI) Annual Meeting 2021 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Francis L Atkinson presented a talk titled "Data integration to facilitate drug discovery" upon invitation to the Chem-Bio Informatics Society (CBI) Annual Meeting in Japan. The talk was well attended and generated discussion among the panellists and audience.
Year(s) Of Engagement Activity 2021
URL https://cbi-society.org/taikai/taikai21/SS/SS-12_2021CBI_Abs_CCDCPDBj.pdf
 
Description Project mentioned in a talk given in a session on "Enabling FAIR Publication, Exchange, and Reuse of Chemistry Data" at the Fall 2021 ACS National Meeting. 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact BioChemGraph mentioned in "FAIR Crystallographic Data Services: Bridging Academia and Industry" - talk given in a session on "Enabling FAIR Publication, Exchange, and Reuse of Chemistry Data" at the Fall 2021 ACS National Meeting - noted as an example of how CCDC can more fully embrace semantic technologies. Reached 40-50 people.
Year(s) Of Engagement Activity 2021
 
Description Talk at February 2023 Cambridge Cheminformatics Network Meeting on "InChIng Towards Better Molecular Identifiers" 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Professional Practitioners
Results and Impact Presentation to highlight the importance of the International Chemical Identifer (InChI) and how it can be used to enable projects such as BioChemGraph. Jointly given with the Secretary of the IUPAC InChI Subcommittee. Prompted discussions with the wider community about current limitations of InChI and how these might be overcome.
Year(s) Of Engagement Activity 2023
 
Description Talk entitled "Bridging Structural Chemistry Communities through FAIR" at ConTech Pharma 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Talk targeted at Pharma Industry and solution providers. About 20 people engaged with it.
Year(s) Of Engagement Activity 2022
URL https://www.stm-publishing.com/contech-pharma-delivering-successful-fair-data-projects-1st-and-2nd-m...
 
Description Talk entitled "Using InChIs to connect across dimensions and domains" at Fall 2022 ACS National Meeting 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Talk described outcomes of work undertaken for the BioChemGraph project to establish reliable standard chemical identifiers (InChIs) for crystal structures in the Cambridge Structural Database. It contributed to a symposium highlighting the impact of InChI across scientific domains and discussing priorities for future development.
Year(s) Of Engagement Activity 2022