BioChemGRAPH - an integrated knowledge graph to facilitate basic and translational research
Lead Research Organisation:
Cambridge Crystallographic Data Centre
Department Name: The Cambridge Crystallographic Data Cent
Abstract
Abstracts are not currently available in GtR for all funded research. This is normally because the abstract was not required at the time of proposal submission, but may be because it included sensitive information such as personal details.
Technical Summary
BioChemGRAPH will develop and sustain an automated process to integrate information for the common set of small molecules from three well maintained and curated core data archives: structures of >30,000 unique small molecules observed in complex with macromolecules from the Protein Data Bank (PDB), small molecule crystal structures from the Cambridge Structural Database (CSD) and associated biochemical assay and target information from ChEMBL. These data will be integrated into the PDBe-KB data resource, a community-driven effort that integrates structural and functional annotations for macromolecules from >20 international data resources.
To achieve this objective, we will build upon and expand UniChem, a service that offers mappings between entries describing small molecules present in numerous data resources. We will map the target information between ChEMBL and the PDB by using UniProtKB accessions cross-referenced in each resource. The project will implement common data standards for the exchange of annotations and will improve findability and accessibility via uniform data access mechanisms (RESTful API and FTP), and via intuitive web components to visualise the data. With the help of this readily available integrated resource, we will develop user-friendly web interfaces aggregating all the data for a given small molecule and its macromolecular binding partners. BioChemGRAPH will thus address a key data integration challenge by developing a robust, time-saving mechanism to obtain all small molecule-related data and associated biochemical, structural and functional information on macromolecules, facilitating increased understanding of the role of small molecules in biological systems (e.g. enzymatic mechanisms) and translational research in a broad range of fields, including synthetic biology, target validation and drug development.
To achieve this objective, we will build upon and expand UniChem, a service that offers mappings between entries describing small molecules present in numerous data resources. We will map the target information between ChEMBL and the PDB by using UniProtKB accessions cross-referenced in each resource. The project will implement common data standards for the exchange of annotations and will improve findability and accessibility via uniform data access mechanisms (RESTful API and FTP), and via intuitive web components to visualise the data. With the help of this readily available integrated resource, we will develop user-friendly web interfaces aggregating all the data for a given small molecule and its macromolecular binding partners. BioChemGRAPH will thus address a key data integration challenge by developing a robust, time-saving mechanism to obtain all small molecule-related data and associated biochemical, structural and functional information on macromolecules, facilitating increased understanding of the role of small molecules in biological systems (e.g. enzymatic mechanisms) and translational research in a broad range of fields, including synthetic biology, target validation and drug development.
Planned Impact
Integrating data from a large number of resources is a challenging but essential task in the new era of data-driven biology. However, keeping track of the increasing number of individual data resources, their curation practices, policies and data formats is nearly an impossible task. Core data resources are uniquely positioned to bring together the growing community of data providers to agree common data exchange standards and thus facilitate the creation of integrated resources with robust data access mechanisms, offering potential time-saving to users. PDBe-KB is one such resource that brings together >20 resources from the Structural Bioinformatics community, including MetalPDB, MCSA and 3DLigandSite, which provide functional annotations for structures in the PDB.
The BioChemGRAPH project will deliver an integrative data resource that will serve as a central repository for structural, functional and biochemical annotations for the common set of small molecular compounds (e.g. drugs, cofactors, inhibitors) and associated target information contained in the PDB, ChEMBL and CSD archives. The integration efforts will not only enrich the information already available in these resources, but also make it more accessible to the broader scientific user community and achieve further impact by combining it with annotations contributed to PDBe-KB. The focus of this work was identified as a clear priority based on feedback from our users through surveys and interviews. The survey received 113 responses from academia (63%) and industry (37%) and guided the definition of needs that will be addressed by this proposal. In particular, the resource will enable users to answer scientific questions that were highlighted in the survey, such as (1) are there annotations explaining the biological role of a particular ligand; (2) are there other small molecules that have the same scaffold or fragments, or bind in similar binding pockets; (3) where are the functional groups of a small molecule based on the statistics of observed atomic-level interactions of the ligand; and (4) which fragment hotspots overlay with the small molecule of interest.
As evident from the support letters, BioChemGRAPH will not only benefit the academic research community but will also provide valuable support for small and medium-sized enterprises and large pharmaceutical companies. The integrated knowledge graph will support a broad range of research areas, e.g. groups studying microbial resistance, rare diseases, drug design or target validation could more easily conduct comparative studies and transfer of structural, functional and interaction-specific annotations between similar small molecules. The resource will also serve as a valuable research tool to investigate the effects of genetic variation, drug resistance, drug repurposing efforts, as well as understanding enzymatic mechanisms at the molecular level, designing synthetic enzymes and other translational research goals. It will enable explorative analyses based on structural similarity, the commonality of scaffolds, fragments or binding pockets. It will also significantly help initiatives like Open Targets and the Illuminating the Druggable Genome project by providing rich data sets in a standardised and interoperable format.
BioChemGRAPH will also deliver a common library of data visualisation tools that will expose the collated data in a uniform manner and lower the barrier to accessing bioactivity and structural data for small molecules. This open source library will potentially standardise visualisation tools for displaying small molecule-related annotations for all relevant data resources. These visualisation tools will also be used to enhance the aggregated views of proteins, effectively linking the small molecule and macromolecule-specific web interfaces.
The BioChemGRAPH project will deliver an integrative data resource that will serve as a central repository for structural, functional and biochemical annotations for the common set of small molecular compounds (e.g. drugs, cofactors, inhibitors) and associated target information contained in the PDB, ChEMBL and CSD archives. The integration efforts will not only enrich the information already available in these resources, but also make it more accessible to the broader scientific user community and achieve further impact by combining it with annotations contributed to PDBe-KB. The focus of this work was identified as a clear priority based on feedback from our users through surveys and interviews. The survey received 113 responses from academia (63%) and industry (37%) and guided the definition of needs that will be addressed by this proposal. In particular, the resource will enable users to answer scientific questions that were highlighted in the survey, such as (1) are there annotations explaining the biological role of a particular ligand; (2) are there other small molecules that have the same scaffold or fragments, or bind in similar binding pockets; (3) where are the functional groups of a small molecule based on the statistics of observed atomic-level interactions of the ligand; and (4) which fragment hotspots overlay with the small molecule of interest.
As evident from the support letters, BioChemGRAPH will not only benefit the academic research community but will also provide valuable support for small and medium-sized enterprises and large pharmaceutical companies. The integrated knowledge graph will support a broad range of research areas, e.g. groups studying microbial resistance, rare diseases, drug design or target validation could more easily conduct comparative studies and transfer of structural, functional and interaction-specific annotations between similar small molecules. The resource will also serve as a valuable research tool to investigate the effects of genetic variation, drug resistance, drug repurposing efforts, as well as understanding enzymatic mechanisms at the molecular level, designing synthetic enzymes and other translational research goals. It will enable explorative analyses based on structural similarity, the commonality of scaffolds, fragments or binding pockets. It will also significantly help initiatives like Open Targets and the Illuminating the Druggable Genome project by providing rich data sets in a standardised and interoperable format.
BioChemGRAPH will also deliver a common library of data visualisation tools that will expose the collated data in a uniform manner and lower the barrier to accessing bioactivity and structural data for small molecules. This open source library will potentially standardise visualisation tools for displaying small molecule-related annotations for all relevant data resources. These visualisation tools will also be used to enhance the aggregated views of proteins, effectively linking the small molecule and macromolecule-specific web interfaces.
Description | We have identified challenges associated with establishing links between different data resources in chemistry, crystallography and biology. We have subsequently developed solutions that address these challenges and implemented these in software and data systems. We have further developed existing technology that identifies preferred protein-ligand binding interactions indicated by small-molecule crystal structures to provide annotations that can be associated with residues in a protein binding sites. These annotations are provided in machine accessible form to enable their inclusion in knowledge graphs and other systems. |
Exploitation Route | The links between data resources in chemistry and biology that are now exposed in human and machine accessible resources can be used by others to aggregate data and information across domains to inform our understanding of biological mechanisms and guide the development of new therapeutic molecules. |
Sectors | Education Healthcare Pharmaceuticals and Medical Biotechnology |
URL | https://www.ebi.ac.uk/about/news/updates-from-data-resources/biochemgraph-data/ |
Title | A method for automatically intersecting structural chemistry and biology data resources |
Description | Building on previous work for reliably generating standard identifiers for small-molecule crystal structures undertaken as part of this award, we established automated workflows to identify the intersection between the Cambridge Structural Database (CSD) and EBI's UniChem resource. The intersection is identified using InChIs for chemical components in UniChem and the CSD and the method is implemented using GitHub Workflows. The intersection is regenerated on a regular basis to ensure it is kept up to date as resources grow. |
Type Of Material | Improvements to research infrastructure |
Year Produced | 2022 |
Provided To Others? | No |
Impact | This process has enabled us to identify the intersection between the CSD, PDB and ChEMBL and thus links between chemical and biological data resources that can be incorporated into a knowledge graph connecting biological and chemical data. Standard identifiers and links to the CSD for the intersecting structures will be submitted for inclusion in UniChem so they become available to other researchers wishing to connect across chemical and biological data resources. We also intend building on the automated workflow described here to help maintain links between the CSD and other chemistry data resources. We anticipate being able to report on these impacts more fully in a future submission. |
Title | A methodology for annotating protein binding sites with interaction hotspots derived from small-molecule crystal structure data |
Description | The CCDC's Fragment Hotspots tool uses small-molecule crystal data derived from the Cambridge Structural Database to identify interactions that determine fragment binding in a ligand-binding site of a protein. The tool has been developed to enable annotation of protein residues in the Protein Data Bank in a form that can be incorporated into the PDBe Knowledge Base. |
Type Of Material | Improvements to research infrastructure |
Year Produced | 2023 |
Provided To Others? | Yes |
Impact | The Fragment Hotspots can now operate on proteins represented in mmCIF format rather than the deprecated PDB format enabling it to work on larger more complex protein structures. Interaction hotspots in binding sides can be annotated by the Fragment Hotspots tool in a way that can be incorporated into the PDBe Knowledge Database. Fragment Hotspots has been packaged to enable easier deployment on a wider range of computational infrastructures. It is now more feasible to systematically run Fragment Hotspots analysis on a large number of protein structures. |
URL | https://github.com/ccdc-opensource/hotspots |
Title | A methodology for reliably generating standard identifiers for small-molecule crystal structures |
Description | The Cambridge Structural Database (CSD) contains over 1 million experimentally determined crystal structures many of which are relevant to biological processes. Our goal is to be able to incorporate links to these crystal structures into data resources that bridge chemistry and biology. We aim to do this using the IUPAC International Chemical Identifier (InChI) which provides a standard unique identifier for a chemical compound. In order to generate reliable InChIs from crystal structure data, we needed to develop a methodology that takes into account experimental artefacts such as crystallographic disorder that can lead to an imperfect 3D model. The methodology developed exploits the 2D chemical representation of a structure and the 3D crystallographic data stored in a CSD entry to generate an InChI that encapsulates chemical connectivity and stereochemistry. |
Type Of Material | Improvements to research infrastructure |
Year Produced | 2022 |
Provided To Others? | No |
Impact | Applying this methodology has enabled InChIs to be generated for almost 95% of organic crystal structures in the CSD. Being able to reliably generate InChIs for crystal structures allows us to establish sustainable workflows to intersect the CSD with EBI and other biological data resources in order to integrate knowledge from chemical and biological domains. This will be described as a separate Research Tool and Method. The methodology developed has also been incorporated into a development version of the CSD Python API which is scheduled to be released outside of the current reporting period. This will enable structural chemists in academia and industry to take advantage of InChIs to link across their own chemistry and biology resources. We will report on this in a future submission. |
Title | Publicly accessible links between structural chemistry and biology data resources |
Description | The EBI UniChem data resource has been populated with InChIs for crystal structures published in the Cambridge Structural Database (CSD) and their corresponding CSD database entry identifiers. Automated workflows have been established to ensure that UniChem will be regularly updated with InChIs and identifiers as the CSD grows. A challenge here was establishing how to link between a single reference chemical structure and possibly many crystal forms that contain that structure. Heuristics were developed to provide the mappings that best fit with the expectations of UniChem and its user community. |
Type Of Material | Improvements to research infrastructure |
Year Produced | 2024 |
Provided To Others? | Yes |
Impact | This development will enable researchers and data services to connect biological structures and data in EBI resources to small molecule crystal structures and their properties. Connections can also be identified between CSD structures and records in other data resources indexed by UniChem. These connections can be added to knowledge graphs including the PDBe Knowledge database to allow correlation of structures and properties across chemistry and biology by AI and other computational methods. As a result of this development, links between structures in the CSD and structures in PDBe and ChEMBL are now being added to information pages in these data resources. |
URL | https://www.ebi.ac.uk/unichem/ |
Title | CSD Python API enhancements |
Description | The CSD Python API has been enhanced with methods that can generate InChIs and InChIKeys for chemical components in crystal structures in the Cambridge Structural Database as well as those loaded from external sources. |
Type Of Technology | Software |
Year Produced | 2023 |
Impact | This is an important addition to core functionality that will increase the ability for researchers worldwide to connect crystal structure data contained in the Cambridge Structural Database (CSD) into a wide range of computational research workflows. It also lays the foundation for more effective retrieval and navigation of crystal structure data based on the identity of constituent chemical components. |
Title | Fragment Hotspots Enhancements |
Description | Given a whole protein and no prior knowledge, the Fragment Hotspots application identifies hotspots within protein binding sites that are favorable for acceptor, donor and hydrophobic groups based on experimental 3D crystallographic data. The enhancement delivered by this project is to implement output of a machine-readable representation of these hotspots to facilitate interoperability with other tools and products, in particular the PDBe Knowledge Base. |
Type Of Technology | Software |
Year Produced | 2024 |
Open Source License? | Yes |
Impact | It is now possible to generate hotspot annotations systematically across large collections of proteins such as the Protein Data Bank and output these in a machine-readable form. |
Description | EBI Structural Bioinformatics Course 2024 |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Postgraduate students |
Results and Impact | Participation in a BioChemGraph Session and Tutorial providing examples of the annotations of protein binding sites that can be generated using Fragment Hotspots and the scientific underpinnings of these. |
Year(s) Of Engagement Activity | 2024 |
URL | https://www.ebi.ac.uk/training/events/structural-bioinformatics-3/ |
Description | One slide presented in a talk at the 25th IUCr Congress 14-22 Aug 2021 |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | One slide on the intersection between CSD/PDB/ChEMBL presented at the 25th IUCr Congress 14-22 Aug 2021: Presentation on the CSD in a session on "Exemplary practice in chemical, biological and materials database archiving". 40-50 people attended. |
Year(s) Of Engagement Activity | 2021 |
Description | Presentation at the CCDC Journal Club |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Local |
Primary Audience | Professional Practitioners |
Results and Impact | Members from different teams (software development, science, commercial) across the organisation attended the talk and provided feedback and input. |
Year(s) Of Engagement Activity | 2021 |
Description | Presentation at the CCDC Science Showcase meeting |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Local |
Primary Audience | Professional Practitioners |
Results and Impact | Talk to report progress made on project to the wider organisation. Attendees from other departments could provide feedback and suggestions. |
Year(s) Of Engagement Activity | 2021 |
Description | Presentation at the Chem-Bio Informatics Society (CBI) Annual Meeting 2021 |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Francis L Atkinson presented a talk titled "Data integration to facilitate drug discovery" upon invitation to the Chem-Bio Informatics Society (CBI) Annual Meeting in Japan. The talk was well attended and generated discussion among the panellists and audience. |
Year(s) Of Engagement Activity | 2021 |
URL | https://cbi-society.org/taikai/taikai21/SS/SS-12_2021CBI_Abs_CCDCPDBj.pdf |
Description | Presentations at regional Crystallography Meetings 2023 |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Regional |
Primary Audience | Professional Practitioners |
Results and Impact | BioChemGraph was mentioned in presentations given at regional crystallography meetings in Belgium and Croatia as an example of new possibilities for using insights generated from data and knowledge in the Cambridge Structural Database to advance research across domains. |
Year(s) Of Engagement Activity | 2023 |
Description | Project mentioned in a talk given in a session on "Enabling FAIR Publication, Exchange, and Reuse of Chemistry Data" at the Fall 2021 ACS National Meeting. |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | BioChemGraph mentioned in "FAIR Crystallographic Data Services: Bridging Academia and Industry" - talk given in a session on "Enabling FAIR Publication, Exchange, and Reuse of Chemistry Data" at the Fall 2021 ACS National Meeting - noted as an example of how CCDC can more fully embrace semantic technologies. Reached 40-50 people. |
Year(s) Of Engagement Activity | 2021 |
Description | Talk at February 2023 Cambridge Cheminformatics Network Meeting on "InChIng Towards Better Molecular Identifiers" |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Regional |
Primary Audience | Professional Practitioners |
Results and Impact | Presentation to highlight the importance of the International Chemical Identifer (InChI) and how it can be used to enable projects such as BioChemGraph. Jointly given with the Secretary of the IUPAC InChI Subcommittee. Prompted discussions with the wider community about current limitations of InChI and how these might be overcome. |
Year(s) Of Engagement Activity | 2023 |
Description | Talk entitled "Bridging Structural Chemistry Communities through FAIR" at ConTech Pharma |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Talk targeted at Pharma Industry and solution providers. About 20 people engaged with it. |
Year(s) Of Engagement Activity | 2022 |
URL | https://www.stm-publishing.com/contech-pharma-delivering-successful-fair-data-projects-1st-and-2nd-m... |
Description | Talk entitled "Using InChIs to connect across dimensions and domains" at Fall 2022 ACS National Meeting |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Talk described outcomes of work undertaken for the BioChemGraph project to establish reliable standard chemical identifiers (InChIs) for crystal structures in the Cambridge Structural Database. It contributed to a symposium highlighting the impact of InChI across scientific domains and discussing priorities for future development. |
Year(s) Of Engagement Activity | 2022 |
Description | Talk in session on "Cross-Disciplinary Data Exchange" at Fall 2023 ACS National Meeting |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Talk referenced the importance of the BioChemGraph project for addressing industrial and academic research needs, particularly with regard to data interoperability and exchange across disciplines. |
Year(s) Of Engagement Activity | 2023 |
Description | Talks relating to data interoperability and best practices in data management at IUCr 2023 World Congress |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | BioChemGraph was featured as a key enabler of data interoperability across the domains of structural chemistry and biology. This raised awareness within the global crystallographic community of the potential that BioChemGraph has to make connections across information resources that can support data-driven research. |
Year(s) Of Engagement Activity | 2023 |