Re-engineering ChEBI for a sustainable future

Lead Research Organisation: European Bioinformatics Institute
Department Name: MSCB Macromolec, structural and chem bio

Abstract

Endogenous small molecules play important roles in regulating complex biological processes and, thus, life itself. Small molecules also serve as powerful tools, with wide-ranging applications in medicine (i.e. as drugs), the biological sciences and biotechnology. An ever-increasing number of novel compounds with a wide range of interesting and potentially useful properties are being identified from sources such as plants, fungi and microorganisms.

Small molecules are thus clearly of critical interest to the scientific community. However, many biologists lack the detailed expertise and knowledge to fully understand and appreciate the many complex and subtle aspects of small molecules, and in particular the many nuances associated with the accurate representation of chemical structures. A further challenge is that the same small molecule will often be referenced by multiple names and synonyms in the scientific literature and in databases. To take one very simple example, the non-steroidal anti-inflammatory drug aspirin is also referred to as acetylsalicylic acid, 2-(acetyloxy)benzoic acid and o-acetylsalicylic acid among many other synonyms. This complexity and ambiguity is a significant obstacle and can lead to wasted effort, inaccurate results and misleading conclusions. The Chemical Entities of Biological Interest (ChEBI) database acts as a reliable and trusted resource that provides "definitive" information about small molecules, thereby delivering a solution to many of these challenges. ChEBI provides biological, chemical and semantic information for small chemical compounds relevant in biology to the community. ChEBI also creates for each distinct molecular structure a stable and unchanging identifier, which is used by multiple other resources to definitively identify that specific compound, much as a grid reference unambiguously identifies a specific location on the earth's surface. In addition, ChEBI incorporates standard naming systems from global bodies such as the International Union of Pure and Applied Chemistry (IUPAC) and the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (NC-IUBMB). All of the information and data in ChEBI is freely available and downloadable without restriction. For these reasons, ChEBI is very widely used as a small molecule reference database by a number of leading biomedical databases. ChEBI is also used by a very large number of users who access its information via the public web site.

The aim of our proposal is to ensure the continued availability and growth of this critical resource for the bioscience community. ChEBI was originally developed in 2004 and as a consequence its underlying computer code is now out-of-date and increasingly difficult to support and maintain. Indeed, there is a growing risk that it will in the near future become incompatible with current computer systems. We therefore propose to completely overhaul and modernise the ChEBI infrastructure, code base and associated software tools. A new user-friendly website will be developed which will enable users to search, retrieve and download data. Advanced users will benefit from the superior programmatic access mechanisms to the data. We will develop a new annotation, curation and submission tool that will improve the overall efficiency of our expert ChEBI curators, for example by automating a number of currently time-consuming manual processes. This will reduce the time and effort required to create new entries. This tool will also benefit users who submit entries to ChEBI by significantly streamlining the submissions process. Our project will enable ChEBI to benefit from recent advances in software development techniques and deliver the new infrastructure platform, critical to enabling ChEBI to continue to fulfil the critical role it plays in the global bioinformatics community.

Technical Summary

ChEBI is a database and ontology containing information about chemical entities of biological interest. It is widely used as a 'small molecule' reference database by a number of leading global resources such as Gene Ontology, UniProt and Rhea, providing identifiers, structures and annotations to enable chemical entities to be unambiguously identified within biological databases, ontologies, models and the literature. ChEBI is also widely used through its public website and API as a rich source of information about small molecules. ChEBI is curated by human experts, and provides a reliable, non-redundant collection of chemical entities and related data such as detailed structure, synonyms, chemical formula, charge, molecular mass and links to external databases. Furthermore, ChEBI also contains an extensive ontology which enables the relationships between chemical entities to be defined on the basis of their shared chemical structure features together with their biological properties and roles.

Since its creation in 2004, ChEBI's software infrastructure has not undergone any major enhancements and is now significantly outdated, resulting in a large and growing maintenance burden. The overall goal of our project is to completely overhaul and modernise ChEBI's software infrastructure to enable ChEBI to continue to provide its critical service to the bioscience community. The work will be divided into four distinct work packages covering (1) the core database and web services, (2) more powerful and scalable searching capabilities using elastic and RDkit, (3) a new web interface and ontology visualisation tool and (4) a new suite of curator tools that will improve efficiency and enable a wider pool of curators to contribute to ChEBI. Documentation and training will be developed to enable users to benefit from these developments which will not only impact on ChEBI itself but also on a multitude of other global bioinformatics resources.

Publications

10 25 50
publication icon
Andrés-Hernández L (2022) Establishing a Common Nutritional Vocabulary - From Food Production to Diet. in Frontiers in nutrition

publication icon
Witting M (2024) Challenges and perspectives for naming lipids in the context of lipidomics. in Metabolomics : Official journal of the Metabolomic Society

 
Title ChEBI 
Description Chemical Entities of Biological Interest (ChEBI) is a freely available dictionary of molecular entities focused on 'small' chemical compounds, which are either products of nature or synthetic products used to intervene in the processes of living organisms. ChEBI incorporates an ontological classification, whereby the relationships between molecular entities or classes of entities and their parents and/or children are specified. 
Type Of Material Database/Collection of data 
Provided To Others? Yes  
Impact ChEBI is a key component of multiple global biodata resources, which draw upon various aspects of the database including chemical structures, the ontology, molecule names and stable molecule identifiers. 
URL https://www.ebi.ac.uk/chebi/