Illuminating the dark metabolome: de novo identification of small molecules from their mass spectra using transformer-based deep learning

Lead Research Organisation: University of Liverpool
Department Name: Biochemistry & Systems Biology

Abstract

The metabolic activities of living cells and organisms lead to the production of many thousands of different molecules. To analyse and quantify them, scientists use methods that separate them in special tubes (known as chromatography columns) and then determine their nature by giving them an electric charge and fragmenting them in the gas phase, then measuring the masses (strictly the mass-to-charge ratios) of the fragments. These 'fragment fingerprints', known as mass spectra, may be compared with those of known molecules stored in databases, and thereby used to identify the molecules. The big problem here is that most of the mass spectra generated bear little or no relation to the comparatively few molecules (relative to all plausible molecules) that ARE in the databases. What is therefore needed is a method that allows one to propose a structure from the mass spectra 'de novo', i.e. without recourse to databases of experimental mass spectra.

Although the number of experimental mass spectra is small, given a molecular structure it is possible to fragment it inside a computer to produce all (or a sensible subset) of the fragments that it COULD create. The ZINC database contains more than 10 billion molecular structure that obey chemical rules.

Modern methods of 'deep learning' or 'generative artificial intelligence (AI)' allow one to relate paired 'in silico' (computer-generated) mass spectra with the structures that 'caused' them, and in an earlier study we used just such a method, known as a 'transformer', trained with some 21 million computer-generated mass spectra, to learn the mass-spectrum-to-structure mapping. This transformer consisted of a neural network with some 400 million nodes, and could indeed generalize to predict the structures of molecules on which it had been trained. Although this was for 2020 (when the work was performed) a very large network - three years earlier it would have been the largest ever published by anyone, including the likes of Google, Facebook and Amazon - it was nowhere near the kinds of network size that were even then being published (e.g. Google Switch > 1 trillion nodes - Hutson, M. (2021) The language machines. Nature. 591, 22-25). Since it is well known (as 'scaling laws') that bigger networks can in effect learn more, the first requirement of this project is to increase the size of both the dataset used to train the network and the network itself, and to see how much this improves generalisation.

A variety of other strategies will also be tried to improve the ability of our new network to generalise to most of the biologically relevant chemical space. These include changing the representation of the structure of the small molecules given to the computer, removing nodes that do little or nothing, changing the architecture of the transformer, and 'fine tuning' the transformer by training it additionally not only with computer-generated mass spectra by composite mass spectra obtained experimentally using a variety of instruments that we already possess.

The result will potentially be a solution to the biggest problem besetting those who study metabolism in any organism - the fact that they cannot even identify the molecules that they can observe, and which can be seen to be intimately involved in the processes of interest.

Technical Summary

Most comprehensive 'omics' studies of small molecules - known as 'untargeted metabolomics' - use liquid chromatography coupled to high-resolution tandem (and MS^n) mass spectrometry. The mass spectra so generated are then compared to molecules in databases that could allow one to identify them. As is well known, the problem is that most experimental mass spectra do not remotely resemble anything in the databases and thus identification is impossible. We need 'de novo' methods that do not rely on these databases

We previously developed a deep neural network that could learn to map ~21M computer-generated mass spectra to the structures behind them, and showed that it could effect a reasonable but incomplete generalisation to 'unknown' experimental mass spectra. We now seek to improve this network in a numbers of ways, including substantial increases in the size of the training set and of the transformer, encoding molecular structures in different ways, exploiting Graphcore's Intelligent Processing Units and massive cloud resources, assessing novel and sparse transformer architectures. The transformers will be fine-tuned with a series of experimental mass spectra that we shall generate as part of this project on our suite of BBSRC-funded mass spectrometers, and tested on 'unknowns' on which the transformers have not been trained.

The result is expected to be a massive improvement in our ability to predict the structures of small molecules from their mass spectra de novo, without the need for the spectra to be in existing libraries.

Publications

10 25 50