Combatting antimicrobial resistance through new software for natural product discovery

Lead Research Organisation: University of Glasgow
Department Name: School of Computing Science


The rate of chemical discovery of new antibiotics is too slow. This has resulted in bacteria evolving resistance to current medicine at a faster rate than new chemistry is being discovered. Globally, antimicrobial resistance is already thought to be responsible for 700,000 deaths per year, and, in the absence of new solutions, this is estimated to rise to 10 million by 2050.

Bacteria themselves are excellent producers of compounds with biologically active properties. In fact, over 70% of the antibiotics approved between 1981 and 2016 are bacterially produced natural products or derivatives thereof. Many of these compounds are assembled by groups of enzymes that are themselves encoded in areas of the bacterial genome known as biosynthetic gene clusters. Technological advances have increased the number, quality and availability of bacterial genome sequences. This wealth of data has revealed that both the number and diversity of predicted biosynthetic gene clusters greatly exceed expectations.

The knowledge that bacteria have the potential to produce this vast reservoir of undiscovered chemistry has re-invigorated the research community. Often bacterial strains are genome sequenced and cultured in an attempt to detect the molecules being produced by the biosynthetic gene clusters identified in the sequence. Whilst mature computational tools exist to analyse the resulting mass spectrometry and sequence data sets independently, the community lacks a platform to bring these two data types together. This absence results in a sever bottleneck in the analysis pipeline as researchers are forced to attempt to manually link the predicted gene clusters with their products, which are hidden somewhere in the mass spectrometry data. Given that a typical strain can easily contain around 100 biosynthetic gene clusters and mass spectrometry of the cultured strain can easily result in fragment spectra for 2000 molecules, it is clear that the space of potential links is too vast for manual investigation.

We will develop and implement the computational tools that can link the gene clusters and their products in these large datasets in an automated way. The tools will allow import of the output of popular spectral and genomic analysis software. Our platform will then predict links and allow users to interactively explore the results. For example, investigating the content of the gene clusters and spectra that have been linked together to see if the link is likely to be genuine. Crucially, this software will be built in a modular manner, with future development in mind. It will therefore be the vehicle into which future tools (e.g. more advanced linking tools optimised for particular natural product gene clusters) can be developed, deployed and benchmarked.

Technical Summary

Bacterial natural products (specialized metabolites) have high potential as future antibiotics. Genome sequencing is revealing that the number and diversity of biosynthetic gene clusters (BGCs; groups of genes encoding enzymes that assemble specialized metabolites) exceeds previous expectations. A key challenge is matching the predicted BGCs to their products measured in bacterial culture via mass spectrometry (MS).
Computational tools such as GNPS molecular networking (for MS data) and antiSMASH (for predicting BGCs) are maturing and provide the means to analyse these data types separately but tools do not exist to help identify which of the ~2000 fragment spectra observed in a typical MS analysis of a single strain (under one fermentation/extraction condition) corresponds to which of the tens of BGCs predicted from the genome. Researchers perform this manually, resulting in a severe analysis bottleneck.
Large chemical and sequence datasets are becoming common: e.g. a combined genomic and mass spectral analysis of 146 strains was recently published. As we enter this data rich era, development of tools based on statistical and machine learning approaches are urgently required. We propose the development of software that will predict biosynthetic-chemical links by mining the data for shared patterns. Groups of similar spectra can be matched to groups of similar BGCs based upon the strains present/absent in the two groups, or biosynthetic features present in the spectra (predicted from the BGCs), or combinations of both.
Various research challenges exist in this area: how to build the link scoring methods, how to group spectra across strains, how to group BGCs across strains, etc. Development and benchmarking of the tools to answer these challenges requires the two data sets to be accessible in a shared analysis space. We propose the development of software that will bring together these two omic data types including the first automatic link prediction approaches.

Planned Impact

This research has a wide range of stakeholders. Researchers whose pipelines will be made more efficient, the pharmaceutical industry who will be able to accelerate the drug discovery process, government policy makers (particularly with respect to strategies to overcome antibiotic resistance), research funders and the public (through the health benefits that can be conferred).

Expanding the chemical resources (and knowledge of them and their biosynthesis) from microorganisms is a global fundamental research goal because of the urgency with which solutions are required to combat antimicrobial resistance. This has clear relevance within both academia and industry. The proposed research will identify links between biosynthetic gene clusters and the specialized metabolites they produce. When such predictions are experimentally validated with synthetic biology approaches, the result is a powerful tool for the discovery and prioritization of new antibiotics. Our software will be the first solution to this problem, but also a platform that can become the basis for the development of the next generation of tools in this area, based on machine learning and data science techniques.

We will combine metabolomics, genomics and software development in a manner that will, ultimately, provide a method to assess the potential of bacteria to address worldwide health issues. Ultimately, the public will indirectly benefit from the efficient discovery of new chemistry to address the antimicrobial resistance crisis.

To maximise uptake, we will provide an easy to install and use docker image of our software. To maximise community involvement, all source code will be available and the software will be designed in a modular way to easily enable the addition (and benchmarking) of new modules for link predictions.

All staff involved in this project will receive excellent exposure (and therefore training) in a vital multidisciplinary area.