Combatting antimicrobial resistance through new software for natural product discovery

Lead Research Organisation: University of Glasgow
Department Name: School of Computing Science


The rate of chemical discovery of new antibiotics is too slow. This has resulted in bacteria evolving resistance to current medicine at a faster rate than new chemistry is being discovered. Globally, antimicrobial resistance is already thought to be responsible for 700,000 deaths per year, and, in the absence of new solutions, this is estimated to rise to 10 million by 2050.

Bacteria themselves are excellent producers of compounds with biologically active properties. In fact, over 70% of the antibiotics approved between 1981 and 2016 are bacterially produced natural products or derivatives thereof. Many of these compounds are assembled by groups of enzymes that are themselves encoded in areas of the bacterial genome known as biosynthetic gene clusters. Technological advances have increased the number, quality and availability of bacterial genome sequences. This wealth of data has revealed that both the number and diversity of predicted biosynthetic gene clusters greatly exceed expectations.

The knowledge that bacteria have the potential to produce this vast reservoir of undiscovered chemistry has re-invigorated the research community. Often bacterial strains are genome sequenced and cultured in an attempt to detect the molecules being produced by the biosynthetic gene clusters identified in the sequence. Whilst mature computational tools exist to analyse the resulting mass spectrometry and sequence data sets independently, the community lacks a platform to bring these two data types together. This absence results in a sever bottleneck in the analysis pipeline as researchers are forced to attempt to manually link the predicted gene clusters with their products, which are hidden somewhere in the mass spectrometry data. Given that a typical strain can easily contain around 100 biosynthetic gene clusters and mass spectrometry of the cultured strain can easily result in fragment spectra for 2000 molecules, it is clear that the space of potential links is too vast for manual investigation.

We will develop and implement the computational tools that can link the gene clusters and their products in these large datasets in an automated way. The tools will allow import of the output of popular spectral and genomic analysis software. Our platform will then predict links and allow users to interactively explore the results. For example, investigating the content of the gene clusters and spectra that have been linked together to see if the link is likely to be genuine. Crucially, this software will be built in a modular manner, with future development in mind. It will therefore be the vehicle into which future tools (e.g. more advanced linking tools optimised for particular natural product gene clusters) can be developed, deployed and benchmarked.

Technical Summary

Bacterial natural products (specialized metabolites) have high potential as future antibiotics. Genome sequencing is revealing that the number and diversity of biosynthetic gene clusters (BGCs; groups of genes encoding enzymes that assemble specialized metabolites) exceeds previous expectations. A key challenge is matching the predicted BGCs to their products measured in bacterial culture via mass spectrometry (MS).
Computational tools such as GNPS molecular networking (for MS data) and antiSMASH (for predicting BGCs) are maturing and provide the means to analyse these data types separately but tools do not exist to help identify which of the ~2000 fragment spectra observed in a typical MS analysis of a single strain (under one fermentation/extraction condition) corresponds to which of the tens of BGCs predicted from the genome. Researchers perform this manually, resulting in a severe analysis bottleneck.
Large chemical and sequence datasets are becoming common: e.g. a combined genomic and mass spectral analysis of 146 strains was recently published. As we enter this data rich era, development of tools based on statistical and machine learning approaches are urgently required. We propose the development of software that will predict biosynthetic-chemical links by mining the data for shared patterns. Groups of similar spectra can be matched to groups of similar BGCs based upon the strains present/absent in the two groups, or biosynthetic features present in the spectra (predicted from the BGCs), or combinations of both.
Various research challenges exist in this area: how to build the link scoring methods, how to group spectra across strains, how to group BGCs across strains, etc. Development and benchmarking of the tools to answer these challenges requires the two data sets to be accessible in a shared analysis space. We propose the development of software that will bring together these two omic data types including the first automatic link prediction approaches.

Planned Impact

This research has a wide range of stakeholders. Researchers whose pipelines will be made more efficient, the pharmaceutical industry who will be able to accelerate the drug discovery process, government policy makers (particularly with respect to strategies to overcome antibiotic resistance), research funders and the public (through the health benefits that can be conferred).

Expanding the chemical resources (and knowledge of them and their biosynthesis) from microorganisms is a global fundamental research goal because of the urgency with which solutions are required to combat antimicrobial resistance. This has clear relevance within both academia and industry. The proposed research will identify links between biosynthetic gene clusters and the specialized metabolites they produce. When such predictions are experimentally validated with synthetic biology approaches, the result is a powerful tool for the discovery and prioritization of new antibiotics. Our software will be the first solution to this problem, but also a platform that can become the basis for the development of the next generation of tools in this area, based on machine learning and data science techniques.

We will combine metabolomics, genomics and software development in a manner that will, ultimately, provide a method to assess the potential of bacteria to address worldwide health issues. Ultimately, the public will indirectly benefit from the efficient discovery of new chemistry to address the antimicrobial resistance crisis.

To maximise uptake, we will provide an easy to install and use docker image of our software. To maximise community involvement, all source code will be available and the software will be designed in a modular way to easily enable the addition (and benchmarking) of new modules for link predictions.

All staff involved in this project will receive excellent exposure (and therefore training) in a vital multidisciplinary area.
Title Metabolomics Data 
Description Metabolomics data, HR-MS/MS profiles of all 26 strains in five media = 130 metabolomics profiles plus controls 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? No  
Impact The dataset will become publicly available once we have finished analysing the data and the publication is complete 
Title NPLinker tool 
Description The NPLinker tool is the main output of this research. It is a platform in which users can mine paired metabolomic and genomic data for links between metabolites and the gene clusters that produce them. The code is currently available on GitHub and a paper is in preparation. 
Type Of Material Computer model/algorithm 
Year Produced 2019 
Provided To Others? Yes  
Impact None yet. 
Description Prof. Juho Rousu 
Organisation Aalto University
Department Department of Computer Science
Country Finland 
Sector Academic/University 
PI Contribution Prof. Rousu and I obtained funding from SICSA for him to visit my group in Glasgow for three months in Summer 2019. Prof Rousu is an expert in the analysis of metabolomic data. Collaboration from his visit has two strands: one stemming from the BBSRC project (Combatting...), in which we are working together on new IOKR methods for predicting the products of Biosynthetic Gene Clusters and a second strand stemming from the EPSRC project (Closed-loop...), in which we are building probabilistic models that incorporate retention time into annotation, that could be used in a closed-loop context to prioritise MS acquisition.
Collaborator Contribution Prof Rousu has provided expertise in kernel methods for metabolite ID, and retention time prediction. His group also funded a visit by one of his PGR students (Eric Bach) to my group for several weeks in Summer 2019 (a direct result of Prof. Rousu's visit)
Impact 1 Draft publication awaiting submission
Start Year 2019
Description ScotChem Natural Products in the Bioeconomy 
Organisation Robert Gordon University
Country United Kingdom 
Sector Academic/University 
PI Contribution Dr Duncan was invited to co-organize the inaugural ScotChem Natural Products in the Bioeconomy workshop as a result of ongoing work from this project.
Collaborator Contribution The Workshop was over-subscribed and held at the University of Aberdeen. It involved an opportunity to engage with industry and academic partners and foster initial collaborations. Research feedback and exchange of ideas was valuable for the development of NP Linker
Impact A subsequent grant (funded by Scottish University's Life Science Alliance) was secured between Robert Gordon's University (RGU, PI) and K. Duncan (co-I) - multidisciplinary (microbiology, environmental science, chemistry, molecular biology)
Start Year 2019
Description Glasgow Science Centre - Curiosity Live 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Schools
Results and Impact Glasgow Science Centre run a regional event, biannually called "Curiosity Live", this event is attended by many high schools over the course of several days. The Duncan group ran an activity stand "medicines from microorganisms" at both the event on November 7th 2019 and March 12th 2020, reaching over 1000 students at each event (over 2000 combined). The stands featured several interactive activities for the students to directly engage from, including "isolating their own bacteria from soil/sediment", a "Where do medicines come from" game and "match the drug to the organism". There were some great questions from all ages of students, about careers, drug discovery, microbiology and chemistry. The event was run by undergraduates and postgraduate members of the Duncan group, encompassing 10 individuals over the two events. Both events were additionally profiled online at the Glasgow Science Centre @gsk1 (twitter and instagram) reaching the greater public, and also on our own social media (twitter @kate_duncan, @medicinesfromthesea instagram). Due to the success of our first event, we were invited back in March 2020, and look forward to contributing to further events.
Year(s) Of Engagement Activity 2019,2020
Description University of Strathclyde - Open Day 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Schools
Results and Impact On the 5th October, we ran a stand called "microbiology and molecular biology" at the University of Strathclyde Open Day. This engaged approx. 400 senior high school students (prospective university students) and their parents in activities such as "actinomycetes - a source of medicines" and "chemical extraction" - which were hands on. This resulted in multiple questions about postgraduate and undergraduate study of microbiology and career choices. The actives were run by current postgraduate students.
Year(s) Of Engagement Activity 2019