Open source pipelines for integrated metabolomics analysis by NMR and mass spectrometry

Lead Research Organisation: University of Liverpool
Department Name: Institute of Integrative Biology

Abstract

Research in the Life Sciences is now commonly performed using high-tech instrumentation, producing very large amounts of data about a system of interest. These techniques are collectively called 'omics (e.g. including genomics, proteomics and metabolomics) - and in different ways can measure how genes are switched on or off, how the proteins encoded by those genes behave in a cell or tissue of interest, or how the metabolites (biochemical molecules in cells) change in abundance, as the system behaves normally or is put under stress by disease, dysfunction or the introduction of toxic substances. The metabolites studied can include molecules that provide energy or structure to cells (e.g. fats, sugars etc), the structural building blocks of DNA and proteins (e.g. nucleotides, amino acids) and essential co-factors to biological processes (e.g. vitamins). In fundamental research, and in clinical situations, the presence of a particular metabolite at an unusual abundance can be an indicator (a biomarker) of a particular state - such as a disease. Indeed, metabolomics research is applied in studies on cancer, infectious disease, heart disease, diabetes and many others.

One of the greatest challenges in metabolomics research is that the analysis of the data is very difficult. Multiple different processing steps are needed to get from the raw data as delivered by the instrument - primarily nuclear magnetic resonance (NMR) spectroscopy or mass spectrometry (MS), to the final results the researcher is interested in, i.e. quantitative and statistically significant differences in particular metabolites between samples. There are multiple software packages (both commercial and free) that can perform individual steps within a complete pipeline, but there is very little good software that makes it easy to perform a full analysis. In this project, we will build such software for data generated from NMR or MS, using a software framework called Galaxy. Galaxy has been designed to construct a web interface on top of other software packages, enabling different (previously disconnected) packages to be joined together into an easy to use pipeline. The joining together of modules needs data files in a standardized format as the input and output of each step, so we will also work within international organizations to help agree on a universally applied standard format to be used in our pipeline and by other software developers working in metabolomics. Our pipeline will make it much easier for scientists to analyse their data and, in particular, to compare or integrate data coming from both complementary techniques (NMR and MS) to get a more complete picture of the system being studied. This will facilitate many more researchers - who currently lack detailed knowledge in metabolomics - to embrace and exploit this powerful technology. Lastly, we will make it easier for scientists to put their data into public databases when they publish their research, enabling other scientists to verify their findings and in some cases re-analyse their data in their own labs.

Technical Summary

Metabolomics comprises an important suite of techniques in modern Life Sciences research, typically performed by NMR spectroscopy or mass spectrometry (MS), applied in a range of fields for biomarker discovery, as well as for understanding metabolic networks in complex and dynamic systems. One of the biggest challenges preventing more widespread adoption of these powerful techniques is that data analysis is difficult, especially when data sets are collected in high-throughput modes. Each technique presents its own challenges, requiring pipelines of (often poorly connected) tools for an end-to-end analysis, and a significant amount of manual analysis for steps where robust software is lacking. For individual steps within a workflow there exists commercial or free software at different stages of maturity, however there are few solutions that offer the capability for automated analysis from data collection through to statistical analysis. In the genomics and proteomics domains, the Galaxy framework has become a popular mechanism for building pipelines of modular tools (originally of command-line nature), through a web interface. Galaxy can be easily configured to run on single servers, compute clusters or cloud-based solutions. In this project our groups at the Universities of Liverpool and Birmingham, both of which have a track record in Galaxy development, will collaborate to build a set of metabolomics tools in Galaxy, enabling the construction of analysis pipelines for both NMR and MS analyses. Crucially, the pipelines will deliver data sets to a shared statistical analysis toolkit, enabling integrated analysis of data sets derived from both techniques. We will also contribute to the development of international data standards for metabolomics, and our new pipelines will facilitate the deposition of experimental metabolomics data into the MetaboLights database at the EBI.

Planned Impact

Impact on health and society: The overall purpose of the project is to make data analysis for metabolomics more straightforward. Metabolomics is a technique increasingly used in human, animal and plant research, and as such, there is the potential for longer term (indirect) impacts, for example through facilitating biomarker discovery and the understanding of molecular mechanisms in fields including ageing, human and environmental health, food safety, industrial biotechnology, bioenergy and synthetic biology.

Economic impact: The facilitation of public data deposition has the potential for long term (indirect) economic impact, since it provides the opportunity for data sets (often collected at great expenses) to be re-purposed or re-analysed, fostering new research areas or in some cases reducing the requirement to collect new data.

Staff development: The postdocs involved will have the opportunity to work as part of an international network (for example working with the EBI, COSMOS, MSI and PSI) in a cutting edge software project. The PIs will benefit through exchange of skills and expertise between partners (the team has strong expertise in software engineering, MS, NMR, data analysis and statistics).

Publications

10 25 50
 
Description We have developed open source software for processing NMR metabolomics data, called tameNMR. Current software provision for NMR metabolomics is primarily using closed source commercial software, with a lack of transparency in the methods applied. tameNMR uses a fully open source approach, with open formats at each stage of the pipeline, and the popular Galaxy framework for deployment. This means that research teams can use the method in their own labs at no cost, and the code is fully transparent in what manipulations are being made to data.

We have also contributed to the creation of a data standard for NMR, called nmrML, which will facilitate open data sharing between labs.
Exploitation Route The software can be used by other labs working in NMR metabolomics. The code is open source, so can be used and adapted freely by others.
Sectors Healthcare,Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology

 
Description We have assisted the development and publication of the open data standard nmrML, which will help data sharing efforts in metabolomics. In turn, this should enable data in the public domain to be re-used for new purposes, as well as indirect economic benefits as the standard becomes adopted in industry.
First Year Of Impact 2017
Sector Digital/Communication/Information Technologies (including Software),Pharmaceuticals and Medical Biotechnology
Impact Types Economic

 
Title tameNMR 
Description Galaxy-based pipeline software for processing NMR metabolomics data. 
Type Of Technology Webtool/Application 
Year Produced 2017 
Impact Impacts still on-going 
URL https://github.com/PGB-LIV/tameNMR