Open source pipelines for integrated metabolomics analysis by NMR and mass spectrometry

Lead Research Organisation: University of Birmingham
Department Name: Sch of Biosciences

Abstract

Research in the Life Sciences is now commonly performed using high-tech instrumentation, producing very large amounts of data about a system of interest. These techniques are collectively called 'omics (e.g. including genomics, proteomics and metabolomics) - and in different ways can measure how genes are switched on or off, how the proteins encoded by those genes behave in a cell or tissue of interest, or how the metabolites (biochemical molecules in cells) change in abundance, as the system behaves normally or is put under stress by disease, dysfunction or the introduction of toxic substances. The metabolites studied can include molecules that provide energy or structure to cells (e.g. fats, sugars etc), the structural building blocks of DNA and proteins (e.g. nucleotides, amino acids) and essential co-factors to biological processes (e.g. vitamins). In fundamental research, and in clinical situations, the presence of a particular metabolite at an unusual abundance can be an indicator (a biomarker) of a particular state - such as a disease. Indeed, metabolomics research is applied in studies on cancer, infectious disease, heart disease, diabetes and many others.

One of the greatest challenges in metabolomics research is that the analysis of the data is very difficult. Multiple different processing steps are needed to get from the raw data as delivered by the instrument - primarily nuclear magnetic resonance (NMR) spectroscopy or mass spectrometry (MS), to the final results the researcher is interested in, i.e. quantitative and statistically significant differences in particular metabolites between samples. There are multiple software packages (both commercial and free) that can perform individual steps within a complete pipeline, but there is very little good software that makes it easy to perform a full analysis. In this project, we will build such software for data generated from NMR or MS, using a software framework called Galaxy. Galaxy has been designed to construct a web interface on top of other software packages, enabling different (previously disconnected) packages to be joined together into an easy to use pipeline. The joining together of modules needs data files in a standardized format as the input and output of each step, so we will also work within international organizations to help agree on a universally applied standard format to be used in our pipeline and by other software developers working in metabolomics. Our pipeline will make it much easier for scientists to analyse their data and, in particular, to compare or integrate data coming from both complementary techniques (NMR and MS) to get a more complete picture of the system being studied. This will facilitate many more researchers - who currently lack detailed knowledge in metabolomics - to embrace and exploit this powerful technology. Lastly, we will make it easier for scientists to put their data into public databases when they publish their research, enabling other scientists to verify their findings and in some cases re-analyse their data in their own labs.

Technical Summary

Metabolomics comprises an important suite of techniques in modern Life Sciences research, typically performed by NMR spectroscopy or mass spectrometry (MS), applied in a range of fields for biomarker discovery, as well as for understanding metabolic networks in complex and dynamic systems. One of the biggest challenges preventing more widespread adoption of these powerful techniques is that data analysis is difficult, especially when data sets are collected in high-throughput modes. Each technique presents its own challenges, requiring pipelines of (often poorly connected) tools for an end-to-end analysis, and a significant amount of manual analysis for steps where robust software is lacking. For individual steps within a workflow there exists commercial or free software at different stages of maturity, however there are few solutions that offer the capability for automated analysis from data collection through to statistical analysis. In the genomics and proteomics domains, the Galaxy framework has become a popular mechanism for building pipelines of modular tools (originally of command-line nature), through a web interface. Galaxy can be easily configured to run on single servers, compute clusters or cloud-based solutions. In this project our groups at the Universities of Liverpool and Birmingham, both of which have a track record in Galaxy development, will collaborate to build a set of metabolomics tools in Galaxy, enabling the construction of analysis pipelines for both NMR and MS analyses. Crucially, the pipelines will deliver data sets to a shared statistical analysis toolkit, enabling integrated analysis of data sets derived from both techniques. We will also contribute to the development of international data standards for metabolomics, and our new pipelines will facilitate the deposition of experimental metabolomics data into the MetaboLights database at the EBI.

Planned Impact

Impact on health and society: The overall purpose of the project is to make data analysis for metabolomics more straightforward. Metabolomics is a technique increasingly used in human, animal and plant research, and as such, there is the potential for longer term (indirect) impacts, for example through facilitating biomarker discovery and the understanding of molecular mechanisms in fields including ageing, human and environmental health, food safety, industrial biotechnology, bioenergy and synthetic biology.

Economic impact: The facilitation of public data deposition has the potential for long term (indirect) economic impact, since it provides the opportunity for data sets (often collected at great expenses) to be re-purposed or re-analysed, fostering new research areas or in some cases reducing the requirement to collect new data.

Staff development: The postdocs involved will have the opportunity to work as part of an international network (for example working with the EBI, COSMOS, MSI and PSI) in a cutting edge software project. The PIs will benefit through exchange of skills and expertise between partners (the team has strong expertise in software engineering, MS, NMR, data analysis and statistics).
 
Description Following an earlier NERC grant to develop Galaxy workflows, in this BBSRC grant we have continued the implementation of our existing metabolomics software tools into Galaxy workflows. This includes the signal processing and analysis of both direct infusion mass spectrometry and liquid chromatography mass spectrometry based metabolomics. As part of this effort we have also conducted an international survey (which we published) on the use of workflows.
Exploitation Route We anticipate widespread uptake of our Galaxy workflows for metabolomics research.
Sectors Agriculture, Food and Drink,Environment,Healthcare

 
Description The overall purpose of the project was to make data processing, analysis and dissemination for mass spectrometry (MS) and nuclear magnetic resonance spectroscopy metabolomics (NMR) more accessible, reproducible, and transparent. Galaxy has become a popular web-based platform for building computational workflows of modular tools [ref]. We have developed a complete set of Galaxy-based tools and training material, that cover a wide range of computational steps that are needed to get from the raw data as delivered by the instrument to the processed dataset available for biological interpretation. Additionally, we have extensively contributed in the development and integration of international data standards for metabolomics into (galaxy-based) web-established workflows [ref]. Finally, the tools and training material have been disseminated through a number of training courses and programs (see Impact). The Galaxy-based tools and workflows, including training material, that have been developed make it much easier for scientists to process, and analyse their MS and NMR datasets and subsequently deposit their datasets in public repositories [refs], such as MetaboLights. Tools and training material developed have been used to train several 100 scientists (i.e. Birmingham Metabolomics Training Centre, FutureLearn and other external training courses). As a result it has facilitated researchers, who currently lack skills and knowledge in metabolomics, to integrate metabolomics technology into their area of science (e.g. human and environmental health, industrial biotechnology, food safety, bioenergy and synthetic biology). The tools and workflows developed to assist in depositing metabolomics datasets to public repositories have the potential for a longer term scientific and economic impacts, such as facilitating biomarker discovery, reuse of data, or reducing the amount of unnecessary data collection [ref]. The activities and dissemination of the outputs of this project have supported the development of the Galaxy platform and the associated science communities. The project and its outputs have indirectly supported the growth of the Galaxy community for Metabolomics, which has resulted in the establishment of a number of galaxy-based initiatives (e.g. PhenoMeNal, Workflow4Metabolomics, ELXIR's Galaxy community) that use and actively develop galaxy to make computational tasks within Metabolomics more accessible, reproducible, and transparent. 1: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5192046/ 2: https://academic.oup.com/bioinformatics/article/33/16/2598/3204983 3: https://www.nature.com/articles/nprot.2016.156
First Year Of Impact 2017
Sector Agriculture, Food and Drink,Chemicals,Environment,Healthcare,Pharmaceuticals and Medical Biotechnology
Impact Types Societal

 
Title Galaxy-M metabolomics workflows 
Description Metabolomics data processing and analysis workflows embedded into Galaxy 
Type Of Material Data handling & control 
Year Produced 2014 
Provided To Others? Yes  
Impact International networking; other labs wanting us to join research grant applications 
 
Description Research collaboration with Gigascience 
Organisation GigaScience
Country United Kingdom 
Sector Private 
PI Contribution Provide domain expertise in metabolomics
Collaborator Contribution Provide technical expertise in tools such as Galaxy; facilitate a link to IT activities in China
Impact See publications
Start Year 2013
 
Description First ever Massive Open Online Course (MOOC) on metabolomics titled 'Metabolomics: Understanding Metabolism in the 21st Century' 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact We developed and ran the first ever Massive Open Online Course (MOOC) on metabolomics, title as above. The course ran for 4 weeks with >2000 active learners.
Year(s) Of Engagement Activity 2015
URL https://www.futurelearn.com/courses/metabolomics