Unifying metabolome and proteome informatics

Lead Research Organisation: University of Manchester

Department Name: Medical and Human Sciences

Abstract

Biologists are increasing wishing to understand the complex interactions between the building blocks of genes, metabolites and proteins that control the function of every living organism. The field of systems biology has emerged to overcome the deficiencies of the traditional reductionist approach, which has identified the building blocks themselves and many of the individual interactions but has not been able to deduce how systems of these blocks act and react in unison. The application of systems biology is widespread, as it promises to revolutionise our understanding of healthy processes in plants, animals and humans, as well as how they break down under disease and how this breakdown can be averted.

Often the systems biology approach starts with a 'snapshot' of a particular biological sample. Mass spectrometry is a pervasive technique for gaining a snapshot of a sample, and it does this by ionising the sample and then measuring each constituent compound's mass and quantity based on the resulting charge. This is often not enough to separate out the sample fully and therefore a preceding phase of liquid or gas chromatography is used to provide an initial separation. Classes of protein and metabolite require different sample preparation, ionisation and chromatography approaches. These all add different kinds of biases and variation which make it extremely challenging to infer links between compounds, especially if the compounds are from different classes. To make matters worse, many snapshots are needed to capture different 'angles' of the biological process under investigation, and the instrumental conditions themselves are not entirely reproducible over time. All this has led systems biology to become a progressively computational discipline.

The academic disciplines for studying global patterns of proteins ('proteomics') and metabolites ('metabolites') have broadly originated from different fields, and therefore there is little synergy between the two. This is also the case for the computational aspect, despite the fact both are applied to mass spectrometry data. Cross-fertilisation of methodology and ideas therefore has the prospect of seeding novel, effective new approaches of analysis. The project team is involved in the development of the prominent mzMatch and ProteoSuite informatics packages for metabolomics and proteomics respectively. They are the most actively developed academic metabolome and proteome informatics packages in the UK. Therefore there is a timely opportunity to lead a concerted effort bringing together the informatics community, methodology and software for metabolomics and proteomics to: (a) Establish a new, powerful unified informatics workflow 'borrowing strength' in methodology advancements across both fields, greater than the sum of its parts and with coherent statistical properties enabling optimal integration into systems biology research; (b) Underpin cross-disciplinary collaborations, new understanding and mobility between metabolomics and proteomics fields; and (c) Support development of joint data exchange and reporting standards for optimal integration of metabolomics and proteomics data.

To achieve this, we will first integrate mzMatch into ProteoSuite with unified data exchange and reporting. This will then enable the development of the novel unified informatics pipeline. The key is to use the same underlying statistical methodology for both types of omics, with analysis differing only in biological models utilised, thus underpinning coherent delivery to downstream systems biology modelling. We will also spearhead a programme of community involvement to encourage long-term community participation in the unified informatics approach. This will include an international one-day workshop drawing in leading groups from both metabolome and proteome informatics disciplines for the first time, in order to foster a shared mind-set towards unifying the two fields.

Technical Summary

Metabolome and proteome informatics research has originated from different fields, yet their distinct perspectives have been applied to identical or similar problems. Cross-fertilisation of methodology and ideas has the prospect of seeding novel, effective new approaches of analysis. Because both fields attach differing focus to different stages of the pipeline, a unified pipeline will maximise potential of the whole workflow for both disciplines. To this end, we propose to bring together metabolome and proteome informatics by harnessing the prominent, open source mzMatch (metabolomics) and ProteoSuite (proteomics) packages as the central nexus to establish a unified informatics suite 'borrowing strength' in methodology advancements across both fields. The fundamental benefit will be statistically consistent and comparable metabolomics and proteomics data for optimised systems biology modelling. To attain this, we will:

1) Integrate mzMatch into ProteoSuite with unified data exchange and reporting. This will: promote synergy and researcher mobility between fields; facilitate teaching and learning of a common workflow and software; facilitate development of unified data standards through cohesive data sharing and re-use; enable an open API for community-centric development of unified informatics methodology.

2) Establish the unified informatics pipeline. The key is to use the same underlying statistical methodology for both types of omics, with analysis differing only in biological models utilised. To achieve this, we will develop novel: (a) integrated feature detection and isotope distribution modelling for metabolomics; (b) Bayesian mixture modelling for consensus identification and robust quantification in proteomics.

3) Bring together metabolome and proteome informatics communities. We will spearhead a programme of community involvement including an international one-day workshop, in order to foster a shared mind-set towards unifying the two fields.

Planned Impact

As well as the academic beneficiaries, the proposed research has significant prospective impact for the mass spectrometry industry and associated proteomics and metabolomics vendors. The proposed unified informatics suite and pipeline will increase the amount of usable data extracted from LC-MS and therefore correspondingly increase users' return on investment. This will make commercial mass spectrometry instrumentation, which requires considerable capital and running costs, more attractive. In particular, we hope this extra research capacity will attract a wider uptake of mass spectrometry in environmental, biological and health research in industry and academia, as well as a wider audience of users and uses amongst systems biology researchers.

The proposed unified informatics pipeline could be seen to be in competition with software products from vendors and instrument manufacturers, particularly Progenesis LC-MS and CoMet (Nonlinear Dynamics, Newcastle, UK. However, since our software is distributed with a permissive license allowing for its unrestricted re-use in other software packages, both free and commercial, we hope that our work will aid commercial software products similarly and therefore raise the bar for the whole field.

There is considerable potential in this application for providing indirect benefits to UK public health, quality of life and environmental sustainability. Our aim is to establish, through cross-fertilisation between metabolome and proteome informatics, a powerful unified workflow with coherent statistical properties enabling optimal downstream integration into the systems biology paradigm. Due to its success and further substantial promise, the BBSRC, UK research councils and industry have invested greatly in the systems biology approach. The potential improvements yielded by our unified workflow will therefore have a clear dissemination route to the public through reduced resources, costs and overheads required for discoveries realised with systems approaches in environmental, biological and biomedical science, and the characterisation of those discoveries.

The PDRA employed on this grant benefit significantly from exposure to the wealth of metabolome and proteome informatics expertise we will bring together in our proposed community-based initiative, particularly since the PDRA will be encouraged to play a significant role in public dissemination. They will also benefit from the uniquely concentrated cross-disciplinary interaction at CADET and the Manchester Institute of Biotechnology.

Funded Value:

£144,291

Funded Period:

Jun 14 - Jan 15

Funder:

BBSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

BB/L018616/1

Principal Investigator:

Andrew Dowsey

Research Subject:

Mathematical sciences (48%)

Tools, technologies & methods (48%)

Research Topic:

Bioinformatics (48%)

Statistics & Appl. Probability (48%)

Organisations

University of Manchester (Lead Research Organisation)

People	ORCID iD
Andrew Dowsey (Principal Investigator)	http://orcid.org/0000-0002-7404-9128
Simon Rogers (Co-Investigator)
Rainer Breitling (Co-Investigator)

Publications

Author Name

Title Publication Date Published

10 25 50

Daly R (2014) MetAssign: probabilistic annotation of metabolites from LC-MS data using a Bayesian clustering approach. in Bioinformatics (Oxford, England)

Van Der Hooft JJ (2016) Topic modeling for untargeted substructure exploration in metabolomics. in Proceedings of the National Academy of Sciences of the United States of America

Van Der Hooft JJJ (2017) Unsupervised Discovery and Comparison of Structural Families Across Multiple Samples in Untargeted Metabolomics. in Analytical chemistry

Wandy J (2015) Incorporating peak grouping information for alignment of multiple liquid chromatography-mass spectrometry datasets. in Bioinformatics (Oxford, England)

Wandy J (2018) Ms2lda.org: web-based topic modelling for substructure discovery in mass spectrometry. in Bioinformatics (Oxford, England)

Key Findings
Impact Summary
Further Funding
Research Databases and Models
Software and Technical Products
Engagement Activities


Description	We have published journal papers on Bayesian clustering and alignment approaches for metabolomics that significantly improve on the state-of-the-art. We are preparing further manuscripts on seaMass-enabled peak/feature detection for metabolomics (and proteomics), and a Bayesian approach for metabolite quantification based on our protein-level differential expression model in Proteomics (Freeman et al, Diabetes 2016). Pilot work based on these technologies have contributed to a University of Liverpool EPSRC Impact Accelerator Award on clinical diagnostics through metabolomics profiling. We have also brought the metabolomics and proteomics communities together for the first time by running the 1st International Workshop for Proteome and Metabolome Informatics. This was run as part of ECCB 2014, Strasbourg, which facilitated wider participation. We brought metabolomics practitioners together with synthetic biology, robotics and data science experts at a national workshop run in March 2016.
Exploitation Route	We have already increased international collaboration and awareness through our workshops. We are now continuing to disseminate our complete unified pipeline for proteomics and metabolomics in the near future, including through the EPSRC Impact Accelerator Award.
Sectors	Agriculture Food and Drink Environment Healthcare Pharmaceuticals and Medical Biotechnology


Description	Technology developed in this grant is being developed for characterising impurities in oligonucleotide drugs, funded by AstraZeneca
First Year Of Impact	2022
Sector	Pharmaceuticals and Medical Biotechnology


Description	Novel semi-supervised Bayesian learning to rapidly screen new oligonucleotide drugs for impurities
Amount	£104,203 (GBP)
Organisation	AstraZeneca
Sector	Private
Country	United Kingdom
Start	08/2021
End	09/2025


Description	University of Liverpool EPSRC Impact Accelerator
Amount	£21,844 (GBP)
Organisation	University of Liverpool
Sector	Academic/University
Country	United Kingdom
Start	03/2016
End	06/2016


Description	University of Manchester BBSRC Impact Accelerator
Amount	£20,926 (GBP)
Organisation	Biotechnology and Biological Sciences Research Council (BBSRC)
Sector	Public
Country	United Kingdom
Start	01/2016
End	06/2016


Title	Topic Modeling for Untargeted Substructure Exploration in Metabolomics
Description	The dataset consists of liquid chromatography mass spectrometry (LC-MS) data and LC-MS with gas-phase fragmentation experiments (LC-MS/MS) data generated from four beer samples.
Type Of Material	Database/Collection of data
Year Produced	2016
Provided To Others?	Yes


Title	MS2LDA
Description	Fragmentation spectra, which provide the characteristic fingerprints of compounds, also contain structural information where a subset of fragment peaks may correspond to a shared chemical substructure in a class of compounds. The aim of this site is to provide an online platform that allows users to perform unsupervised substructure discovery in fragmentation experiments, decompose fragmentation experiments into characterized substructures (Mass2Motifs) found in MS/MS spectra of reference compounds, and integrate fragmentation analysis with comparative metabolomics experiments. In our proposed method (MS2LDA - Simon Rogers et al.), discrete fragment and neutral loss features are extracted from fragmentation spectra. Related features that tend to co-occur are detected using the Latent Dirichlet Allocation model. The figure below shows the analogy between LDA for text and MS2LDA for fragment and neutral loss features. LDA finds topics interpreted as 'football related', 'business-related' and 'environment related'. MS2LDA finds sets of concurring mass fragments or losses (Mass2Motifs) that can be interpreted as 'Asparagine-related', 'Hexose-related' and 'Adenine-related'. The tool currently accepts the fragmentation experiments in various formats (mzML, MSP, MGF) and optionally an MS1 peak list can be added to which the MS1 peaks found in the fragmentation experiment are then matched prior to running LDA or Decomposition.
Type Of Technology	Webtool/Application
Year Produced	2017
Open Source License?	Yes
Impact	MS2LDA aids in structural annotation of metabolites and guides prioritization of analysis by using Mass2Motif prevalence. Uptake is promising.
URL	http://ms2lda.org/


Description	1st International Workshop on Unified Proteome and Metabolome Informatics
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Lead organiser for the first workshop bringing together the metabolome and proteome informatics fields, held at the European Conference on Computational Biology 2014, Strasbourg, France. Workshop was held on Saturday 6th September 2014, and has contributed to impact of new HUPO and ISCB Computational Mass Spectrometry Working Groups.
Year(s) Of Engagement Activity	2014
URL	http://www.ebi.ac.uk/eccb/2014/eccb14.loria.fr/index.html


Description	The fusion of data science, autonomous systems and robotics to enable next-generation synthetic biology and biomedicine
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Professional Practitioners
Results and Impact	On 9th March 2016 we held a workshop on Big Data, Robotics and Autonomous Systems in Synthetic Biology, including Metabolomics. On the day before we held a smaller workshop targeted solely at Data Science, with 18 delegates across Liverpool, Manchester and Glasgow. The main workshop brought together researchers in Data Science, and Robotics and Autonomous Systems (RAS), with the combined expertise of the new Synthetic Biology centres across the UK, to discuss how synthetic biology can be enabled and scaled up through technology development. In particular, the main focus of the workshop was to envisage the jump from automation to autonomy, where cutting edge Data Science and RAS are harnessed to realise machine-intelligent instrumentation and eventually a self-optimising synthetic biology paradigm. There were 50 delegates from 10 UK universities (Bristol, Newcastle, Warwick, UCL, Birmingham, Cambridge, Manchester, Edinburgh, Imperial, Liverpool, Glasgow) as well as the BBSRC, KTN and two industrial participants (Synthace, m2p-labs). The morning sessions included 5 scene settings talks and 10 lightning talks from delegates. In the afternoon, we facilitated two breakout sessions, firstly a cross-theme breakout (what are the key challenges restricting autonomy?) followed by an intra-theme breakout (how can they be solved and sustained?). Agreed outcomes at a UK-wide level include Data Exchange & Standardisation - defining the core set of metadata needed to transfer data between different instruments, with a network on standardisation around RAS, and communication with manufacturers and encourage them to support annual automations users forum. A number of research consortia were also build by attendees, including: (i) Autonomous systems 'learning to learn'; Robots should be able to operate outside of pre-defined framework i.e. robot technicians are still very much needed; (ii) Potential value in very large data repositories for data mining e.g. database for assemblies and constructs; Genome foundries talking to each other to share data and protocols.
Year(s) Of Engagement Activity	2016
URL	http://www.biospi.org/workshops/bdras-synbio/