Unifying metabolome and proteome informatics

Lead Research Organisation: University of Manchester
Department Name: Medical and Human Sciences


Biologists are increasing wishing to understand the complex interactions between the building blocks of genes, metabolites and proteins that control the function of every living organism. The field of systems biology has emerged to overcome the deficiencies of the traditional reductionist approach, which has identified the building blocks themselves and many of the individual interactions but has not been able to deduce how systems of these blocks act and react in unison. The application of systems biology is widespread, as it promises to revolutionise our understanding of healthy processes in plants, animals and humans, as well as how they break down under disease and how this breakdown can be averted.

Often the systems biology approach starts with a 'snapshot' of a particular biological sample. Mass spectrometry is a pervasive technique for gaining a snapshot of a sample, and it does this by ionising the sample and then measuring each constituent compound's mass and quantity based on the resulting charge. This is often not enough to separate out the sample fully and therefore a preceding phase of liquid or gas chromatography is used to provide an initial separation. Classes of protein and metabolite require different sample preparation, ionisation and chromatography approaches. These all add different kinds of biases and variation which make it extremely challenging to infer links between compounds, especially if the compounds are from different classes. To make matters worse, many snapshots are needed to capture different 'angles' of the biological process under investigation, and the instrumental conditions themselves are not entirely reproducible over time. All this has led systems biology to become a progressively computational discipline.

The academic disciplines for studying global patterns of proteins ('proteomics') and metabolites ('metabolites') have broadly originated from different fields, and therefore there is little synergy between the two. This is also the case for the computational aspect, despite the fact both are applied to mass spectrometry data. Cross-fertilisation of methodology and ideas therefore has the prospect of seeding novel, effective new approaches of analysis. The project team is involved in the development of the prominent mzMatch and ProteoSuite informatics packages for metabolomics and proteomics respectively. They are the most actively developed academic metabolome and proteome informatics packages in the UK. Therefore there is a timely opportunity to lead a concerted effort bringing together the informatics community, methodology and software for metabolomics and proteomics to: (a) Establish a new, powerful unified informatics workflow 'borrowing strength' in methodology advancements across both fields, greater than the sum of its parts and with coherent statistical properties enabling optimal integration into systems biology research; (b) Underpin cross-disciplinary collaborations, new understanding and mobility between metabolomics and proteomics fields; and (c) Support development of joint data exchange and reporting standards for optimal integration of metabolomics and proteomics data.

To achieve this, we will first integrate mzMatch into ProteoSuite with unified data exchange and reporting. This will then enable the development of the novel unified informatics pipeline. The key is to use the same underlying statistical methodology for both types of omics, with analysis differing only in biological models utilised, thus underpinning coherent delivery to downstream systems biology modelling. We will also spearhead a programme of community involvement to encourage long-term community participation in the unified informatics approach. This will include an international one-day workshop drawing in leading groups from both metabolome and proteome informatics disciplines for the first time, in order to foster a shared mind-set towards unifying the two fields.

Technical Summary

Metabolome and proteome informatics research has originated from different fields, yet their distinct perspectives have been applied to identical or similar problems. Cross-fertilisation of methodology and ideas has the prospect of seeding novel, effective new approaches of analysis. Because both fields attach differing focus to different stages of the pipeline, a unified pipeline will maximise potential of the whole workflow for both disciplines. To this end, we propose to bring together metabolome and proteome informatics by harnessing the prominent, open source mzMatch (metabolomics) and ProteoSuite (proteomics) packages as the central nexus to establish a unified informatics suite 'borrowing strength' in methodology advancements across both fields. The fundamental benefit will be statistically consistent and comparable metabolomics and proteomics data for optimised systems biology modelling. To attain this, we will:

1) Integrate mzMatch into ProteoSuite with unified data exchange and reporting. This will: promote synergy and researcher mobility between fields; facilitate teaching and learning of a common workflow and software; facilitate development of unified data standards through cohesive data sharing and re-use; enable an open API for community-centric development of unified informatics methodology.

2) Establish the unified informatics pipeline. The key is to use the same underlying statistical methodology for both types of omics, with analysis differing only in biological models utilised. To achieve this, we will develop novel: (a) integrated feature detection and isotope distribution modelling for metabolomics; (b) Bayesian mixture modelling for consensus identification and robust quantification in proteomics.

3) Bring together metabolome and proteome informatics communities. We will spearhead a programme of community involvement including an international one-day workshop, in order to foster a shared mind-set towards unifying the two fields.

Planned Impact

As well as the academic beneficiaries, the proposed research has significant prospective impact for the mass spectrometry industry and associated proteomics and metabolomics vendors. The proposed unified informatics suite and pipeline will increase the amount of usable data extracted from LC-MS and therefore correspondingly increase users' return on investment. This will make commercial mass spectrometry instrumentation, which requires considerable capital and running costs, more attractive. In particular, we hope this extra research capacity will attract a wider uptake of mass spectrometry in environmental, biological and health research in industry and academia, as well as a wider audience of users and uses amongst systems biology researchers.

The proposed unified informatics pipeline could be seen to be in competition with software products from vendors and instrument manufacturers, particularly Progenesis LC-MS and CoMet (Nonlinear Dynamics, Newcastle, UK. However, since our software is distributed with a permissive license allowing for its unrestricted re-use in other software packages, both free and commercial, we hope that our work will aid commercial software products similarly and therefore raise the bar for the whole field.

There is considerable potential in this application for providing indirect benefits to UK public health, quality of life and environmental sustainability. Our aim is to establish, through cross-fertilisation between metabolome and proteome informatics, a powerful unified workflow with coherent statistical properties enabling optimal downstream integration into the systems biology paradigm. Due to its success and further substantial promise, the BBSRC, UK research councils and industry have invested greatly in the systems biology approach. The potential improvements yielded by our unified workflow will therefore have a clear dissemination route to the public through reduced resources, costs and overheads required for discoveries realised with systems approaches in environmental, biological and biomedical science, and the characterisation of those discoveries.

The PDRA employed on this grant benefit significantly from exposure to the wealth of metabolome and proteome informatics expertise we will bring together in our proposed community-based initiative, particularly since the PDRA will be encouraged to play a significant role in public dissemination. They will also benefit from the uniquely concentrated cross-disciplinary interaction at CADET and the Manchester Institute of Biotechnology.


10 25 50
Description We have published journal papers on Bayesian clustering and alignment approaches for metabolomics that significantly improve on the state-of-the-art. We are preparing further manuscripts on seaMass-enabled peak/feature detection for metabolomics (and proteomics), and a Bayesian approach for metabolite quantification based on our protein-level differential expression model in Proteomics (Freeman et al, Diabetes 2016). Pilot work based on these technologies have contributed to a University of Liverpool EPSRC Impact Accelerator Award on clinical diagnostics through metabolomics profiling.

We have also brought the metabolomics and proteomics communities together for the first time by running the 1st International Workshop for Proteome and Metabolome Informatics. This was run as part of ECCB 2014, Strasbourg, which facilitated wider participation. We brought metabolomics practitioners together with synthetic biology, robotics and data science experts at a national workshop run in March 2016.
Exploitation Route We have already increased international collaboration and awareness through our workshops. We are now continuing to disseminate our complete unified pipeline for proteomics and metabolomics in the near future, including through the EPSRC Impact Accelerator Award.
Sectors Agriculture, Food and Drink,Environment,Healthcare,Pharmaceuticals and Medical Biotechnology

Description Technology developed in this grant is being developed for characterising impurities in oligonucleotide drugs, funded by AstraZeneca
First Year Of Impact 2022
Sector Pharmaceuticals and Medical Biotechnology
Description Novel semi-supervised Bayesian learning to rapidly screen new oligonucleotide drugs for impurities
Amount £104,203 (GBP)
Organisation AstraZeneca 
Sector Private
Country United Kingdom
Start 09/2021 
End 09/2025
Description University of Liverpool EPSRC Impact Accelerator
Amount £21,844 (GBP)
Organisation University of Liverpool 
Sector Academic/University
Country United Kingdom
Start 04/2016 
End 06/2016
Description University of Manchester BBSRC Impact Accelerator
Amount £20,926 (GBP)
Organisation Biotechnology and Biological Sciences Research Council (BBSRC) 
Sector Public
Country United Kingdom
Start 01/2016 
End 06/2016
Title Topic Modeling for Untargeted Substructure Exploration in Metabolomics 
Description The dataset consists of liquid chromatography mass spectrometry (LC-MS) data and LC-MS with gas-phase fragmentation experiments (LC-MS/MS) data generated from four beer samples. 
Type Of Material Database/Collection of data 
Year Produced 2016 
Provided To Others? Yes  
Title MS2LDA 
Description Fragmentation spectra, which provide the characteristic fingerprints of compounds, also contain structural information where a subset of fragment peaks may correspond to a shared chemical substructure in a class of compounds. The aim of this site is to provide an online platform that allows users to perform unsupervised substructure discovery in fragmentation experiments, decompose fragmentation experiments into characterized substructures (Mass2Motifs) found in MS/MS spectra of reference compounds, and integrate fragmentation analysis with comparative metabolomics experiments. In our proposed method (MS2LDA - Simon Rogers et al.), discrete fragment and neutral loss features are extracted from fragmentation spectra. Related features that tend to co-occur are detected using the Latent Dirichlet Allocation model. The figure below shows the analogy between LDA for text and MS2LDA for fragment and neutral loss features. LDA finds topics interpreted as 'football related', 'business-related' and 'environment related'. MS2LDA finds sets of concurring mass fragments or losses (Mass2Motifs) that can be interpreted as 'Asparagine-related', 'Hexose-related' and 'Adenine-related'. The tool currently accepts the fragmentation experiments in various formats (mzML, MSP, MGF) and optionally an MS1 peak list can be added to which the MS1 peaks found in the fragmentation experiment are then matched prior to running LDA or Decomposition. 
Type Of Technology Webtool/Application 
Year Produced 2017 
Open Source License? Yes  
Impact MS2LDA aids in structural annotation of metabolites and guides prioritization of analysis by using Mass2Motif prevalence. Uptake is promising. 
URL http://ms2lda.org/
Description 1st International Workshop on Unified Proteome and Metabolome Informatics 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Lead organiser for the first workshop bringing together the metabolome and proteome informatics fields, held at the European Conference on Computational Biology 2014, Strasbourg, France. Workshop was held on Saturday 6th September 2014, and has contributed to impact of new HUPO and ISCB Computational Mass Spectrometry Working Groups.
Year(s) Of Engagement Activity 2014
URL http://www.ebi.ac.uk/eccb/2014/eccb14.loria.fr/index.html
Description The fusion of data science, autonomous systems and robotics to enable next-generation synthetic biology and biomedicine 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact On 9th March 2016 we held a workshop on Big Data, Robotics and Autonomous Systems in Synthetic Biology, including Metabolomics. On the day before we held a smaller workshop targeted solely at Data Science, with 18 delegates across Liverpool, Manchester and Glasgow. The main workshop brought together researchers in Data Science, and Robotics and Autonomous Systems (RAS), with the combined expertise of the new Synthetic Biology centres across the UK, to discuss how synthetic biology can be enabled and scaled up through technology development. In particular, the main focus of the workshop was to envisage the jump from automation to autonomy, where cutting edge Data Science and RAS are harnessed to realise machine-intelligent instrumentation and eventually a self-optimising synthetic biology paradigm. There were 50 delegates from 10 UK universities (Bristol, Newcastle, Warwick, UCL, Birmingham, Cambridge, Manchester, Edinburgh, Imperial, Liverpool, Glasgow) as well as the BBSRC, KTN and two industrial participants (Synthace, m2p-labs). The morning sessions included 5 scene settings talks and 10 lightning talks from delegates. In the afternoon, we facilitated two breakout sessions, firstly a cross-theme breakout (what are the key challenges restricting autonomy?) followed by an intra-theme breakout (how can they be solved and sustained?). Agreed outcomes at a UK-wide level include Data Exchange & Standardisation - defining the core set of metadata needed to transfer data between different instruments, with a network on standardisation around RAS, and communication with manufacturers and encourage them to support annual automations users forum. A number of research consortia were also build by attendees, including: (i) Autonomous systems 'learning to learn'; Robots should be able to operate outside of pre-defined framework i.e. robot technicians are still very much needed; (ii) Potential value in very large data repositories for data mining e.g. database for assemblies and constructs; Genome foundries talking to each other to share data and protocols.
Year(s) Of Engagement Activity 2016
URL http://www.biospi.org/workshops/bdras-synbio/