A holistic statistical modelling approach to quantitative discovery proteomics and metabolomics for underpinning integrative systems medicine

Lead Research Organisation: University of Liverpool
Department Name: Electrical Engineering and Electronics

Abstract

Medical researchers are increasing wishing to understand the complex interactions between the building blocks of genes, metabolites and proteins that control human function, how they break down under disease and how this breakdown can be averted. The field of systems biology has emerged to overcome the deficiencies of the traditional reductionist approach, which has identified the building blocks themselves and many of the individual interactions but has not been able to deduce how systems of these blocks act and react in unison. The application of systems biology is widespread, as it promises to revolutionise our understanding of healthy processes in plants, animals and humans. This huge body of evidence from life sciences research provides ample justification for the widespread potential in translation to systems medicine, for empowering medical research, biomarker discovery and personalised medicine.

Often the systems medicine approach starts with snapshots of a particular biological sample and supporting readings or clinical data. Mass spectrometry is a pervasive technique for gaining a snapshot of a sample, and it does this by ionising the sample and then measuring each constituent compound's mass and quantity based on the resulting charge. This is often not enough to separate out the sample fully and therefore a preceding phase of liquid or gas chromatography is used to provide an initial separation. Due to technical and biological variations, it is necessary to analyse multiple samples to get reliable readings. Furthermore, classes of protein and metabolites require different sample preparation, different chromatography settings and different types of mass spectrometry instrumentation. These all add different kinds of biases and variation. Moreover, in biomedical research, despite stringent control of confounding factors in experimental design, a step-change in complexity and variation is evident within typical disease models and clinical samples.

Unfortunately, bioanalytical and bioinformatics methodology for protein and metabolite mass spectrometry is fundamentally reliant on the simplifying characteristics of well-controlled systems biology studies, and performs poorly on complex biomedical samples. Since the datasets are so large, the existing computational techniques tend to convert the rich raw data from mass spectrometry output to a symbolic representation of compounds too early on. The integration of the complement of protein and metabolite measurements from biomedical samples into rigorous statistical models for translational research, clinical trial design and clinical diagnostic and prognostic prediction is reliant on their appropriate and accurate statistical handling. Unfortunately, this is exceptionally problematic with current approaches.

We instead advocate all experimental raw data across proteins, metabolites and gene expression should be modelled together, so statistical 'strength' can be borrowed across the collection when making decisions about whether a compound or compound interaction truly exists in the data and at what level of confidence and relative quantity between health and disease. We propose that with a holistic model precisely evaluating all the statistical variation and bias across complete experimental designs, we can significantly increase our understanding of underlying variations in mass spectrometry experiments in the clinical setting and provide an enabling pathway to improving data analysis and interpretation, ultimately leading to enhanced sensitivity and robustness of these technologies to benefit translational and clinical research.

Technical Summary

The huge body of evidence from life sciences research provides ample justification for the widespread potential of mass spectrometry-based discovery proteomics and metabolomics in systems medicine. Current bioinformatics methodology is reliant on the characteristics of controlled systems biology studies, but there is a step-change in complexity, heterogeneity and confounding within typical disease models and clinical samples.

The pervasive approach to label-free liquid chromatography-mass spectrometry analysis is an ad-hoc series of data-reduction steps. Once the rich raw data is gone, multiple confounding effects are coalesced into single measurements with complex error distributions, non-linearity and large amounts of outliers and missing data. Data integration into rigorous statistical models for pre-clinical and clinical trial design, prediction models and latent endotype models is reliant on appropriate statistical handling, which is currently problematic. There is an opportunity to supersede the current ad-hoc quantitation pipeline with holistic Bayesian modelling able to cope with the demands of translational and clinical data:

(1) We will develop variational Bayes latent variable selection methodology for unmixing, extraction and quantification under uncertainty, with strength borrowed via consensus signal morphology across runs. This will deliver statistically valid measurements with approximate posteriors.

(2) We will develop a holistic modelling platform for systems medicine with Bayesian mixed-effects, integrating Bayesian protein and metabolite inference on the extracted posteriors, Bayesian transcriptomics methodology and experimental metadata.

(3) We will acquire and disseminate first-in-class translational and clinical proteomics/metabolomics model experiments faithfully reflecting the challenges, plus apply the methodology to production experiments. This will answer how to appropriately validate for systems medicine applications.

Planned Impact

As well as the academic beneficiaries, the proposed research has significant prospective impact for the mass spectrometry industry and associated proteomics and metabolomics vendors. The developed feature extraction methodology will increase the amount of usable data extracted from LC-MS and therefore correspondingly increase users' return on investment. This will make commercial mass spectrometry instrumentation, which requires a considerable capital and running costs, more attractive. In particular, we hope this extra research capacity will attract a wider uptake of mass spectrometry in translational research in industry and academia, as well as a wider audience of users and uses.

The proposed discovery engine could be seen to be in direct competition with informatics products from software vendors and instrument manufacturers. In fact we perceive a symbiotic relationship with user-centric discovery packages such as Progenesis (Nonlinear Dynamics, Newcastle, UK). The majority of development time for these packages is spent in data import/export, graphical interface, workflow, and results presentation. They also expose interfaces to popular search engines for feature identification including Mascot (Matrix Science, London, UK), which is an essential source of complementary information for the proposed platform. We will therefore investigate the commercialisation of our methodology, for which there is no equivalent in their systems. This could potentially occur in the short to medium term. Nevertheless, we are committed to providing our methodology freely for academic use.

There is considerable potential in this application for providing indirect benefits across the spectrum of public health, treatment and quality of life for patients and the general public in the UK and abroad. Our stated aim is to enable reliable and precise statistical evidence from large-scale cross-omics experiments, such as those using the systems medicine approach that is increasingly becoming more indispensable. This improvement will disseminate to the National Health Service and public through reduction in the considerable costs of the systems medicine approach, and increase sensitivity for making medical discoveries and the characterisation of those discoveries. Since the system is designed to elucidate underlying effects across experimental and omics techniques, it is also possible that tertiary disease processes could be identified which otherwise would go unnoticed. This has the potential to deliver further novel breakthroughs and characterise potentially interfering disease processes, therefore avoiding subsequent misallocation of resources.
 
Description Bilateral BBSRC/NSF Responsive Mode Grant
Amount £1,043,055 (GBP)
Funding ID BB/M024954/1 
Organisation Biotechnology and Biological Sciences Research Council (BBSRC) 
Sector Public
Country United Kingdom
Start 10/2015 
End 09/2018
 
Description Enabling advanced analytics for all users of the proteomics facility
Amount £4,172 (GBP)
Organisation University of Bristol 
Sector Academic/University
Country United Kingdom
Start 01/2018 
End 07/2018
 
Description Methodology Research Panel
Amount £594,485 (GBP)
Funding ID MR/N028457/1 
Organisation Medical Research Council (MRC) 
Sector Public
Country United Kingdom
Start 04/2017 
End 03/2020
 
Description University of Liverpool EPSRC Impact Accelerator
Amount £21,844 (GBP)
Organisation University of Liverpool 
Sector Academic/University
Country United Kingdom
Start 04/2016 
End 06/2016
 
Description University of Manchester BBSRC Impact Accelerator
Amount £20,926 (GBP)
Organisation Biotechnology and Biological Sciences Research Council (BBSRC) 
Sector Public
Country United Kingdom
Start 01/2016 
End 06/2016
 
Description Prof Jeffrey Morris 
Organisation University of Texas
Department M. D. Anderson Cancer Center
Country United States 
Sector Academic/University 
PI Contribution Translation of Prof Morris' Wavelet Functional Mixed Model methodology to the proteomics LC-MS (Liquid Chromatography - Mass Spectrometry) field.
Collaborator Contribution Access to Prof Morris' expertise and unpublished methodology in order to create our novel differential analysis workflow for raw LC-MS data.
Impact Two publications [Liao et al, IEEE ISBI 2014; Dowsey et al Proteomics, 2010, 4226-57] plus a successful submission to the September 2014 BBSRC Bilateral NSF/BIO-BBSRC responsive mode call [BB/M024954/1].
Start Year 2009
 
Title BayesProt v1.0 
Description BayesTraq: a Bayesian mixed-effects model for protein quantification in iTraq clinical proteomics 
Type Of Technology Software 
Year Produced 2015 
Open Source License? Yes  
Impact Significantly improves the sensitivity and robustness of differential analysis in iTraq proteomics 
URL http://www.biospi.org/research/ms/bayestraq/
 
Title mzMLb 
Description
Type Of Technology Software 
Year Produced 2018 
Impact Proteomics Standards Initiative standards compatible binary mass spectrometry data format for efficient read/write speed and storage space requirements 
URL https://github.com/biospi/mzmlb
 
Title seaMass 
Description The seaMass software is our open source dissemination route for the LC-MS (Liquid Chromatography - Mass Spectrometry) analysis algorithms developed by our group, including signal restoration and visualisation. 
Type Of Technology Software 
Year Produced 2014 
Open Source License? Yes  
Impact The software has only recently been released, but there is strong interest for its incorporation into the ProteoSuite's consortium's BBSRC BBR funded user-centric proteomics software (http://www.proteosuite.org/?q=aboutus). 
URL http://www.biospi.org/research/ms/seamass/
 
Description Conference Talk: Building a new computational biomarker discovery platform for clinical SWATH with image analysis techniques 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Invited talk sponsored by vendor SCIEX at 'ProteoMMX 4.0', Chester, 5th-7th April 2016. Part of activities around collaboration with SCIEX and the Manchester MRC funded Stoller Biomarker Discovery Centre.
Year(s) Of Engagement Activity 2016
 
Description Conference Talk: Robust protein-level differential analysis for iTRAQ through a Bayesian mixed-effects model 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Abstract selected for plenary talk at 'British Society for Proteome Research Meeting', Reading, 20th-22nd Jul 2015. Led to collaboration with MRC Centre for Drug Safety Science at University of Liverpool on analysis of their clinical data.
Year(s) Of Engagement Activity 2015
URL http://www.bspr.org/event/bspr-meeting-2015