A holistic statistical modelling approach to quantitative discovery proteomics and metabolomics for underpinning integrative systems medicine

Lead Research Organisation: University of Liverpool

Department Name: Electrical Engineering and Electronics

Abstract

Medical researchers are increasing wishing to understand the complex interactions between the building blocks of genes, metabolites and proteins that control human function, how they break down under disease and how this breakdown can be averted. The field of systems biology has emerged to overcome the deficiencies of the traditional reductionist approach, which has identified the building blocks themselves and many of the individual interactions but has not been able to deduce how systems of these blocks act and react in unison. The application of systems biology is widespread, as it promises to revolutionise our understanding of healthy processes in plants, animals and humans. This huge body of evidence from life sciences research provides ample justification for the widespread potential in translation to systems medicine, for empowering medical research, biomarker discovery and personalised medicine.

Often the systems medicine approach starts with snapshots of a particular biological sample and supporting readings or clinical data. Mass spectrometry is a pervasive technique for gaining a snapshot of a sample, and it does this by ionising the sample and then measuring each constituent compound's mass and quantity based on the resulting charge. This is often not enough to separate out the sample fully and therefore a preceding phase of liquid or gas chromatography is used to provide an initial separation. Due to technical and biological variations, it is necessary to analyse multiple samples to get reliable readings. Furthermore, classes of protein and metabolites require different sample preparation, different chromatography settings and different types of mass spectrometry instrumentation. These all add different kinds of biases and variation. Moreover, in biomedical research, despite stringent control of confounding factors in experimental design, a step-change in complexity and variation is evident within typical disease models and clinical samples.

Unfortunately, bioanalytical and bioinformatics methodology for protein and metabolite mass spectrometry is fundamentally reliant on the simplifying characteristics of well-controlled systems biology studies, and performs poorly on complex biomedical samples. Since the datasets are so large, the existing computational techniques tend to convert the rich raw data from mass spectrometry output to a symbolic representation of compounds too early on. The integration of the complement of protein and metabolite measurements from biomedical samples into rigorous statistical models for translational research, clinical trial design and clinical diagnostic and prognostic prediction is reliant on their appropriate and accurate statistical handling. Unfortunately, this is exceptionally problematic with current approaches.

We instead advocate all experimental raw data across proteins, metabolites and gene expression should be modelled together, so statistical 'strength' can be borrowed across the collection when making decisions about whether a compound or compound interaction truly exists in the data and at what level of confidence and relative quantity between health and disease. We propose that with a holistic model precisely evaluating all the statistical variation and bias across complete experimental designs, we can significantly increase our understanding of underlying variations in mass spectrometry experiments in the clinical setting and provide an enabling pathway to improving data analysis and interpretation, ultimately leading to enhanced sensitivity and robustness of these technologies to benefit translational and clinical research.

Technical Summary

The huge body of evidence from life sciences research provides ample justification for the widespread potential of mass spectrometry-based discovery proteomics and metabolomics in systems medicine. Current bioinformatics methodology is reliant on the characteristics of controlled systems biology studies, but there is a step-change in complexity, heterogeneity and confounding within typical disease models and clinical samples.

The pervasive approach to label-free liquid chromatography-mass spectrometry analysis is an ad-hoc series of data-reduction steps. Once the rich raw data is gone, multiple confounding effects are coalesced into single measurements with complex error distributions, non-linearity and large amounts of outliers and missing data. Data integration into rigorous statistical models for pre-clinical and clinical trial design, prediction models and latent endotype models is reliant on appropriate statistical handling, which is currently problematic. There is an opportunity to supersede the current ad-hoc quantitation pipeline with holistic Bayesian modelling able to cope with the demands of translational and clinical data:

(1) We will develop variational Bayes latent variable selection methodology for unmixing, extraction and quantification under uncertainty, with strength borrowed via consensus signal morphology across runs. This will deliver statistically valid measurements with approximate posteriors.

(2) We will develop a holistic modelling platform for systems medicine with Bayesian mixed-effects, integrating Bayesian protein and metabolite inference on the extracted posteriors, Bayesian transcriptomics methodology and experimental metadata.

(3) We will acquire and disseminate first-in-class translational and clinical proteomics/metabolomics model experiments faithfully reflecting the challenges, plus apply the methodology to production experiments. This will answer how to appropriately validate for systems medicine applications.

Planned Impact

As well as the academic beneficiaries, the proposed research has significant prospective impact for the mass spectrometry industry and associated proteomics and metabolomics vendors. The developed feature extraction methodology will increase the amount of usable data extracted from LC-MS and therefore correspondingly increase users' return on investment. This will make commercial mass spectrometry instrumentation, which requires a considerable capital and running costs, more attractive. In particular, we hope this extra research capacity will attract a wider uptake of mass spectrometry in translational research in industry and academia, as well as a wider audience of users and uses.

The proposed discovery engine could be seen to be in direct competition with informatics products from software vendors and instrument manufacturers. In fact we perceive a symbiotic relationship with user-centric discovery packages such as Progenesis (Nonlinear Dynamics, Newcastle, UK). The majority of development time for these packages is spent in data import/export, graphical interface, workflow, and results presentation. They also expose interfaces to popular search engines for feature identification including Mascot (Matrix Science, London, UK), which is an essential source of complementary information for the proposed platform. We will therefore investigate the commercialisation of our methodology, for which there is no equivalent in their systems. This could potentially occur in the short to medium term. Nevertheless, we are committed to providing our methodology freely for academic use.

There is considerable potential in this application for providing indirect benefits across the spectrum of public health, treatment and quality of life for patients and the general public in the UK and abroad. Our stated aim is to enable reliable and precise statistical evidence from large-scale cross-omics experiments, such as those using the systems medicine approach that is increasingly becoming more indispensable. This improvement will disseminate to the National Health Service and public through reduction in the considerable costs of the systems medicine approach, and increase sensitivity for making medical discoveries and the characterisation of those discoveries. Since the system is designed to elucidate underlying effects across experimental and omics techniques, it is also possible that tertiary disease processes could be identified which otherwise would go unnoticed. This has the potential to deliver further novel breakthroughs and characterise potentially interfering disease processes, therefore avoiding subsequent misallocation of resources.

Funded Value:

£276,589

Funded Period:

Feb 15 - Jun 16

Funder:

MRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

MR/L011093/2

Principal Investigator:

Andrew Dowsey

Research Subject:

Mathematical sciences (30%)

Omic sciences & technologies (42%)

Tools, technologies & methods (24%)

Research Topic:

Bioinformatics (24%)

Metabolomics / Metabonomics (12%)

Proteomics (18%)

Statistics & Appl. Probability (30%)

Transcriptomics (12%)

Organisations

People	ORCID iD
Andrew Dowsey (Principal Investigator)	http://orcid.org/0000-0002-7404-9128

Publications

Author Name

Title Publication Date Published

10 25 50

Aitken JF (2017) Quantitative data describing the impact of the flavonol rutin on in-vivo blood-glucose and fluid-intake profiles, and survival of human-amylin transgenic mice. in Data in brief

Aitken JF (2017) Rutin suppresses human-amylin/hIAPP misfolding and oligomer formation in-vitro, and ameliorates diabetes and its impacts in human-amylin/hIAPP transgenic mice. in Biochemical and biophysical research communications

Deutsch E (2018) Expanding the Use of Spectral Libraries in Proteomics in Journal of Proteome Research

Dowsey A (2017) The need for statistical contributions to bioinformatics at scale, with illustration to mass spectrometry in Statistical Modelling

Freeman OJ (2016) Metabolic Dysfunction Is Restricted to the Sciatic Nerve in Experimental Diabetic Neuropathy. in Diabetes

Liao H (2016) Proteome Informatics

Philbert SA (2021) Widespread severe cerebral elevations of haptoglobin and haemopexin in sporadic Alzheimer's disease: Evidence for a pervasive microvasculopathy. in Biochemical and biophysical research communications

Xu J (2016) Elevation of brain glucose and polyol-pathway intermediates with accompanying brain-copper deficiency in patients with Alzheimer's disease: metabolic basis for dementia. in Scientific reports

Xu J (2019) Regional protein expression in human Alzheimer's brain correlates with disease severity. in Communications biology

Related Projects

Project Reference	Relationship	Related To	Start	End	Award Value
MR/L011093/1			19/08/2014	18/01/2015	£331,010
MR/L011093/2	Transfer	MR/L011093/1	01/02/2015	29/06/2016	£276,589
MR/L011093/3	Transfer	MR/L011093/2	31/07/2016	30/11/2017	£120,914

Further Funding
Collaboration
Software and Technical Products
Engagement Activities


Description	Bilateral BBSRC/NSF Responsive Mode Grant
Amount	£1,043,055 (GBP)
Funding ID	BB/M024954/1
Organisation	Biotechnology and Biological Sciences Research Council (BBSRC)
Sector	Public
Country	United Kingdom
Start	09/2015
End	09/2018


Description	Enabling advanced analytics for all users of the proteomics facility
Amount	£4,172 (GBP)
Organisation	University of Bristol
Sector	Academic/University
Country	United Kingdom
Start	01/2018
End	07/2018


Description	Methodology Research Panel
Amount	£594,485 (GBP)
Funding ID	MR/N028457/1
Organisation	Medical Research Council (MRC)
Sector	Public
Country	United Kingdom
Start	03/2017
End	03/2020


Description	University of Liverpool EPSRC Impact Accelerator
Amount	£21,844 (GBP)
Organisation	University of Liverpool
Sector	Academic/University
Country	United Kingdom
Start	03/2016
End	06/2016


Description	University of Manchester BBSRC Impact Accelerator
Amount	£20,926 (GBP)
Organisation	Biotechnology and Biological Sciences Research Council (BBSRC)
Sector	Public
Country	United Kingdom
Start	01/2016
End	06/2016


Description	Prof Jeffrey Morris
Organisation	University of Texas
Department	M. D. Anderson Cancer Center
Country	United States
Sector	Academic/University
PI Contribution	Translation of Prof Morris' Wavelet Functional Mixed Model methodology to the proteomics LC-MS (Liquid Chromatography - Mass Spectrometry) field.
Collaborator Contribution	Access to Prof Morris' expertise and unpublished methodology in order to create our novel differential analysis workflow for raw LC-MS data.
Impact	Two publications [Liao et al, IEEE ISBI 2014; Dowsey et al Proteomics, 2010, 4226-57] plus a successful submission to the September 2014 BBSRC Bilateral NSF/BIO-BBSRC responsive mode call [BB/M024954/1].
Start Year	2009


Title	BayesProt v1.0
Description	BayesTraq: a Bayesian mixed-effects model for protein quantification in iTraq clinical proteomics
Type Of Technology	Software
Year Produced	2015
Open Source License?	Yes
Impact	Significantly improves the sensitivity and robustness of differential analysis in iTraq proteomics
URL	http://www.biospi.org/research/ms/bayestraq/


Title	mzMLb
Description	A
Type Of Technology	Software
Year Produced	2018
Impact	Proteomics Standards Initiative standards compatible binary mass spectrometry data format for efficient read/write speed and storage space requirements
URL	https://github.com/biospi/mzmlb


Title	seaMass
Description	The seaMass software is our open source dissemination route for the LC-MS (Liquid Chromatography - Mass Spectrometry) analysis algorithms developed by our group, including signal restoration and visualisation.
Type Of Technology	Software
Year Produced	2014
Open Source License?	Yes
Impact	The software has only recently been released, but there is strong interest for its incorporation into the ProteoSuite's consortium's BBSRC BBR funded user-centric proteomics software (http://www.proteosuite.org/?q=aboutus).
URL	http://www.biospi.org/research/ms/seamass/


Description	Conference Talk: Building a new computational biomarker discovery platform for clinical SWATH with image analysis techniques
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Invited talk sponsored by vendor SCIEX at 'ProteoMMX 4.0', Chester, 5th-7th April 2016. Part of activities around collaboration with SCIEX and the Manchester MRC funded Stoller Biomarker Discovery Centre.
Year(s) Of Engagement Activity	2016


Description	Conference Talk: Robust protein-level differential analysis for iTRAQ through a Bayesian mixed-effects model
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Abstract selected for plenary talk at 'British Society for Proteome Research Meeting', Reading, 20th-22nd Jul 2015. Led to collaboration with MRC Centre for Drug Safety Science at University of Liverpool on analysis of their clinical data.
Year(s) Of Engagement Activity	2015
URL	http://www.bspr.org/event/bspr-meeting-2015