Bilateral NSF/BIO-BBSRC: Bayesian Quantitative Proteomics

Lead Research Organisation: University of Bristol

Department Name: Clinical Veterinary Science

Abstract

Research in the life sciences is being driven forward by cutting-edge techniques for studying the molecules acting in cells. The functional molecules in cells are proteins - the expression, activity and interactions of particular proteins in any given cell define its structure and what it is capable of doing. As one example, we are often interested in studying what proteins are present in diseased cells and in what quantities, compared with normal cells, since the identity of the proteins may help us understand the disease process, and the search for new drug targets. The technologies used to study proteins on a large scale are collectively called proteomics. The main method used in proteomics is mass spectrometry (MS), which can calculate the molecular weight and abundance of molecules.

The majority of proteomics workflows perform a step of protein digestion prior to MS. The result of digestion is that all the proteins become broken up into small chains, called peptides. This step has become common, because peptides are easier to analyse by MS, due to their lower mass, producing simpler data to interpret. The set of peptides is then identified and often quantified across different conditions (e.g. disease versus healthy cells). We often know that a peptide was derived from a specific parent protein, and so we can use the identity and quantification of that peptide as a proxy measure for the behaviour of the protein across our samples of interest, and as such these workflows are called "bottom-up". One issue with the digestion of proteins is that some proteins break down quicker than others - for some proteins/peptides digestion is incomplete, producing unreliable quantification data, which at present is not fully understood or compensated for by the analysis software.

While bottom-up studies dominate the field, they currently have several significant drawbacks. Proteins are molecules that tend to exist in multiple different, related forms in the cells, which have been called proteoforms - through the gene encoding the protein being processed in different ways (alternatively splicing), or through the addition of functionally important chemical groups, called post-translational modifications (PTMs). Since only one or a few peptides are different between different proteoforms, they are far more challenging (or impossible with current techniques) to quantify accurately. Current practice in proteomics generally ignores this problem - losing vast amounts of data about the true nature of the molecules in the system. There are MS techniques for studying intact proteins and their proteoforms (called top-down methods), but at present these do not function in high-throughput mode, and thus are typically used for targeted studies on a small number of proteins.

In order to make a step change in the quantification and discovery of proteoforms, we will develop an integrated suite of analysis techniques using a powerful statistical technique called Bayesian modelling. With Bayesian approaches, the problem at hand is simulated many thousands of times probabilistically. By interpreting the range of different conclusions reached, we can get an idea of how certain we are about the results, which is crucial given the subtle nature of the evidence within the MS datasets. In essence, our computational techniques will deliver the same quality of data about individual proteoforms (including novel discovery of PTMs) as top-down techniques, but based off bottom-up (peptide-focussed) workflows - thus, for the first time, enabling highly accurate proteoform-level discovery and quantification in high-throughput mode. To ensure rapid and wide uptake of our new methods, we will integrate our advancements into a freely available software suite we are developing, ProteoSuite.

Technical Summary

Tandem Mass Spectrometry (MS/MS) coupled to Liquid Chromatography (LC) is the primary technique used in proteomics. The most common approach is LC separation of tryptic fragments derived from a proteome digestion, followed by tandem MS of the peptides. This entire workflow is conceived as a series of discrete steps, some chemical, some instrumental, some informatics and some statistical. Existing software concentrates on subcomponents of the workflow, and comprise a series of deterministic, self-contained steps. No methods propagate uncertainty from one step to the next, nor do they borrow strength either within or across steps - this starkly contrasts with recent advancements in processing RNA-seq data.

We propose to translate the whole protein quantification pipeline into a rigorous statistical framework underpinned by Bayesian methodology. The new framework will enable us to integrate evidence across all experimentally acquired datasets, and allow us to borrow strength from unused structure within a proteomics workflow, including digestion dynamics. Our proposed pipeline consists of three synergistic developments (1) Utilisation of all unidentified (peptide) features, as well as identified features, to infer the most likely mixture of proteins present in a sample; (2) Differential quantification of complex mixtures of known proteoforms; (3) Discovery of unknown proteoforms and all modifications (PTMs) carried by their quantification signatures. These advancements will elicit a step-change in quantification sensitivity and interpretation at the proteoform level for the first time. We will disseminate this end-to-end analysis solution within the user-centric standards compliant ProteoSuite package, and as a Galaxy workflow for high-throughput pipelines.

Planned Impact

As well as the academic beneficiaries, the proposed research has significant prospective impact for the mass spectrometry industry and associated proteomics vendors. The proposed Bayesian Quantiative Proteomics platform will increase the amount of usable data extracted from LC-MS and therefore correspondingly increase users' return on investment. This will make commercial mass spectrometry instrumentation, which requires considerable capital and running costs, more attractive. In particular, we hope this extra research capacity will attract a wider uptake of mass spectrometry in environmental, biological and health research in industry and academia, as well as a wider audience of users and uses amongst systems biology researchers.

There is potential for direct impact through the licensing of some or all of our software tools developed, as we are working towards for other packages with Waters Inc.

There is considerable potential in this application for providing indirect benefits to UK public health, quality of life and environmental sustainability. Our aim is to establish a powerful platform for differential proteoform analysis and discovery enabling a wealth of new investigations in the biological sciences and translational medicine. Due to its success and further substantial promise, the BBSRC, UK research councils and industry have invested greatly in the systems biology approach. The potential improvements yielded by our workflow will therefore have a clear dissemination route to the public through reduced resources, costs and overheads required for discoveries realised with systems approaches in environmental, biological and biomedical science, and the characterisation of those discoveries.

The PDRAs employed on this grant benefit significantly from exposure to the wealth of proteome informatics expertise we will bring together, particularly since the PDRAs will be encouraged to play a significant role in public dissemination. All staff will benefit through being engaged within an international, cutting edge interdisciplinary project.

Funded Value:

£238,698

Funded Period:

Jul 16 - Mar 19

Funder:

BBSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

BB/M024954/2

Principal Investigator:

Andrew Dowsey

Research Subject:

Mathematical sciences (33%)

Omic sciences & technologies (44%)

Tools, technologies & methods (22%)

Research Topic:

Bioinformatics (22%)

Proteomics (44%)

Statistics & Appl. Probability (33%)

Organisations

People	ORCID iD
Andrew Dowsey (Principal Investigator)	http://orcid.org/0000-0002-7404-9128
Magnus Rattray (Co-Investigator)	http://orcid.org/0000-0001-8196-5565

Publications

Author Name Title Publication Date Published

|< < 1 2 > >|

10 25 50

Liao H (2016) Proteome Informatics

Dowsey A (2017) The need for statistical contributions to bioinformatics at scale, with illustration to mass spectrometry in Statistical Modelling

Deutsch EW (2018) Expanding the Use of Spectral Libraries in Proteomics. in Journal of proteome research

Xu J (2019) Regional protein expression in human Alzheimer's brain correlates with disease severity in Communications Biology

Kassab S (2019) Cognitive dysfunction in diabetic rats is prevented by pyridoxamine treatment. A multidisciplinary investigation. in Molecular metabolism

Bhamber R (2020) mzMLb: a future-proof raw mass spectrometry data format based on standards-compliant mzML and optimized for speed and storage requirements

Bhamber RS (2021) mzMLb: A Future-Proof Raw Mass Spectrometry Data Format Based on Standards-Compliant mzML and Optimized for Speed and Storage Requirements. in Journal of proteome research

Scholefield M (2021) Severe and Regionally Widespread Increases in Tissue Urea in the Human Brain Represent a Novel Finding of Pathogenic Potential in Parkinson's Disease Dementia. in Frontiers in molecular neuroscience

Philbert SA (2021) Widespread severe cerebral elevations of haptoglobin and haemopexin in sporadic Alzheimer's disease: Evidence for a pervasive microvasculopathy. in Biochemical and biophysical research communications

Sang C (2022) Coenzyme A-Dependent Tricarboxylic Acid Cycle Enzymes Are Decreased in Alzheimer's Disease Consistent With Cerebral Pantothenate Deficiency in Frontiers in Aging Neuroscience

Key Findings
Impact Summary
Further Funding
Collaboration
Software and Technical Products


Description	We have developed BayesProt v2.0, a Bayesian mixture modelling tool to deconvolute quantifications of different protein isoforms. We have submitted an applied manuscript on the use of this tool, plus a technical manuscript is in preparation. We have discovered significant limitations in false discovery rate control in proteomics and have developed a method that assessing the uncertainty in the false discovery rate for the first time, for which we are preparing a manuscript. These techniques seeded the BBSRC European Partnering Award BB/R021430/1. We are also preparing to finish our group sparse regression technique for deconvoluting mass spectrometry features that would feed into BayesProt.
Exploitation Route	We will engage with the University of Bristol's agent for intellectual property commercialisation.
Sectors	Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Environment,Healthcare,Pharmaceuticals and Medical Biotechnology


Description	Technology developed in this grant is being developed for detecting pathogens in the environment, funded by Dstl
Sector	Aerospace, Defence and Marine


Description	Belgium: Taming the application of statistics in proteomics and metabolomics
Amount	£10,323 (GBP)
Funding ID	BB/R021430/1
Organisation	Biotechnology and Biological Sciences Research Council (BBSRC)
Sector	Public
Country	United Kingdom
Start	07/2018
End	06/2019


Description	Enabling advanced analytics for all users of the proteomics facility
Amount	£4,172 (GBP)
Organisation	University of Bristol
Sector	Academic/University
Country	United Kingdom
Start	01/2018
End	07/2018


Description	Identification of hazardous chemical and biological contamination on surfaces using spectral signatures
Amount	£44,891 (GBP)
Organisation	Defence Science & Technology Laboratory (DSTL)
Sector	Public
Country	United Kingdom
Start	10/2021
End	02/2022


Description	Methodology Research Panel
Amount	£594,485 (GBP)
Funding ID	MR/N028457/1
Organisation	Medical Research Council (MRC)
Sector	Public
Country	United Kingdom
Start	04/2017
End	03/2020


Description	University of Liverpool EPSRC Impact Accelerator
Amount	£21,844 (GBP)
Organisation	University of Liverpool
Sector	Academic/University
Country	United Kingdom
Start	04/2016
End	06/2016


Description	Prof Jeffrey Morris
Organisation	University of Texas
Department	M. D. Anderson Cancer Center
Country	United States
Sector	Academic/University
PI Contribution	Translation of Prof Morris' Wavelet Functional Mixed Model methodology to the proteomics LC-MS (Liquid Chromatography - Mass Spectrometry) field.
Collaborator Contribution	Access to Prof Morris' expertise and unpublished methodology in order to create our novel differential analysis workflow for raw LC-MS data.
Impact	Two publications [Liao et al, IEEE ISBI 2014; Dowsey et al Proteomics, 2010, 4226-57] plus a successful submission to the September 2014 BBSRC Bilateral NSF/BIO-BBSRC responsive mode call [BB/M024954/1].
Start Year	2009


Description	Proteomics Standards Initiative
Organisation	Human Proteome Organization
Department	Proteomics Standards Inititative
Country	United States
Sector	Charity/Non Profit
PI Contribution	Expertise on signal compression and data representation for application to the PSI's mzML standard interchange format for proteomics
Collaborator Contribution	Implementation and validation of new signal compression approaches for mzML
Impact	One publication [Teleman et al, Molecular and Cellular Proteomics, 1537-42, 2014], with open source implementation in ProteoWizard (http://proteowizard.sourceforge.net/)
Start Year	2013


Title	BayesProt v1.0
Description	BayesTraq: a Bayesian mixed-effects model for protein quantification in iTraq clinical proteomics
Type Of Technology	Software
Year Produced	2015
Open Source License?	Yes
Impact	Significantly improves the sensitivity and robustness of differential analysis in iTraq proteomics
URL	http://www.biospi.org/research/ms/bayestraq/


Title	mzMLb
Description	A
Type Of Technology	Software
Year Produced	2018
Impact	Proteomics Standards Initiative standards compatible binary mass spectrometry data format for efficient read/write speed and storage space requirements
URL	https://github.com/biospi/mzmlb

Abstract

Technical Summary

Planned Impact

Organisations

People

ORCID iD

Publications