Bilateral NSF/BIO-BBSRC: Bayesian Quantitative Proteomics

Lead Research Organisation: University of Bristol
Department Name: Clinical Veterinary Science

Abstract

Research in the life sciences is being driven forward by cutting-edge techniques for studying the molecules acting in cells. The functional molecules in cells are proteins - the expression, activity and interactions of particular proteins in any given cell define its structure and what it is capable of doing. As one example, we are often interested in studying what proteins are present in diseased cells and in what quantities, compared with normal cells, since the identity of the proteins may help us understand the disease process, and the search for new drug targets. The technologies used to study proteins on a large scale are collectively called proteomics. The main method used in proteomics is mass spectrometry (MS), which can calculate the molecular weight and abundance of molecules.

The majority of proteomics workflows perform a step of protein digestion prior to MS. The result of digestion is that all the proteins become broken up into small chains, called peptides. This step has become common, because peptides are easier to analyse by MS, due to their lower mass, producing simpler data to interpret. The set of peptides is then identified and often quantified across different conditions (e.g. disease versus healthy cells). We often know that a peptide was derived from a specific parent protein, and so we can use the identity and quantification of that peptide as a proxy measure for the behaviour of the protein across our samples of interest, and as such these workflows are called "bottom-up". One issue with the digestion of proteins is that some proteins break down quicker than others - for some proteins/peptides digestion is incomplete, producing unreliable quantification data, which at present is not fully understood or compensated for by the analysis software.

While bottom-up studies dominate the field, they currently have several significant drawbacks. Proteins are molecules that tend to exist in multiple different, related forms in the cells, which have been called proteoforms - through the gene encoding the protein being processed in different ways (alternatively splicing), or through the addition of functionally important chemical groups, called post-translational modifications (PTMs). Since only one or a few peptides are different between different proteoforms, they are far more challenging (or impossible with current techniques) to quantify accurately. Current practice in proteomics generally ignores this problem - losing vast amounts of data about the true nature of the molecules in the system. There are MS techniques for studying intact proteins and their proteoforms (called top-down methods), but at present these do not function in high-throughput mode, and thus are typically used for targeted studies on a small number of proteins.

In order to make a step change in the quantification and discovery of proteoforms, we will develop an integrated suite of analysis techniques using a powerful statistical technique called Bayesian modelling. With Bayesian approaches, the problem at hand is simulated many thousands of times probabilistically. By interpreting the range of different conclusions reached, we can get an idea of how certain we are about the results, which is crucial given the subtle nature of the evidence within the MS datasets. In essence, our computational techniques will deliver the same quality of data about individual proteoforms (including novel discovery of PTMs) as top-down techniques, but based off bottom-up (peptide-focussed) workflows - thus, for the first time, enabling highly accurate proteoform-level discovery and quantification in high-throughput mode. To ensure rapid and wide uptake of our new methods, we will integrate our advancements into a freely available software suite we are developing, ProteoSuite.

Technical Summary

Tandem Mass Spectrometry (MS/MS) coupled to Liquid Chromatography (LC) is the primary technique used in proteomics. The most common approach is LC separation of tryptic fragments derived from a proteome digestion, followed by tandem MS of the peptides. This entire workflow is conceived as a series of discrete steps, some chemical, some instrumental, some informatics and some statistical. Existing software concentrates on subcomponents of the workflow, and comprise a series of deterministic, self-contained steps. No methods propagate uncertainty from one step to the next, nor do they borrow strength either within or across steps - this starkly contrasts with recent advancements in processing RNA-seq data.

We propose to translate the whole protein quantification pipeline into a rigorous statistical framework underpinned by Bayesian methodology. The new framework will enable us to integrate evidence across all experimentally acquired datasets, and allow us to borrow strength from unused structure within a proteomics workflow, including digestion dynamics. Our proposed pipeline consists of three synergistic developments (1) Utilisation of all unidentified (peptide) features, as well as identified features, to infer the most likely mixture of proteins present in a sample; (2) Differential quantification of complex mixtures of known proteoforms; (3) Discovery of unknown proteoforms and all modifications (PTMs) carried by their quantification signatures. These advancements will elicit a step-change in quantification sensitivity and interpretation at the proteoform level for the first time. We will disseminate this end-to-end analysis solution within the user-centric standards compliant ProteoSuite package, and as a Galaxy workflow for high-throughput pipelines.

Planned Impact

As well as the academic beneficiaries, the proposed research has significant prospective impact for the mass spectrometry industry and associated proteomics vendors. The proposed Bayesian Quantiative Proteomics platform will increase the amount of usable data extracted from LC-MS and therefore correspondingly increase users' return on investment. This will make commercial mass spectrometry instrumentation, which requires considerable capital and running costs, more attractive. In particular, we hope this extra research capacity will attract a wider uptake of mass spectrometry in environmental, biological and health research in industry and academia, as well as a wider audience of users and uses amongst systems biology researchers.

There is potential for direct impact through the licensing of some or all of our software tools developed, as we are working towards for other packages with Waters Inc.

There is considerable potential in this application for providing indirect benefits to UK public health, quality of life and environmental sustainability. Our aim is to establish a powerful platform for differential proteoform analysis and discovery enabling a wealth of new investigations in the biological sciences and translational medicine. Due to its success and further substantial promise, the BBSRC, UK research councils and industry have invested greatly in the systems biology approach. The potential improvements yielded by our workflow will therefore have a clear dissemination route to the public through reduced resources, costs and overheads required for discoveries realised with systems approaches in environmental, biological and biomedical science, and the characterisation of those discoveries.

The PDRAs employed on this grant benefit significantly from exposure to the wealth of proteome informatics expertise we will bring together, particularly since the PDRAs will be encouraged to play a significant role in public dissemination. All staff will benefit through being engaged within an international, cutting edge interdisciplinary project.

Publications

10 25 50
publication icon
Liao H (2016) Proteome Informatics

publication icon
Deutsch EW (2018) Expanding the Use of Spectral Libraries in Proteomics. in Journal of proteome research

 
Description We have developed BayesProt v2.0, a Bayesian mixture modelling tool to deconvolute quantifications of different protein isoforms. We have submitted an applied manuscript on the use of this tool, plus a technical manuscript is in preparation. We have discovered significant limitations in false discovery rate control in proteomics and have developed a method that assessing the uncertainty in the false discovery rate for the first time, for which we are preparing a manuscript. These techniques seeded the BBSRC European Partnering Award BB/R021430/1. We are also preparing to finish our group sparse regression technique for deconvoluting mass spectrometry features that would feed into BayesProt.
Exploitation Route We will engage with the University of Bristol's agent for intellectual property commercialisation.
Sectors Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Environment,Healthcare,Pharmaceuticals and Medical Biotechnology

 
Description Technology developed in this grant is being developed for detecting pathogens in the environment, funded by Dstl
Sector Aerospace, Defence and Marine
 
Description Belgium: Taming the application of statistics in proteomics and metabolomics
Amount £10,323 (GBP)
Funding ID BB/R021430/1 
Organisation Biotechnology and Biological Sciences Research Council (BBSRC) 
Sector Public
Country United Kingdom
Start 07/2018 
End 06/2019
 
Description Enabling advanced analytics for all users of the proteomics facility
Amount £4,172 (GBP)
Organisation University of Bristol 
Sector Academic/University
Country United Kingdom
Start 01/2018 
End 07/2018
 
Description Identification of hazardous chemical and biological contamination on surfaces using spectral signatures
Amount £44,891 (GBP)
Organisation Defence Science & Technology Laboratory (DSTL) 
Sector Public
Country United Kingdom
Start 10/2021 
End 02/2022
 
Description Methodology Research Panel
Amount £594,485 (GBP)
Funding ID MR/N028457/1 
Organisation Medical Research Council (MRC) 
Sector Public
Country United Kingdom
Start 04/2017 
End 03/2020
 
Description University of Liverpool EPSRC Impact Accelerator
Amount £21,844 (GBP)
Organisation University of Liverpool 
Sector Academic/University
Country United Kingdom
Start 04/2016 
End 06/2016
 
Description Prof Jeffrey Morris 
Organisation University of Texas
Department M. D. Anderson Cancer Center
Country United States 
Sector Academic/University 
PI Contribution Translation of Prof Morris' Wavelet Functional Mixed Model methodology to the proteomics LC-MS (Liquid Chromatography - Mass Spectrometry) field.
Collaborator Contribution Access to Prof Morris' expertise and unpublished methodology in order to create our novel differential analysis workflow for raw LC-MS data.
Impact Two publications [Liao et al, IEEE ISBI 2014; Dowsey et al Proteomics, 2010, 4226-57] plus a successful submission to the September 2014 BBSRC Bilateral NSF/BIO-BBSRC responsive mode call [BB/M024954/1].
Start Year 2009
 
Description Proteomics Standards Initiative 
Organisation Human Proteome Organization
Department Proteomics Standards Inititative
Country United States 
Sector Charity/Non Profit 
PI Contribution Expertise on signal compression and data representation for application to the PSI's mzML standard interchange format for proteomics
Collaborator Contribution Implementation and validation of new signal compression approaches for mzML
Impact One publication [Teleman et al, Molecular and Cellular Proteomics, 1537-42, 2014], with open source implementation in ProteoWizard (http://proteowizard.sourceforge.net/)
Start Year 2013
 
Title BayesProt v1.0 
Description BayesTraq: a Bayesian mixed-effects model for protein quantification in iTraq clinical proteomics 
Type Of Technology Software 
Year Produced 2015 
Open Source License? Yes  
Impact Significantly improves the sensitivity and robustness of differential analysis in iTraq proteomics 
URL http://www.biospi.org/research/ms/bayestraq/
 
Title mzMLb 
Description
Type Of Technology Software 
Year Produced 2018 
Impact Proteomics Standards Initiative standards compatible binary mass spectrometry data format for efficient read/write speed and storage space requirements 
URL https://github.com/biospi/mzmlb