Bilateral NSF/BIO-BBSRC: Bayesian Quantitative Proteomics
Lead Research Organisation:
University of Liverpool
Department Name: Electrical & Electronic Engineering
Abstract
Research in the life sciences is being driven forward by cutting-edge techniques for studying the molecules acting in cells. The functional molecules in cells are proteins - the expression, activity and interactions of particular proteins in any given cell define its structure and what it is capable of doing. As one example, we are often interested in studying what proteins are present in diseased cells and in what quantities, compared with normal cells, since the identity of the proteins may help us understand the disease process, and the search for new drug targets. The technologies used to study proteins on a large scale are collectively called proteomics. The main method used in proteomics is mass spectrometry (MS), which can calculate the molecular weight and abundance of molecules.
The majority of proteomics workflows perform a step of protein digestion prior to MS. The result of digestion is that all the proteins become broken up into small chains, called peptides. This step has become common, because peptides are easier to analyse by MS, due to their lower mass, producing simpler data to interpret. The set of peptides is then identified and often quantified across different conditions (e.g. disease versus healthy cells). We often know that a peptide was derived from a specific parent protein, and so we can use the identity and quantification of that peptide as a proxy measure for the behaviour of the protein across our samples of interest, and as such these workflows are called "bottom-up". One issue with the digestion of proteins is that some proteins break down quicker than others - for some proteins/peptides digestion is incomplete, producing unreliable quantification data, which at present is not fully understood or compensated for by the analysis software.
While bottom-up studies dominate the field, they currently have several significant drawbacks. Proteins are molecules that tend to exist in multiple different, related forms in the cells, which have been called proteoforms - through the gene encoding the protein being processed in different ways (alternatively splicing), or through the addition of functionally important chemical groups, called post-translational modifications (PTMs). Since only one or a few peptides are different between different proteoforms, they are far more challenging (or impossible with current techniques) to quantify accurately. Current practice in proteomics generally ignores this problem - losing vast amounts of data about the true nature of the molecules in the system. There are MS techniques for studying intact proteins and their proteoforms (called top-down methods), but at present these do not function in high-throughput mode, and thus are typically used for targeted studies on a small number of proteins.
In order to make a step change in the quantification and discovery of proteoforms, we will develop an integrated suite of analysis techniques using a powerful statistical technique called Bayesian modelling. With Bayesian approaches, the problem at hand is simulated many thousands of times probabilistically. By interpreting the range of different conclusions reached, we can get an idea of how certain we are about the results, which is crucial given the subtle nature of the evidence within the MS datasets. In essence, our computational techniques will deliver the same quality of data about individual proteoforms (including novel discovery of PTMs) as top-down techniques, but based off bottom-up (peptide-focussed) workflows - thus, for the first time, enabling highly accurate proteoform-level discovery and quantification in high-throughput mode. To ensure rapid and wide uptake of our new methods, we will integrate our advancements into a freely available software suite we are developing, ProteoSuite.
The majority of proteomics workflows perform a step of protein digestion prior to MS. The result of digestion is that all the proteins become broken up into small chains, called peptides. This step has become common, because peptides are easier to analyse by MS, due to their lower mass, producing simpler data to interpret. The set of peptides is then identified and often quantified across different conditions (e.g. disease versus healthy cells). We often know that a peptide was derived from a specific parent protein, and so we can use the identity and quantification of that peptide as a proxy measure for the behaviour of the protein across our samples of interest, and as such these workflows are called "bottom-up". One issue with the digestion of proteins is that some proteins break down quicker than others - for some proteins/peptides digestion is incomplete, producing unreliable quantification data, which at present is not fully understood or compensated for by the analysis software.
While bottom-up studies dominate the field, they currently have several significant drawbacks. Proteins are molecules that tend to exist in multiple different, related forms in the cells, which have been called proteoforms - through the gene encoding the protein being processed in different ways (alternatively splicing), or through the addition of functionally important chemical groups, called post-translational modifications (PTMs). Since only one or a few peptides are different between different proteoforms, they are far more challenging (or impossible with current techniques) to quantify accurately. Current practice in proteomics generally ignores this problem - losing vast amounts of data about the true nature of the molecules in the system. There are MS techniques for studying intact proteins and their proteoforms (called top-down methods), but at present these do not function in high-throughput mode, and thus are typically used for targeted studies on a small number of proteins.
In order to make a step change in the quantification and discovery of proteoforms, we will develop an integrated suite of analysis techniques using a powerful statistical technique called Bayesian modelling. With Bayesian approaches, the problem at hand is simulated many thousands of times probabilistically. By interpreting the range of different conclusions reached, we can get an idea of how certain we are about the results, which is crucial given the subtle nature of the evidence within the MS datasets. In essence, our computational techniques will deliver the same quality of data about individual proteoforms (including novel discovery of PTMs) as top-down techniques, but based off bottom-up (peptide-focussed) workflows - thus, for the first time, enabling highly accurate proteoform-level discovery and quantification in high-throughput mode. To ensure rapid and wide uptake of our new methods, we will integrate our advancements into a freely available software suite we are developing, ProteoSuite.
Technical Summary
Tandem Mass Spectrometry (MS/MS) coupled to Liquid Chromatography (LC) is the primary technique used in proteomics. The most common approach is LC separation of tryptic fragments derived from a proteome digestion, followed by tandem MS of the peptides. This entire workflow is conceived as a series of discrete steps, some chemical, some instrumental, some informatics and some statistical. Existing software concentrates on subcomponents of the workflow, and comprise a series of deterministic, self-contained steps. No methods propagate uncertainty from one step to the next, nor do they borrow strength either within or across steps - this starkly contrasts with recent advancements in processing RNA-seq data.
We propose to translate the whole protein quantification pipeline into a rigorous statistical framework underpinned by Bayesian methodology. The new framework will enable us to integrate evidence across all experimentally acquired datasets, and allow us to borrow strength from unused structure within a proteomics workflow, including digestion dynamics. Our proposed pipeline consists of three synergistic developments (1) Utilisation of all unidentified (peptide) features, as well as identified features, to infer the most likely mixture of proteins present in a sample; (2) Differential quantification of complex mixtures of known proteoforms; (3) Discovery of unknown proteoforms and all modifications (PTMs) carried by their quantification signatures. These advancements will elicit a step-change in quantification sensitivity and interpretation at the proteoform level for the first time. We will disseminate this end-to-end analysis solution within the user-centric standards compliant ProteoSuite package, and as a Galaxy workflow for high-throughput pipelines.
We propose to translate the whole protein quantification pipeline into a rigorous statistical framework underpinned by Bayesian methodology. The new framework will enable us to integrate evidence across all experimentally acquired datasets, and allow us to borrow strength from unused structure within a proteomics workflow, including digestion dynamics. Our proposed pipeline consists of three synergistic developments (1) Utilisation of all unidentified (peptide) features, as well as identified features, to infer the most likely mixture of proteins present in a sample; (2) Differential quantification of complex mixtures of known proteoforms; (3) Discovery of unknown proteoforms and all modifications (PTMs) carried by their quantification signatures. These advancements will elicit a step-change in quantification sensitivity and interpretation at the proteoform level for the first time. We will disseminate this end-to-end analysis solution within the user-centric standards compliant ProteoSuite package, and as a Galaxy workflow for high-throughput pipelines.
Planned Impact
As well as the academic beneficiaries, the proposed research has significant prospective impact for the mass spectrometry industry and associated proteomics vendors. The proposed Bayesian Quantiative Proteomics platform will increase the amount of usable data extracted from LC-MS and therefore correspondingly increase users' return on investment. This will make commercial mass spectrometry instrumentation, which requires considerable capital and running costs, more attractive. In particular, we hope this extra research capacity will attract a wider uptake of mass spectrometry in environmental, biological and health research in industry and academia, as well as a wider audience of users and uses amongst systems biology researchers.
There is potential for direct impact through the licensing of some or all of our software tools developed, as we are working towards for other packages with Waters Inc.
There is considerable potential in this application for providing indirect benefits to UK public health, quality of life and environmental sustainability. Our aim is to establish a powerful platform for differential proteoform analysis and discovery enabling a wealth of new investigations in the biological sciences and translational medicine. Due to its success and further substantial promise, the BBSRC, UK research councils and industry have invested greatly in the systems biology approach. The potential improvements yielded by our workflow will therefore have a clear dissemination route to the public through reduced resources, costs and overheads required for discoveries realised with systems approaches in environmental, biological and biomedical science, and the characterisation of those discoveries.
The PDRAs employed on this grant benefit significantly from exposure to the wealth of proteome informatics expertise we will bring together, particularly since the PDRAs will be encouraged to play a significant role in public dissemination. All staff will benefit through being engaged within an international, cutting edge interdisciplinary project.
There is potential for direct impact through the licensing of some or all of our software tools developed, as we are working towards for other packages with Waters Inc.
There is considerable potential in this application for providing indirect benefits to UK public health, quality of life and environmental sustainability. Our aim is to establish a powerful platform for differential proteoform analysis and discovery enabling a wealth of new investigations in the biological sciences and translational medicine. Due to its success and further substantial promise, the BBSRC, UK research councils and industry have invested greatly in the systems biology approach. The potential improvements yielded by our workflow will therefore have a clear dissemination route to the public through reduced resources, costs and overheads required for discoveries realised with systems approaches in environmental, biological and biomedical science, and the characterisation of those discoveries.
The PDRAs employed on this grant benefit significantly from exposure to the wealth of proteome informatics expertise we will bring together, particularly since the PDRAs will be encouraged to play a significant role in public dissemination. All staff will benefit through being engaged within an international, cutting edge interdisciplinary project.
Publications
Sang C
(2022)
Coenzyme A-Dependent Tricarboxylic Acid Cycle Enzymes Are Decreased in Alzheimer's Disease Consistent With Cerebral Pantothenate Deficiency.
in Frontiers in aging neuroscience
Kassab S
(2019)
Cognitive dysfunction in diabetic rats is prevented by pyridoxamine treatment. A multidisciplinary investigation
in Molecular Metabolism
Deutsch EW
(2018)
Expanding the Use of Spectral Libraries in Proteomics.
in Journal of proteome research
Lu J
(2023)
Five Inhibitory Receptors Display Distinct Vesicular Distributions in Murine T Cells
in Cells
Lu J
(2023)
Five inhibitory receptors display distinct vesicular distributions in T cells.
in bioRxiv : the preprint server for biology
Patassini S
(2015)
Identification of elevated urea as a severe, ubiquitous metabolic defect in the brain of patients with Huntington's disease.
in Biochemical and biophysical research communications
Mcharg S
(2022)
Mast cell infiltration of the choroid and protease release are early events in age-related macular degeneration associated with genetic risk at both chromosomes 1q32 and 10q26.
in Proceedings of the National Academy of Sciences of the United States of America
Bhamber RS
(2021)
mzMLb: A Future-Proof Raw Mass Spectrometry Data Format Based on Standards-Compliant mzML and Optimized for Speed and Storage Requirements.
in Journal of proteome research
Liao H
(2016)
Proteome Informatics
Description | We have developed BayesProt v2.0, a Bayesian mixture modelling tool to deconvolute quantifications of different protein isoforms. We have submitted an applied manuscript on the use of this tool, plus a technical manuscript is in preparation. We have discovered significant limitations in false discovery rate control in proteomics and have developed a method that assessing the uncertainty in the false discovery rate for the first time, for which we are preparing a manuscript. These techniques seeded the BBSRC European Partnering Award BB/R021430/1. We are also preparing to finish our group sparse regression technique for deconvoluting mass spectrometry features that would feed into BayesProt. |
Exploitation Route | We will engage with the University of Bristol's agent for intellectual property commercialisation. |
Sectors | Agriculture Food and Drink Digital/Communication/Information Technologies (including Software) Environment Healthcare Pharmaceuticals and Medical Biotechnology |
Description | Technology developed in this grant is being developed for detection of pathogens in the environment, funded by Dstl |
Sector | Aerospace, Defence and Marine |
Description | Belgium: Taming the application of statistics in proteomics and metabolomics |
Amount | £10,323 (GBP) |
Funding ID | BB/R021430/1 |
Organisation | Biotechnology and Biological Sciences Research Council (BBSRC) |
Sector | Public |
Country | United Kingdom |
Start | 06/2018 |
End | 06/2019 |
Description | Enabling advanced analytics for all users of the proteomics facility |
Amount | £4,172 (GBP) |
Organisation | University of Bristol |
Sector | Academic/University |
Country | United Kingdom |
Start | 01/2018 |
End | 07/2018 |
Description | Identification of hazardous chemical and biological contamination on surfaces using spectral signatures |
Amount | £44,891 (GBP) |
Organisation | Defence Science & Technology Laboratory (DSTL) |
Sector | Public |
Country | United Kingdom |
Start | 09/2021 |
End | 02/2022 |
Description | Methodology Research Panel |
Amount | £594,485 (GBP) |
Funding ID | MR/N028457/1 |
Organisation | Medical Research Council (MRC) |
Sector | Public |
Country | United Kingdom |
Start | 03/2017 |
End | 03/2020 |
Description | University of Liverpool EPSRC Impact Accelerator |
Amount | £21,844 (GBP) |
Organisation | University of Liverpool |
Sector | Academic/University |
Country | United Kingdom |
Start | 03/2016 |
End | 06/2016 |
Description | Prof Jeffrey Morris |
Organisation | University of Texas |
Department | M. D. Anderson Cancer Center |
Country | United States |
Sector | Academic/University |
PI Contribution | Translation of Prof Morris' Wavelet Functional Mixed Model methodology to the proteomics LC-MS (Liquid Chromatography - Mass Spectrometry) field. |
Collaborator Contribution | Access to Prof Morris' expertise and unpublished methodology in order to create our novel differential analysis workflow for raw LC-MS data. |
Impact | Two publications [Liao et al, IEEE ISBI 2014; Dowsey et al Proteomics, 2010, 4226-57] plus a successful submission to the September 2014 BBSRC Bilateral NSF/BIO-BBSRC responsive mode call [BB/M024954/1]. |
Start Year | 2009 |
Description | Proteomics Standards Initiative |
Organisation | Human Proteome Organization |
Department | Proteomics Standards Inititative |
Country | United States |
Sector | Charity/Non Profit |
PI Contribution | Expertise on signal compression and data representation for application to the PSI's mzML standard interchange format for proteomics |
Collaborator Contribution | Implementation and validation of new signal compression approaches for mzML |
Impact | One publication [Teleman et al, Molecular and Cellular Proteomics, 1537-42, 2014], with open source implementation in ProteoWizard (http://proteowizard.sourceforge.net/) |
Start Year | 2013 |
Title | BayesProt v1.0 |
Description | BayesTraq: a Bayesian mixed-effects model for protein quantification in iTraq clinical proteomics |
Type Of Technology | Software |
Year Produced | 2015 |
Open Source License? | Yes |
Impact | Significantly improves the sensitivity and robustness of differential analysis in iTraq proteomics |
URL | http://www.biospi.org/research/ms/bayestraq/ |
Title | mzMLb |
Description | A |
Type Of Technology | Software |
Year Produced | 2018 |
Impact | Proteomics Standards Initiative standards compatible binary mass spectrometry data format for efficient read/write speed and storage space requirements |
URL | https://github.com/biospi/mzmlb |