ProteoFormer - a software toolkit for top-down proteomics

Lead Research Organisation: University of Manchester
Department Name: Medical and Human Sciences

Abstract

Abstracts are not currently available in GtR for all funded research. This is normally because the abstract was not required at the time of proposal submission, but may be because it included sensitive information such as personal details.

Technical Summary

Most proteomics workflows employ a "bottom-up" method, in which the identity and intensity of peptide ions measured by mass spectrometry (MS) are used as a proxy measure for the parent protein, since signals from analysis of intact proteins were too complex to interpret on older, low-resolution instruments. Recent improvements mean that the isotope pattern of large molecular weight species can now be resolved, enabling direct analysis of intact proteins - "top-down" proteomics. A major advantage of top-down analysis is the ability to characterise closely related proteins (proteoforms), resulting from paralogs, alternative splicing or different post-translational modifications (PTMs) on proteins.

MS data from intact proteins can be hugely complex to interpret, and the majority of software tools have been developed for peptide data. In this project, we will develop software to tackle two related problems. Firstly, ionisation of intact proteins generates high-charge state ions, often with overlapping signals. We have previously developed the seaMass software, which is able to identify, separate and de-charge overlapping isotope patterns in complex LC-MS peptide data robustly under ion counting noise. We will adapt seaMass for detecting high-charge state, intact protein data, producing de-charged MS1 and MS2 spectra. Secondly, the traditional sequence database search used in peptide identification (in which modifications must be specified in advance) is not suitable for intact protein data. We are developing the Proteoformer software, which converts a fragment spectrum into a set of peptide regular expressions (pep regexes) that are used to search a protein database index. Following identification of the correct database protein, we can dynamically identify the PTMs and processing events that have occurred de novo. Both software packages will be provided with a graphical interface through our Proteosuite software, and will work with open data standards.

Planned Impact

Our developments will have impacts through the following routes:

- The development of seaMass-TD and Proteoformer will make it more straightforward for top-down analysis to be performed on a much wider range of instruments, producing high-quality and reliable results. This will open up this important technology for studying proteins in their native state in the cell, for basic and applied research across numerous domains.

- Our software has the potential to increase sales of mass spectrometers, capable of performing top-down analysis. In particular, locally we are working with Waters to develop software compatible with their data, since current software has typically been designed for mass spectrometers produced by Thermo.

- We will explore routes for commercialisation of Proteoformer and seaMass-TD, as discussed in the Pathways to Impact document.

- We will work with international consortia aimed with data sharing and standardisation in proteomics - the Proteomics Standards Initiative (PSI), ProteomeXchange and EBI's PRIDE database to ensure that current standards can appropriately handle top-down data, and researchers can submit data to the leading public repositories for community re-analysis.

Publications

10 25 50
 
Description We have developed a new top-down proteomics deconvolution strategy based on seaMass called seaMass-TD. Like Waters MaxEnt and Thermo ReSpect, seaMass-TD is a true mathematical deconvolution of the raw data, modelling the prior belief as a set of constraints (mass relationships between charge states, peak FWHM/shape) and using an iterative method to solve the inverse problem of finding the most probable deconvolution that fits the model. However, seaMass-TD is unique: (a) By learning the range of protein isotope distributions generated from UniProt, relaxed to allow small deviations caused by unknown proteoforms, we enable overlapping proteoform deconvolution whilst also probabilistically outputting a range of monoisotoptic peak candidates for each; (b) A sparse regression approach is used, based on the assumption that there are far fewer proteins in the dataset than datapoints. Improbable proteins are thus eliminated after only a few iterations, hence seaMass-TD is orders of magnitude faster than MaxEnt, allowing it to process at high mass resolution like Xtract/MS-Deconv but on non-isotopically resolved data for the first time; (c) Through implementation of group sparse regression, we allow complete flexibility in the charge state distribution of each proteoform, inferring both the isotope and charge state distribution for each.
Exploitation Route In order to develop the technique further to enable deconvolution of high mass proteins and complex LC-MS data, the method has been used as pilot work for a BBSRC responsive mode application.
Sectors Agriculture, Food and Drink,Environment,Healthcare,Pharmaceuticals and Medical Biotechnology

URL http://www.biospi.org/research/ms/seamass-td/
 
Description Technology developed in this grant is being developed for characterising impurities in oligonucleotide drugs, funded by AstraZeneca
Sector Pharmaceuticals and Medical Biotechnology
 
Description Novel semi-supervised Bayesian learning to rapidly screen new oligonucleotide drugs for impurities
Amount £104,203 (GBP)
Organisation AstraZeneca 
Sector Private
Country United Kingdom
Start 09/2021 
End 09/2025
 
Title seaMass-TD 
Description seaMass-TD is the first method to deconvolute top down proteomics spectra that infers high resolution output on isotopically unresolved input. It extends the Peptide Simplex deisotoping model to whole proteins, and the seaMass sparse inference model to group sparsity in order to link together the full charge state ladders of these whole protein isotope distributions. 
Type Of Technology Software 
Year Produced 2015 
Open Source License? Yes  
Impact Currently an alpha quality version demonstrating its power; impact ongoing. 
URL http://www.biospi.org/research/ms/seamass-td/