ProteoFormer - a software toolkit for top-down proteomics

Lead Research Organisation: University of Liverpool
Department Name: Institute of Integrative Biology

Abstract

Research in the life sciences is being driven forward by cutting-edge techniques for studying the molecules acting in cells. The functional molecules in cells are proteins - the expression, activity and interactions of particular proteins in any given cell define its structure and what it is capable of doing. As one example of these techniques, we are often interested in studying what proteins are present in diseased cells and in what quantities, compared with normal cells, since the identity of the proteins may help us understand the overall disease process, and the search for new drug targets. The set of technologies used to study proteins on a large scale are called proteomics, as the complete set of proteins present in a given cell or sample of interest is described as the "proteome". The main method used in proteomics is mass spectrometry (MS). MS is essentially a technique for calculating the molecular weight of molecules, and it can also provide information about the abundance of a given molecule in the sample. MS has been available for many decades, but recent years have seen huge strides in the ability to perform high-throughput workflows, studying tens of thousands of molecules in any given sample, and great advances in the instrument resolution, such that molecules of almost identical mass can be differentiated.

The majority of proteomics workflows perform a step of protein digestion prior to MS. The result of digestion is that all the proteins in the sample become broken up in a predictable manner into small chains, called peptides. This step has become common, because peptides are easier to analyse by MS, due to their lower mass, producing simpler data to interpret. The set of peptides is then identified and often quantified across different conditions (e.g. disease versus healthy cells). We often know in advance that a peptide was derived from a specific parent protein, and so we can use the identity and quantification of that peptide as a proxy measure for the behaviour of the protein across our samples of interest, and as such these workflows are called "bottom-up". However, while bottom-up studies dominate the field, they have a significant drawback. Proteins are molecules that tend to exist in multiple different, related forms in the cells, which have been called proteoforms. Proteins may become activated or deactivated by the addition/removal of one or more chemical groups, called post-translational modifications (PTMs). Many important diseases have been shown to be associated with dysfunction of PTMs, including cancer and neurodegeneration. Bottom-up studies have a severely limited ability to study these important proteoforms, since the small pieces of a protein (the peptides), cannot tell us which proteoform we are looking at, only that one of the many possible proteoforms is present.

Our groups are pioneering techniques to study intact proteins by MS, in so-called top-down studies. Recent developments in MS instruments enable much larger molecules to be studied effectively, opening up the possibility to study each proteoform in the state in which it is present in the cell. However, MS produces very complex and often overlapping signals from closely related proteins, which are difficult to interpret. Software for identification and quantification of peptide signals from MS data has been developed for over twenty years, but research for interpreting protein-level data is quite limited. We specialise in developing software for proteomics and we are devising a new algorithm that, firstly, will simplify the complex MS data signal coming from the instrument, and secondly, will confidently identify the proteins present including all PTMs on the different proteoforms we detect. The software will help to advance the field of top-down proteomics, making this a more broadly accessible method, and thus improving our ability to study proteins in cells, for a wide range of applications.

Technical Summary

Most proteomics workflows employ a "bottom-up" method, in which the identity and intensity of peptide ions measured by mass spectrometry (MS) are used as a proxy measure for the parent protein, since signals from analysis of intact proteins were too complex to interpret on older, low-resolution instruments. Recent improvements mean that the isotope pattern of large molecular weight species can now be resolved, enabling direct analysis of intact proteins - "top-down" proteomics. A major advantage of top-down analysis is the ability to characterise closely related proteins (proteoforms), resulting from paralogs, alternative splicing or different post-translational modifications (PTMs) on proteins.

MS data from intact proteins can be hugely complex to interpret, and the majority of software tools have been developed for peptide data. In this project, we will develop software to tackle two related problems. Firstly, ionisation of intact proteins generates high-charge state ions, often with overlapping signals. We have previously developed the seaMass software, which is able to identify, separate and de-charge overlapping isotope patterns in complex LC-MS peptide data robustly under ion counting noise. We will adapt seaMass for detecting high-charge state, intact protein data, producing de-charged MS1 and MS2 spectra. Secondly, the traditional sequence database search used in peptide identification (in which modifications must be specified in advance) is not suitable for intact protein data. We are developing the Proteoformer software, which converts a fragment spectrum into a set of peptide regular expressions (pep regexes) that are used to search a protein database index. Following identification of the correct database protein, we can dynamically identify the PTMs and processing events that have occurred de novo. Both software packages will be provided with a graphical interface through our Proteosuite software, and will work with open data standards.

Planned Impact

Our developments will have impacts through the following routes:

- The development of seaMass-TD and Proteoformer will make it more straightforward for top-down analysis to be performed on a much wider range of instruments, producing high-quality and reliable results. This will open up this important technology for studying proteins in their native state in the cell, for basic and applied research across numerous domains.

- Our software has the potential to increase sales of mass spectrometers, capable of performing top-down analysis. In particular, locally we are working with Waters to develop software compatible with their data, since current software has typically been designed for mass spectrometers produced by Thermo.

- We will explore routes for commercialisation of Proteoformer and seaMass-TD, as discussed in the Pathways to Impact document.

- We will work with international consortia aimed with data sharing and standardisation in proteomics - the Proteomics Standards Initiative (PSI), ProteomeXchange and EBI's PRIDE database to ensure that current standards can appropriately handle top-down data, and researchers can submit data to the leading public repositories for community re-analysis.

Publications

10 25 50
publication icon
Collins A (2018) phpMs: A PHP-Based Mass Spectrometry Utilities Library. in Journal of proteome research

publication icon
VizcaĆ­no JA (2017) The mzIdentML Data Standard Version 1.2, Supporting Advances in Proteome Informatics. in Molecular & cellular proteomics : MCP

 
Description We have developed new software for understanding what proteins exist in nature, by analysis of intact proteins analysed by mass spectrometry (MS). The software is able to deconvolute complex signals generated by MS from proteins (with multiple overlapping signals from different molecules), and then search those patterns against a sequence database to identify the proteins and their modifications.
Exploitation Route The software needs further development to become a robust technique to be used in other labs.
Sectors Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology

URL http://pgb.liv.ac.uk/proteoformer-td/
 
Title Proteoformer-TD 
Description New software for top-down proteomics - interpreting fragment spectra from intact proteins analysed by mass spectrometry. 
Type Of Technology Software 
Year Produced 2015 
Open Source License? Yes  
Impact These will be published in due course. 
URL http://pgb.liv.ac.uk/proteoformer-td/
 
Title phpMs-TopDown 
Description The software performs processing of "top-down" mass spectrometry data on proteins, taking input from peak detection algorithm to perform: feature detection along the time axis, association of fragment ions to precursors, database search and modification site localisation. 
Type Of Technology Webtool/Application 
Year Produced 2019 
Open Source License? Yes  
Impact There is a current lack of open source software for top-down proteomics, so the software has the potential for impact in labs that are employing this up and coming method. 
URL https://github.com/PGB-LIV/phpMs-TopDown