Database on demand - creating customized sequence databases for efficient protein identification

Lead Research Organisation: European Bioinformatics Institute
Department Name: Proteomics Services Team

Abstract

The field of proteomics attempts to identify and characterize the protein complement of cells or tissues. The most popular analytical technique to achieve these goals is mass spectrometry. The mass spectra that are obtained from these instruments are usually identified by comparing them with predicted spectra based on protein sequences from sequence databases. Sophisticated computer algorithms such as the MASCOT search engine (http://www.matrixscience.com), have been developed to automate this particular task in order to accommodate the large amounts of data generated by this approach. Interestingly, only a minor fraction of the acquired spectra can be assigned to known proteins. Since all proteins can potentially go through certain changes during their lifetime in (or outside) the cell, the search algorithms are built to take certain changes into account. Mass differences based on the addition of so-called posttranslational modifications (e.g.: phosphorylation) are usually optionally taken into account by these search engines. Unfortunately, proteolytic cleavage, another biologically relevant form of protein processing, is not taken into consideration. The biological relevance of cleavage events is exemplified by the fact that many proteins found in the circulatory system (e.g.: in plasma or serum) show signs of proteolytic degradation. The cleavage patterns that these proteins or their fragments carry are hypothesized to have great significance as biomarkers for abnormal processes in the body at large. The ability to reliably and quickly identify such degradation products can thus serve an important role in the early detection of disease. Another point that is often overlooked by search engines concerns common contaminants found in samples / from the pig trypsin that is used to digest the samples to mycobacterial or viral infection of the cell lines under study. Finally, the occurence of sequence variations (through splice variants or single aminoacid polymorphisms) can further confound the identification process. Research of the frequency and importance of these minor sequence variations is therefore not straightforward. It is clear from the above that the spectra that elude identification for this reason are of great biological interest. It is also clear that the tools for reliably identifying such spectra are available, given that they can match the spectrum against the correct sequence. In order to furnish these algorithms with an enhanced set of sequences against which to match the acquired mass spectra, simple pre-processing steps of the original sequence database suffice. In this project, we propose to develop a tool that will allow the user to obtain such a customized, enriched sequence database. The user will be able to specify (a combination of) pre-processing steps that should be applied to the sequence database on a user-friendly web form. The software will subsequently take care of generating the corresponding database and format it in such a way that it can readily be used in search engines such as MASCOT. The user will simply need to download the generated database by following a web link upon notification of the completion of the process and upload this database into MASCOT (or any other search engine). The software will be a highly modular layer between the user and the sequence database that will enable preprocessing steps suited for current-day proteomics analyses, and will be easily extensible for future requirements from the community. This simple step of enriching the sequence database against which mass spectra are matched, will enhance the identification efficiency of current research projects (as well as enabling the re-analysis of previous efforts) and has the potential to unlock novel and highly interesting biological findings. As such, the tool holds great promise as a means to raise the value-for-money of proteomics experiments, while at the same time expanding the reach of the field.

Technical Summary

Mass spectrometry driven research faces the challenge of identifying and characterizing the protein complement of cells and tissues. Present-day high throughput experiments can generate tens of thousands of fragmentation spectra in one run, with only a fraction of these spectra leading to identifications. A substantial amount of the unidentified spectra is probably due to the unexpected nature of their precursors. Search engines try to accommodate for the variability by allowing mass differences resulting from protein modifications. Major contributions to protein variability that are consistently overlooked however, are splice variants, single aminoacid polymorphisms, and proteolytic cleavage or sequential degradation. Proteins in highly interesting body fluids such as serum or plasma are known to undergo such cleavages, as do proteins in very interesting biological processes such as apoptosis. These problems can be solved by pre-processing sequence databases to expose the information contained in them for interpretation by existing search engines. Firsty, existing annotation in sequence databases can be used to predict variants, polymorphsisms and sequence truncation events. Known events will thus become detectable. Secondly, by systematically pre-processing all proteins in a sequence database according to a suitable hypothesis, rich yet targetted search bases can be developed for the identification of mass spectra. The aim of the project is to develop a web-based interface where the appropriate pre-processing steps can be selected, by the user. The tool will then generate the desired database in FASTA format and deposit it on a file server. The user will be notified of job completion and will receive a hyperlink that allows direct downloading of the customized database. This database can be easily uploaded to a search engine to serve as an enriched search base. As such, the tool will maximize the value obtained from proteomics studies, and enhance their reach.

Publications

10 25 50