Further Development of the Genome Annotating Proteomic Pipeline

Lead Research Organisation: Cranfield University
Department Name: Cranfield Health

Abstract

The completion of the Human Genome Project, which set out to read the entire DNA 'blueprint' for a human being, was one of the greatest scientific achievements of recent times. However, there is a limit to what the genome itself can tell us about how a human being grows and functions, as it does not change according to time or tissue type, and varies little between individuals of the same species. To get a real insight into biology and disease mechanisms, we need to look at the proteins expressed from the genome, as these are the molecules that actually do the work in the body. As the protein complement of a cell varies tremendously over time, between tissue types, and in response to environmental conditions, developing methods to identify which proteins are present in a particular sample is crucial. Proteomics is the name given to the science of identifying all the proteins within a sample. This is typically achieved by separating the proteins, breaking them up into smaller peptides, and analysing them with mass spectrometry (MS) to generate a characteristic spectrum which can be matched to a sequence known to exist in the genome. This spectral matching is a very difficult process, due to the complexity and volume of MS data collected. In a previously funded BBSRC project, we developed the Genome Annotating Proteomic Pipeline (GAPP) which uses distributed computing and advanced algorithms to confidently identify peptides. This system is freely available via the web (www.gapp.info), and has proven popular within the proteomics community. However, it could be easier to extract biologically relevant results from the system, and there are some missing features which would make GAPP much more widely applicable / particular support for single MS data and other species. The aim of this proposal is to add this functionality and improve GAPP, to widen the user community and increase the scientific value of GAPP to these users.

Technical Summary

In a previous two year BBSRC-funded project we have developed a fully automated high throughput peptide identification system which we have subsequently called GAPP (Genome Annotating Proteomic Pipeline). This pipeline takes as input a series of MS/MS peak lists from a given experimental sample, and produces a series of database entries corresponding to the peptides observed within the sample, along with related confidence scores and a list of proteins to which the peptides belong. In a recent comparative study, the pipeline has been shown to outperform other systems in terms of peptides confidently identified, and its application has already led to one high impact publication, in PNAS. The purpose of this proposal is to request additional funding to continue the development of GAPP, specifically: 1. Modify the identification pipeline and associated peptide identification database so that data can be accepted from any species for which sequence data exists in Ensembl format. 2. Modify the pipeline and database so that it can deal with single MS data, as well as the tandem MS data currently supported. 3. Make substantial improvements to the user interface for database querying and visualisation as described in the case for support. 4. Implement other peptide identification algorithms that are developed during the project, provided they have a clear scientific benefit, such as improving proteome coverage, increasing confidence in peptide observations, or increasing data throughput.

Publications

10 25 50