Pipeline for interpretation and storage of organelle proteomics data

Lead Research Organisation: University of Cambridge
Department Name: Biochemistry

Abstract

Organelle proteomics is an emerging field within the area of the study of the proteins. The proteome is the set of proteins expressed by a cell or found in a biological fluid at any given time and circumstances. Within cells of more complex organisms such as fungi, plants and animals, many proteins are found in specific subcellular structures called organelles, where they carry out their function. Determining the sub-cellular location(s) of a protein is very desirable to biologists for two reasons. Firstly, it can help elucidate their role in the cell as proteins are spatially organised according to their function, and location is an important determinant of the specificity of their molecular interactions. Secondly, it refines our knowledge of cellular processes by pinpointing certain activities to specific organelles. Unfortunately, most organelles cannot be purified away from contaminants in such a way as to lead to an accurate catalogue of proteins from any given organelle. Recently several high throughput methods have emerged involving quantitative strategies, which have overcome the need to produce a pure organelle for analysis. Each of these methods relies on quantitative proteomics to characterize the distribution pattern of organelles amongst partially enriched fractions generated by various separation technologies and have the potential to discriminate between genuine organelle residents and contaminants without preparation of pure organelles. For all of these methods, two data analysis stages are essential; the first deals with appropriate normalisation of quantitative data and removal of system bias; the second involves robust multivariate processing of the signals and a statistical assessment of the confidence in the results, which is required to match distribution and enrichment patterns to those of known organelle markers. Fully curated data sets containing information about experimental design, data manipulation and assignment of proteins to subcellular locations would be of immense value to biologists. To date there is no easily accessible, streamlined, facile software suites which allow data analysis and capturing of data and meta data in a standardized way which can then be easily accessed and interpreted by the community at large. In this proposal three groups that already have successful and fruitful collaborations in place and extremely complementary areas of expertise, Lilley (organelle proteomics) Huber (Statistics, software), Hermjakob and Martens (PRIDE database), all within the Cambridge area, aim to produce a facile organelle pipeline used by the growing organelle proteomics community and will aid not only data analysis, but data storage and presentation of data for submission to all the major journals. Its output will be easily accessible by a wide variety of biologists and will facilitate data sharing amongst a growing cohort of scientists..

Technical Summary

Determining the sub-cellular location(s) of a protein is essential in the elucidation of cellular processes. It is not possible to purify most organelles away from significant amounts of contamination of organelles with similar physical properties. Recently several high throughput methods have emerged involving proteomics methods, which have overcome the need for pure organelles. These methods rely on the characterization of distribution pattern of organelles amongst partially enriched fractions generated by various technologies. For all of these methods, two data analysis stages are essential; appropriate normalisation of quantitative data and removal of system bias; robust multivariate processing and a statistical assessment of the confidence in the results is required to match distribution and enrichment patterns to those of known organelle markers. To date there is no easily accessible, streamlined, facile software suites which allow data analysis and capturing of data and metadata in a standardized way which can then be easily accessed and interpreted by the community. PRIDE is now the global repository for proteomics data, recommended by major journals for deposition of relevant datasets. PRIDE does not have the functionality to deal with deposition of organelle proteomics datasets. Here we aim to produce a facile organelle open source pipeline which can be utilized in a user friendly, explanatory and scientifically sound manner. The culmination of the project will be a suite of open source software, into which raw data will be applied. After data normalisation, a choice of statistical tests will be available, with clear explanation of how these tests operate, to allow clustering of data to reveal assignments to organelles, where possible. These assignments, raw data, details of the experimental design and starting samples will then be captured by PRIDE allowing storage of the complete information about the experiment, aiding publication and data sharing.

Publications

10 25 50
 
Description The outcomes of this grant were the creation of informatics tools to allow rapid exploitation of the very latest cutting edge technology by providing a mechanism to facilitate visualization, analysis and data sharing of organelle proteomics datasets. other outputs of this work also include the novel application of statistical approaches which may lead to new biological understanding and also the development of R based software for the proteomics community.
Exploitation Route In the biotechnology, pharmaceutical and clinical sectors The outputs have been disseminated in an R package (MSnbase) which is now part of Bioconductor, three peer reviewed papers and another manuscript submitted to Bioinformatics. The PI and PDRA have also given numerous talks and the PDRA also has taught on a variety of R courses.
Sectors Education,Healthcare,Pharmaceuticals and Medical Biotechnology

 
Description the results achieved during the grant have been used in subsequent softwares created in the Lilley lab.
First Year Of Impact 2011
Sector Digital/Communication/Information Technologies (including Software)
 
Description Tools and Resources Development Fund (toolkit)
Amount £150,000 (GBP)
Funding ID BB/H0242471 
Organisation Biotechnology and Biological Sciences Research Council (BBSRC) 
Sector Public
Country United Kingdom
Start 02/2011 
End 08/2012
 
Title pRoloc 
Description organelle proteomics software 
Type Of Technology Software 
Year Produced 2012 
Open Source License? Yes  
Impact used by many remote users