Pipeline for interpretation and storage of organelle proteomics data

Lead Research Organisation: University of Cambridge

Department Name: Biochemistry

Abstract

Organelle proteomics is an emerging field within the area of the study of the proteins. The proteome is the set of proteins expressed by a cell or found in a biological fluid at any given time and circumstances. Within cells of more complex organisms such as fungi, plants and animals, many proteins are found in specific subcellular structures called organelles, where they carry out their function. Determining the sub-cellular location(s) of a protein is very desirable to biologists for two reasons. Firstly, it can help elucidate their role in the cell as proteins are spatially organised according to their function, and location is an important determinant of the specificity of their molecular interactions. Secondly, it refines our knowledge of cellular processes by pinpointing certain activities to specific organelles. Unfortunately, most organelles cannot be purified away from contaminants in such a way as to lead to an accurate catalogue of proteins from any given organelle. Recently several high throughput methods have emerged involving quantitative strategies, which have overcome the need to produce a pure organelle for analysis. Each of these methods relies on quantitative proteomics to characterize the distribution pattern of organelles amongst partially enriched fractions generated by various separation technologies and have the potential to discriminate between genuine organelle residents and contaminants without preparation of pure organelles. For all of these methods, two data analysis stages are essential; the first deals with appropriate normalisation of quantitative data and removal of system bias; the second involves robust multivariate processing of the signals and a statistical assessment of the confidence in the results, which is required to match distribution and enrichment patterns to those of known organelle markers. Fully curated data sets containing information about experimental design, data manipulation and assignment of proteins to subcellular locations would be of immense value to biologists. To date there is no easily accessible, streamlined, facile software suites which allow data analysis and capturing of data and meta data in a standardized way which can then be easily accessed and interpreted by the community at large. In this proposal three groups that already have successful and fruitful collaborations in place and extremely complementary areas of expertise, Lilley (organelle proteomics) Huber (Statistics, software), Hermjakob and Martens (PRIDE database), all within the Cambridge area, aim to produce a facile organelle pipeline used by the growing organelle proteomics community and will aid not only data analysis, but data storage and presentation of data for submission to all the major journals. Its output will be easily accessible by a wide variety of biologists and will facilitate data sharing amongst a growing cohort of scientists..

Technical Summary

Determining the sub-cellular location(s) of a protein is essential in the elucidation of cellular processes. It is not possible to purify most organelles away from significant amounts of contamination of organelles with similar physical properties. Recently several high throughput methods have emerged involving proteomics methods, which have overcome the need for pure organelles. These methods rely on the characterization of distribution pattern of organelles amongst partially enriched fractions generated by various technologies. For all of these methods, two data analysis stages are essential; appropriate normalisation of quantitative data and removal of system bias; robust multivariate processing and a statistical assessment of the confidence in the results is required to match distribution and enrichment patterns to those of known organelle markers. To date there is no easily accessible, streamlined, facile software suites which allow data analysis and capturing of data and metadata in a standardized way which can then be easily accessed and interpreted by the community. PRIDE is now the global repository for proteomics data, recommended by major journals for deposition of relevant datasets. PRIDE does not have the functionality to deal with deposition of organelle proteomics datasets. Here we aim to produce a facile organelle open source pipeline which can be utilized in a user friendly, explanatory and scientifically sound manner. The culmination of the project will be a suite of open source software, into which raw data will be applied. After data normalisation, a choice of statistical tests will be available, with clear explanation of how these tests operate, to allow clustering of data to reveal assignments to organelles, where possible. These assignments, raw data, details of the experimental design and starting samples will then be captured by PRIDE allowing storage of the complete information about the experiment, aiding publication and data sharing.

Funded Value:

£118,713

Funded Period:

Jan 10 - Mar 11

Funder:

BBSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

BB/G024618/1

Principal Investigator:

Kathryn Lilley

Research Subject:

Omic sciences & technologies (40%)

Tools, technologies & methods (20%)

Research Topic:

Bioinformatics (20%)

Proteomics (40%)

Organisations

University of Cambridge (Lead Research Organisation)

People	ORCID iD
Kathryn Lilley (Principal Investigator)
Henning Hermjakob (Co-Investigator)	http://orcid.org/0000-0001-8479-0262
Wolfgang Huber (Co-Investigator)
Lennart Martens (Researcher Co-Investigator)

Publications

Author Name

Title Publication Date Published

10 25 50

Gatto L (2012) MSnbase-an R/Bioconductor package for isobaric tagged mass spectrometry data visualization, processing and quantitation. in Bioinformatics (Oxford, England)

Gatto L (2010) Organelle proteomics experimental designs and analysis. in Proteomics

Lilley KS (2011) Challenges for proteomics core facilities. in Proteomics

Thul P (2017) A subcellular map of the human proteome in Science

Villanueva E (2024) System-wide analysis of RNA and protein subcellular localization dynamics. in Nature methods

Key Findings
Impact Summary
Further Funding
Software and Technical Products


Description	The outcomes of this grant were the creation of informatics tools to allow rapid exploitation of the very latest cutting edge technology by providing a mechanism to facilitate visualization, analysis and data sharing of organelle proteomics datasets. other outputs of this work also include the novel application of statistical approaches which may lead to new biological understanding and also the development of R based software for the proteomics community.
Exploitation Route	In the biotechnology, pharmaceutical and clinical sectors The outputs have been disseminated in an R package (MSnbase) which is now part of Bioconductor, three peer reviewed papers and another manuscript submitted to Bioinformatics. The PI and PDRA have also given numerous talks and the PDRA also has taught on a variety of R courses.
Sectors	Education Healthcare Pharmaceuticals and Medical Biotechnology


Description	the results achieved during the grant have been used in subsequent softwares created in the Lilley lab.
First Year Of Impact	2011
Sector	Digital/Communication/Information Technologies (including Software)


Description	Tools and Resources Development Fund (toolkit)
Amount	£150,000 (GBP)
Funding ID	BB/H0242471
Organisation	Biotechnology and Biological Sciences Research Council (BBSRC)
Sector	Public
Country	United Kingdom
Start	02/2011
End	08/2012


Title	pRoloc
Description	organelle proteomics software
Type Of Technology	Software
Year Produced	2012
Open Source License?	Yes
Impact	used by many remote users

Abstract

Technical Summary

Organisations

People

ORCID iD

Publications