Toolkit for Interpretation of Organelle Proteomics Data

Lead Research Organisation: University of Cambridge
Department Name: Biochemistry

Abstract

Organelle proteomics is an up and coming field as the functionality of proteins and cellular mechanisms are clearly linked to the subcellular locations of the proteins. Such is the growing prominence of the field of organelle proteomics, the Human Proteome Organisation (HUPO) is holding its annual Barbados Workshop in January 2010 on the subject in an attempt to move the field forward such that optimal protocols and data analysis are disseminated through the proteomics and cell biology communities. A major frustration within the field is that the collection of high quality data is very time consuming and expensive on resources. The data sets that are produced in general are extremely rich sources of information, however, to date these have not been analysed to their full potential because of lack of suitable statistical workflows. For example, a recent paper published in MCP (Andreyev et al, Mol, Cell Proteomics -on-line 2009) contains a very rich dataset which has only been superficially mined to produce a limited organelle marker data set, but the rest of the data is not fully analysed and languishes in an spreadsheet within the supplemental data. Basic analytical strategies, pioneered by the applicants' laboratories, have shown the potential of such datasets to produce robust organelle proteome lists containing organelle specific annotations of proteins of unknown localization. Even the applicants' substantial organelle proteomics datasets, however, have only been analysed to a limited extent with established statistical approaches. In this proposal we aim to create more sophisticated statistical tools building on what has already been established by the applicants, which will be enable assignment of proteins to subcellular location using semi-supervised pattern recognition algorithms. These will lead to assignment of protein-organelle membership, resolution of proteins association with multiple organelles, and identification changes in protein-organelle association across multiple experimental conditions. These tools will be produced as freely-available software for analysis of standard organelle proteomics data generated by utilisation of the most common approaches. Application of these novel statistical approaches will lead to the creation of optimal organelle proteomics datasets which will themselves be deposited in a proteomics data repository, PRIDE, which can be publically accessed. In summary, the proposal will create a much needed tool to allow robust analysis of organelle proteomics datasets, and enable re-analysis of existing very rich data sets such that the most optimal mining of these data is achieved. It will also offer optimal tools for analysis of future organelle proteomics datasets which are starting to be produced by the proteomics/cell biology communities in earnest. The above work plan will be expedited by a multidisciplinary team which includes, Kathryn Lilley, developer of the organelle proteomics technologies and Matthew Trotter a bioinfromatician and statistician.

Technical Summary

Organelle proteomics is an up and coming field as the functionality of proteins and cellular mechanisms are clearly linked to their subcellular location. Such is the growing prominence of the field, the Human Proteome Organisation (HUPO) is holding its annual Barbados Workshop in 2010 on the subject. A major frustration within the field is that the collection of high quality data is very expensive on resources. The data sets that are produced are extremely rich sources of information which, to date, have not been analysed to their full potential because of lack of suitable statistical workflows. Analysis techniques, pioneered by the applicants' laboratories have demonstrated the potential of such datasets to produce robust organelle proteome. The aims of the proposal are to: i) collate data sets from many organelle proteomics approaches including LOPIT, an approach developed in the PI's laboratory. ii) further develop semi-supervised pattern recognition algorithms, to assign protein-organelle membership, resolve protein association with multiple organelles, and identify changes in protein-organelle association across experimental conditions. A freely-available toolkit for organelle proteomics data analysis in the R statistical language will also be developed. iii) apply above to i) and deposit finding in PRIDE and also specialist organism databases such as LOCATE, SUBA, FlyMine The above work plan will be expedited by a multidisciplinary team including, Kathryn Lilley, developer of the LOPIT technology and Matthew Trotter a statistician who has worked extensively with gene and protein expression data. The organelle proteomics and cell biology communities will be the major beneficiaries of this work. The long over due requirement of the proposed work is exemplified by letters of support by not only very productive collaborators of the applicants, but also from two of the top organelle proteomics laboratories in the world.

Planned Impact

Organelle proteomics is an up and coming field as the functionality of proteins and cellular mechanisms are clearly linked to the subcellular locations of the proteins. The data sets that are produced are extremely rich sources of information which to date these have not been analysed to their full potential because of lack of suitable statistical workflows. Analysis techniques, pioneered by the applicants' laboratories using of principle components analysis and hierarchical clustering have shown the potential of such datasets to produce robust organelle proteomes. In order to perform the proposed work, we will: i) collate data sets from many organelle proteomics approaches. ii) develop semi-supervised pattern recognition algorithms, to assign protein-organelle membership, resolve protein association with multiple organelles, and identify changes in protein-organelle association across experimental conditions. A freely-available toolkit for organelle proteomics data analysis in the R statistical language will also be developed. iii) apply above to i) and deposit finding in PRIDE and also specialist organism databases such as LOCATE, SUBA, FlyMine The above work plan will be expedited by a multidisciplinary team which includes, Kathryn Lilley, developer of the LOPIT technology and Matthew Trotter a statistician who has worked extensively with gene and protein expression data. Who will benefit from this research? The organelle proteomics community will be the major beneficiaries of this work. The long over due requirement of the proposed work is exemplified by letters of support by not only very productive collaborators of the applicants, but also from two of the top organelle proteomics laboratories in the world. Moreover, cell biologists, both academic and within the pharmaceutical sector will also benefit as this proposal underpins the interface of modern 'omics technologies and more classical cell biological methodologies. How will they benefit from this research? Benefits will be a pipeline to enable optimal mining of organelle proteomics data sets in the form of robust analytical methods for high through put organelle proteomics datasets. Furthermore, approaches will be further developed to enable characterisation sets of proteins whose correlated change in subcellular upon specific perturbation will give insight into cellular mechanisms. Additionally fully characterised organelle proteomics datasets will be deposited in publically accessible databases and sub cellular location information communicated to organism specific databases. What will be done to ensure that they have the opportunity to benefit from this research? The statistical tools produced will be implemented in the R statistical programming environment (www.r-project.org) in order to synchronise with existing efforts to provide open-source R scripts for handling raw LOPIT output to the Bioconductor suite of bioinformatics software (BBSRC: BB/G024618/1). Manuscripts will be written which describe not only the novel statistical approaches developed but also their demonstration by re-analysis of existing data and novel datasets produced with the applicants' laboratory during the course of the funding period. IT is envisaged that these manuscripts will be submitted to high impact journals with large general readership, such as Nature Methods and Nature Biotechnology. KSL is invited to give numerous talks at all the top proteomics conferences world wide, thus she will endeavour to publicise the work described here at such events. KSL and MWBT have recently submitted a FP7-Infrastructure proposal in collaboration with other top proteomics laboratories in Europe. A large portion of this proportion of this proposal is given over to forming transnational training facilities. KSL and MWBT intend to offer organelle proteomics data analysis training as part of this proposal.

Publications

10 25 50
 
Description We developed a tool kit where we modified software tools developed by the imaging community and applied them to organelle proteomics data.
The tools have been deposited as open-source software and are now being used by researchers worldwide.
Exploitation Route The software algorithm produced is being used by others as judged by conference proceedings and new collaborations.
Sectors Digital/Communication/Information Technologies (including Software),Healthcare,Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology

 
Description Researchers in the field of organelle proteomics have benefit from the tools developed here. Indeed the tools developed have attracted new collaborations between the PI's lab. and top organelle proteomics groups worldwide. They have also attracted interest from mass spectrometry vendors.
First Year Of Impact 2012
Sector Digital/Communication/Information Technologies (including Software)
 
Description Tools and Resources Development Fund (data fusion)
Amount £150,000 (GBP)
Funding ID BB/K00137X/1 
Organisation Biotechnology and Biological Sciences Research Council (BBSRC) 
Sector Public
Country United Kingdom
Start 08/2012 
End 02/2014
 
Description sLoLa
Amount £3,700,000 (GBP)
Funding ID BB/L002817/1 
Organisation Biotechnology and Biological Sciences Research Council (BBSRC) 
Sector Public
Country United Kingdom
Start 03/2014 
End 02/2018
 
Title pRoloc 
Description organelle proteomics software 
Type Of Technology Software 
Year Produced 2012 
Open Source License? Yes  
Impact used by many remote users