Automated identification of optimal data-specific organelle clusters using freely available protein annotations

Lead Research Organisation: University College London
Department Name: Genetics Evolution and Environment


Organelle proteomics is the systematic study of proteins and their assignments to sub-cellular compartments like organelles and macro-molecular complexes. It is a growing field in importance and popularity and over the last few years has gained a large amount of attention due to the role played by organelles in carrying out defined cellular processes.

The most information-rich datasets are generated using high accuracy mass-spectrometry (MS), a technique that allows to identify and quantify the proteome content in complex biological samples. These datasets are high quality rich sources of data that have been mined using a variety of robust supervised statistical machine learning (ML) methods which have shown to yield valuable protein-organelle predictions (BBSRC: BB/G024618/1 and BB/H024247/1). These classification methods require as set of tenth expert-curated ground-truth marker proteins of know localisation and then match proteins of unknown localisation to organelles based on their MS data resemblance with those of marker proteins. However, there are still inherent issues that limit the optimal application of such contemporary classification methods: (1) the limited number of organelle markers and the reliance on time-consuming manual curation and (2) the limited number of organelle classes that systematically underestimates the sub-cellular diversity recorded in the datasets.

In this proposal we aim to improve protein-organelle association via the application of different state-of-the-art methods to remove the need to ground-truth marker proteins to accurately assign proteins to a broader set of sub-cellular compartments. These unsupervised approaches will be looking specifically for patterns in the organelle proteomics data. We will also make use vast amounts of freely available protein annotation data like the Gene Ontology. These annotations, while prone to erroneous or misleading information, are available for tens of thousands proteins, describing all organelles identified so far. The amount of anntation data allows to overcome its uncertainty and investigate the sub-cellular environment at a much more meaningful diversity. In addition, the proposed methods will allow complete automation of the data analysis, thus permitting the treatment of more and bigger datasets.

The development of a framework that will support this annotation to guide the extrapolation and elucidation of patterns in the MS data will lead to the creation of optimal organelle proteomics datasets which will be deposited in a public access proteomics data repository through the main ProteomeXchange submission portal. These tools will be made freely available as open-source software for the use of the whole proteomics community.

The work proposed in this grant will be implemented by a multidisciplinary team bringing together expertise in state-of-the-art mass-spectrometry based proteomics approaches (KSL), database annotation (CD), contemporary pattern recognition methods (AP, TB, SBH and LG), computational bioinformatics and code development (LG) and applied mathematics (LMS). LG, KSL and LMS have worked together previously on organelle proteomics grants that resulted in the release of the current state-of-the-art toolkits for organelle proteomics data analysis.

Technical Summary

Localisation of proteins inside cells is of paramount importance to study their function, refine our comprehension of sub-cellular process and organisation and understand the effect of perturbations at the sub-cellular level. Various dedicated experimental designs based on biochemical separation and quantitative mass-spectrometry have been described and refined over the years. The major break-through in terms of organelle proteomics data analysis consists in the application of state-of-the-art supervised machine learning (ML) techniques. These techniques utilise the quantitative profiles of the proteins and permit optimal classification of proteins of unknown localisation based on the definition of sub-cellular markers. These markers represent proteins of known localisation, identified through manual database mining, literature search and, most crucially, expert curation. Manual curation of a dataset containing thousands of proteins is however, although currently the most reliable solution, an extremely time consuming task. Furthermore, the quest for tens of highly reliable markers per organelle favours large, well characterised organelles at the expenses of smaller, less studied compartments, leading to systematic under-representation of the true organelle diversity in the experimental data. Our project proposes a major shift in the analysis of organelle proteomics data by abandoning supervised ML which requires rigid sets of highly reliable markers and instead employ unsupervised and semi-unsupervised approaches relying on the vast amount of freely available database annotations such as, for example, the Gene Ontology. These novel approaches will allow to (1) automate the analysis of our datasets without the expensive manual curation and (2) assess the true cellular diversity that underpin such experiments at a much finer scale. These techniques will be made accessible in the frame of the open source pRoloc framework for organelle proteomics data analysis.

Planned Impact

Who will benefit from this research?

The developments proposed in this project will benefit the organelle proteomics community in particular as we will develop and share improved tools to analyse such data. The proteomics field as a whole will also benefit as our methods and software, although focused on organelle proteomics data, have a much wider scope and impact and can be applied in other fields. Computational biologists will also benefit from the open-source organelle proteomics analysis methods and the quality software that will be distributed to the wider community. Cell biologists, both academic and within the pharmaceutical sector will also immensely benefit as this proposal underpins the interface of modern omics technologies and more classical cell biological methodologies.

Our work is targeted to experimentalist users who will use our tools to analyse their data, as well as computational scientists and developers who want to re-use or adapt our methods and software infrastructure to new projects and topics.

How will they benefit from this research?

The toolkit will ensure unprecedented mining of proteomics data produced from widely-used gradient-based proteomics approaches, enabling unprecedented insight into the underlying sub-cellular diversity of these data. In addition, it will provide a benchmark upon which to add new data analysis methods as the technology and data annotation progresses. The sophisticated statistical machine learning methods will be made available for the statistical programming environment R and the Bioconductor project and will inter-operate with existing complementary software. Our novel methods will no doubt be applicable in other omics areas of research due to the inherit cross-disciplinary nature of computer science, mathematics and machine learning that underpins many areas of computational biology. Lastly, the project will contribute knowledge and scientific advancement in the form of the dissemination of data and improvement of the analyses of complex multivariate data to facilitate interpretation and understanding of relevant biological processes. Fully characterised organelle proteomics datasets will be deposited in publicly accessible databases (via the ProteomeXchange portal) upon publication of the peer-reviewed research outputs and the detailed analysis methodologies will be documented and distributed with software releases to facilitate application of our methods to new datasets and use cases.

The research staff will benefit from the multi-disciplinary research environment and extend their national and international research network through on-going collaborations. In addition to the benefits of improved tools and data, the academic beneficiaries will also be invited to workshops that will be organised in the frame of the European FP7 project to promote our approaches.

What will be done to ensure that they have the opportunity to benefit from this research?

The algorithms and tools developed in this proposal will be implemented in the R statistical programming environment ( and will be deposited to the Bioconductor suite of bioinformatics software. The algorithms will be implemented as independent modules that will be contributed to and compatible with current the pRoloc analysis framework (developed by LG and LMS in BBSRC: BB/H024247/1 and BB/G024618/1), to form a freely available open-source toolkit for the analysis of organelle proteomics data. It is envisaged that these manuscripts will be submitted to high impact journals with large general readership, such as Nature Methods and Nature Biotechnology. KSL, LG and CD are invited to give numerous talks at all the top proteomics and computational conferences world wide, thus they will endeavour to publicise the work described here at such events.
Description The spatial diversity in mass spectrometry based spatial proteomics experiment such as LOPIT and hyperLOPIT is far greater that is currently documented in the literature. There are many sub-cellular compartments that are missed when relying on manual annotation of the data. We have developed a simple computational method relying on public repositories and an associated interactive user interface that enables users to highlight and explore this diversity (see URL below). The computational method and an associated interactive exploration interface are available through the open source pRoloc ( and pRolocGUI ( packages.
Exploitation Route Automatic annotation has identified new needs in terms of machine learning for spatial proteomics data. We are currently working in collaboration with Sean Holden from the Computer Laboratory at the University of Cambridge to address these new needs.
Sectors Agriculture, Food and Drink,Healthcare,Pharmaceuticals and Medical Biotechnology

Title GO reliability scores 
Description We estimated reliability scores for Gene Ontology terms using the approach described in (Skunca & Dessimoz, PLOS Comp Biol 2013, DOI:10.1371/journal.pcbi.1002533), but updated on 2014 data. This dataset was provided to our collaborators (BBSRC grant BB/L018497/1) and will form the basis of a new semi-supervised method for subcellular localisation of proteins from proteomics data. 
Type Of Material Database/Collection of data 
Year Produced 2014 
Provided To Others? Yes  
Impact Though we have shared the dataset with our collaborators, it is as of yet unpublished. 
Title pRoloc 
Description pRoloc is a complete infrastructure to support and guide the sound analysis of quantitative mass-spectrometry-based spatial proteomics data. It provides functionality for unsupervised and supervised machine learning for data exploration and protein classification and novelty detection to identify new putative sub-cellular clusters. The software builds upon existing infrastructure for data management and data processing. 
Type Of Technology Software 
Year Produced 2012 
Impact Dissemination of software, visualiation and analysis method for the analysis of spatial proteomics. The software and associated techniques have been applied by other groups, both in collaboration with the authors and independently. 
Title pRolocGUI 
Description Interactively visualisation of organelle (spatial) proteomics data on the basis of pRoloc, pRolocdata and shiny. 
Type Of Technology Software 
Year Produced 2014 
Impact Interactive interface use to disseminate the first large-scale and high resolution stem call localisation map (Christoforou et al., 2016 doi:10.1038/ncomms9992). 
Title pRolocdata 
Description Mass-spectrometry based spatial proteomics data sets from Dunkley et al. (2006), Foster et al. (2006), Tan et al. (2009), Hall et al. (2009), Trotter et al. (2010), Ferro et al. (2010), Nikolovski et al. (2012, 2014), Breckels et al. (2013), Groen et al. (2014) and Christoforou et al. (2015), and protein complex separation data from Kristensen et al. (2012), Havugimana et al. (2012), Kirkwood et al. (2013) and Fabre et al. (2015). 
Type Of Technology Software 
Year Produced 2012 
Impact Dissemination of various spatial proteomics data and data underlying the implementation of reproducible spatial proteomics data analysis.