Automated identification of optimal data-specific organelle clusters using freely available protein annotations

Lead Research Organisation: University College London

Department Name: Genetics Evolution and Environment

Abstract

Abstracts are not currently available in GtR for all funded research. This is normally because the abstract was not required at the time of proposal submission, but may be because it included sensitive information such as personal details.

Technical Summary

Localisation of proteins inside cells is of paramount importance to study their function, refine our comprehension of sub-cellular process and organisation and understand the effect of perturbations at the sub-cellular level. Various dedicated experimental designs based on biochemical separation and quantitative mass-spectrometry have been described and refined over the years. The major break-through in terms of organelle proteomics data analysis consists in the application of state-of-the-art supervised machine learning (ML) techniques. These techniques utilise the quantitative profiles of the proteins and permit optimal classification of proteins of unknown localisation based on the definition of sub-cellular markers. These markers represent proteins of known localisation, identified through manual database mining, literature search and, most crucially, expert curation. Manual curation of a dataset containing thousands of proteins is however, although currently the most reliable solution, an extremely time consuming task. Furthermore, the quest for tens of highly reliable markers per organelle favours large, well characterised organelles at the expenses of smaller, less studied compartments, leading to systematic under-representation of the true organelle diversity in the experimental data. Our project proposes a major shift in the analysis of organelle proteomics data by abandoning supervised ML which requires rigid sets of highly reliable markers and instead employ unsupervised and semi-unsupervised approaches relying on the vast amount of freely available database annotations such as, for example, the Gene Ontology. These novel approaches will allow to (1) automate the analysis of our datasets without the expensive manual curation and (2) assess the true cellular diversity that underpin such experiments at a much finer scale. These techniques will be made accessible in the frame of the open source pRoloc framework for organelle proteomics data analysis.

Planned Impact

Who will benefit from this research?

The developments proposed in this project will benefit the organelle proteomics community in particular as we will develop and share improved tools to analyse such data. The proteomics field as a whole will also benefit as our methods and software, although focused on organelle proteomics data, have a much wider scope and impact and can be applied in other fields. Computational biologists will also benefit from the open-source organelle proteomics analysis methods and the quality software that will be distributed to the wider community. Cell biologists, both academic and within the pharmaceutical sector will also immensely benefit as this proposal underpins the interface of modern omics technologies and more classical cell biological methodologies.

Our work is targeted to experimentalist users who will use our tools to analyse their data, as well as computational scientists and developers who want to re-use or adapt our methods and software infrastructure to new projects and topics.

How will they benefit from this research?

The toolkit will ensure unprecedented mining of proteomics data produced from widely-used gradient-based proteomics approaches, enabling unprecedented insight into the underlying sub-cellular diversity of these data. In addition, it will provide a benchmark upon which to add new data analysis methods as the technology and data annotation progresses. The sophisticated statistical machine learning methods will be made available for the statistical programming environment R and the Bioconductor project and will inter-operate with existing complementary software. Our novel methods will no doubt be applicable in other omics areas of research due to the inherit cross-disciplinary nature of computer science, mathematics and machine learning that underpins many areas of computational biology. Lastly, the project will contribute knowledge and scientific advancement in the form of the dissemination of data and improvement of the analyses of complex multivariate data to facilitate interpretation and understanding of relevant biological processes. Fully characterised organelle proteomics datasets will be deposited in publicly accessible databases (via the ProteomeXchange portal) upon publication of the peer-reviewed research outputs and the detailed analysis methodologies will be documented and distributed with software releases to facilitate application of our methods to new datasets and use cases.

The research staff will benefit from the multi-disciplinary research environment and extend their national and international research network through on-going collaborations. In addition to the benefits of improved tools and data, the academic beneficiaries will also be invited to workshops that will be organised in the frame of the European FP7 project to promote our approaches.

What will be done to ensure that they have the opportunity to benefit from this research?

The algorithms and tools developed in this proposal will be implemented in the R statistical programming environment (www.r-project.org) and will be deposited to the Bioconductor suite of bioinformatics software. The algorithms will be implemented as independent modules that will be contributed to and compatible with current the pRoloc analysis framework (developed by LG and LMS in BBSRC: BB/H024247/1 and BB/G024618/1), to form a freely available open-source toolkit for the analysis of organelle proteomics data. It is envisaged that these manuscripts will be submitted to high impact journals with large general readership, such as Nature Methods and Nature Biotechnology. KSL, LG and CD are invited to give numerous talks at all the top proteomics and computational conferences world wide, thus they will endeavour to publicise the work described here at such events.

Funded Value:

£10,175

Funded Period:

Apr 14 - Sep 15

Funder:

BBSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

BB/L018241/1

Principal Investigator:

Christophe Dessimoz

Research Subject:

Info. & commun. Technol. (16%)

Omic sciences & technologies (48%)

Tools, technologies & methods (32%)

Research Topic:

Artificial Intelligence (16%)

Bioinformatics (32%)

Functional genomics (16%)

Proteomics (32%)

Organisations

University College London (Lead Research Organisation)

People	ORCID iD
Christophe Dessimoz (Principal Investigator)

Publications

Author Name

Title Publication Date Published

10 25 50

Škunca N (2015) Phylogenetic profiling: how much input data is enough? in PloS one

Tan G (2015) Current Methods for Automated Filtering of Multiple Sequence Alignments Frequently Worsen Single-Gene Phylogenetic Inference. in Systematic biology

Sojo V (2016) Membrane Proteins Are Dramatically Less Conserved than Water-Soluble Proteins across the Tree of Life. in Molecular biology and evolution

Piližota I (2019) Phylogenetic approaches to identifying fragments of the same gene, with application to the wheat genome. in Bioinformatics (Oxford, England)

Jiang Y (2016) An expanded evaluation of protein function prediction methods shows an improvement in accuracy. in Genome biology

Altenhoff AM (2016) Standardized benchmarking in the quest for orthologs. in Nature methods

Altenhoff AM (2018) The OMA orthology database in 2018: retrieving evolutionary relationships among all domains of life through richer web and programmatic interfaces. in Nucleic acids research

Altenhoff AM (2015) The OMA orthology database in 2015: function predictions, better plant support, synteny view and other improvements. in Nucleic acids research

Key Findings
Research Databases and Models
Software and Technical Products


Description	The spatial diversity in mass spectrometry based spatial proteomics experiment such as LOPIT and hyperLOPIT is far greater that is currently documented in the literature. There are many sub-cellular compartments that are missed when relying on manual annotation of the data. We have developed a simple computational method relying on public repositories and an associated interactive user interface that enables users to highlight and explore this diversity (see URL below). The computational method and an associated interactive exploration interface are available through the open source pRoloc (http://bioconductor.org/packages/pRoloc) and pRolocGUI (http://bioconductor.org/packages/pRolocGUI) packages.
Exploitation Route	Automatic annotation has identified new needs in terms of machine learning for spatial proteomics data. We are currently working in collaboration with Sean Holden from the Computer Laboratory at the University of Cambridge to address these new needs.
Sectors	Agriculture, Food and Drink,Healthcare,Pharmaceuticals and Medical Biotechnology
URL	https://lgatto.github.io/pRoloc/articles/pRoloc-goannotations.html


Title	GO reliability scores
Description	We estimated reliability scores for Gene Ontology terms using the approach described in (Skunca & Dessimoz, PLOS Comp Biol 2013, DOI:10.1371/journal.pcbi.1002533), but updated on 2014 data. This dataset was provided to our collaborators (BBSRC grant BB/L018497/1) and will form the basis of a new semi-supervised method for subcellular localisation of proteins from proteomics data.
Type Of Material	Database/Collection of data
Year Produced	2014
Provided To Others?	Yes
Impact	Though we have shared the dataset with our collaborators, it is as of yet unpublished.


Title	pRoloc
Description	pRoloc is a complete infrastructure to support and guide the sound analysis of quantitative mass-spectrometry-based spatial proteomics data. It provides functionality for unsupervised and supervised machine learning for data exploration and protein classification and novelty detection to identify new putative sub-cellular clusters. The software builds upon existing infrastructure for data management and data processing.
Type Of Technology	Software
Year Produced	2012
Open Source License?	Yes
Impact	Dissemination of software, visualiation and analysis method for the analysis of spatial proteomics. The software and associated techniques have been applied by other groups, both in collaboration with the authors and independently.
URL	http://www.bioconductor.org/packages/release/bioc/html/pRoloc.html


Title	pRolocGUI
Description	Interactively visualisation of organelle (spatial) proteomics data on the basis of pRoloc, pRolocdata and shiny.
Type Of Technology	Software
Year Produced	2014
Open Source License?	Yes
Impact	Interactive interface use to disseminate the first large-scale and high resolution stem call localisation map (Christoforou et al., 2016 doi:10.1038/ncomms9992).
URL	http://bioconductor.org/packages/release/bioc/html/pRolocGUI.html


Title	pRolocdata
Description	Mass-spectrometry based spatial proteomics data sets from Dunkley et al. (2006), Foster et al. (2006), Tan et al. (2009), Hall et al. (2009), Trotter et al. (2010), Ferro et al. (2010), Nikolovski et al. (2012, 2014), Breckels et al. (2013), Groen et al. (2014) and Christoforou et al. (2015), and protein complex separation data from Kristensen et al. (2012), Havugimana et al. (2012), Kirkwood et al. (2013) and Fabre et al. (2015).
Type Of Technology	Software
Year Produced	2012
Open Source License?	Yes
Impact	Dissemination of various spatial proteomics data and data underlying the implementation of reproducible spatial proteomics data analysis.
URL	http://bioconductor.org/packages/release/data/experiment/html/pRolocdata.html

Abstract

Technical Summary

Planned Impact

Organisations

People

ORCID iD

Publications