An Integrated Open Source Software Resource for Quantitative Proteomics

Lead Research Organisation: European Bioinformatics Institute

Department Name: Proteomics Services Team

Abstract

In a scientific sense, a living system such as a plant, animal, organ or cell can be considered to be a complex machine. The basic components that make up this machine are molecules, of which there are several main types - genes, proteins and metabolites. To understand how these molecules work together to produce the complex living systems that we see around us we need to have analytical methods capable of detecting and quantifying these molecules. This proposal deals with one aspect of this analysis - proteomics - the science of identifying and quantifying proteins. The most popular approach in proteomics is to simplify a sample by separating all the proteins, digesting those proteins with an enzyme into much smaller components (peptides) and then analysing all these peptides with mass spectrometry (MS). Identification of proteins can then be carried out by computational analysis of the mass spectrum acquired from each peptide - peptides are usually mapped to proteins by comparison of observed spectra to those in a database. Protein quantity is typically calculated from mass spectral peak intensities, or by simply considering how many peptides have been observed from each protein. Within this general analytical schema there are a great many variations according to the laboratory that is doing the analysis, the samples being analysed, or the overall aim of the experiment. Factors that may differ between experimental protocols include the protein separation method (some people use gels, others liquid chromatography), different types of mass spectrometry, different search databases (some are simulated from protein sequences, others are libraries of experimentally acquired spectra), and different methods of quantitation (for instance there are various methods of labelling which are used to distinguish peptides from different samples during the analysis). This plethora of quantitative proteomic methods has two major disadvantages for proteomics practitioners. Firstly, it is a challenge to devise standard data formats for sharing proteomic data because there are so many experimental parameters to capture and different parameters are required for different protocols. Secondly, for each different protocol it can be necessary to perform a different computational analysis of the data - this has led to the development of many different software tools, particularly for quantitative proteomics in which each tool can be specific for a particular type of mass spectrometer, a particular type of labelling or a particular quantitation algorithm. The resulting array of incompatible software is bewildering to the typical proteomics practitioner, and because effort is spread across many tools there is limited resource to optimise the robustness and usability of each individual tool. In the work described in this proposal the four main centres of proteome informatics expertise in the UK aim to work together to develop an integrated suite of analysis and statistical processing tools for all popular variants of quantitative proteomics. The software will cover the whole range of quantitative proteomic data analysis, from extracting abundance data from the original MS spectra through to statistical analysis and deposition of results into the public proteomic data repository, PRIDE. A key component needed to get this working will be standard data formats to link each step of the data analysis. We will therefore be making a substantial contribution to the completion of the necessary quantitative data standards as part of this project. Overall, we aim to produce a robust, easy to use, standards-compliant software suite that will prove invaluable for proteomics practitioners seeking to analyse and share their quantitative proteomic data, regardless of the specific quantitative protocol they use.

Technical Summary

The aim of the project is for a consortium of proteome informatics experts from the EBI and the Universities of Manchester, Liverpool and Cranfield to deliver an open-source software workbench for quantitative proteomics which makes it simple for bench scientists to analyse their data using state of the art methods, submit to public repositories and re-analyse public data sets. We will work with the Proteomics Standards Initiative to define the new standard for quantitative data (mzQuantML) and provide on-going support for standards for mass spectra (mzML), transition design (TraML) and identifications (mzIdentML). These standards will underpin the software toolkit, which will be entirely independent of the analysis platform used. The software toolkit will have a common user interface, integrating a number of existing and new resources, developed at different sites. The software will provide high-quality identification data through the integration of algorithms previously developed for using multiple search engines, re-scoring identifications and inferring the presence or abundance of proteins, where there is ambiguity. We will incorporate standards-compliant quantitation tools based on Cranfield's existing X-Tracker platform and Manchester's SILACanalyser. We also commit to implement improved algorithms and new methods as they appear in the literature. We will integrate Cranfield's MRMaid transition design into PRIDE and provide support for TraML in MRMaid and X-Tracker. The software will incorporate a statistical analysis module, allowing the user to interpret the effects of using different methods and optimise the parameters of the algorithms. It will provide a simple mechanism for upload of a complete experiment to PRIDE, including spectra (mzML), identifications (mzIdentML) and quantitations (mzQuantML) within a wrapper capturing experimental metadata. The software will also provide support for reanalysis of reposited datasets in PRIDE.

Planned Impact

The major direct beneficiaries beyond academic researchers are pharmaceutical and biotechnology companies engaged in proteomic research, vendors of mass spectrometers, and companies involved in developing software for proteomic data analysis. As evidenced from our letters of support from industrial collaborators, there is considerable interest in proteomics. Many pharmaceutical companies have now outsourced their proteomic research or use data in the public domain in their analyses, not least from the various HUPO projects (as captured in PRIDE). The industry therefore stands to benefit significantly if we can bring about a major increase in the amount of quantitative data deposited in public databases, with sufficient metadata to draw conclusions about its validity and reliability. We also anticipate that this application will move on the field of proteomic analysis, such that it is simple to analyse quantitative data with a variety of tools, regardless of the experimental method employed. This improved accessibility and increased confidence in obtained results may have the effect of rejuvenating proteomics within industrial settings, overcoming existing concerns about achieving reproducible results. We predict that some of this impact will be realised within the five year duration of the project, helping to cement the UK's reputation as one of the world leaders in proteomics. In the longer term, findings emanating from proteomic analyses have great potential to improve health and quality of life, presenting economic opportunities across a broad range of sectors. In the health sector alone, quantitative proteomics can be used in discovery pipelines for new drugs, vaccines or as biomarkers of disease states that can be used as the basis of new diagnostic products. While nucleic acid-based techniques are currently preferred due to a perceived greater reproducibility and simpler analysis, the fact remains that proteins are the functional molecules in cells. Indeed, some clinically relevant tissues are only really accessible via proteomics, such as plasma, where there is no RNA component to study, yet this remains a relatively easy sample to obtain and analyse. Studies of comparative genomics or transcriptomics will only ever be indirect indicators of cellular states or processes. Countless studies have demonstrated poor correlations between the level of cellular RNA and the corresponding abundance of protein, and many signalling events are controlled by post-translational modifications. There is a growing realisation that a system-wide approach to biological research is required to understand the complexities of living organisms, and quantitative proteomics is a key tool in the armoury of those wishing to follow this approach. Over the course of the project we expect to see this systems approach gain momentum in application areas such as pharmaceuticals, bioprocessing, plant science (including research intended to mitigate climate change) and food science. As detailed in the case for support, we have a comprehensive plan for communication and engagement to ensure that the work carried out has the maximum possible impact on the beneficiaries mentioned above. In addition to passive dissemination via the project web site we will also be presenting our software at as many conferences and seminars as possible. Furthermore, the proposed training courses will present a perfect opportunity to engage the user community, particularly as we plan to hold at least one alongside the EBI/BSPR proteomics conference, which is traditionally attended by many industry-based scientists and vendors of proteomic hardware and software. It should also be noted that all four members of the consortium are already working with industrial partners and therefore have a direct route via which to publicise the work to these organisations and garner feedback to ensure that the software suite produced has maximum relevance to their needs.

Funded Value:

£228,575

Funded Period:

Nov 10 - Oct 15

Funder:

BBSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

BB/I000909/1

Principal Investigator:

Henning Hermjakob

Research Subject:

Omic sciences & technologies (51%)

Tools, technologies & methods (16%)

Research Topic:

Bioinformatics (16%)

Proteomics (51%)

Organisations

European Bioinformatics Institute (Lead Research Organisation)

People	ORCID iD
Henning Hermjakob (Principal Investigator)	http://orcid.org/0000-0001-8479-0262

Publications

Author Name

Title Publication Date Published

|< < 1 2 > >|

10 25 50

Gonzalez-Galarza F (2012) A Critical Appraisal of Techniques, Software Packages, and Standards for Quantitative Proteomic Analysis in OMICS: A Journal of Integrative Biology

Griss J (2014) The mzTab data exchange format: communicating mass-spectrometry-based proteomics and metabolomics experimental results to a wider audience. in Molecular & cellular proteomics : MCP

Griss J (2016) Recognizing millions of consistently unidentified spectra across hundreds of shotgun proteomics datasets in Nature Methods

Jones AR (2012) The mzIdentML data standard for mass spectrometry-based proteomics results. in Molecular & cellular proteomics : MCP

Mayer G (2014) Controlled vocabularies and ontologies in proteomics: overview, principles and practice. in Biochimica et biophysica acta

Mayer G (2013) The HUPO proteomics standards initiative- mass spectrometry controlled vocabulary. in Database : the journal of biological databases and curation

Perez-Riverol Y (2016) PRIDE Inspector Toolsuite: Moving Toward a Universal Visualization Tool for Proteomics Data Standard Formats and Quality Assessment of ProteomeXchange Datasets. in Molecular & cellular proteomics : MCP

Perez-Riverol Y (2015) Making proteomics data accessible and reusable: current state of proteomics databases and repositories. in Proteomics

Perez-Riverol Y (2016) Ten Simple Rules for Taking Advantage of Git and GitHub. in PLoS computational biology

Perez-Riverol Y (2015) ms-data-core-api: an open-source, metadata-oriented library for computational proteomics. in Bioinformatics (Oxford, England)

Key Findings
Impact Summary
Research Databases and Models
Software and Technical Products
Engagement Activities


Description	We have developed international standard data formats for quantitative proteomics. In the context of this grant the mzTab standard format was developed (PMID: 24980485, http://www.psidev.info/mztab). mzTab supports identification and quantification information and it is currently supported by popular proteomics analysis tools such as MaxQuant, Mascot and OpenMS. Since end of 2018, it is possible to perform data submissions to the world-leading PRIDE database in this format. Results are parsed and can be visualised in the PRIDE web interface and in the PRIDE Inspector stand-alone tool. mzTab is the first standard format for quantitative proteomics that has had at least some adoption in the field, as demonstrated by some implementations in both commercial and academic software.
Exploitation Route	Although the format was originally developed to support also metabolomics results, it was done at a very basic level. This is why the extension mzTab-M was developed to support more appropriately metabolomics results (PMID: 30688441).
Sectors	Agriculture Food and Drink Digital/Communication/Information Technologies (including Software) Pharmaceuticals and Medical Biotechnology
URL	http://www.psidev.info/mztab


Description	As an off-shoot from this work, the U Liverpool team have received further funding for the Proteolabels software, which ultimately will be a commercial product As a key point, the data standard mzTab supports identification and quantification information and it is currently supported by popular proteomics analysis tools such as MaxQuant, Mascot (commercial tool, Matrix Science) and OpenMS. Since end of 2018, it is possible to perform data submissions to the world-leading PRIDE database in this format. Results are parsed and can be visualised in the PRIDE web interface and in the PRIDE Inspector stand-alone tool. So, mzTab is really the first data standard format for quantitative proteomics data with some adoption in the field. An extended version for metabolomics data called mzTab-M has also been recently developed, which will open the way for new software implementations in both academic and commercial software.
First Year Of Impact	2013
Sector	Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Pharmaceuticals and Medical Biotechnology
Impact Types	Economic


Title	PRIDE
Description	The PRIDE database is the major repository for mass spectrometry based protein expression data globally.
Type Of Material	Database/Collection of data
Provided To Others?	Yes
Impact	In the context of the ProteomeXchange consortium, the stability of PRIDE and its good user support has contributed to a community-wide shift from "closed" to "open" data, and data deposition in PRIDE or one of its partner databases is now mandatory for publication in the major repository in the domain, MCP.
URL	http://www.ebi.ac.uk/pride/


Title	ProteoSuite
Description	One of the main objectives of the project was to deliver a simple user interface to provide access to all quantitative software in a single environment. This software is newly developed in this project, led by the Liverpool group, entitled Proteosuite (http://www.proteosuite.org/).
Type Of Technology	Software
Year Produced	2013
Open Source License?	Yes
Impact	This is a beta version, not yet a final release.


Title	jmzTab
Description	mzTab is the most recent standard format developed by the Proteomics Standards Initiative (PSI). mzTab is a flexible tab-delimited file that can capture identification and quantification results coming from mass spectrometry (MS)-based proteomics and metabolomics approaches. We here present an open-source Java Application Programming Interface (API) for mzTab called jmzTab. The software allows the efficient processing of mzTab files, providing read and write capabilities, and is designed to be embedded in other software packages. The second key feature of the jmzTab model is that it provides a flexible framework to maintain the logical integrity between the metadata and the table-based sections in the mzTab files. In this article, as two example implementations, we also describe two stand-alone tools that can be used to validate mzTab files and to convert PRIDE XML files to mzTab.
Type Of Technology	Software
Year Produced	2013
Open Source License?	Yes
Impact	jmzTab is a library used in other tools such as PRIDE Inspector. It can also be used to parse and output mzTab files. mzTab is a open standard data format developed by the Proteomics Standards Initiative.
URL	https://github.com/PRIDE-Utilities/jmzTab


Title	ms-data-core-api
Description	The ms-data-core-api is a free, open-source library for developing computational proteomics tools and pipelines. The Application Program Interface, written in Java, enables rapid tool creation by providing a robust, pluggable programming interface and common data model. The data model is based on controlled vocabularies/ontologies and captures the whole range of data types included in common proteomics experimental workflows, going from spectra to identifications to quantitative results. The library contains readers for three of the most used Proteomics Standards Initiative standard file formats: mzML, mzIdentML, and mzTab. In addition to mzML, it also supports other common mass spectra formats: dta, ms2, mgf, pkl, apl (text-based), mzXML and mzData (XML-based). Also, it can be used to read PRIDE XML, the original format used by the PRIDE database, one of the world-leading proteomics resources. Finally, we present a set of algorithms and tools whose implementation illustrates the simplicity of developing applications using the library.
Type Of Technology	Software
Year Produced	2015
Open Source License?	Yes
Impact	The API is used in tools like PRIDE Inspector Toolsuite, and in the PRIDE internal submission pipeline. Other colleagues in the field are also using this library as a common data model.
URL	https://github.com/PRIDE-Utilities/ms-data-core-api


Description	Career Q&A
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Schools
Results and Impact	This career Q&A with year 10 students was carried out virtually for the local collage and it is hoped that it would encourage more student to think about entering not only science but all the field of bioinformatics.
Year(s) Of Engagement Activity	2020