PROCESS - Proteomics data Collection, Software and Standards to support open access and long term management of data

Lead Research Organisation: European Bioinformatics Institute
Department Name: Proteomics Services Team

Abstract

Abstracts are not currently available in GtR for all funded research. This is normally because the abstract was not required at the time of proposal submission, but may be because it included sensitive information such as personal details.

Technical Summary

The PROCESS project will provide the framework needed for sharing, public data deposition and re-analysis of proteomics experimental data, based around mass spectrometry (MS), developed in the context of the Proteomics Standards Initiative (PSI), in which the applicants take a lead role. The framework comprises the evolution and maintenance of standard data formats (mzML, mzIdentML, mzQuantML and mzTab), controlled vocabularies (PSI-MS and PSI-MOD) and software tools (PRIDE Inspector, PSI Validator, and Java programming interfaces to the standards). The standards will be developed to cope with evolution in proteomics technology, including the capability to handle ambiguity in protein modification sites and protein grouping, data independent acquisition in MS and top-down proteomics. We will also develop international standards for compression of proteomics data sets to ensure that software performance and database architectures can scale up to the outputs of the newest instruments.

The PRIDE database at the EBI is the primary public database for experimental proteomics data. It has recently initiated a (potentially huge) raw data archive service for the community, in which the PSI standards play a central role. The PROCESS outputs will ensure that the wider research community will have long-term access to experimental proteomics data for re-use across a range of purposes. New modules will be created in the PRIDE Inspector software for data visualisation and analysis, and the further development of the programming interfaces will help bioinformatics developers to build tools for re-analysis of data sets.

Planned Impact

The direct beneficiaries include:

- Vendors of commercial software, including UK SME's Matrix Science and Nonlinear Dynamics, will benefit (see Pathways to Impact)
- Vendors of instruments will benefits, through increased compatibility of their raw data with a range of analysis software and easier deposition of data into PRIDE. Letters of support from Waters and AB Sciex demonstrate their commitment to PROCESS.
- Numerous pharmaceutical companies use mass spectrometry for analysis of proteins or metabolites. They will benefit through easier connectivity between software packages and more data in the public domain for re-analysis.
- Research councils and charities funding research will benefit through the potential for increased impact of the (proteomics) projects they fund, as public data deposition becomes straightforward and expected of all projects.

As proteomics is a key technology in the Life Sciences, there is the potential for considerable indirect benefits as PROCESS will help the field to become less fragmented and data analysis to become more straightforward. These benefits could be realised in any area of basic biology, biomedical or clinical science, for example leading to new drugs or biomarkers being discovered.

Staff employed will benefit:
- Exposure to numerous international collaborations, through the PSI (see letters of support)
- New collaborations with industry, particularly in relation to the shared development of software (see letters of support)

Publications

10 25 50
 
Description In the context of the ProteomeXchange consortium, the stability of PRIDE and its good user support has contributed to a community-wide shift from "closed" to "open" data, and data deposition in PRIDE or one of its partner databases is now mandatory for publication in many journals, for instance Molecular and Cellular Proteomics (MCP), journals from the Nature group, journals from the PLOS group, and from 2018, Journal of Proteome Research (JPR)
Exploitation Route The developments of this grant have lead to a rapid increase of data deposition in the PRIDE database as the UK repository of the international ProteomeXchange consortium. As of February 2019, the 2016 NAR publication for PRIDE has been cited >1800 times according to Google Scholar. In parallel, the 2013 NAR PRIDE publication has been cited >1400 times. Additionally the main ProteomeXchange publication (Nature Biotechnology, 2014) has been cited >1600 times.
Sectors Agriculture, Food and Drink,Healthcare,Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology

URL http://www.proteomeXchange.org
 
Description Our group has a very strong track record of engagement with industry in many projects in proteome informatics. For example, proteomics open data standards (e.g. mzML, mzidentML, support for mzTab is now ongoing), produced by the PSI (Proteomics Standards Initiative), have been implemented in commercial products from the leading mass spectrometry (MS) instrument manufacturers. This includes Thermo Fisher Scientific, Waters (in the ProteinLynx Global Server, PLGS) and SCIEX (in ProteinPilot®), as well as in the market leading search engine Mascot (from Matrix Science), and in other tools developed by SMEs (Small and Medium Enterprises), such as Peaks (Bioinformatics Solutions) and Scaffold (Proteome Software). Some of our previously developed APIs have also been integrated already in commercial software, e.g. the jmzIdentML library (https://github.com/PRIDE-Utilities/jmzIdentML) has been integrated in the PLGS Waters software.
First Year Of Impact 2015
Sector Pharmaceuticals and Medical Biotechnology
Impact Types Cultural

 
Title PRIDE 
Description The PRIDE database is the major repository for mass spectrometry based protein expression data globally. 
Type Of Material Database/Collection of data 
Provided To Others? Yes  
Impact In the context of the ProteomeXchange consortium, the stability of PRIDE and its good user support has contributed to a community-wide shift from "closed" to "open" data, and data deposition in PRIDE or one of its partner databases is now mandatory for publication in the major repository in the domain, MCP. 
URL http://www.ebi.ac.uk/pride/
 
Title PRIDE Inspector Toolsuite 
Description The original PRIDE Inspector tool was developed as an open source standalone tool to enable the visualization and validation of mass-spectrometry (MS)-based proteomics data before data submission, or already publicly available in the PRIDE (PRoteomics IDEntifications) database. The initial implementation of the tool focused on visualizing PRIDE data by supporting the PRIDE XML format and a direct access to private (password protected) and public experiments in PRIDE. The ProteomeXchange (PX) Consortium has been set up to enable a better integration of existing public proteomics repositories, maximizing its benefit to the scientific community through the implementation of standard submission and dissemination pipelines. Within the Consortium, PRIDE is focused on supporting submissions of tandem MS data. The increasing use and popularity of the new PSI (Proteomics Standards Initiative) data standards such as mzIdentML and mzTab, and the diversity of workflows supported by the PX resources, prompted us to design and implement a new suite of algorithms and libraries that would build upon the success of the original PRIDE Inspector and would enable users to visualize and validate PX "complete" submissions. The PRIDE Inspector Toolsuite supports the handling and visualization of different experimental output files, ranging from spectra (mzML, mzXML and the most popular peak lists formats), peptide and protein identification results (mzIdentML, PRIDE XML, mzTab), to quantification data (mzTab, PRIDE XML), using a modular and extensible set of open-source, cross-platform libraries. We believe that the PRIDE Inspector Toolsuite represents a milestone in the visualization and quality assessment of proteomics data. It is freely available at http://github.com/PRIDE-Toolsuite/. 
Type Of Technology Webtool/Application 
Year Produced 2015 
Impact PRIDE Inspector is widely used by PRIDE users as a visualisation and basic analysis tool for proteomics open standard formats. 
URL http://github.com/PRIDE-Toolsuite/
 
Title jmzTab 
Description mzTab is the most recent standard format developed by the Proteomics Standards Initiative (PSI). mzTab is a flexible tab-delimited file that can capture identification and quantification results coming from mass spectrometry (MS)-based proteomics and metabolomics approaches. We here present an open-source Java Application Programming Interface (API) for mzTab called jmzTab. The software allows the efficient processing of mzTab files, providing read and write capabilities, and is designed to be embedded in other software packages. The second key feature of the jmzTab model is that it provides a flexible framework to maintain the logical integrity between the metadata and the table-based sections in the mzTab files. In this article, as two example implementations, we also describe two stand-alone tools that can be used to validate mzTab files and to convert PRIDE XML files to mzTab. 
Type Of Technology Software 
Year Produced 2013 
Open Source License? Yes  
Impact jmzTab is a library used in other tools such as PRIDE Inspector. It can also be used to parse and output mzTab files. mzTab is a open standard data format developed by the Proteomics Standards Initiative. 
URL https://github.com/PRIDE-Utilities/jmzTab
 
Title ms-data-core-api 
Description The ms-data-core-api is a free, open-source library for developing computational proteomics tools and pipelines. The Application Program Interface, written in Java, enables rapid tool creation by providing a robust, pluggable programming interface and common data model. The data model is based on controlled vocabularies/ontologies and captures the whole range of data types included in common proteomics experimental workflows, going from spectra to identifications to quantitative results. The library contains readers for three of the most used Proteomics Standards Initiative standard file formats: mzML, mzIdentML, and mzTab. In addition to mzML, it also supports other common mass spectra formats: dta, ms2, mgf, pkl, apl (text-based), mzXML and mzData (XML-based). Also, it can be used to read PRIDE XML, the original format used by the PRIDE database, one of the world-leading proteomics resources. Finally, we present a set of algorithms and tools whose implementation illustrates the simplicity of developing applications using the library. 
Type Of Technology Software 
Year Produced 2015 
Open Source License? Yes  
Impact The API is used in tools like PRIDE Inspector Toolsuite, and in the PRIDE internal submission pipeline. Other colleagues in the field are also using this library as a common data model. 
URL https://github.com/PRIDE-Utilities/ms-data-core-api
 
Description Career Q&A 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Schools
Results and Impact This career Q&A with year 10 students was carried out virtually for the local collage and it is hoped that it would encourage more student to think about entering not only science but all the field of bioinformatics.
Year(s) Of Engagement Activity 2020
 
Description DNA workshop 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Schools
Results and Impact An introduction to science fot Primary school children on the topic of DNA.
Year(s) Of Engagement Activity 2020
 
Description Great Abington KS2 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Schools
Results and Impact A meet the experts and Opne lab tour of Great Abington KS2 school
Year(s) Of Engagement Activity 2020