PROCESS - Proteomics data Collection, Software and Standards to support open access and long term management of data

Lead Research Organisation: University of Liverpool
Department Name: Institute of Integrative Biology

Abstract

Proteomics is the science of studying large numbers of proteins - the key molecules that perform the functional roles in cells, and it is the natural partner to genomics - the study of the genes that encode those proteins. Proteomic studies are performed in laboratories all over the world, investigating disease processes, as well as the basic function of cells in humans, animals, plants and microorganisms. Proteins are challenging molecules to work with but the technology of mass spectrometry (MS) has developed over many years, such that it is now possible to identify and quantify many hundreds, or even thousands, of proteins simultaneously in one type of sample compared with another, for example to test how a cell responds during a disease process compared to a healthy cell, allowing us to begin understanding the complex and dynamic molecular changes.

MS can produce very large raw data sets, running to many Gigabytes for a single sample analysed. The raw files are processed, often in two stages by different software packages that first identify and then quantify the proteins that were analysed by the instrument. In the past, the raw data files were encoded in a data format specific to each instrument vendor, effectively tying scientists to using the software provided with the instrument, which have not always been the optimal solutions for analysing the data. A global consortium of academics and industrial researchers, called the Proteomics Standards Initiative (PSI), has collaborated to agree open access standards for storing raw data, protein identification data and quantitative data. These standards mean that open-source (and free) software can now be developed, capable of analysing data arising from any type of instrument. It also means that data sets generated at high cost can be deposited in a public repository, such as the PRIDE database, hosted at the European Bioinformatics Institute, allowing their re-use for integration and interpretation of data from other studies, improving our knowledge about genomes and biological systems, and improving software tools in this field.

In this project, we are requesting support so that the PSI standard data formats can continue to be maintained and evolve as new proteomics techniques are described in the literature. We are also developing interfaces so that other groups can develop new software packages easily, using the PSI standards as inputs and outputs. The standards are being used as part of a recently released raw data archive within PRIDE, which will store very large amounts of data for the entire scientific community. As such, we are working on software to make it straightforward for research labs to deposit and visualise data in PRIDE, as well as optimising the way in which data is compressed and stored, so that the system can scale for the needs of the next generation of instruments. These developments of PSI standards, software and PRIDE are essential for making sure that proteomics data are open access for all researchers and not restricted to the small number of laboratories with specialised, expensive software.

Technical Summary

The PROCESS project will provide the framework needed for sharing, public data deposition and re-analysis of proteomics experimental data, based around mass spectrometry (MS), developed in the context of the Proteomics Standards Initiative (PSI), in which the applicants take a lead role. The framework comprises the evolution and maintenance of standard data formats (mzML, mzIdentML, mzQuantML and mzTab), controlled vocabularies (PSI-MS and PSI-MOD) and software tools (PRIDE Inspector, PSI Validator, and Java programming interfaces to the standards). The standards will be developed to cope with evolution in proteomics technology, including the capability to handle ambiguity in protein modification sites and protein grouping, data independent acquisition in MS and top-down proteomics. We will also develop international standards for compression of proteomics data sets to ensure that software performance and database architectures can scale up to the outputs of the newest instruments.

The PRIDE database at the EBI is the primary public database for experimental proteomics data. It has recently initiated a (potentially huge) raw data archive service for the community, in which the PSI standards play a central role. The PROCESS outputs will ensure that the wider research community will have long-term access to experimental proteomics data for re-use across a range of purposes. New modules will be created in the PRIDE Inspector software for data visualisation and analysis, and the further development of the programming interfaces will help bioinformatics developers to build tools for re-analysis of data sets.

Planned Impact

The direct beneficiaries include:

- Vendors of commercial software, including UK SME's Matrix Science and Nonlinear Dynamics, will benefit (see Pathways to Impact).
- Vendors of instruments will benefits, through increased compatibility of their raw data with a range of analysis software and easier deposition of data into PRIDE. Letters of support from Waters and AB Sciex demonstrate their commitment to PROCESS.
- Numerous pharmaceutical companies use mass spectrometry for analysis of proteins or metabolites. They will benefit through easier connectivity between software packages and more data in the public domain for re-analysis.
- Research councils and charities funding research will benefit through the potential for increased impact of the proteomics (and metabolomics) projects they fund, as public data deposition becomes straightforward and expected of all projects.

As proteomics is a key technology in the Life Sciences, there is the potential for considerable indirect benefits as PROCESS will help the field to become less fragmented and data analysis to become more straightforward. These benefits could be realised in any area of basic biology, biomedical or clinical science, for example leading to new drugs or biomarkers being discovered.

Staff employed will benefit:
- Exposure to numerous international collaborations, through the PSI (see letters of support).
- New collaborations with industry, particularly in relation to the shared development of software (see letters of support).

Publications

10 25 50
 
Description The Life Sciences have been transformed by the emergence of high-throughput "omics" techniques, including proteomics - global studies of protein expression using mass spectrometry (MS). We have developed and led the Proteomics Standards Initiative (PSI) for >15 years, an international academic-industry partnership supporting open data. When PSI began, there was almost no sharing of proteomics data. Through our sustained efforts, a substantial proportion of published studies are now accompanied by completely open data, in freely available databases. This transformation encourages high-quality reproducible science and allows new discoveries and impact, through large scale re-analysis of data. This grant funded the development of standards (see publications list) and software support data sharing and re-analysis, and funded the PSI annual workshops for three years at which the proteomics community came together to work on standards and software.
Exploitation Route Standards and associated software are open source.
Sectors Healthcare,Pharmaceuticals and Medical Biotechnology

URL http://www.psidev.info/
 
Description The standards have been implemented by major corporations and SMEs, enabling communication between easier different platforms. One of the main outcomes of PROCESS was improvements and stabilisation of the mzIdentML standard, and a better software ecosystem. mzIdentML is now implemented by all the following vendors: Thermo, SCIEX, Waters, Matrix Science, Bruker, Proteome Software, Bioinformatics Inc. The software ecosystem makes it easier for data to be re-used from public databases in the ProteomeXchange consortium, leading to economic benefits (through data re-use) and indirect benefits arising from results (potentially in biomedical research/healthcare etc).
First Year Of Impact 2014
Sector Healthcare,Manufacturing, including Industrial Biotechology
Impact Types Economic

 
Title mzIdentML 1.2 
Description Updates to the mzIdentML data standard for proteomics in mzIdentML 1.2 
Type Of Material Computer model/algorithm 
Year Produced 2017 
Provided To Others? Yes  
Impact The standard is exported from commercial and free software, and ready by the major databases in the field. 
URL https://github.com/HUPO-PSI/mzIdentML
 
Title proBED data standard 
Description Data standard for displaying proteomics data on genomes 
Type Of Material Computer model/algorithm 
Year Produced 2017 
Provided To Others? Yes  
Impact proBED allows proteomics data to be displayed on genome browsers, thus connecting up two major types of public data in omics research. 
URL http://www.psidev.info/probed
 
Title mzidLibrary 
Description Library of analysis routines for mzIdentML data standard used in proteomics 
Type Of Technology Software 
Year Produced 2013 
Open Source License? Yes  
Impact Facilites open source development in proteomics 
URL https://code.google.com/p/mzidentml-lib/
 
Title mzqLibrary 
Description Open source pipeline for processing quantitative proteomics data 
Type Of Technology Software 
Year Produced 2015 
Open Source License? Yes  
Impact The software was the first to implement the PSI mzQuantML standard as a reference architecture for processing standard compliant data. 
URL https://github.com/PGB-LIV/mzqLibrary