PRIDE Converter - Efficient Database Deposition of Mass Spectrometry Data

Lead Research Organisation: European Bioinformatics Institute
Department Name: Proteomics Services Team

Abstract

Public availability of biological data has been of paramount importance in the rapid development of molecular biology. However, the amount of proteomics data in public domain repositories is regrettably still quite low in comparison with other disciplines like genomics and transcriptomics. One of the most prominent public repositories for proteomics data is the PRoteomics IDEntifications database (PRIDE, http://www.ebi.ac.uk) at the EBI, in Cambridge. At present proteomics journals are increasingly mandating public deposition of MS data to public repositories in general, and to PRIDE in particular, to support the publication of related manuscripts. At the same time, funding agencies (such as BBRSR and the Wellcome Trust in the UK) are clearly supporting this trend as a way to maximize the value of the funds provided. However, in practical terms, this public data-sharing policy cannot succeed if no reliable and 'user-friendly' submission tools, that can efficiently capture the technical and biological metadata, are provided to the research community. The submission tool PRIDE Converter (http://code.google.com/p/pride-converter) was developed with that idea in mind. It is an open source, platform independent software and a big part of its success can be attributed to its easy-to-use graphical user interface (GUI) component. PRIDE Converter is currently by far the most comprehensive and popular tool of this kind, since it made the submission of MS data a much easier and more straightforward process. It has definitely been the key factor in the huge growth in data contents in PRIDE for the last two years and has become the de facto submission tool for PRIDE for most researchers. From Jan 2009 to Sept 2010, PRIDE has received 243 data depositions, comprising more than 63.6 million mass spectra, through PRIDE Converter. The redevelopment proposed here is based on user feedback gathered by PRIDE curators in direct exchange with data depositors, as well as on discussions with journal editors, and recent development of community standards for mass spectrometry (MS). Beyond the technical objectives a-b (see 'Technical Summary'), we urgently need to implement support for current community data standards. mzML for the representation of mass spectra has recently been published and supersedes mzData, currently supported by PRIDE Converter. mzIdentML for the representation of protein and peptide identifications has recently been released and is already supported by Mascot 2.3 and other tools. Key objective c) of this proposal is to implement full mzML/mzIdentML support in PRIDE Converter. In addition to standards for data representation, the HUPO Proteomics Standards Initiative has published a series of 'Minimum Requirements' documents, describing the metadata items which should be reported for proteomics experiments. Currently, adherence to these Minimum requirements documents is not validated by the PRIDE Converter. Key objective d) is to implement such validation, but also to make adherence to these requirements efficient by providing a template mechanism for repetitive submission processes. This will make the reuse of the data more feasible and will allow perform more reliable global re-analysis of data (meta-analysis studies). In its current form, the PRIDE Converter provides only rudimentary support for quantitative MS technologies, which are quickly becoming the standard proteomics approach. Key objective e) aims to implement light weight PRIDE Converter support for quantitative proteomics tools. The final objective f) is the standardisation of protein inference. Currently, proteomics search tools usually select one of a range of equivalent protein choices for a given peptide set. We will standardise this process as much as possible between different search tools, to ensure optimised comparability of proteomics data.

Technical Summary

While the extension of standards support in PRIDE Converter requires significant new development, the key technical challenge is the support for large submissions, in terms of number and size of processed files. Currently, each individual input file needs to be processed interactively, an unrealistic demand for submissions often comprising dozens or hundreds of files. Objective a) is to modularise PRIDE Converter to allow usage in batch mode, automating multi-file processing based on templates. Another technical challenge is the size of individual files, which can often be several Gigabytes in size. Currently, PRIDE Converter uses an in-memory data model (DOM), requiring all data to be in the main memory of the computer. The overall memory consumption is 2-3 times the size of the PRIDE XML output file. Thus, standard desktops with 4 GB main memory are quickly insufficient to process large files. We quite frequently have to do custom conversions for data depositors on large memory EBI machines, a process which is clearly not scalable. Objective b) is to redevelop the PRIDE Converter memory management to overcome this limitation. PRIDE Converter has so far had more than 150 minor updates, usually in response to user feedback (http://code.google.com/p/pride-converter/updates/). However, two years after the original release, the tool has become difficult to maintain, and urgently needed major updates are impossible without a major redevelopment. While the basic technologies (Java, XML) will remain unchanged, the redevelopment will modularise the source code, implementing a strict MVC (model-view-controller) concept, allowing reuse of components both for the interactive mode and the new batch processing mode. As with the current PRIDE Converter, the project will be completely open source, allowing users to freely adapt the modular system to their particular needs.

Planned Impact

The proposed work directly supports BBSRCs data sharing policy (http://www.bbsrc.ac.uk/web/FILES/Policies/data_sharing_policy.pdf) by reducing the necessary 'activation energy' to start a beneficial circle of lower resistance to data deposition, more public data availability, more re-use of public data, and resulting awareness of the benefits of publicly available data, not only for the community, but also for the data producer through improved visibility and citations. As described in the previous section, primary beneficiaries are academic proteomics researchers, both as data producers and consumers. However, the pharmaceutical and more generally life science based industry also stands to benefit from more and more useful proteomics data. As PRIDE follows a strict open data policy, no IP restrictions will limit usefulness of the data to the commercial sector. In recent years, in particular the pharmaceutical industry also became much more open to collaboration and even public data release in areas considered precompetitive. In some instances, public data release is more hampered by the necessary time and effort than by IP considerations. Thus, reducing the necessary effort for public data release might even be a step towards tapping the vast treasure of currently private data generated, often in high quality, by the commercial sector, and releasing it into the public domain.
 
Description The improvements to the PRIDE data submission tool have contributed to a gradual cultural change in the domain of mass spectrometry based proteomics; submission of proteomics data to a public repository is is now almost normal part of manuscript submission.
Exploitation Route The PRIDE converter tool is part of the PRIDE production process, and has been continuously further developed since the end of the grant.
Sectors Healthcare,Pharmaceuticals and Medical Biotechnology

URL http://www.ebi.ac.uk/pride
 
Description The improvements to the PRIDE data submission tool have contributed to a gradual cultural change in the domain of mass spectrometry based proteomics; submission of proteomics data to a public repository is is now almost normal part of manuscript submission. As a result, re-use of proteomics data and citations of the PRIDE database have risen dramatically. As of March 2017, the 2016 PRIDE NAR has 322 Google Scholar citations.
First Year Of Impact 2015
Sector Education,Healthcare,Pharmaceuticals and Medical Biotechnology
Impact Types Policy & public services

 
Title PRIDE Converter 2 
Description PRIDE Converter 2 enables the conversion from a number of proteomics output formats into PRIDE XML 
Type Of Technology Software 
Year Produced 2012 
Open Source License? Yes  
Impact The number of submissions to the PRIDE database increased. In addition, user experience was much better when compared with the original PRIDE Converter tool. PRIDE Converter Tool is now (11/2014) used by the vast majority of data depositions to the PRIDE database, >700 in 2014. 
URL https://code.google.com/p/pride-converter-2/
 
Description Career Q&A 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Schools
Results and Impact This career Q&A with year 10 students was carried out virtually for the local collage and it is hoped that it would encourage more student to think about entering not only science but all the field of bioinformatics.
Year(s) Of Engagement Activity 2020
 
Description DNA workshop 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Schools
Results and Impact An introduction to science fot Primary school children on the topic of DNA.
Year(s) Of Engagement Activity 2020
 
Description Great Abington KS2 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Schools
Results and Impact A meet the experts and Opne lab tour of Great Abington KS2 school
Year(s) Of Engagement Activity 2020