ProteomeHarvest - Excel/XML Bridge for User-friendly Proteomics Data Collection

Lead Research Organisation: European Bioinformatics Institute
Department Name: Proteomics Services Team

Abstract

Today, scientific experiments in molecular biology in general and in proteomics in particular, are often done on a large scale, producing large numbers of individual data items. These large data sets are then the basis of scientific publications. Often only relatively few results are actually contributing to the final conclusions reached by the researcher, but the complete result sets can provide valuable knowledge to other researchers comparing them to their own results. However, to allow others to understand how the experiments were done, they need to be described in a very detailed manner. To avoid 'comparing apples and pears', this discription needs to be done in a systematic manner, using established rules or standards on how to describe experiments. In addition, the data needs to be easily accessible for other researchers, which can be best achieved by entering it into large databases, accessible over the internet. Overall, a lot of effort is needed to describe a large experiment in the detailed, standardised manner which allows others to understand them. In other projects, we are working on setting up common rules for the description of proteomics experiments. However, even the best rules are useless if they are not applied. As scientists, like everybody else, tend to do only the minimum amount of work to achieve their goals, their experiment description is often incomplete, focussing only on the aspects they consider relevant. And of course they tend to quickly tire of properly entering the data into databases if they have to use complicated tools they have to install on their computer, and with which they are not familiar. On the other hand, there are programs they know well, because the use them almost every day anyway to manage their data. The main purpose of this proposal is to use one such tool, Microsoft Excel, to develop forms which allow scientists to enter their results into a database in as easy a manner as possible. Biologists are used to Excel, they are familiar with its functionality, and they nearly always have it installed on their computer anyway. We plan to develop Excel forms which are as user friendly as possible, but still capture all the necessary data to appropriately describe the results of a large experiment, according to established rules and standards. While Excel is often used to store experiment results, this is often done in a very unsystematic manner, and it is usually very difficult to transfer the data into XML, a file format which is nowadays practically the standard way for entering data into databases. Also, so far it has been difficult to use and regularly update controlled vocabularies in Excel. Controlled vocabularies are lists of possible words which can be entered in a specific field in a form, to avoid typing errors, and to ensure everybody uses the same word for the same thing. In this project, we propose to develop advanced Excel forms for proteomics data harvesting. These forms should provide researchers with an easy tool to store their data in a systematic manner, ready for sending it to a database. These forms will be able to communicate with a database on the internet to provide up-to-date controlled vocabularies, and they will be able to directly send the data in the form of XML to a database on the internet. We will develop and test these forms for the existing PRIDE proteomics database, making use of the existing database for data storage, and using OLS, the ontology lookup service developed as part of PRIDE, to keep controlled vocabularies in the Excel forms up to date. By providing Excel forms as a user-friendly way to store proteomics data and send it to public databases, we hope to convince researchers to invest a little bit of extra effort to make their valuable data accessible to their collegues by sending it to public databases, and thus to maximise the use of data paid for by the tax payer anyway.

Technical Summary

In the BBSRC funded ISPIDER project, we are developing the Proteomics Identification Database (PRIDE), which as of January 2006 contains more than 180.000 protein identifications based on more than 500.000 peptide identifications, and fully implements the HUPO PSI standards. However, in common with other proteomics databases, PRIDE only captures a fraction of the proteomics data generated with public funds. A major obstacle to a more comprehensive capture of proteomics data is the extra effort required by authors to submit their data to public data repositories. While XML today is the de facto standard for systematic data communication, there are still few user friendly tools for XML data management available. Microsoft Excel is still the standard 'data management system' for the vast majority of scientists in molecular biology. In the HUPO Plasma Proteome Project, we participated in the central collection of proteomics data in a multinational, collaborative project (Adamski et al, 2005). Participants were offered to submit their results either in a collection of Excel spreadsheets, or as XML, supported by a customised version of the PEDRO XML tool (Garwood et al, 2004). Only three out of 18 laboratories submitted their data in XML, 15 chose the Excel spreadsheets, in spite of specific encouragement to use the XML schema. The aim of the ProteomeHarvest project is to develop an advanced Excel spreadsheet for user-friendly data submission to the PRIDE database. This spreadsheet will combine the user-friendlyness and familiarity of Excel with modern, XML based technologies for well-structured data entry. The form will automatically generate well-formatted XML suitable for direct submission to the PRIDE database. By developing a maximally user-friendly data submission tool for PRIDE, we hope to increase the quality and quantity of proteomics data deposited in public databases, and thus maximise the use of valuable data generated by the proteomics community.

Publications

10 25 50