Standards-compliant software tools for curation and public deposition of proteomics data

Lead Research Organisation: University of Liverpool
Department Name: Veterinary Preclinical Science

Abstract

We are now in the 'post-genomic' era of biological research since the complete genetic makeup of many organisms is now known. The genome sequence essentially provides us with the parts list of the complete organism and the blueprint describing how the organism functions. The genome sequence is only a static representation though - it is the same in all cells of an organism and does not change during the organism's lifetime. All cellular functions are provided by proteins, which are each encoded by a gene within the genome. The set of proteins expressed in each cell changes dramatically as the organism develops and as it is placed under different environmental conditions, such as external stresses on cells. Researchers are interested in studying the complete set of proteins in a sample, which is called the proteome. Many research laboratories around the world are now performing these proteomics studies, using mass spectrometry to identify large sets of proteins concurrently. Proteomics studies may be used simply to identify the proteins present in the samples of interest, or to compare the proteins present in one condition compared with another. These techniques can be used to find the proteins that are switched on or off, for example during a disease process or following infection with a parasitic organism, helping us to understand the molecular basis of the response. These studies can generate vast data sets, which are produced first by the mass spectrometer and then by the software package chosen for data analysis. It can be difficult for researchers to access and analyse data produced in other laboratories, unless they use the same software package in their laboratory, because each piece of software generally only works with its own data format. A worldwide consortium has formed to standardise how data from these studies are reported, and in this context, my group has led an initiative to create a single data format for storing the results of proteomics studies. We have recently released a standard format for protein identifications made from mass spectrometry data, called mzIdentML. In this application, we are aiming to build software so that it is easy for software developers around the world to build new applications or alter existing applications so that they can read and write data in this format. This will mean that data produced in one laboratory could be analysed using a variety of software packages, not just the software in use in the source laboratory. There is also a growing movement for researchers to store their data in public databases when they publish a study. This will allow other researchers to download the data to check their findings and to integrate the published data with their own results. These processes will be greatly simplified by having a worldwide-accepted standard format. In this application, we are developing a set of software tools for performing standard tasks on identification data, for example performing statistical analysis to determine significance of particular results. These tools are open-source, as we intend for other software developers, working both in academia and industry, to re-use these tools in their applications. We are also developing a graphical viewer so that scientists can visualise their data in different ways to help understand the results contained in the large data sets. The viewer will be integrated into an application produced by one of the main public databases, making it simple for scientists to analyse their data and then upload it directly to the public database when they publish a journal article. These developments will help scientists working in proteomics to share their data. In turn, this will help proteomics databases to grow rapidly, which will benefit all molecular biology researchers as they will have access to huge protein data sets for data mining, allowing new biological discoveries to be made.

Technical Summary

In proteomics research, there is a growing consensus that the scientific community would benefit significantly from open access to all data sets on which published results are based. Most journals now recommend that data sets are deposited in public databases, prior to publication. However, it is still relatively challenging for laboratory scientists to comply with data sharing initiatives, due to the fragmentation of data formats and analysis tools used in different laboratories. In response, the Proteomics Standards Initiative has recently released a standard format, mzIdentML, to store peptides and proteins identified from mass spectrometry data. The format has been developed in a consortium comprising the main software vendors and producers of open-source software and, as such, represents the first community agreement on a unifying data standard. In this application, we will develop an open-source programming interface to mzIdentML to enable bioinformatics developers to incorporate mzIdentML support in their software tools. The programming interface will comprise re-usable libraries of common routines performed on proteomic identification data, including setting identification thresholds, converting peptide evidence into protein evidence, calculating global statistics and combining or comparing files. We will also produce a free viewer for mzIdentML that enables bench scientists to visualise data and call any of the library functions for curating their results. The viewer will also help scientists to submit their data directly to public proteomics databases. These developments will greatly simplify proteomic data sharing, and will lead to a large increase in the number of publicly available data sets.

Planned Impact

There is considerable potential in this application for providing direct benefits to the commercial sector. In the context of proteomics data standards, my group has collaborations with Matrix Science, Thermo, Waters and ProteomeSoftware who have all helped to develop mzIdentML and are currently working on implementations. The commercial sector has recognised the benefits of open-data standards, since users are requesting transparent access to their data and wish to use more than one tool for data analysis. The PSI has a good reputation with industrial partners so they are willing to dedicate significant funds and developer time towards implementing PSI standards, in the knowledge that specifications will be well designed and likely to gain broad community acceptance. As one example of a specific benefit, Matrix Science (UK) produces the market leading proteomics search engine (Mascot), and they have invested considerable efforts in mzIdentML, through their director (David Creasy) working on the project since its inception in 2004 and currently co-chairing the PSI-PI group. Matrix Science are the first company to implement mzIdentML export in their software and thus if the worldwide uptake of the standard is good, they will have a competitive advantage. The proposed software tools are open-source and, as such, could be seen to be in direct competition with products from software vendors and instrument manufacturers. In reality, the mzidView software will be complementary to commercial search engines and LIMS, since it only offers functionality for post-processing results. The API and transformation libraries are completely open-source, released under a permissive licence, which will enable commercial entities to use them directly within their software. The mission of the PSI is to encourage as much open data sharing as possible, and this means facilitating both open-source software implementations and commercial solutions. Journals publishing proteomics studies will benefit through increased (and easier) deposition of data, which will ease the peer-review process and will prevent journals having to provide ad hoc solutions for upload of data as supplementary material to their own websites in inappropriate formats. The software will greatly increase the number of proteomics data sets deposited in public databases. In economic terms, data sharing is highly valuable. In the BBSRC grant awards database, there are 785 records that include 'proteomics' as a keyword. This huge investment will not reach its full potential unless the data sets produced can be mined by the wider scientific community for deriving new biological knowledge. In terms of public health, this application could have considerable indirect benefits. Our stated aim is to greatly increase the number of publicly available proteomics data sets. These data sets could have any number of unexpected uses from understanding basic biological processes, to the search for disease biomarkers or in drug and vaccine development.

Publications

10 25 50
 
Description The project aimed to develop standard data formats for data coming from "proteomics" studies, and software that makes it easier for research groups to use. These developments allow better communication between software used in different labs, supporting anaylsis or re-use of public data - with potential for improving transparency of scientific publications (results can be checked) or used for a new purpose.
Exploitation Route Standards are being further developed by the Proteomics Standards Initaitive - see reports on PROCESS grant.
Sectors Healthcare,Pharmaceuticals and Medical Biotechnology

 
Description Standards developed in this grant have been implemented in a range of commercial software - see PROCESS grant reports.
Sector Healthcare,Pharmaceuticals and Medical Biotechnology
Impact Types Economic