GRAPPA - Global compRehensive Atlas of Peptide and Protein Abundance

Lead Research Organisation: University of Liverpool
Department Name: Institute of Integrative Biology

Abstract

Proteins are the key molecules in biological systems carrying functions, acting as enzymes to catalyse reactions, as a signalling transducers to allow cells to respond to changing environments and providing structural features to cells amongst many other roles. Biological and biomedical research has undergone a technology revolution in recent decades, whereby "Big Data" approaches have become widespread, enabling samples to be analysed in a high-throughput manner. For analysis of proteins, a suite of technologies collectively called "proteomics", use mass spectrometry to measure 1000s of proteins simultaneously, enabling researchers to study which groups of proteins change in abundance during, for example, disease processes to help understand the disease and potentially develop therapeutic targets.

Proteomics techniques rely upon expensive instrumentation and data are challenging to collect and process, including complex protocols for lab analysis and downstream data/statistical analysis. As a result, data are potentially highly valuable and very often contain considerable potential for different applications beyond the initial study. More generally in biosciences, there has been a move towards greater transparency and open access of data, enabling results to be validated and to widen access to research, beyond select labs with access to the best technology. In proteomics, the applicants have been involved with making data open access for over 15 years, through the design of data standards and developing freely available public repositories. The situation now is one where there are vast amounts of data, particularly deposited in the EBI's PRIDE database, the world leading resource for proteomics datasets. Most relevant journals publishing studies require that authors make their data available, through an umbrella collection of databases called ProteomeXchange, of which PRIDE is the leading member. These datasets are routinely re-used for new purposes, for example to support defining where genes exist in genomes or to search for modifications to proteins. However, most of the re-use of proteomics data is currently done by specialist research groups with expertise themselves in proteomics data analysis. PRIDE mostly contains raw data as collected of the instrument (prior to complex processing) or lists of proteins that have been identified, but not quantitative values represented in a standard way. Quantitative measurements of proteins are potentially highly valuable so that researchers in a wide range of disciplines can understand how proteins are distributed in different cells or tissues under standard and changing conditions (such as diseases).

Our overall goal in this proposal is to build data analysis pipelines and reprocess 100s of datasets already in PRIDE and those deposited in the coming years, so that we can unlock this potentially huge untapped value in quantitative proteomics data. The data will be represented in a new "PRIDE Quant" module, and passed via a pipeline to the EBI's Expression Atlas database, which is designed to present a "biologist-friendly" view of the data, where researchers from any discipline can visualise data and download it in large batches for analysis in any downstream application.

Technical Summary

The world-leading PRIDE database now contains >14,000 proteomics datasets, all of which contain raw mass spectrometry (MS) data, some contain standardised lists of protein identifications but currently none contain quantitative data expressed in a standard format. As such, there is vast untapped potential for quantitative data re-use, for the majority of research groups who do not have the capability to re-process data sets themselves.

In this project, we will develop robust open cloud-based data analysis pipelines that will be used to process 100s of publicly available datasets, using standardised data processing and normalisation protocols. All datasets will be made available within a new portal, PRIDE Quant to support computational users, and will be passed to the Expression Atlas database to provide a biologist-friendly view of the data. Data processing will largely focus on human samples for which the highest data volumes exist, including both "baseline" datasets e.g. to provide cell line or tissue/organ-level estimates of protein abundance, and "differential" expression datasets for various diseases including cancer, dementia, diabetes and major infectious diseases.

We will develop several exemplar applications of the data, including displays showing correlations between gene and protein expression for matched samples, generation of co-expression networks from proteomics data, and generating vast maps of peptide-level abundance to support new research in proteome bioinformatics.

Planned Impact

Human proteomics data have considerable potential to support biomarker discovery efforts by pharmaceutical companies, or for example to test the distribution of particular proteins over various tissues or cell types, more broadly to support pharmaceutical industry development pipelines. Many pharmaceutical companies do not have in-house proteomics analysis capabilities, and will be able to mine any datasets they wish straightforwardly, without requiring local/specialist bioinformatics support.

Research councils and charities funding research will benefit through the potential for increased impact of the mass spectrometry (MS)-based proteomics projects they fund, thanks to the re-analysis of public proteomics datasets and the integration of quantitative proteomics data in Expression Atlas.

More broadly, as proteomics is a key technology in the Life Sciences, there is the potential for considerable indirect benefits across a wide range of areas in basic biology, biomedical and clinical science, as more value will be derived from datasets.

Life scientists worldwide will be able to benefit from the training activities planned (both face-to-face and via on-line resources).

Staff employed will benefit:

- Receiving further training in a key enabling technology for the BBSRC (proteomics) and exposure to a multi-disciplinary team, and to conferences, workshops and new national and international collaborations (for example through the Proteomics Standards Initiative).

- Acquiring skills needed to work with bioinformatics software in a cloud environment, something that is getting increasingly important with the growing size of datasets and the need of suitable IT infrastructure.
 
Description We have generated standardised maps of the proteins expressed in multiple human tissues, which will support research in a variety of biological and biomedical fields. We have also generated the same comparative data for major rodent models (rat and mouse), showing that protein abundance is well conserved across species. These resources will be highly useful to assist in the design of new experiments, as well as helping to interpret current results.
Exploitation Route We have a new grant about to start further developing the outputs, with more data, and a new type of MS data called DIA.
Sectors Healthcare,Pharmaceuticals and Medical Biotechnology