Galaxy Workflows for Proteomics Informed by Transcriptomics (PIT)

Lead Research Organisation: Queen Mary University of London
Department Name: Sch of Biological and Chemical Sciences

Abstract

Identifying which proteins are present in a given biological sample, and in what quantities, is essential to understanding many biological processes. A technique called "shotgun proteomics" has become the method of choice for tackling this problem. In a shotgun proteomics analysis proteins are first broken down into more easily analysable segments (peptides) using a cleavage enzyme, then separated using liquid chromatography (LC), prior to individual injection into a tandem mass spectrometer (MS/MS), which breaks peptides into fragments, producing a spectrum of product ions that can be considered as a fingerprint for each peptide. Software is used to match the acquired spectra to peptides and these peptide identifications are then used to infer the presence of proteins.

Working out which peptide is represented by each of the acquired spectra is clearly a crucial part of shotgun proteomics. In theory, because we understand the principles of peptide fragmentation, it should be possible to take any peptide spectrum and work out the sequence of the peptide from which it came. In practice this is usually too difficult because the combination of imperfect MS/MS spectra and the huge number of peptides that could potentially exist make incorrect identifications very likely. To circumvent this problem, protein identification software seeks to match peptide spectra only to those peptide sequences that might reasonably be expected to be in the sample. Currently this is done by searching against the sequences of all proteins that the species under study is known to produce (the "proteome"), downloaded from an online database (e.g. UniProt). However, high quality proteomes are only available for a small number of species. What if you want to do proteomics on a sample from a species for which a proteome is not available, or on a sample from an experiment involving multiple species, or unknown species?

We recently developed (and tested, and published) a solution to this problem, which we call proteomics informed by transcriptomics (PIT). The key to PIT is the creation of a sample-specific list of proteins that may be present, derived from gene transcripts found in the sample. Transcripts are copies of genes that are used to make proteins, so by knowing which transcripts are present in a sample we can predict which proteins might be present. The transcripts are found by using a next generation sequencing technique called RNA-seq. Until very recently, RNA-seq involved mapping short reads to a reference genome, but software is now available that can assemble transcripts de novo.

The PIT approach therefore makes it possible to identify and quantify proteins in complex samples when a reference proteome (or genome) is not available. This opens many new areas of research for species that do not have well annotated genomes (which include many pests, pathogens and plants), and also for experiments where proteins from multiple species are present (so-called "metaproteomics") or where the proteome is changing (e.g. during viral infection). There are also a number of additional spin-off benefits such as the ability to find protein variants that are specific to the individual under study (i.e. not present in any reference proteome), and possibility to annotate genomes.

Currently, the main challenge of the PIT approach is the complexity of the data analysis necessary to integrate the transcriptomic and proteomic data and report results in a way that is useful to biologists. The aim of this proposal is therefore to put together a suite of easy to use connected software tools that enable the typical bench scientist to perform the necessary data analysis within an acceptable timescale with no bioinformatics support. To help achieve this we plan to implement the software within the popular Galaxy framework. Galaxy provides an easy to use web browser interface and can take advantage of powerful computing resources.

Technical Summary

Popular software for the identification of proteins from MS/MS spectra (e.g. Mascot, OMSSA, MaxQuant) searches acquired spectra against a peptide list derived from the proteome of the species under study. This makes the protein identification problem tractable by constraining the peptide search space, but limits the application of proteomics to those few species for which there is a high quality well annotated genome, and does not inherently find variant peptides that differ from the reference proteome.

We have recently shown that these limitations can be overcome by generating a sample-specific protein database from transcriptomes assembled de novo from RNA-seq short reads acquired from the same sample. This technique, which we refer to as proteomics informed by transcriptomics (PIT) has recently (Sept 2012) been accepted for publication in Nature Methods. We showed that, for case studies including adenovirus infected HeLa cells, the PIT approach identified >95% of the peptides found using a traditional proteomics search against a reference proteome and, thanks to the more tightly defined search space and the ability to match sample-specific variant peptides, detected several hundred additional peptides that were present.

The aim of the proposed project is to use our unique experience of the PIT approach to develop a suite of Galaxy-based workflows that allow the typical bench scientist to perform the data analysis workflows needed to support PIT and extract biologically relevant information from the results obtained. The workflows will be composed of numerous individual tools, some of which are already available for Galaxy (e.g. Trinity for de novo transcript assembly, and getORF for deriving protein sequences from the transcripts), others that need to be "wrapped" for use in Galaxy (e.g. OMSSA protein search engine), and a further set that must be specially written for PIT (these will be primarily for data integration and downstream analysis and reporting).

Planned Impact

As a fundamental methodology that substantially improves our ability to study proteins and understand genomes, the potential beneficiaries of the PIT approach that will be facilitated by the proposed software development are broad and numerous.

As already mentioned, the concept of PIT analysis emanated from the infectious diseases community, following the realisation that traditional proteomics was not well suited to many of the studies that they were undertaking, especially with non-model organisms such as mosquito and bat. If we take virology as an example where PIT can bring new insights, the improved understanding of viruses that PIT can provide clearly has great potential to impact on human health, animal welfare, public policy and the economy. This is just one example among many others, including food security and industrial biotechnology, which are of intense interest to both academia and industry (as evidenced by the supplied letters of support).

This proposal will also help bolster the UK's position in proteomics research. Despite proteomics being a very competitive area, BBSRC funding has helped the UK to establish several internationally competitive research groups, both in laboratory proteomics and proteome informatics. This has led to commercial activities, including the formation of the very successful proteomics software companies Matrix Science and Nonlinear Dynamics. With continued investment we see no reason why the UK cannot retain its leading position in proteomics, and in this case also help bolster our expertise in the increasingly important area of data integration.

In terms of timescale, we genuinely expect some benefits of this project to be realised within the timescale of the project itself as researchers at Bristol are already doing PIT analysis and our proposed software will be made available to them as it is developed. This will allow them to get more out of their data in shorter timescales (as well as helping us refine our software in response to their feedback). Scientific benefits will extend further once the PIT workflows are made generally available towards the end of the project, and any societal benefits that follow from novel scientific insights would become apparent in subsequent years.

Publications

10 25 50
 
Description We have created GIO, a software system that uses the well-established Galaxy platform to make integrated analysis of transcriptomic and proteomic data available to the typical bench scientist via a simple web interface.
Exploitation Route Due to to open and modular nature of our system, there is significant scope for further development by ourselves and others.
Sectors Agriculture, Food and Drink,Education,Environment,Healthcare,Pharmaceuticals and Medical Biotechnology

URL http://gio.sbcs.qmul.ac.uk
 
Description Understanding Alternative Splicing in Human Cancer by Proteomics Informed by Transcriptomics 
Organisation Queen Mary University of London
Department Centre for Molecular Oncology
Country United Kingdom 
Sector Hospitals 
PI Contribution We are applying our PIT methodology to experimental data obtained by the group of Dr Pabhakar Rajan, in an effort to help him understand the role that alternative splicing plays in cancer.
Collaborator Contribution The partner has provided high quality multi-omic data, and valuable domain knowledge.
Impact This collaboration in multi-disciplinary, leading to novel software tools and improved biological understanding. These will be published in due course.
Start Year 2018
 
Title GIO: Galaxy Integrated Omics 
Description Galaxy-based Integrated Omics (GIO) is a customised Galaxy server dedicated to providing easy access to proteomics tools and pipelines. 
Type Of Technology Webtool/Application 
Year Produced 2014 
Open Source License? Yes  
Impact Used in a number of research projects, and for teaching proteome informatics at Masters level and beyond. 
URL http://gio.sbcs.qmul.ac.uk/