ProteoGenomics: Dynamic Linkage of Genomes and Proteomes through Ensembl and ProteomeXchange

Lead Research Organisation: University of Liverpool
Department Name: Institute of Integrative Biology

Abstract

Abstracts are not currently available in GtR for all funded research. This is normally because the abstract was not required at the time of proposal submission, but may be because it included sensitive information such as personal details.

Technical Summary

The Distributed Annotation System (DAS) has for a long time been the workhorse for integration of external data sources into Ensembl. The UCSC Genome Browser has developed and switched to the more modern and efficient 'TrackHub' technology. Ensembl now provides preliminary support for TrackHubs as well. A particular challenge is the up-to-date integration of mass spectrometry (MS)-based proteomics information, due to the use of different search databases in different labs, and regular updates in genome assemblies. However, there is the potential for huge benefits from high-quality proteogenomics integration: e.g. reliable identification of isoforms, post-translational modifications (PTMs) and quantitative protein expression information are prominent examples. The UK-based PRIDE database is one of the major resources for MS data worldwide, as well as a major driver in the international ProteomeXchange (PX) consortium.
In this project, we will provide an integrated proteogenomics infrastructure to vastly improve the current situation. We will improve TrackHub support for Ensembl, demonstrating its usefulness and performance through the complex use case of proteogenomics. This project will involve massively parallel re-analysis of MS data as new genome builds are released, via the "ProteoAnnotator" pipeline. The reprocessing of the proteomics data sets will be done at different levels: peptide/protein identification, quantification (using spectral counting), identification aimed at improving genome annotation and unrestricted search of PTMs. The reanalysed data sets will be sourced from the PRIDE repository (as part of the PX consortium), as well as from existing BBSRC-funded proteogenomics projects, and will be mapped onto Ensembl. We will focus on human and model organisms represented in Ensembl and Ensembl Genomes, like mouse, rat and Arabidopsis, and eukaryotic pathogens such as Toxoplasma, Plasmodium and Trypanosoma.

Planned Impact

The direct beneficiaries include:

- Software vendors or pharmaceutical research and development teams, since we envisage they may wish to take up our software for local pipelines. It is important to highlight that all the software developed in the context of "ProteoGenomics" will be open-source using the Apache 2.0 licence.
- Research councils and charities funding research will benefit through the potential for increased impact of the mass spectrometry (MS)-based proteomics projects they fund, since the envisioned integration of proteomics data in Ensembl constitutes an important step forward for the field. In addition, there will be a higher incentive for public data deposition in the proteomics field due to the increased visibility of proteomics data in Ensembl.
- As proteomics is a key technology in the Life Sciences, there is the potential for considerable indirect benefits as "ProteoGenomics" will integrate proteomics information at the genome level. These benefits could be realised in any area of basic biology, biomedical or clinical science. For instance, through the reprocessing of datasets, it will be possible to find new post-translational modifications (PTMs) or genome features such as e.g. new exon-intron boundaries or DNA variation information.

Staff employed will benefit:
- Training in two key enabling technologies for the BBSRC (genomics and proteomics) and exposure to new collaborations.
 
Description We have developed infrastructure and software for re-processing proteomics data in the public domain, and using it to annotate genomes - including visual display as tracks of data via the newly developed proBED data format.

The software infrastructure (proBED) and track hubs support genome annotation processes, by clearly and simply displaying the level of protein expression support for predicted gene models, which until now has been technically challenging to implement.
Exploitation Route The sofware produced and data standards are aleady being picked up by other groups.
Sectors Healthcare,Manufacturing, including Industrial Biotechology

 
Description Impacts are still on-going - however software and data standards produced in this grant are being picked up by other research teams and industry.
First Year Of Impact 2016
Sector Healthcare,Leisure Activities, including Sports, Recreation and Tourism
Impact Types Economic

 
Title mzIdentML 1.2 
Description Updates to the mzIdentML data standard for proteomics in mzIdentML 1.2 
Type Of Material Computer model/algorithm 
Year Produced 2017 
Provided To Others? Yes  
Impact The standard is exported from commercial and free software, and ready by the major databases in the field. 
URL https://github.com/HUPO-PSI/mzIdentML
 
Title proBED data standard 
Description Data standard for displaying proteomics data on genomes 
Type Of Material Computer model/algorithm 
Year Produced 2017 
Provided To Others? Yes  
Impact proBED allows proteomics data to be displayed on genome browsers, thus connecting up two major types of public data in omics research. 
URL http://www.psidev.info/probed
 
Title ProteoAnnotator 
Description Abstract from paper: The recent massive increase in capability for sequencing genomes is producing enormous advances in our understanding of biological systems. However, there is a bottleneck in genome annotation - determining the structure of all transcribed genes. Experimental data from MS studies can play a major role in confirming and correcting gene structure - proteogenomics. However, there are some technical and practical challenges to overcome, since proteogenomics requires pipelines comprising a complex set of interconnected modules as well as bespoke routines, for example in protein inference and statistics. We are introducing a complete, open source pipeline for proteogenomics, called ProteoAnnotator, which incorporates a graphical user interface and implements the Proteomics Standards Initiative mzIdentML standard for each analysis stage. All steps are included as standalone modules with the mzIdentML library, allowing other groups to re-use the whole pipeline or constituent parts within other tools. We have developed new modules for pre-processing and combining multiple search databases, for performing peptide-level statistics on mzIdentML files, for scoring grouped protein identifications matched to a given genomic locus to validate that updates to the official gene models are statistically sound, and for mapping end results back onto the genome. ProteoAnnotator is available from http://www.proteoannotator.org/. 
Type Of Technology Software 
Year Produced 2014 
Open Source License? Yes  
Impact Pipeline for re-annotating genomes with proteomics data. Being implemented to connect two major public databases - EBI Ensembl and EBI PRIDE. 
URL http://www.proteoannotator.org/