ProteoGenomics: Dynamic Linkage of Genomes and Proteomes through Ensembl and ProteomeXchange

Lead Research Organisation: University of Liverpool

Department Name: Institute of Integrative Biology

Abstract

Abstracts are not currently available in GtR for all funded research. This is normally because the abstract was not required at the time of proposal submission, but may be because it included sensitive information such as personal details.

Technical Summary

The Distributed Annotation System (DAS) has for a long time been the workhorse for integration of external data sources into Ensembl. The UCSC Genome Browser has developed and switched to the more modern and efficient 'TrackHub' technology. Ensembl now provides preliminary support for TrackHubs as well. A particular challenge is the up-to-date integration of mass spectrometry (MS)-based proteomics information, due to the use of different search databases in different labs, and regular updates in genome assemblies. However, there is the potential for huge benefits from high-quality proteogenomics integration: e.g. reliable identification of isoforms, post-translational modifications (PTMs) and quantitative protein expression information are prominent examples. The UK-based PRIDE database is one of the major resources for MS data worldwide, as well as a major driver in the international ProteomeXchange (PX) consortium.
In this project, we will provide an integrated proteogenomics infrastructure to vastly improve the current situation. We will improve TrackHub support for Ensembl, demonstrating its usefulness and performance through the complex use case of proteogenomics. This project will involve massively parallel re-analysis of MS data as new genome builds are released, via the "ProteoAnnotator" pipeline. The reprocessing of the proteomics data sets will be done at different levels: peptide/protein identification, quantification (using spectral counting), identification aimed at improving genome annotation and unrestricted search of PTMs. The reanalysed data sets will be sourced from the PRIDE repository (as part of the PX consortium), as well as from existing BBSRC-funded proteogenomics projects, and will be mapped onto Ensembl. We will focus on human and model organisms represented in Ensembl and Ensembl Genomes, like mouse, rat and Arabidopsis, and eukaryotic pathogens such as Toxoplasma, Plasmodium and Trypanosoma.

Planned Impact

The direct beneficiaries include:

- Software vendors or pharmaceutical research and development teams, since we envisage they may wish to take up our software for local pipelines. It is important to highlight that all the software developed in the context of "ProteoGenomics" will be open-source using the Apache 2.0 licence.
- Research councils and charities funding research will benefit through the potential for increased impact of the mass spectrometry (MS)-based proteomics projects they fund, since the envisioned integration of proteomics data in Ensembl constitutes an important step forward for the field. In addition, there will be a higher incentive for public data deposition in the proteomics field due to the increased visibility of proteomics data in Ensembl.
- As proteomics is a key technology in the Life Sciences, there is the potential for considerable indirect benefits as "ProteoGenomics" will integrate proteomics information at the genome level. These benefits could be realised in any area of basic biology, biomedical or clinical science. For instance, through the reprocessing of datasets, it will be possible to find new post-translational modifications (PTMs) or genome features such as e.g. new exon-intron boundaries or DNA variation information.

Staff employed will benefit:
- Training in two key enabling technologies for the BBSRC (genomics and proteomics) and exposure to new collaborations.

Funded Value:

£206,035

Funded Period:

Jun 14 - Dec 16

Funder:

BBSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

BB/L024128/1

Principal Investigator:

Andrew Jones

Research Subject:

Omic sciences & technologies (98%)

Research Topic:

Genomics (28%)

Proteomics (70%)

Organisations

University of Liverpool (Lead Research Organisation)

People	ORCID iD
Andrew Jones (Principal Investigator)

Publications

Author Name

Title Publication Date Published

|< < 1 2 > >|

10 25 50

Deutsch EW (2023) Proteomics Standards Initiative at Twenty Years: Current Activities and Future Work. in Journal of proteome research

Deutsch EW (2017) Proteomics Standards Initiative: Fifteen Years of Progress and Future Work. in Journal of proteome research

Ghali F (2014) ProteoAnnotator--open source proteogenomics annotation software supporting PSI standards. in Proteomics

Hoffmann N (2019) mzTab-M: A Data Standard for Sharing Quantitative Results in Mass Spectrometry Metabolomics. in Analytical chemistry

Krishna R (2015) A large-scale proteogenomics study of apicomplexan pathogens-Toxoplasma gondii and Neospora caninum. in Proteomics

Menschaert G (2018) The proBAM and proBed standard formats: enabling a seamless integration of genomics and proteomics data. in Genome biology

Ren Z (2019) Improvements to the Rice Genome Annotation Through Large-Scale Analysis of RNA-Seq and Proteomics Data Sets. in Molecular & cellular proteomics : MCP

Ren Z (2018) Improvements to the rice genome annotation through large-scale analysis of RNA-Seq and proteomics datasets

Silmon De Monerri N (2015) Integration of RNA-seq and proteomics data with genomics for improved genome annotation in Apicomplexan parasites in PROTEOMICS

Vizcaíno JA (2017) The mzIdentML Data Standard Version 1.2, Supporting Advances in Proteome Informatics. in Molecular & cellular proteomics : MCP

Key Findings
Impact Summary
Research Databases and Models
Software and Technical Products


Description	We have developed infrastructure and software for re-processing proteomics data in the public domain, and using it to annotate genomes - including visual display as tracks of data via the newly developed proBED data format. The software infrastructure (proBED) and track hubs support genome annotation processes, by clearly and simply displaying the level of protein expression support for predicted gene models, which until now has been technically challenging to implement.
Exploitation Route	The sofware produced and data standards are aleady being picked up by other groups.
Sectors	Healthcare Manufacturing including Industrial Biotechology


Description	Impacts are still on-going - however software and data standards produced in this grant are being picked up by other research teams and industry.
First Year Of Impact	2016
Sector	Healthcare,Leisure Activities, including Sports, Recreation and Tourism
Impact Types	Economic


Title	mzIdentML 1.2
Description	Updates to the mzIdentML data standard for proteomics in mzIdentML 1.2
Type Of Material	Computer model/algorithm
Year Produced	2017
Provided To Others?	Yes
Impact	The standard is exported from commercial and free software, and ready by the major databases in the field.
URL	https://github.com/HUPO-PSI/mzIdentML


Title	proBED data standard
Description	Data standard for displaying proteomics data on genomes
Type Of Material	Computer model/algorithm
Year Produced	2017
Provided To Others?	Yes
Impact	proBED allows proteomics data to be displayed on genome browsers, thus connecting up two major types of public data in omics research.
URL	http://www.psidev.info/probed


Title	ProteoAnnotator
Description	Abstract from paper: The recent massive increase in capability for sequencing genomes is producing enormous advances in our understanding of biological systems. However, there is a bottleneck in genome annotation - determining the structure of all transcribed genes. Experimental data from MS studies can play a major role in confirming and correcting gene structure - proteogenomics. However, there are some technical and practical challenges to overcome, since proteogenomics requires pipelines comprising a complex set of interconnected modules as well as bespoke routines, for example in protein inference and statistics. We are introducing a complete, open source pipeline for proteogenomics, called ProteoAnnotator, which incorporates a graphical user interface and implements the Proteomics Standards Initiative mzIdentML standard for each analysis stage. All steps are included as standalone modules with the mzIdentML library, allowing other groups to re-use the whole pipeline or constituent parts within other tools. We have developed new modules for pre-processing and combining multiple search databases, for performing peptide-level statistics on mzIdentML files, for scoring grouped protein identifications matched to a given genomic locus to validate that updates to the official gene models are statistically sound, and for mapping end results back onto the genome. ProteoAnnotator is available from http://www.proteoannotator.org/.
Type Of Technology	Software
Year Produced	2014
Open Source License?	Yes
Impact	Pipeline for re-annotating genomes with proteomics data. Being implemented to connect two major public databases - EBI Ensembl and EBI PRIDE.
URL	http://www.proteoannotator.org/