ProteoGenomics: Dynamic Linkage of Genomes and Proteomes through Ensembl and ProteomeXchange
Lead Research Organisation:
University of Liverpool
Department Name: Institute of Integrative Biology
Abstract
Abstracts are not currently available in GtR for all funded research. This is normally because the abstract was not required at the time of proposal submission, but may be because it included sensitive information such as personal details.
Technical Summary
The Distributed Annotation System (DAS) has for a long time been the workhorse for integration of external data sources into Ensembl. The UCSC Genome Browser has developed and switched to the more modern and efficient 'TrackHub' technology. Ensembl now provides preliminary support for TrackHubs as well. A particular challenge is the up-to-date integration of mass spectrometry (MS)-based proteomics information, due to the use of different search databases in different labs, and regular updates in genome assemblies. However, there is the potential for huge benefits from high-quality proteogenomics integration: e.g. reliable identification of isoforms, post-translational modifications (PTMs) and quantitative protein expression information are prominent examples. The UK-based PRIDE database is one of the major resources for MS data worldwide, as well as a major driver in the international ProteomeXchange (PX) consortium.
In this project, we will provide an integrated proteogenomics infrastructure to vastly improve the current situation. We will improve TrackHub support for Ensembl, demonstrating its usefulness and performance through the complex use case of proteogenomics. This project will involve massively parallel re-analysis of MS data as new genome builds are released, via the "ProteoAnnotator" pipeline. The reprocessing of the proteomics data sets will be done at different levels: peptide/protein identification, quantification (using spectral counting), identification aimed at improving genome annotation and unrestricted search of PTMs. The reanalysed data sets will be sourced from the PRIDE repository (as part of the PX consortium), as well as from existing BBSRC-funded proteogenomics projects, and will be mapped onto Ensembl. We will focus on human and model organisms represented in Ensembl and Ensembl Genomes, like mouse, rat and Arabidopsis, and eukaryotic pathogens such as Toxoplasma, Plasmodium and Trypanosoma.
In this project, we will provide an integrated proteogenomics infrastructure to vastly improve the current situation. We will improve TrackHub support for Ensembl, demonstrating its usefulness and performance through the complex use case of proteogenomics. This project will involve massively parallel re-analysis of MS data as new genome builds are released, via the "ProteoAnnotator" pipeline. The reprocessing of the proteomics data sets will be done at different levels: peptide/protein identification, quantification (using spectral counting), identification aimed at improving genome annotation and unrestricted search of PTMs. The reanalysed data sets will be sourced from the PRIDE repository (as part of the PX consortium), as well as from existing BBSRC-funded proteogenomics projects, and will be mapped onto Ensembl. We will focus on human and model organisms represented in Ensembl and Ensembl Genomes, like mouse, rat and Arabidopsis, and eukaryotic pathogens such as Toxoplasma, Plasmodium and Trypanosoma.
Planned Impact
The direct beneficiaries include:
- Software vendors or pharmaceutical research and development teams, since we envisage they may wish to take up our software for local pipelines. It is important to highlight that all the software developed in the context of "ProteoGenomics" will be open-source using the Apache 2.0 licence.
- Research councils and charities funding research will benefit through the potential for increased impact of the mass spectrometry (MS)-based proteomics projects they fund, since the envisioned integration of proteomics data in Ensembl constitutes an important step forward for the field. In addition, there will be a higher incentive for public data deposition in the proteomics field due to the increased visibility of proteomics data in Ensembl.
- As proteomics is a key technology in the Life Sciences, there is the potential for considerable indirect benefits as "ProteoGenomics" will integrate proteomics information at the genome level. These benefits could be realised in any area of basic biology, biomedical or clinical science. For instance, through the reprocessing of datasets, it will be possible to find new post-translational modifications (PTMs) or genome features such as e.g. new exon-intron boundaries or DNA variation information.
Staff employed will benefit:
- Training in two key enabling technologies for the BBSRC (genomics and proteomics) and exposure to new collaborations.
- Software vendors or pharmaceutical research and development teams, since we envisage they may wish to take up our software for local pipelines. It is important to highlight that all the software developed in the context of "ProteoGenomics" will be open-source using the Apache 2.0 licence.
- Research councils and charities funding research will benefit through the potential for increased impact of the mass spectrometry (MS)-based proteomics projects they fund, since the envisioned integration of proteomics data in Ensembl constitutes an important step forward for the field. In addition, there will be a higher incentive for public data deposition in the proteomics field due to the increased visibility of proteomics data in Ensembl.
- As proteomics is a key technology in the Life Sciences, there is the potential for considerable indirect benefits as "ProteoGenomics" will integrate proteomics information at the genome level. These benefits could be realised in any area of basic biology, biomedical or clinical science. For instance, through the reprocessing of datasets, it will be possible to find new post-translational modifications (PTMs) or genome features such as e.g. new exon-intron boundaries or DNA variation information.
Staff employed will benefit:
- Training in two key enabling technologies for the BBSRC (genomics and proteomics) and exposure to new collaborations.
Organisations
People |
ORCID iD |
Andrew Jones (Principal Investigator) |
Publications
Deutsch EW
(2023)
Proteomics Standards Initiative at Twenty Years: Current Activities and Future Work.
in Journal of proteome research
Deutsch EW
(2017)
Proteomics Standards Initiative: Fifteen Years of Progress and Future Work.
in Journal of proteome research
Ghali F
(2014)
ProteoAnnotator--open source proteogenomics annotation software supporting PSI standards.
in Proteomics
Hoffmann N
(2019)
mzTab-M: A Data Standard for Sharing Quantitative Results in Mass Spectrometry Metabolomics.
in Analytical chemistry
Krishna R
(2015)
A large-scale proteogenomics study of apicomplexan pathogens-Toxoplasma gondii and Neospora caninum.
in Proteomics
Menschaert G
(2018)
The proBAM and proBed standard formats: enabling a seamless integration of genomics and proteomics data.
in Genome biology
Ren Z
(2019)
Improvements to the Rice Genome Annotation Through Large-Scale Analysis of RNA-Seq and Proteomics Data Sets.
in Molecular & cellular proteomics : MCP
Silmon De Monerri N
(2015)
Integration of RNA-seq and proteomics data with genomics for improved genome annotation in Apicomplexan parasites
in PROTEOMICS
VizcaĆno JA
(2017)
The mzIdentML Data Standard Version 1.2, Supporting Advances in Proteome Informatics.
in Molecular & cellular proteomics : MCP
Description | We have developed infrastructure and software for re-processing proteomics data in the public domain, and using it to annotate genomes - including visual display as tracks of data via the newly developed proBED data format. The software infrastructure (proBED) and track hubs support genome annotation processes, by clearly and simply displaying the level of protein expression support for predicted gene models, which until now has been technically challenging to implement. |
Exploitation Route | The sofware produced and data standards are aleady being picked up by other groups. |
Sectors | Healthcare Manufacturing including Industrial Biotechology |
Description | Impacts are still on-going - however software and data standards produced in this grant are being picked up by other research teams and industry. |
First Year Of Impact | 2016 |
Sector | Healthcare,Leisure Activities, including Sports, Recreation and Tourism |
Impact Types | Economic |
Title | mzIdentML 1.2 |
Description | Updates to the mzIdentML data standard for proteomics in mzIdentML 1.2 |
Type Of Material | Computer model/algorithm |
Year Produced | 2017 |
Provided To Others? | Yes |
Impact | The standard is exported from commercial and free software, and ready by the major databases in the field. |
URL | https://github.com/HUPO-PSI/mzIdentML |
Title | proBED data standard |
Description | Data standard for displaying proteomics data on genomes |
Type Of Material | Computer model/algorithm |
Year Produced | 2017 |
Provided To Others? | Yes |
Impact | proBED allows proteomics data to be displayed on genome browsers, thus connecting up two major types of public data in omics research. |
URL | http://www.psidev.info/probed |
Title | ProteoAnnotator |
Description | Abstract from paper: The recent massive increase in capability for sequencing genomes is producing enormous advances in our understanding of biological systems. However, there is a bottleneck in genome annotation - determining the structure of all transcribed genes. Experimental data from MS studies can play a major role in confirming and correcting gene structure - proteogenomics. However, there are some technical and practical challenges to overcome, since proteogenomics requires pipelines comprising a complex set of interconnected modules as well as bespoke routines, for example in protein inference and statistics. We are introducing a complete, open source pipeline for proteogenomics, called ProteoAnnotator, which incorporates a graphical user interface and implements the Proteomics Standards Initiative mzIdentML standard for each analysis stage. All steps are included as standalone modules with the mzIdentML library, allowing other groups to re-use the whole pipeline or constituent parts within other tools. We have developed new modules for pre-processing and combining multiple search databases, for performing peptide-level statistics on mzIdentML files, for scoring grouped protein identifications matched to a given genomic locus to validate that updates to the official gene models are statistically sound, and for mapping end results back onto the genome. ProteoAnnotator is available from http://www.proteoannotator.org/. |
Type Of Technology | Software |
Year Produced | 2014 |
Open Source License? | Yes |
Impact | Pipeline for re-annotating genomes with proteomics data. Being implemented to connect two major public databases - EBI Ensembl and EBI PRIDE. |
URL | http://www.proteoannotator.org/ |