ProteoGenomics: Dynamic Linkage of Genomes and Proteomes through Ensembl and ProteomeXchange
Lead Research Organisation:
European Bioinformatics Institute
Department Name: Proteomics Services Team
Abstract
For researchers in the Life Sciences, it imperative that they are able to access and view the human genome, and genomes of model organisms and human pathogens in an efficient and user-friendly way via the Internet. The genome itself is annotated with information about the locations and functions of genes, and quantitative data about genes and other elements within the genome. The UK-based Ensembl project is a leading genome browser, used by thousands of researchers every day. The value of genomic information is greatly increased when it is integrated with and can be directly viewed alongside other biological data sources such as proteomics - a set of technologies devoted to the identification and quantification of proteins, the functional molecules encoded by each gene. From a technical point of view, the large size of modern biological data sets makes it challenging to efficiently integrate them into genome browsers. A technology called DAS (Distributed Annotation System) is the prevalent technology used by genome browsers to integrate external data but it can no longer support much-needed new features or scale to the sizes of modern data sets. Another genome browser, the UCSC Genome Browser, has developed a more modern and efficient technology, specifically designed for large-scale data sets called 'TrackHubs'. Both UCSC and Ensembl have developed initial support for this technology, but there are still limitations for many users, and Ensembl's support remains incomplete. In the 'ProteoGenomics' project, we first want to further develop the 'TrackHub' technology, expanding its scope of usage in Ensembl, and making it easier for researchers around the world to discover and use TrackHubs containing different types of research data. Ensembl's TrackHub technology will be expanded to proteomics data for the first time and thus improve the provision of non-genomics biological information in this widely used resource.
In the project, we are going to build technology to integrate proteomics data with the genome data held in Ensembl, in a dynamic and effective way. With this aim in mind we will use public MS proteomics data submitted and available in one of the main repositories in the world, the UK-based resource PRIDE, which is also one leading the ProteomeXchange Consortium of proteomics resources. We will reanalyse the data in PRIDE via our ProteoAnnotator pipeline to provide updated or complementary information to the results originally submitted by the research team that generated the data. We are pioneering techniques for extracting more value from the same data, to understand how proteins vary in their abundance and in chemical modifications that occur on proteins, altering their function, two types of results often not generated initially by research groups submitting data to PRIDE. Through this data reuse and the extraction of new biological findings, the value of the submitted datasets will increase. In addition, 'ProteoGenomics' will provide a portal for datasets from the recently started Human Proteome Project (HPP), providing the global research community with a single entry point to these datasets.
In the project, we are going to build technology to integrate proteomics data with the genome data held in Ensembl, in a dynamic and effective way. With this aim in mind we will use public MS proteomics data submitted and available in one of the main repositories in the world, the UK-based resource PRIDE, which is also one leading the ProteomeXchange Consortium of proteomics resources. We will reanalyse the data in PRIDE via our ProteoAnnotator pipeline to provide updated or complementary information to the results originally submitted by the research team that generated the data. We are pioneering techniques for extracting more value from the same data, to understand how proteins vary in their abundance and in chemical modifications that occur on proteins, altering their function, two types of results often not generated initially by research groups submitting data to PRIDE. Through this data reuse and the extraction of new biological findings, the value of the submitted datasets will increase. In addition, 'ProteoGenomics' will provide a portal for datasets from the recently started Human Proteome Project (HPP), providing the global research community with a single entry point to these datasets.
Technical Summary
The Distributed Annotation System (DAS) has for a long time been the workhorse for integration of external data sources into Ensembl. The UCSC Genome Browser has developed and switched to the more modern and efficient 'TrackHub' technology. Ensembl now provides preliminary support for TrackHubs as well. A particular challenge is the up-to-date integration of mass spectrometry (MS)-based proteomics information, due to the use of different search databases in different labs, and regular updates in genome assemblies. However, there is the potential for huge benefits from high-quality proteogenomics integration: e.g. reliable identification of isoforms, post-translational modifications (PTMs) and quantitative protein expression information are prominent examples. The UK-based PRIDE database is one of the major resources for MS data worldwide, as well as a major driver in the international ProteomeXchange (PX) consortium.
In this project, we will provide an integrated proteogenomics infrastructure to vastly improve the current situation. We will improve TrackHub support for Ensembl, demonstrating its usefulness and performance through the complex use case of proteogenomics. This project will involve massively parallel re-analysis of MS data as new genome builds are released, via the "ProteoAnnotator" pipeline. The reprocessing of the proteomics data sets will be done at different levels: peptide/protein identification, quantification (using spectral counting), identification aimed at improving genome annotation and unrestricted search of PTMs. The reanalysed data sets will be sourced from the PRIDE repository (as part of the PX consortium), as well as from existing BBSRC-funded proteogenomics projects, and will be mapped onto Ensembl. We will focus on human and model organisms represented in Ensembl and Ensembl Genomes, like mouse, rat and Arabidopsis, and eukaryotic pathogens such as Toxoplasma, Plasmodium and Trypanosoma.
In this project, we will provide an integrated proteogenomics infrastructure to vastly improve the current situation. We will improve TrackHub support for Ensembl, demonstrating its usefulness and performance through the complex use case of proteogenomics. This project will involve massively parallel re-analysis of MS data as new genome builds are released, via the "ProteoAnnotator" pipeline. The reprocessing of the proteomics data sets will be done at different levels: peptide/protein identification, quantification (using spectral counting), identification aimed at improving genome annotation and unrestricted search of PTMs. The reanalysed data sets will be sourced from the PRIDE repository (as part of the PX consortium), as well as from existing BBSRC-funded proteogenomics projects, and will be mapped onto Ensembl. We will focus on human and model organisms represented in Ensembl and Ensembl Genomes, like mouse, rat and Arabidopsis, and eukaryotic pathogens such as Toxoplasma, Plasmodium and Trypanosoma.
Planned Impact
The direct beneficiaries include:
- Software vendors or pharmaceutical research and development teams, since we envisage they may wish to take up our software for local pipelines. It is important to highlight that all the software developed in the context of "ProteoGenomics" will be open-source using the Apache 2.0 licence.
- Research councils and charities funding research will benefit through the potential for increased impact of the mass spectrometry (MS)-based proteomics projects they fund, since the envisioned integration of proteomics data in Ensembl constitutes an important step forward for the field. In addition, there will be a higher incentive for public data deposition in the proteomics field due to the increased visibility of proteomics data in Ensembl.
- As proteomics is a key technology in the Life Sciences, there is the potential for considerable indirect benefits as "ProteoGenomics" will integrate proteomics information at the genome level. These benefits could be realised in any area of basic biology, biomedical or clinical science. For instance, through the reprocessing of datasets, it will be possible to find new post-translational modifications (PTMs) or genome features such as e.g. new exon-intron boundaries or DNA variation information.
Staff employed will benefit:
- Training in two key enabling technologies for the BBSRC (genomics and proteomics) and exposure to new collaborations.
- Software vendors or pharmaceutical research and development teams, since we envisage they may wish to take up our software for local pipelines. It is important to highlight that all the software developed in the context of "ProteoGenomics" will be open-source using the Apache 2.0 licence.
- Research councils and charities funding research will benefit through the potential for increased impact of the mass spectrometry (MS)-based proteomics projects they fund, since the envisioned integration of proteomics data in Ensembl constitutes an important step forward for the field. In addition, there will be a higher incentive for public data deposition in the proteomics field due to the increased visibility of proteomics data in Ensembl.
- As proteomics is a key technology in the Life Sciences, there is the potential for considerable indirect benefits as "ProteoGenomics" will integrate proteomics information at the genome level. These benefits could be realised in any area of basic biology, biomedical or clinical science. For instance, through the reprocessing of datasets, it will be possible to find new post-translational modifications (PTMs) or genome features such as e.g. new exon-intron boundaries or DNA variation information.
Staff employed will benefit:
- Training in two key enabling technologies for the BBSRC (genomics and proteomics) and exposure to new collaborations.
Publications
Aken BL
(2017)
Ensembl 2017.
in Nucleic acids research
Cunningham F
(2015)
Ensembl 2015.
in Nucleic acids research
Cunningham F
(2015)
Improving the Sequence Ontology terminology for genomic variant annotation.
in Journal of biomedical semantics
Deutsch EW
(2017)
The ProteomeXchange consortium in 2017: supporting the cultural change in proteomics public data deposition.
in Nucleic acids research
Deutsch EW
(2017)
Proteomics Standards Initiative: Fifteen Years of Progress and Future Work.
in Journal of proteome research
Deutsch EW
(2023)
Proteomics Standards Initiative at Twenty Years: Current Activities and Future Work.
in Journal of proteome research
Deutsch EW
(2023)
The ProteomeXchange consortium at 10 years: 2023 update.
in Nucleic acids research
Griss J
(2018)
Response to "Comparison and Evaluation of Clustering Algorithms for Tandem Mass Spectra".
in Journal of proteome research
Horvatovich P
(2015)
Quest for Missing Proteins: Update 2015 on Chromosome-Centric Human Proteome Project.
in Journal of proteome research
Howe KL
(2020)
Ensembl Genomes 2020-enabling non-vertebrate genomic research.
in Nucleic acids research
Jarnuczak AF
(2017)
Using the PRIDE Database and ProteomeXchange for Submitting and Accessing Public Proteomics Datasets.
in Current protocols in bioinformatics
Martens L
(2017)
A Golden Age for Working with Public Proteomics Data.
in Trends in biochemical sciences
Menschaert G
(2018)
The proBAM and proBed standard formats: enabling a seamless integration of genomics and proteomics data.
in Genome biology
Perez-Riverol Y
(2020)
Scalable Data Analysis in Proteomics and Metabolomics Using BioContainers and Workflows Engines.
in Proteomics
Perez-Riverol Y
(2016)
Ten Simple Rules for Taking Advantage of Git and GitHub.
in PLoS computational biology
Perez-Riverol Y
(2019)
The PRIDE database and related tools and resources in 2019: improving support for quantification data.
in Nucleic acids research
Perez-Riverol Y
(2015)
ms-data-core-api: an open-source, metadata-oriented library for computational proteomics.
in Bioinformatics (Oxford, England)
Perez-Riverol Y
(2017)
OLS Client and OLS Dialog: Open Source Tools to Annotate Public Omics Datasets
in PROTEOMICS
Perez-Riverol Y
(2017)
Discovering and linking public omics data sets using the Omics Discovery Index.
in Nature biotechnology
Perez-Riverol Y
(2016)
PRIDE Inspector Toolsuite: Moving Toward a Universal Visualization Tool for Proteomics Data Standard Formats and Quality Assessment of ProteomeXchange Datasets
in Molecular & Cellular Proteomics
Reisinger F
(2015)
Introducing the PRIDE Archive RESTful web services.
in Nucleic acids research
Ruffier M
(2017)
Ensembl core software resources: storage and programmatic access for DNA sequence and genome annotation.
in Database : the journal of biological databases and curation
Uszkoreit J
(2018)
Protein inference using PIA workflows and PSI standard file formats
Uszkoreit J
(2019)
Protein Inference Using PIA Workflows and PSI Standard File Formats.
in Journal of proteome research
Vaudel M
(2016)
Exploring the potential of public proteomics data.
in Proteomics
Vizcaíno JA
(2017)
The mzIdentML Data Standard Version 1.2, Supporting Advances in Proteome Informatics.
in Molecular & cellular proteomics : MCP
Vizcaíno JA
(2016)
2016 update of the PRIDE database and its related tools.
in Nucleic acids research
Vizcaíno JA
(2016)
2016 update of the PRIDE database and its related tools.
in Nucleic acids research
Yates A
(2016)
Ensembl 2016.
in Nucleic acids research
Zerbino DR
(2018)
Ensembl 2018.
in Nucleic acids research
Zerbino DR
(2016)
Ensembl regulation resources.
in Database : the journal of biological databases and curation
Description | We have developed a REST web service for the PRIDE database. We have also developed a set of data standards (proBed and proBAM) for proteogenomics data, which are compatible with their genomics counterparts (BED and BAM/SAM). Pipelines to integrate automatically proteomics data from PRIDE into Ensembl have been developed, but we are still refining some details. The TrackHub registry has been developed by the Ensembl team. |
Exploitation Route | Existing pipelines can be extended to support other species, to support the new data standards proBed and proBAM, and to include proteomics quantitative data. among other things. |
Sectors | Agriculture Food and Drink Pharmaceuticals and Medical Biotechnology |
URL | https://trackhubregistry.org/ |
Title | Additional file 1: Table S1. of The proBAM and proBed standard formats: enabling a seamless integration of genomics and proteomics data |
Description | Detailed description on the two formats presented, proBAM (S1A) and proBed (S1B). (XLSX 46Â kb) |
Type Of Material | Database/Collection of data |
Year Produced | 2018 |
Provided To Others? | Yes |
URL | https://springernature.figshare.com/articles/Additional_file_1_Table_S1_of_The_proBAM_and_proBed_sta... |
Title | Additional file 1: Table S1. of The proBAM and proBed standard formats: enabling a seamless integration of genomics and proteomics data |
Description | Detailed description on the two formats presented, proBAM (S1A) and proBed (S1B). (XLSX 46Â kb) |
Type Of Material | Database/Collection of data |
Year Produced | 2018 |
Provided To Others? | Yes |
URL | https://springernature.figshare.com/articles/Additional_file_1_Table_S1_of_The_proBAM_and_proBed_sta... |
Title | PRIDE Archive RESTful service |
Description | The PRIDE (PRoteomics IDEntifications) database is one of the world-leading public repositories of mass spectrometry (MS)-based proteomics data and it is a founding member of the ProteomeXchange Consortium of proteomics resources. New REST (REpresentational State Transfer) web services have been developed to serve the most popular functionality provided by BioMart (now discontinued due to data scalability issues) and address the data access requirements of the newly developed PRIDE Archive. Using the API (Application Programming Interface) it is now possible to programmatically query for and retrieve peptide and protein identifications, project and assay metadata and the originally submitted files. Searching and filtering is also possible by metadata information, such as sample details (e.g. species and tissues), instrumentation (mass spectrometer), keywords and other provided annotations. |
Type Of Technology | Webtool/Application |
Year Produced | 2015 |
Open Source License? | Yes |
Impact | The API has already been adopted by a few applications and standalone tools such as PeptideShaker, PRIDE Inspector, the Unipept web application and the Python-based BioServices package. It is also heavily used by PRIDE users |
Title | TrackHub registry |
Description | The TrackHub registry can be used as a registry for Track Hubs that can be displayed in Genome Browsers, enabling their findability (https://trackhubregistry.org/). |
Type Of Technology | Webtool/Application |
Year Produced | 2017 |
Open Source License? | Yes |
Impact | Improve findability of biological data that can be integrated in Genome Browsers such as Ensembl |
URL | https://trackhubregistry.org/ |
Title | proBed and proBAM data standards for proteogenomics data |
Description | we introduce here two novel standard data formats, proBAM and proBed, that have been developed to address the current challenges of integrating mass spectrometry-based proteomics data with genomics and transcriptomics information in proteogenomics studies. proBAM and proBed are adaptations of the well-defined, widely used file formats SAM/BAM and BED, respectively, and both have been extended to meet the specific requirements entailed by proteomics data. Therefore, existing popular genomics tools such as SAMtools and Bedtools, and several widely used genome browsers, can already be used to manipulate and visualize these formats "out-of-the-box." |
Type Of Technology | New Material/Compound |
Year Produced | 2018 |
Impact | https://genomebiology.biomedcentral.com/articles/10.1186/s13059-017-1377-x We also highlight that a number of specific additional software tools, properly supporting the proteomics information available in these formats, are now available providing functionalities such as file generation, file conversion, and data analysis. |
URL | http://www.psidev.info/probam |
Description | Career Q&A |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Local |
Primary Audience | Schools |
Results and Impact | This career Q&A with year 10 students was carried out virtually for the local collage and it is hoped that it would encourage more student to think about entering not only science but all the field of bioinformatics. |
Year(s) Of Engagement Activity | 2020 |
Description | DNA workshop |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | Local |
Primary Audience | Schools |
Results and Impact | An introduction to science fot Primary school children on the topic of DNA. |
Year(s) Of Engagement Activity | 2020 |
Description | Great Abington KS2 |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | Local |
Primary Audience | Schools |
Results and Impact | A meet the experts and Opne lab tour of Great Abington KS2 school |
Year(s) Of Engagement Activity | 2020 |