ProteoGenomics: Dynamic Linkage of Genomes and Proteomes through Ensembl and ProteomeXchange

Lead Research Organisation: European Bioinformatics Institute

Department Name: Proteomics Services Team

Abstract

For researchers in the Life Sciences, it imperative that they are able to access and view the human genome, and genomes of model organisms and human pathogens in an efficient and user-friendly way via the Internet. The genome itself is annotated with information about the locations and functions of genes, and quantitative data about genes and other elements within the genome. The UK-based Ensembl project is a leading genome browser, used by thousands of researchers every day. The value of genomic information is greatly increased when it is integrated with and can be directly viewed alongside other biological data sources such as proteomics - a set of technologies devoted to the identification and quantification of proteins, the functional molecules encoded by each gene. From a technical point of view, the large size of modern biological data sets makes it challenging to efficiently integrate them into genome browsers. A technology called DAS (Distributed Annotation System) is the prevalent technology used by genome browsers to integrate external data but it can no longer support much-needed new features or scale to the sizes of modern data sets. Another genome browser, the UCSC Genome Browser, has developed a more modern and efficient technology, specifically designed for large-scale data sets called 'TrackHubs'. Both UCSC and Ensembl have developed initial support for this technology, but there are still limitations for many users, and Ensembl's support remains incomplete. In the 'ProteoGenomics' project, we first want to further develop the 'TrackHub' technology, expanding its scope of usage in Ensembl, and making it easier for researchers around the world to discover and use TrackHubs containing different types of research data. Ensembl's TrackHub technology will be expanded to proteomics data for the first time and thus improve the provision of non-genomics biological information in this widely used resource.
In the project, we are going to build technology to integrate proteomics data with the genome data held in Ensembl, in a dynamic and effective way. With this aim in mind we will use public MS proteomics data submitted and available in one of the main repositories in the world, the UK-based resource PRIDE, which is also one leading the ProteomeXchange Consortium of proteomics resources. We will reanalyse the data in PRIDE via our ProteoAnnotator pipeline to provide updated or complementary information to the results originally submitted by the research team that generated the data. We are pioneering techniques for extracting more value from the same data, to understand how proteins vary in their abundance and in chemical modifications that occur on proteins, altering their function, two types of results often not generated initially by research groups submitting data to PRIDE. Through this data reuse and the extraction of new biological findings, the value of the submitted datasets will increase. In addition, 'ProteoGenomics' will provide a portal for datasets from the recently started Human Proteome Project (HPP), providing the global research community with a single entry point to these datasets.

Technical Summary

The Distributed Annotation System (DAS) has for a long time been the workhorse for integration of external data sources into Ensembl. The UCSC Genome Browser has developed and switched to the more modern and efficient 'TrackHub' technology. Ensembl now provides preliminary support for TrackHubs as well. A particular challenge is the up-to-date integration of mass spectrometry (MS)-based proteomics information, due to the use of different search databases in different labs, and regular updates in genome assemblies. However, there is the potential for huge benefits from high-quality proteogenomics integration: e.g. reliable identification of isoforms, post-translational modifications (PTMs) and quantitative protein expression information are prominent examples. The UK-based PRIDE database is one of the major resources for MS data worldwide, as well as a major driver in the international ProteomeXchange (PX) consortium.
In this project, we will provide an integrated proteogenomics infrastructure to vastly improve the current situation. We will improve TrackHub support for Ensembl, demonstrating its usefulness and performance through the complex use case of proteogenomics. This project will involve massively parallel re-analysis of MS data as new genome builds are released, via the "ProteoAnnotator" pipeline. The reprocessing of the proteomics data sets will be done at different levels: peptide/protein identification, quantification (using spectral counting), identification aimed at improving genome annotation and unrestricted search of PTMs. The reanalysed data sets will be sourced from the PRIDE repository (as part of the PX consortium), as well as from existing BBSRC-funded proteogenomics projects, and will be mapped onto Ensembl. We will focus on human and model organisms represented in Ensembl and Ensembl Genomes, like mouse, rat and Arabidopsis, and eukaryotic pathogens such as Toxoplasma, Plasmodium and Trypanosoma.

Planned Impact

The direct beneficiaries include:

- Software vendors or pharmaceutical research and development teams, since we envisage they may wish to take up our software for local pipelines. It is important to highlight that all the software developed in the context of "ProteoGenomics" will be open-source using the Apache 2.0 licence.
- Research councils and charities funding research will benefit through the potential for increased impact of the mass spectrometry (MS)-based proteomics projects they fund, since the envisioned integration of proteomics data in Ensembl constitutes an important step forward for the field. In addition, there will be a higher incentive for public data deposition in the proteomics field due to the increased visibility of proteomics data in Ensembl.
- As proteomics is a key technology in the Life Sciences, there is the potential for considerable indirect benefits as "ProteoGenomics" will integrate proteomics information at the genome level. These benefits could be realised in any area of basic biology, biomedical or clinical science. For instance, through the reprocessing of datasets, it will be possible to find new post-translational modifications (PTMs) or genome features such as e.g. new exon-intron boundaries or DNA variation information.

Staff employed will benefit:
- Training in two key enabling technologies for the BBSRC (genomics and proteomics) and exposure to new collaborations.

Funded Value:

£481,807

Funded Period:

Aug 14 - Aug 17

Funder:

BBSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

BB/L024225/1

Principal Investigator:

Henning Hermjakob

Research Subject:

Biomolecules & biochemistry (24%)

Omic sciences & technologies (24%)

Tools, technologies & methods (48%)

Research Topic:

Bioinformatics (48%)

Functional genomics (24%)

Protein expression (24%)

Organisations

European Bioinformatics Institute (Lead Research Organisation)

People	ORCID iD
Henning Hermjakob (Principal Investigator)	http://orcid.org/0000-0001-8479-0262
Juan Antonio Vizcaino (Co-Investigator)	http://orcid.org/0000-0002-3905-4335
Paul Flicek (Co-Investigator)

Publications

Author Name

Title Publication Date Published

|< < 1 2 3 4 > >|

10 25 50

Aken B (2017) Ensembl 2017 in Nucleic Acids Research

Cunningham F (2015) Ensembl 2015. in Nucleic acids research

Cunningham F (2015) Improving the Sequence Ontology terminology for genomic variant annotation. in Journal of biomedical semantics

Deutsch E (2022) The Proteomics Standards Initiative at Twenty Years: Current Activities and Future Work

Deutsch E (2023) Proteomics Standards Initiative at Twenty Years: Current Activities and Future Work

Deutsch EW (2017) Proteomics Standards Initiative: Fifteen Years of Progress and Future Work. in Journal of proteome research

Deutsch EW (2023) Proteomics Standards Initiative at Twenty Years: Current Activities and Future Work. in Journal of proteome research

Deutsch EW (2023) The ProteomeXchange consortium at 10 years: 2023 update. in Nucleic acids research

Deutsch EW (2017) The ProteomeXchange consortium in 2017: supporting the cultural change in proteomics public data deposition. in Nucleic acids research

Griss J (2018) Response to "Comparison and Evaluation of Clustering Algorithms for Tandem Mass Spectra". in Journal of proteome research

Key Findings
Research Databases and Models
Software and Technical Products
Engagement Activities


Description	We have developed a REST web service for the PRIDE database. We have also developed a set of data standards (proBed and proBAM) for proteogenomics data, which are compatible with their genomics counterparts (BED and BAM/SAM). Pipelines to integrate automatically proteomics data from PRIDE into Ensembl have been developed, but we are still refining some details. The TrackHub registry has been developed by the Ensembl team.
Exploitation Route	Existing pipelines can be extended to support other species, to support the new data standards proBed and proBAM, and to include proteomics quantitative data. among other things.
Sectors	Agriculture Food and Drink Pharmaceuticals and Medical Biotechnology
URL	https://trackhubregistry.org/


Title	Additional file 1: Table S1. of The proBAM and proBed standard formats: enabling a seamless integration of genomics and proteomics data
Description	Detailed description on the two formats presented, proBAM (S1A) and proBed (S1B). (XLSX 46Â kb)
Type Of Material	Database/Collection of data
Year Produced	2018
Provided To Others?	Yes
URL	https://springernature.figshare.com/articles/Additional_file_1_Table_S1_of_The_proBAM_and_proBed_sta...


Title	Additional file 1: Table S1. of The proBAM and proBed standard formats: enabling a seamless integration of genomics and proteomics data
Description	Detailed description on the two formats presented, proBAM (S1A) and proBed (S1B). (XLSX 46Â kb)
Type Of Material	Database/Collection of data
Year Produced	2018
Provided To Others?	Yes
URL	https://springernature.figshare.com/articles/Additional_file_1_Table_S1_of_The_proBAM_and_proBed_sta...


Title	PRIDE Archive RESTful service
Description	The PRIDE (PRoteomics IDEntifications) database is one of the world-leading public repositories of mass spectrometry (MS)-based proteomics data and it is a founding member of the ProteomeXchange Consortium of proteomics resources. New REST (REpresentational State Transfer) web services have been developed to serve the most popular functionality provided by BioMart (now discontinued due to data scalability issues) and address the data access requirements of the newly developed PRIDE Archive. Using the API (Application Programming Interface) it is now possible to programmatically query for and retrieve peptide and protein identifications, project and assay metadata and the originally submitted files. Searching and filtering is also possible by metadata information, such as sample details (e.g. species and tissues), instrumentation (mass spectrometer), keywords and other provided annotations.
Type Of Technology	Webtool/Application
Year Produced	2015
Open Source License?	Yes
Impact	The API has already been adopted by a few applications and standalone tools such as PeptideShaker, PRIDE Inspector, the Unipept web application and the Python-based BioServices package. It is also heavily used by PRIDE users


Title	TrackHub registry
Description	The TrackHub registry can be used as a registry for Track Hubs that can be displayed in Genome Browsers, enabling their findability (https://trackhubregistry.org/).
Type Of Technology	Webtool/Application
Year Produced	2017
Open Source License?	Yes
Impact	Improve findability of biological data that can be integrated in Genome Browsers such as Ensembl
URL	https://trackhubregistry.org/


Title	proBed and proBAM data standards for proteogenomics data
Description	we introduce here two novel standard data formats, proBAM and proBed, that have been developed to address the current challenges of integrating mass spectrometry-based proteomics data with genomics and transcriptomics information in proteogenomics studies. proBAM and proBed are adaptations of the well-defined, widely used file formats SAM/BAM and BED, respectively, and both have been extended to meet the specific requirements entailed by proteomics data. Therefore, existing popular genomics tools such as SAMtools and Bedtools, and several widely used genome browsers, can already be used to manipulate and visualize these formats "out-of-the-box."
Type Of Technology	New Material/Compound
Year Produced	2018
Impact	https://genomebiology.biomedcentral.com/articles/10.1186/s13059-017-1377-x We also highlight that a number of specific additional software tools, properly supporting the proteomics information available in these formats, are now available providing functionalities such as file generation, file conversion, and data analysis.
URL	http://www.psidev.info/probam


Description	Career Q&A
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Schools
Results and Impact	This career Q&A with year 10 students was carried out virtually for the local collage and it is hoped that it would encourage more student to think about entering not only science but all the field of bioinformatics.
Year(s) Of Engagement Activity	2020


Description	DNA workshop
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Schools
Results and Impact	An introduction to science fot Primary school children on the topic of DNA.
Year(s) Of Engagement Activity	2020


Description	Great Abington KS2
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Schools
Results and Impact	A meet the experts and Opne lab tour of Great Abington KS2 school
Year(s) Of Engagement Activity	2020