ProteoGenomics: Dynamic Linkage of Genomes and Proteomes through Ensembl and ProteomeXchange

Lead Research Organisation: European Bioinformatics Institute
Department Name: Proteomics Services Team

Abstract

For researchers in the Life Sciences, it imperative that they are able to access and view the human genome, and genomes of model organisms and human pathogens in an efficient and user-friendly way via the Internet. The genome itself is annotated with information about the locations and functions of genes, and quantitative data about genes and other elements within the genome. The UK-based Ensembl project is a leading genome browser, used by thousands of researchers every day. The value of genomic information is greatly increased when it is integrated with and can be directly viewed alongside other biological data sources such as proteomics - a set of technologies devoted to the identification and quantification of proteins, the functional molecules encoded by each gene. From a technical point of view, the large size of modern biological data sets makes it challenging to efficiently integrate them into genome browsers. A technology called DAS (Distributed Annotation System) is the prevalent technology used by genome browsers to integrate external data but it can no longer support much-needed new features or scale to the sizes of modern data sets. Another genome browser, the UCSC Genome Browser, has developed a more modern and efficient technology, specifically designed for large-scale data sets called 'TrackHubs'. Both UCSC and Ensembl have developed initial support for this technology, but there are still limitations for many users, and Ensembl's support remains incomplete. In the 'ProteoGenomics' project, we first want to further develop the 'TrackHub' technology, expanding its scope of usage in Ensembl, and making it easier for researchers around the world to discover and use TrackHubs containing different types of research data. Ensembl's TrackHub technology will be expanded to proteomics data for the first time and thus improve the provision of non-genomics biological information in this widely used resource.
In the project, we are going to build technology to integrate proteomics data with the genome data held in Ensembl, in a dynamic and effective way. With this aim in mind we will use public MS proteomics data submitted and available in one of the main repositories in the world, the UK-based resource PRIDE, which is also one leading the ProteomeXchange Consortium of proteomics resources. We will reanalyse the data in PRIDE via our ProteoAnnotator pipeline to provide updated or complementary information to the results originally submitted by the research team that generated the data. We are pioneering techniques for extracting more value from the same data, to understand how proteins vary in their abundance and in chemical modifications that occur on proteins, altering their function, two types of results often not generated initially by research groups submitting data to PRIDE. Through this data reuse and the extraction of new biological findings, the value of the submitted datasets will increase. In addition, 'ProteoGenomics' will provide a portal for datasets from the recently started Human Proteome Project (HPP), providing the global research community with a single entry point to these datasets.

Technical Summary

The Distributed Annotation System (DAS) has for a long time been the workhorse for integration of external data sources into Ensembl. The UCSC Genome Browser has developed and switched to the more modern and efficient 'TrackHub' technology. Ensembl now provides preliminary support for TrackHubs as well. A particular challenge is the up-to-date integration of mass spectrometry (MS)-based proteomics information, due to the use of different search databases in different labs, and regular updates in genome assemblies. However, there is the potential for huge benefits from high-quality proteogenomics integration: e.g. reliable identification of isoforms, post-translational modifications (PTMs) and quantitative protein expression information are prominent examples. The UK-based PRIDE database is one of the major resources for MS data worldwide, as well as a major driver in the international ProteomeXchange (PX) consortium.
In this project, we will provide an integrated proteogenomics infrastructure to vastly improve the current situation. We will improve TrackHub support for Ensembl, demonstrating its usefulness and performance through the complex use case of proteogenomics. This project will involve massively parallel re-analysis of MS data as new genome builds are released, via the "ProteoAnnotator" pipeline. The reprocessing of the proteomics data sets will be done at different levels: peptide/protein identification, quantification (using spectral counting), identification aimed at improving genome annotation and unrestricted search of PTMs. The reanalysed data sets will be sourced from the PRIDE repository (as part of the PX consortium), as well as from existing BBSRC-funded proteogenomics projects, and will be mapped onto Ensembl. We will focus on human and model organisms represented in Ensembl and Ensembl Genomes, like mouse, rat and Arabidopsis, and eukaryotic pathogens such as Toxoplasma, Plasmodium and Trypanosoma.

Planned Impact

The direct beneficiaries include:

- Software vendors or pharmaceutical research and development teams, since we envisage they may wish to take up our software for local pipelines. It is important to highlight that all the software developed in the context of "ProteoGenomics" will be open-source using the Apache 2.0 licence.
- Research councils and charities funding research will benefit through the potential for increased impact of the mass spectrometry (MS)-based proteomics projects they fund, since the envisioned integration of proteomics data in Ensembl constitutes an important step forward for the field. In addition, there will be a higher incentive for public data deposition in the proteomics field due to the increased visibility of proteomics data in Ensembl.
- As proteomics is a key technology in the Life Sciences, there is the potential for considerable indirect benefits as "ProteoGenomics" will integrate proteomics information at the genome level. These benefits could be realised in any area of basic biology, biomedical or clinical science. For instance, through the reprocessing of datasets, it will be possible to find new post-translational modifications (PTMs) or genome features such as e.g. new exon-intron boundaries or DNA variation information.

Staff employed will benefit:
- Training in two key enabling technologies for the BBSRC (genomics and proteomics) and exposure to new collaborations.

Publications

10 25 50
publication icon
Aken BL (2017) Ensembl 2017. in Nucleic acids research

publication icon
Cunningham F (2015) Improving the Sequence Ontology terminology for genomic variant annotation. in Journal of biomedical semantics

publication icon
Cunningham F (2015) Ensembl 2015. in Nucleic acids research

publication icon
Deutsch EW (2017) Proteomics Standards Initiative: Fifteen Years of Progress and Future Work. in Journal of proteome research

publication icon
Deutsch EW (2023) The ProteomeXchange consortium at 10 years: 2023 update. in Nucleic acids research

publication icon
Howe KL (2020) Ensembl Genomes 2020-enabling non-vertebrate genomic research. in Nucleic acids research

 
Description We have developed a REST web service for the PRIDE database. We have also developed a set of data standards (proBed and proBAM) for proteogenomics data, which are compatible with their genomics counterparts (BED and BAM/SAM). Pipelines to integrate automatically proteomics data from PRIDE into Ensembl have been developed, but we are still refining some details. The TrackHub registry has been developed by the Ensembl team.
Exploitation Route Existing pipelines can be extended to support other species, to support the new data standards proBed and proBAM, and to include proteomics quantitative data. among other things.
Sectors Agriculture, Food and Drink,Pharmaceuticals and Medical Biotechnology

URL https://trackhubregistry.org/
 
Title PRIDE Archive RESTful service 
Description The PRIDE (PRoteomics IDEntifications) database is one of the world-leading public repositories of mass spectrometry (MS)-based proteomics data and it is a founding member of the ProteomeXchange Consortium of proteomics resources. New REST (REpresentational State Transfer) web services have been developed to serve the most popular functionality provided by BioMart (now discontinued due to data scalability issues) and address the data access requirements of the newly developed PRIDE Archive. Using the API (Application Programming Interface) it is now possible to programmatically query for and retrieve peptide and protein identifications, project and assay metadata and the originally submitted files. Searching and filtering is also possible by metadata information, such as sample details (e.g. species and tissues), instrumentation (mass spectrometer), keywords and other provided annotations. 
Type Of Technology Webtool/Application 
Year Produced 2015 
Open Source License? Yes  
Impact The API has already been adopted by a few applications and standalone tools such as PeptideShaker, PRIDE Inspector, the Unipept web application and the Python-based BioServices package. It is also heavily used by PRIDE users 
 
Title TrackHub registry 
Description The TrackHub registry can be used as a registry for Track Hubs that can be displayed in Genome Browsers, enabling their findability (https://trackhubregistry.org/). 
Type Of Technology Webtool/Application 
Year Produced 2017 
Open Source License? Yes  
Impact Improve findability of biological data that can be integrated in Genome Browsers such as Ensembl 
URL https://trackhubregistry.org/
 
Title proBed and proBAM data standards for proteogenomics data 
Description we introduce here two novel standard data formats, proBAM and proBed, that have been developed to address the current challenges of integrating mass spectrometry-based proteomics data with genomics and transcriptomics information in proteogenomics studies. proBAM and proBed are adaptations of the well-defined, widely used file formats SAM/BAM and BED, respectively, and both have been extended to meet the specific requirements entailed by proteomics data. Therefore, existing popular genomics tools such as SAMtools and Bedtools, and several widely used genome browsers, can already be used to manipulate and visualize these formats "out-of-the-box." 
Type Of Technology New Material/Compound 
Year Produced 2018 
Impact https://genomebiology.biomedcentral.com/articles/10.1186/s13059-017-1377-x We also highlight that a number of specific additional software tools, properly supporting the proteomics information available in these formats, are now available providing functionalities such as file generation, file conversion, and data analysis. 
URL http://www.psidev.info/probam
 
Description Career Q&A 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Schools
Results and Impact This career Q&A with year 10 students was carried out virtually for the local collage and it is hoped that it would encourage more student to think about entering not only science but all the field of bioinformatics.
Year(s) Of Engagement Activity 2020
 
Description DNA workshop 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Schools
Results and Impact An introduction to science fot Primary school children on the topic of DNA.
Year(s) Of Engagement Activity 2020
 
Description Great Abington KS2 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Schools
Results and Impact A meet the experts and Opne lab tour of Great Abington KS2 school
Year(s) Of Engagement Activity 2020