BBSRC-NSF/BIO PTMeXchange: Globally harmonized re-analysis and sharing of data on post-translational modifications

Lead Research Organisation: European Bioinformatics Institute

Department Name: OMICs

Abstract

Proteins are the key functional molecules in cells, performing multiple biological tasks. This includes catalysing reactions, providing structure to cellular components, signaling between different cells and regulating the production of other genes among many others. Proteins are composed of chains of individual amino acids that are formed initially into a long sequence, which forms into a strictly controlled 3D structure, giving the highly specific function to each protein. The advent of genome sequencing has transformed our ability to study these molecules into a "Big Data" discipline, coupled to advances in mass spectrometry and allied computing techniques. This particular branch of "'omics" is referred to as proteomics - the high-throughput study (identification and quantification) of all the proteins that can be detected in a given biological sample. Proteomics is used right across biological and biomedical research for profiling systems as varied as human, model organisms including plants, and infectious diseases/microbes, among many others.

Many biological functions are dependent on chemical modifications that proteins can undergo, called Post-translational Modifications (PTMs). Due to the occurrence of PTMs, one particular gene can produce a great number of different protein entities which can potentially have different biological functions. PTMs can provide a rapid mechanism for changing function, such as switching an enzyme (biological catalyst) "on" and "off". Due to their functional importance, sites of PTMs on proteins are frequently the targets for drug design, particularly against cancer.

In this grant, we will study, using high-quality data analysis pipelines, the occurrence of the main types of PTMs across hundreds of proteomics datasets in the public domain, involving human and the main model organisms (e.g. mouse, rat and the model plant Arabidopsis). Three world-leading bioinformatics resources are involved in this proposal, namely PRIDE and PeptideAtlas (proteomics resources), and UniProtKB (protein knowledge-base). We expect that UniProtKB will be the main resource to disseminate the outputs of the project to thousands of researchers, working in varied disciplines. We will also showcase possible research applications of this huge amount of data that will be generated, for example studying how PTMs have evolved in different groups of species. We will ensure that all the outputs of the project are disseminated via different training and outreach activities, including e.g. delivering workshops, training and online help/tutorials.

Technical Summary

The types and sites of post-translational modifications (PTMs) on proteins are rich and diverse, providing cells with a rapid mechanism for adapting function under different conditions. PTMs are widely studied across all areas of fundamental and applied life sciences research. Proteomics approaches using mass spectrometry (MS) provide the sole high-throughput means to detect and localize protein PTMs. Despite their biological importance, PTM-relevant data is collated in the public domain via disparate resources, with a lack of data provenance. An efficient way to improve the situation is to make PTM information derived from proteomics approaches available through UniProtKB (http://www.uniprot.org/), the world-leading protein-knowledgebase. There are hundreds of relevant PTM proteomics datasets in the public domain since the proteomics community is now widely embracing open data policies (e.g. through the resources PRIDE and PeptideAtlas, part of the ProteomeXchange consortium).
We will develop and deploy in the cloud open and reproducible pipelines to re-analyse consistently hundreds of PTM relevant public datasets coming from human and the main model organisms. Complementary analysis approaches will be used: primarily standard protein database-based but also spectral library-based and open modification searches. Special attention will be devoted to ensuring that PTM localization is accurate and community guidelines will be developed with that goal in mind. These data will be widely disseminated to UniProtKB and other knowledge-bases (e.g. neXtProt) and made available at PRIDE, PeptideAtlas, and a new resource PTMeXchange. These new PTM data will be integrated across studies, to increase statistical power at an unprecedented scale and accuracy. Finally, we will perform several following demonstration studies to understand PTM motifs, function and evolution.

Planned Impact

There is the potential for the following impacts:

- The biggest potential impact is on Pharma, within which there are many efforts in drug design to target cell signalling, and PTMs. The results will inevitably feed into improved understanding of processes and potentially generating new targets. There is also potential for indirect benefits in the biotech industry (improved understanding of PTMs in fungi) and Agrifood (PTMs on plants), e.g. derived through inference of site conservation from model organisms.

- Software vendors or pharmaceutical research and development teams will benefit, since we envisage they may wish to take up our software for local pipelines (e.g. deployed in their own cloud environments). It is important to highlight that all the software developed during the proposal will be open source.

- Research councils and charities funding research will benefit through the potential for increased impact of the mass spectrometry (MS)-based proteomics projects they fund, thanks to the re-analysis of public proteomics datasets and the integration of novel PTM proteomics data in UniProtKB.

- More broadly, as proteomics is a key technology in the Life Sciences, there is the potential for considerable indirect benefits across a wide range of areas in basic biology, biomedical and clinical science, as more value will be derived from datasets.

- Life scientists worldwide will be able to benefit from the training activities planned (both face-to-face and via on-line resources).

Staff employed will benefit:

- Further training in one key enabling technology for the BBSRC (proteomics) and exposure to a multi-disciplinary team, and to conferences, workshops and new national and International collaborations.

- Acquire skills needed to work with bioinformatics software in a cloud environment, something that is getting increasingly important with the growing size of datasets and the need of suitable IT infrastructure. The team will also use cutting edge machine learning methods in WP4, which are skills hugely in demand in academic research and industry.

Funded Value:

£463,036

Funded Period:

Nov 19 - Jun 23

Funder:

BBSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

BB/S01781X/1

Principal Investigator:

Juan Antonio Vizcaino

Research Subject:

Omic sciences & technologies (48%)

Tools, technologies & methods (48%)

Research Topic:

Bioinformatics (48%)

Proteomics (48%)

Organisations

European Bioinformatics Institute (Lead Research Organisation)

People	ORCID iD
Juan Antonio Vizcaino (Principal Investigator)	http://orcid.org/0000-0002-3905-4335
Maria J. Martin (Co-Investigator)	http://orcid.org/0000-0001-5454-2815

Publications

Author Name

Title Publication Date Published

|< < 1 2 > >|

10 25 50

Bai J (2024) Open-source large language models in action: A bioinformatics chatbot for PRIDE database in PROTEOMICS

Bowler-Barnett EH (2023) UniProt and Mass Spectrometry-Based Proteomics-A 2-Way Working Relationship. in Molecular & cellular proteomics : MCP

Camacho OJM (2024) Phosphorylation in the Plasmodium falciparum Proteome: A Meta-Analysis of Publicly Available Data Sets. in Journal of proteome research

Combe CW (2024) mzIdentML 1.3.0 - Essential progress on the support of crosslinking and other identifications based on multiple spectra. in Proteomics

Deutsch E (2021) Universal Spectrum Identifier for mass spectra in Nature Methods

Deutsch EW (2023) Proteomics Standards Initiative at Twenty Years: Current Activities and Future Work. in Journal of proteome research

Deutsch EW (2023) The ProteomeXchange consortium at 10 years: 2023 update. in Nucleic acids research

Kerry A Ramsbottom (2024) Meta-Analysis of Rice Phosphoproteomics Data to Understand Variation in Cell Signaling Across the Rice Pan-Genome

LeDuc RD (2022) Proteomics Standards Initiative's ProForma 2.0: Unifying the Encoding of Proteoforms and Peptidoforms. in Journal of proteome research

Lussi YC (2023) Searching and Navigating UniProt Databases. in Current protocols

Key Findings
Impact Summary
Further Funding
Research Databases and Models
Engagement Activities


Description	So far, we have achieved the following: - We have developed the concept of using scoring and ranking protein modifications on a decoy amino acid, i.e. one that cannot be modified, to allow for independent estimation of global FLR. We test a variety of different amino acids to act as the decoy, on both synthetic and real data sets, demonstrating that the amino acid selection can make a substantial difference to the estimated global FLR. The corresponding manuscript is now published (PMID: 35640880). - We are currently building a map of phosphorylation in rice and in Plasmodium falciparum, by re-analysing different groups of phospho-enriched datasets. - Rice data has been made available already in UniProt. The UniProt web interface has been updated to support PTMeXchange data. - A new version of PRIDE's Universal Spectrum Identifier functionality has been developed to support linking back the mass spectrometry based experimental evidences from UniProt to PRIDE. - We have developed a first version PTM-site centric data formats to enable the dissemination of information from proteomics software and proteomics resources and UniProt. - We are developing a set of community guidelines for PTM-data analysis (of enriched datasets) and dissemination into UniProt. -
Exploitation Route	It is still early to say
Sectors	Digital/Communication/Information Technologies (including Software) Healthcare


Description	The main objective of this grant was to disseminate high-quality PTM data in UniProt, the world's most used protein knowledge-base. With this idea in mind, we have selected and reanalysed public phospho-enriched datasets from the PRIDE database, developed methodology (the phosphoAla decoy amino acid) and data formats and pipelines to start integrating this information in UniProt, linking back to the original mass spectrometry proteomics evidence in the PRIDE database and/or in the PeptideAtlas resource. We have done that already for rice and Plasmodium phosphorylation, and have worked in a phospho build for human and mouse data. Additionally we have started working in other modifications such as ubiquitination and acetylation. We are also trying to build a community project including formalised guidelines, so that everyone can contribute data to PTMeXchange. There is still a lot to be achieved, given the huge amount of data available in the public domain.
First Year Of Impact	2023
Sector	Digital/Communication/Information Technologies (including Software),Healthcare


Description	PTM-AI: Improving the detection and functional characterization of post-translational modifications
Amount	£312,499 (GBP)
Funding ID	BB/Y513829/1
Organisation	Biotechnology and Biological Sciences Research Council (BBSRC)
Sector	Public
Country	United Kingdom
Start	02/2024
End	08/2025


Description	The Open Data Exchange Ecosystem in Proteomics: Evolving its Utility
Amount	£131,897 (GBP)
Funding ID	EP/Y035984/1
Organisation	Engineering and Physical Sciences Research Council (EPSRC)
Sector	Public
Country	United Kingdom
Start	03/2024
End	02/2026


Title	Availability of PTM (post translational modification) data in UniProt
Description	Via the PTMeXchange Consortium we aim to link PTM data as shown in UniProt to the original mass spectrometry (MS) evidence in proteomics data repositories such as PRIDE. After the original PTMeXchange grant finished, we are continuing to integrate PTM data in UniProt using the PTM-AI grant.
Type Of Material	Database/Collection of data
Year Produced	2023
Provided To Others?	Yes
Impact	Scientist can start to access now reliable PTM data (phosphorylation, but also soon ubiquitination, acetylation, SUMOyliation and methylation) in UniProt
URL	https://www.uniprot.org/


Title	PRIDE database
Description	The PRIDE database is the world leading data repository for mass spectrometry proteomics data (https://www.ebi.ac.uk/pride/). Created originally in 2004, a lot of functionality/capabilities have been and continue to be added to PRIDE as a result of different BBSRC grants. PRIDE has become the world leading resource for mass spectrometry (MS) proteomics dataset and commands a huge International impact. PRIDE is also leading the activities of the International ProteomeXchange Consortium. Additionally, public proteomics data included in PRIDE is increasingly being reused and integrated in added-value bioinformatics resources: Expression Atlas (quantitative proteomics datasets), Ensembl (proteogenomics information) and UniProt (for post-translational modification data).
Type Of Material	Database/Collection of data
Provided To Others?	Yes
Impact	PRIDE has become the world leading proteomics data repository, and as such, PRIDE has an enormous International impact. It enables data reproducibility and data re-use by third parties.
URL	https://www.ebi.ac.uk/pride/


Description	EuBIC-MS Winter School 2024
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	This winter school provided workshops and training for rersearchers in computational Mass Spectrometry tools and workflows, it also provides lecturers and practical workshops covering the identification, quantificatio, result interpretation and integration of MS data. It aims to provide researchers with the tools they require to increase their usage of proteomics data.
Year(s) Of Engagement Activity	2024
URL	https://eubic-ms.org/events/2024-winter-school/


Description	Open data Practises in Proteomics
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Part of the Human Proteome Organisation webinar series, this webinar explores the benefits of making data available in the public domain and how this can be achieved. It enables researchers to discover how these practices can unlock new opportunities for research and innovation in the field of proteomics.
Year(s) Of Engagement Activity	2023
URL	https://www.youtube.com/watch?v=-XeuJ4MlqK0


Description	Proteomics Bioinformatics
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Provision of hands-on training in the basics of mass spectrometry (MS) and proteomics bioinformatics. Training provided on how to use search engines and post-processing software, quantitative approaches, MS data repositories, the use of public databases for protein analysis, annotation of subsequent protein lists, and incorporation of information from molecular interaction and pathway databases. The course is aimed at research scientists with a minimum of a degree in a scientific discipline, including industrial, laboratory and clinical staff, as well as specialists in related fields. It looks to provide researchers with the knowledge and tools for them to be able to utilize proteomics and proeomics bioinformatics more effectively in their own research.
Year(s) Of Engagement Activity	2023
URL	https://www.ebi.ac.uk/training/events/proteomics-bioinformatics-0/