3D-Proteomics: FAIRification of proteomics data for comprehensive integration with structural biology information
Lead Research Organisation:
European Bioinformatics Institute
Department Name: OMICs
Abstract
Proteins are molecules found in all living organisms that provide structure and carry out most of the important functions in a cell, including catalysing (causing or speeding up) chemical reactions and signalling between different cells. Proteomics is the study of the entire set of proteins in a given biological sample such as a cell or an organism like a bacteria, plant or human. Since proteins are essential for so many crucial functions, proteomics can tell us a lot about how organisms work and also about what happens in illnesses, as well as helping to identify potential treatments. This means that proteomics is used across many areas of beneficial biological and biomedical research.
Currently the primary technology used in proteomics is a technique called mass spectrometry (MS), which works by breaking up a protein into small fragments, sorting them and then reporting their mass. The quantity and identity of the protein can then be determined using different software tools. The structure of a protein is also very important, as the way that a protein is organised via folding will help it to carry out its job. The structure also determines how it is able to interact with other proteins, for example a protein that transports another protein around a cell needs to have a part that binds to it specifically. Protein structure can be studied using techniques like x-ray crystallography, which makes use of the way that different structures diffract (bend) x-rays. A more recent development called cross-linking MS (CL-MS) is a powerful tool for visualising how proteins fold and join together, and it works by running MS on proteins that are linked by specialised chemical reagents called cross-linkers. Unfortunately, CL-MS does not yet have coordinated mature open standards and existing datasets are not well linked to other information about protein structure. This means that it is difficult to compare and integrate findings between research groups and that important knowledge may be missed.
It is important that proteomics databases follow the FAIR principles of being easy to find (Findable), free and open source (Accessible), easily shared and processed (Interoperable) and Reusable. Our research groups manage two world-leading databases: the PRoteomics IDEntifications database (PRIDE), which is a repository for proteomics data generated using MS, and the Protein Data Bank (PDB), which is home to 3D structural data for large molecules including proteins. This project will combine these tools with our expertise in CL-MS in order to develop FAIR data standards and software so that proteomics data generated using CL-MS has a common format and processing pipeline, and so that a suite of software tools is made available in order to process and analyse the data freely and easily. PRIDE will be extended to include these standardised CL-MS data formats, and key software tools for data deposition and visualisation will be made available. As a key point, we will create links between PRIDE and PDB in order to allow for joined-up examination of structural data, including integration between the PDB and PRIDE submission systems. This will mean that researchers will be able to more easily analyse proteins and identify links between their research and other projects, even if they don't have access to CL-MS equipment themselves.
The tools and standards that will be generated by this project will benefit researchers across a wide range of biological and biomedical fields, and will provide an interface between proteomics and structural biology information that will enhance and connect research findings. The software will ensure that important and novel structural proteomics data are made accessible and findable, and the standards will maintain its interoperability and reusability. We will make sure that our work is disseminated widely and we will deliver workshops to train and assist researchers in making full use of these valuable resources.
Currently the primary technology used in proteomics is a technique called mass spectrometry (MS), which works by breaking up a protein into small fragments, sorting them and then reporting their mass. The quantity and identity of the protein can then be determined using different software tools. The structure of a protein is also very important, as the way that a protein is organised via folding will help it to carry out its job. The structure also determines how it is able to interact with other proteins, for example a protein that transports another protein around a cell needs to have a part that binds to it specifically. Protein structure can be studied using techniques like x-ray crystallography, which makes use of the way that different structures diffract (bend) x-rays. A more recent development called cross-linking MS (CL-MS) is a powerful tool for visualising how proteins fold and join together, and it works by running MS on proteins that are linked by specialised chemical reagents called cross-linkers. Unfortunately, CL-MS does not yet have coordinated mature open standards and existing datasets are not well linked to other information about protein structure. This means that it is difficult to compare and integrate findings between research groups and that important knowledge may be missed.
It is important that proteomics databases follow the FAIR principles of being easy to find (Findable), free and open source (Accessible), easily shared and processed (Interoperable) and Reusable. Our research groups manage two world-leading databases: the PRoteomics IDEntifications database (PRIDE), which is a repository for proteomics data generated using MS, and the Protein Data Bank (PDB), which is home to 3D structural data for large molecules including proteins. This project will combine these tools with our expertise in CL-MS in order to develop FAIR data standards and software so that proteomics data generated using CL-MS has a common format and processing pipeline, and so that a suite of software tools is made available in order to process and analyse the data freely and easily. PRIDE will be extended to include these standardised CL-MS data formats, and key software tools for data deposition and visualisation will be made available. As a key point, we will create links between PRIDE and PDB in order to allow for joined-up examination of structural data, including integration between the PDB and PRIDE submission systems. This will mean that researchers will be able to more easily analyse proteins and identify links between their research and other projects, even if they don't have access to CL-MS equipment themselves.
The tools and standards that will be generated by this project will benefit researchers across a wide range of biological and biomedical fields, and will provide an interface between proteomics and structural biology information that will enhance and connect research findings. The software will ensure that important and novel structural proteomics data are made accessible and findable, and the standards will maintain its interoperability and reusability. We will make sure that our work is disseminated widely and we will deliver workshops to train and assist researchers in making full use of these valuable resources.
Technical Summary
Structural biology is one field where proteomics techniques are having an increasing impact. In the interface between proteomics and structural biology, cross-linking mass spectrometry (CL-MS) is the most popular and mature approach. Because of the complementarity to established structural methods, CL-MS has gained popularity in the structural biology community.
The PRIDE database has become by far the world-leading resource, storing currently >85% of proteomics datasets worldwide. PRIDE stores >19,000 datasets, with ~1,000 (~5.4%, Nov 2020) coming from CL-MS, Hygrogen Deuterium eXchange (HDX-MS) and other MS-proteomics techniques. At present, PRIDE cannot handle the integration, access and visualisation of CL-MS data in the same way to datasets exported from standard proteomics workflows (CL-MS datasets are then labelled as "partial" submissions).
During this last year, two related community white papers have been published which summarise the conclusions of a series of community meetings. These two white papers form the basis for the main objectives of "3D-Proteomics". The first white paper calls for the integration of PDB with federated data resources in other fields (mentioning explicitly PRIDE for proteomics data) (WP3 in this proposal) to e.g. better support integrative modelling approaches, being CL-MS one of the most prominent use cases that should be supported. The second white paper highlights the need to develop appropriate data standards (WP1), software tools for CL-MS data (WP2), improve data deposition (WP3), and data access and visualisation (WP4), in-line with other MS-based proteomics approaches. The two white papers clearly demonstrate the need and demand for the outputs of "3D-Proteomics". As an overall result, CL-MS data will be made 'FAIR-er' (Findable, Accessible, Interoperable and Reusable). Additionally, we will standardise the representation of post-translational modifications in both PDB and PDBe-Knowledge-Base (PDBe-KB).
The PRIDE database has become by far the world-leading resource, storing currently >85% of proteomics datasets worldwide. PRIDE stores >19,000 datasets, with ~1,000 (~5.4%, Nov 2020) coming from CL-MS, Hygrogen Deuterium eXchange (HDX-MS) and other MS-proteomics techniques. At present, PRIDE cannot handle the integration, access and visualisation of CL-MS data in the same way to datasets exported from standard proteomics workflows (CL-MS datasets are then labelled as "partial" submissions).
During this last year, two related community white papers have been published which summarise the conclusions of a series of community meetings. These two white papers form the basis for the main objectives of "3D-Proteomics". The first white paper calls for the integration of PDB with federated data resources in other fields (mentioning explicitly PRIDE for proteomics data) (WP3 in this proposal) to e.g. better support integrative modelling approaches, being CL-MS one of the most prominent use cases that should be supported. The second white paper highlights the need to develop appropriate data standards (WP1), software tools for CL-MS data (WP2), improve data deposition (WP3), and data access and visualisation (WP4), in-line with other MS-based proteomics approaches. The two white papers clearly demonstrate the need and demand for the outputs of "3D-Proteomics". As an overall result, CL-MS data will be made 'FAIR-er' (Findable, Accessible, Interoperable and Reusable). Additionally, we will standardise the representation of post-translational modifications in both PDB and PDBe-Knowledge-Base (PDBe-KB).
Publications
Deutsch EW
(2023)
Proteomics Standards Initiative at Twenty Years: Current Activities and Future Work.
in Journal of proteome research
Rehfeldt TG
(2023)
ProteomicsML: An Online Platform for Community-Curated Data sets and Tutorials for Machine Learning in Proteomics.
in Journal of proteome research
Varadi M
(2022)
PDBe and PDBe-KB: Providing high-quality, up-to-date and integrated resources of macromolecular structures to support basic and applied research and education.
in Protein science : a publication of the Protein Society
Title | PRIDE Crosslinking |
Description | We present a new resource in the PRIDE ecosystem: PRIDE Crosslinking (https://www.ebi.ac.uk/pride/archive/crosslinking) aiming to improve data access and visualisation for MS-based proteomics structural biology studies using the web application xiVIEW, making this data more FAIR (Findable, Accessible, Interoperable and Re-usable). Additionally, it aims to integrate crosslinking MS data with data in the Protein Data Bank (PDB), including PDBe-KB (PDB in Europe-Knowledge-Base), PDB-Dev and AlphafoldDB (database of predicted protein structures). |
Type Of Material | Database/Collection of data |
Year Produced | 2024 |
Provided To Others? | Yes |
Impact | At the moment the resource is quite new and still in development. The software is deployed in the EMBL-EBI cloud infrastructure, which is based on Kubernetes, bringing the tools close to the spatial proteomics data stored in PRIDE. In terms of long-term sustainability, this synergistic approach is i the best way forward. On one hand, database providers (in this case the PRIDE database) take responsibility of the data storage and representation. On the other hand, the researchers take responsibility of maintaining the domain-specific open software used (in this case xiVIEW) to access and visualise the data, also getting extra exposure and recognition from doing this through a widely used resource such as PRIDE. As the final outcome, spatial proteomics data in the PRIDE database is more FAIR. Analogous approaches could be followed in the future for different types of proteomics data |
URL | https://www.ebi.ac.uk/pride/archive/crosslinking |
Description | EuBIC-MS Winter School 2024 |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | This winter school provided workshops and training for rersearchers in computational Mass Spectrometry tools and workflows, it also provides lecturers and practical workshops covering the identification, quantificatio, result interpretation and integration of MS data. It aims to provide researchers with the tools they require to increase their usage of proteomics data. |
Year(s) Of Engagement Activity | 2024 |
URL | https://eubic-ms.org/events/2024-winter-school/ |
Description | Open data Practises in Proteomics |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Part of the Human Proteome Organisation webinar series, this webinar explores the benefits of making data available in the public domain and how this can be achieved. It enables researchers to discover how these practices can unlock new opportunities for research and innovation in the field of proteomics. |
Year(s) Of Engagement Activity | 2023 |
URL | https://www.youtube.com/watch?v=-XeuJ4MlqK0 |
Description | Proteomics Bioinformatics |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Provision of hands-on training in the basics of mass spectrometry (MS) and proteomics bioinformatics. Training provided on how to use search engines and post-processing software, quantitative approaches, MS data repositories, the use of public databases for protein analysis, annotation of subsequent protein lists, and incorporation of information from molecular interaction and pathway databases. The course is aimed at research scientists with a minimum of a degree in a scientific discipline, including industrial, laboratory and clinical staff, as well as specialists in related fields. It looks to provide researchers with the knowledge and tools for them to be able to utilize proteomics and proeomics bioinformatics more effectively in their own research. |
Year(s) Of Engagement Activity | 2023 |
URL | https://www.ebi.ac.uk/training/events/proteomics-bioinformatics-0/ |