BBSRC-NSF/BIO. Globally harmonized re-analysis of Data Independent Acquisition (DIA) proteomics datasets enables the creation of new resources
Lead Research Organisation:
EMBL - European Bioinformatics Institute
Department Name: Proteomics
Abstract
Proteins are important molecules that carry out most of the activities that take place in each cell of an organism, such as transporting substances and providing structural support. A proteome is the complete set of all the proteins in a system or organism under certain conditions at a given time, and proteomics is the large-scale study of proteomes. Proteomics applies to many parts of biology as it can tell us a lot about how a system or organism works, and can provide vital information about illnesses and potential treatments.
The main technique used in proteomics research is mass spectrometry (MS), which works by breaking up a mixed protein sample into small fragments, sorting them and then reporting their mass. This information is used to determine the identity and amount of the proteins. Recently, a MS approach called data independent acquisition (DIA) has become popular. Traditional MS, called data dependent acquisition (DDA), is biased towards the fragments that have the strongest signal, but DIA is not limited by this. This means that DIA allows researchers to quantify proteins that are present even in very small numbers, allowing for better representation of the proteome. Spectral libraries are collections of pre-annotated experimental MS outputs that are used in DIA data analysis. Recently spectral libraries have been developed using machine learning, which provides a great opportunity for novel artificial intelligence (AI) approaches to proteomics research. Overall, quantitative DIA data is very rich, as it represents a comprehensive digital record of the proteome that can be analysed using different tools and approaches over time.
The groups involved in this project have been working to make DIA proteomics data freely available worldwide via the ProteomeXchange (PX) consortium, and to ensure that this data is generated and reported using consistent standards via the Proteomics Standards Initiative (PSI). This publicly-available data provides a great opportunity for researchers to reconfirm original results and obtain new insights. However, there have so far been very limited re-analysis efforts. This may be due to the complex nature of DIA data analysis, and also because of a lack of availability of spectral libraries.
Our project aims to address this by generating new knowledge coming from the re-analysis of DIA proteomics datasets and creating novel infrastructure to better support public DIA proteomics data and spectral libraries. Additionally, we will create novel infrastructure for making spectral libraries Findable, Accessible, Interoperable and Re-usable (FAIR), which will enhance the reproducibility of published studies. To achieve these goals we will produce reliable and high-quality protein expression (i.e. protein production) and abundance information from the re-analysis of manually curated public DIA quantitative datasets and we will make these freely available in PX and via EMBL-EBI's Expression Atlas, to be consumed by non-experts in proteomics. We will also create protein co-expression and abundance maps for different biological conditions using the DIA re-analyses and make them available via PX. This would be the first time that these maps are generated on such large amounts of DIA proteomics data and will take advantage of the unique advantages, such as size and coverage, of DIA datasets. Further, we will develop novel infrastructure and data standards to make DIA proteomics data and, as a key point, spectral libraries FAIR. This will involve creating open source tools and infrastructure, and developing PSI standards.
The co-expression maps, infrastructure and standards that will be generated by this project will benefit researchers across a wide range of biological and biomedical fields, and will provide the ability to strengthen and connect existing research findings. We will disseminate our work widely to train and assist researchers in making full use of these valuable resources.
The main technique used in proteomics research is mass spectrometry (MS), which works by breaking up a mixed protein sample into small fragments, sorting them and then reporting their mass. This information is used to determine the identity and amount of the proteins. Recently, a MS approach called data independent acquisition (DIA) has become popular. Traditional MS, called data dependent acquisition (DDA), is biased towards the fragments that have the strongest signal, but DIA is not limited by this. This means that DIA allows researchers to quantify proteins that are present even in very small numbers, allowing for better representation of the proteome. Spectral libraries are collections of pre-annotated experimental MS outputs that are used in DIA data analysis. Recently spectral libraries have been developed using machine learning, which provides a great opportunity for novel artificial intelligence (AI) approaches to proteomics research. Overall, quantitative DIA data is very rich, as it represents a comprehensive digital record of the proteome that can be analysed using different tools and approaches over time.
The groups involved in this project have been working to make DIA proteomics data freely available worldwide via the ProteomeXchange (PX) consortium, and to ensure that this data is generated and reported using consistent standards via the Proteomics Standards Initiative (PSI). This publicly-available data provides a great opportunity for researchers to reconfirm original results and obtain new insights. However, there have so far been very limited re-analysis efforts. This may be due to the complex nature of DIA data analysis, and also because of a lack of availability of spectral libraries.
Our project aims to address this by generating new knowledge coming from the re-analysis of DIA proteomics datasets and creating novel infrastructure to better support public DIA proteomics data and spectral libraries. Additionally, we will create novel infrastructure for making spectral libraries Findable, Accessible, Interoperable and Re-usable (FAIR), which will enhance the reproducibility of published studies. To achieve these goals we will produce reliable and high-quality protein expression (i.e. protein production) and abundance information from the re-analysis of manually curated public DIA quantitative datasets and we will make these freely available in PX and via EMBL-EBI's Expression Atlas, to be consumed by non-experts in proteomics. We will also create protein co-expression and abundance maps for different biological conditions using the DIA re-analyses and make them available via PX. This would be the first time that these maps are generated on such large amounts of DIA proteomics data and will take advantage of the unique advantages, such as size and coverage, of DIA datasets. Further, we will develop novel infrastructure and data standards to make DIA proteomics data and, as a key point, spectral libraries FAIR. This will involve creating open source tools and infrastructure, and developing PSI standards.
The co-expression maps, infrastructure and standards that will be generated by this project will benefit researchers across a wide range of biological and biomedical fields, and will provide the ability to strengthen and connect existing research findings. We will disseminate our work widely to train and assist researchers in making full use of these valuable resources.
Technical Summary
Proteomics is a key technology for life sciences research, enabling large scale quantitative measurement of many proteins under different conditions. Recently there has been rapid growth in data independent acquisition (DIA) mass spectrometry (MS) for quantitative proteomics over the more traditional data dependent acquisition (DDA). DIA has the potential to deliver more reproducible measurements, with fewer missing values, but relies on more complex informatics for identifying proteins, often employing previously annotated spectral libraries (SLs).
We have a leading role in enabling unified international data deposition and access via the ProteomeXchange (PX) consortium of databases. There are now 1000s of raw DIA datasets in the public domain with vast potential value for informing on the biology of the samples analysed. However, the value is mostly locked at present, since the public records are lacking the SLs used to identify proteins, and there is a knowledge gap in how to use public SLs reliably to re-analyse datasets at scale. There is also significant potential for errors in SL construction to be "silent", meaning that incorrect protein identifications in published results cannot be detected and worse, if the same SLs are used in multiple studies, for errors to be falsely replicated.
In this "DIA-eXchange" project our goal is to unlock the potential in public DIA data by developing database(s), standards and software so that when DIA proteomics datasets are published the SL (and source evidence) is deposited into PX to make DIA proteomics "FAIR"-compliant. This will enable other groups to verify published findings and re-analyse DIA data for new purposes. We will benchmark open source software and different methods of creating SLs to develop best practice guidelines, and re-analyse 100s of datasets ourselves, depositing the standardised/uniform protein abundance values in "added value" biologist-focussed databases, namely EMBL-EBI's Expression Atlas.
We have a leading role in enabling unified international data deposition and access via the ProteomeXchange (PX) consortium of databases. There are now 1000s of raw DIA datasets in the public domain with vast potential value for informing on the biology of the samples analysed. However, the value is mostly locked at present, since the public records are lacking the SLs used to identify proteins, and there is a knowledge gap in how to use public SLs reliably to re-analyse datasets at scale. There is also significant potential for errors in SL construction to be "silent", meaning that incorrect protein identifications in published results cannot be detected and worse, if the same SLs are used in multiple studies, for errors to be falsely replicated.
In this "DIA-eXchange" project our goal is to unlock the potential in public DIA data by developing database(s), standards and software so that when DIA proteomics datasets are published the SL (and source evidence) is deposited into PX to make DIA proteomics "FAIR"-compliant. This will enable other groups to verify published findings and re-analyse DIA data for new purposes. We will benchmark open source software and different methods of creating SLs to develop best practice guidelines, and re-analyse 100s of datasets ourselves, depositing the standardised/uniform protein abundance values in "added value" biologist-focussed databases, namely EMBL-EBI's Expression Atlas.
Publications


Claeys T
(2023)
lesSDRF is more: maximizing the value of proteomics data through streamlined metadata annotation
in Nature Communications

George N
(2024)
Expression Atlas update: insights from sequencing data at both bulk and single cell level.
in Nucleic acids research

Perez-Riverol Y
(2024)
The PRIDE database at 20 years: 2025 update
in Nucleic Acids Research
Title | Availability of DIA (Data Independent Acquisition) proteomics expression data in Expression Atlas |
Description | Expression Atlas (https://www.ebi.ac.uk/gxa/home) is an open science resource at the European Bioinformatics Institute that gives users a powerful way to find information about gene and protein expression. We have made available there the results of the re-analysis of 10 Data Independent Acquisition (DIA) proteomics datasets. |
Type Of Material | Database/Collection of data |
Year Produced | 2021 |
Provided To Others? | Yes |
Impact | To the best of our knowledge, this is the first time that public DIA proteomics datasets have been re-analysed and the results have been made available in an open resource such as Expression Atlas. |
URL | https://www.ebi.ac.uk/gxa/home |
Description | EuBIC-MS Winter School 2024 |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | This winter school provided workshops and training for rersearchers in computational Mass Spectrometry tools and workflows, it also provides lecturers and practical workshops covering the identification, quantificatio, result interpretation and integration of MS data. It aims to provide researchers with the tools they require to increase their usage of proteomics data. |
Year(s) Of Engagement Activity | 2024 |
URL | https://eubic-ms.org/events/2024-winter-school/ |
Description | Open data Practises in Proteomics |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Part of the Human Proteome Organisation webinar series, this webinar explores the benefits of making data available in the public domain and how this can be achieved. It enables researchers to discover how these practices can unlock new opportunities for research and innovation in the field of proteomics. |
Year(s) Of Engagement Activity | 2023 |
URL | https://www.youtube.com/watch?v=-XeuJ4MlqK0 |
Description | Proteomics Bioinformatics |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Provision of hands-on training in the basics of mass spectrometry (MS) and proteomics bioinformatics. Training provided on how to use search engines and post-processing software, quantitative approaches, MS data repositories, the use of public databases for protein analysis, annotation of subsequent protein lists, and incorporation of information from molecular interaction and pathway databases. The course is aimed at research scientists with a minimum of a degree in a scientific discipline, including industrial, laboratory and clinical staff, as well as specialists in related fields. It looks to provide researchers with the knowledge and tools for them to be able to utilize proteomics and proeomics bioinformatics more effectively in their own research. |
Year(s) Of Engagement Activity | 2023 |
URL | https://www.ebi.ac.uk/training/events/proteomics-bioinformatics-0/ |