In silico mass spectrometry for biologists: Tools and resources for next-generation proteomics

Lead Research Organisation: European Bioinformatics Institute
Department Name: Proteomics

Abstract

Proteins are the key functional molecules in cells, performing multiple biological tasks. This includes catalysing reactions, providing structure to cellular components, signalling between different cells and regulating the production of other genes as transcription factors. The recent advent of genome sequencing has transformed our ability to study these molecules into a "Big Data" discipline, coupled to advances in mass spectrometry (MS) and allied computing techniques. This particular branch of the "'omics" is referred to as proteomics - the high-throughput study (identification and importantly, quantification) of all the proteins that can be detected in a given biological sample. For example, by discovery of the proteins that are more abundant in different life cycle stages (during development or during ageing) ,may give us clues as to which biological pathways control these processes. Proteomics is used right across biological and biomedical research for profiling systems as varied as plants, model organisms, infectious diseases/microbes, chronic disease of humans and animals, among many others.

Currently, the primary technology used in proteomics is MS. Each assay (or scan) in a given MS run (one given experiment) provides us information about which proteins are present in our samples, by studying the peptides generated from them using a defined enzyme (e.g. trypsin). In the mass spectrometer, each peptide is broken up, and the instrument reports the masses of the different fragments in so called mass spectra. In the most traditional and most widely-used proteomics approaches nowadays, called 'data dependent acquisition' (DDA) techniques, only the most abundant peptides are measured by the instrument, and a lot of the remaining peptides are simply not detected and/or measured. This leaves the possibility that invaluable biological information is simply missed, which informs on the relative level of proteins in the cell. Recently, a novel group of proteomic approaches are starting to be used which can overcome some of the limitations of DDA approaches, known as Data Independent Acquisition (DIA) methods. Excitingly, these methods capture a near-complete digital record of the proteome in that experiment, but require more sophisticated software tools to mine these DIA maps. Relatively few groups are expert in their use, limiting the potential of the community to analyse the growing numbers of DIA data sets. Additionally, the current software tools are not yet robust enough, nor available on user-friendly web-based platforms that the average biologist can use.

In this project, we will develop and build open software able to analyse proteomics datasets generated using these novel DIA proteomics approaches in a robust manner, so they can be used in the future by anyone in the community. This will be achieved by making the software available on the European Bioinformatics Institute's "cloud" IT infrastructure. When the project finishes, the generated software pipelines will be ready to be deployed in other similar infrastructures in the UK and internationally. We will also improve and refine current analysis methods by using proteomics data already made available in the public domain, by extending existing collections of mass spectra called spectral libraries. This will support a rich portfolio of (re)analysis methods for the user base, with 'plug and play' components, that also includes support for detection of so called post-translational modifications (PTMs), which are notoriously difficult to identify otherwise.

The project outputs will greatly benefit a wide-range of biological and biomedical researchers interested in proteomic techniques for interrogation of samples - even if they don't have access to mass spectrometers. We will ensure this is disseminated via delivering workshops, training and online help/tutorials.

Technical Summary

To date, mass spectrometry (MS)-based proteomics has been largely driven by Data-Dependent Acquisition (DDA) approaches, where complex mixtures of peptide analytes are separated via liquid chromatography and elute into the instrument. This approach is limited by instrument throughput and the stochastic sampling of the analyte, leading to under-sampling and poor detection of low abundance proteins. To address such limitations, Data Independent Acquisition (DIA) approaches are gaining popularity, led by SWATH-MS and MSe/HD-MSe. These methods sample the analyte more uniformly and capture richer, deeper data, but generate more challenging data sets to interrogate which require sophisticated software solutions. Indeed, the lack of standard tools and the extra expertise required is preventing the further popularity and adoption of DIA proteomics approaches. Here, we will develop open analysis pipelines for different DIA techniques using non-commercial software (e.g. OpenSWATH, DIA Umpire, Skyline, etc), and deploy them in the EBI "Embassy Cloud" infrastructure. We will enable easy access to robust, and portable pipelines that can also be deployed in other cloud environments, for wider community benefit. In addition we will extend the functionality of the world-leading proteomics resource (PRIDE Archive at EMBL-EBI) and related tooling, extend the data standard mzTab to create a common output format of the analysis. A further compelling aspect is the link to PRIDE Archive that will support construction of robust spectral libraries (from different instruments and species), that can be used by us and our users to conduct novel DIA analyses. This will make good use of the growing DDA and DIA public datasets in PRIDE Archive to extract new knowledge. Novel results will be communicated to the original submitter and the rest of PRIDE Archive users, as well as into three EMBL-EBI resources: Ensembl, UniProt and the Expression Atlas.

Planned Impact

There is the potential for the following impacts:

- Mass spectrometry vendors (at least SCIEX and Waters) will benefit through the free availability of robust, reliable, reproducible and improve pipelines for the analysis of DIA proteomics datasets. When these pipelines are robust, there will not be the urgency to keep developing their own commercial software solutions, with gains in resources that could be focused in other efforts.

- Software vendors or pharmaceutical research and development teams, since we envisage they may wish to take up our software for local pipelines (e.g. deployed in their own cloud environments). It is important to highlight that all the software developed during the proposal will be open source or at least free-to-use (if the original software use to build the analysis pipelines is not open source). Commercial software will not be part of the developed pipelines.

- Research councils and charities funding research will benefit through the potential for increased impact of the mass spectrometry (MS)-based proteomics projects they fund, thanks to the re-analysis of public DIA proteomics datasets and the integration of novel proteomics data in Ensembl, UniProt and the Expression Atlas.

- Leveraging research partnerships and funding with industry via knowledge exchange and innovation funding has been successfully demonstrable at UoM. We have been fruitful with MRC CiC, P2D, Wellcome Trust ISSF, HEIF, and EPSRC IAA funding streams, which are all aimed at promoting and driving impact. Manchester projects with an MS foundation have always been successful in the life and biomedical sciences, in themselves generating high impact papers and multiple millions of GBP in industry and key stakeholder support.

- There is potential for our infrastructure to assist in clinical biomarker discovery, since DIA based methods (such as SWATH-MS and MSe/HD-MSe) are hugely growing in use in this space, as exemplified by the Stoller Biomarker Discovery Centre Manchester (where some of the applicants are involved).

- More broadly, as proteomics is a key technology in the Life Sciences, there is the potential for considerable indirect benefits on a wide range of areas in basic biology, biomedical or clinical science, as more value will be derived from datasets, including post-translational modifications (PTMs) - key regulators of cell signalling, and thus often studied in the clinical context.

Staff employed will benefit:

- Further training in one key enabling technology for the BBSRC (proteomics) and exposure to conferences, workshops and new national and International collaborations.

- Acquire skills needed to work with bioinformatics software in a cloud environment, something that is getting increasingly important with the growing size of datasets and the need of suitable IT infrastructure.
 
Description The work is almost complete now (the revised version of the corresponding manuscript is under review, a pre-print is available at https://www.biorxiv.org/content/10.1101/2021.06.08.447493v2). There will be two main outputs from this award:

- An open analysis pipeline that can be used to analyse Data Independent Acquisition (DIA) proteomics data generated using the SWATH-MS approach. DIA data is an experimental approach that is gaining popularity in the field. It is especially suited for quantitative proteomics approaches. The pipeline is available at https://github.com/PRIDE-reanalysis/DIA-reanalysis.
- Proof of concept study demonstrating that it is possible to perform a re-analysis of public DIA proteomics datasets. We have performed that using 10 datasets available in the PRIDE database and have integrated the results into the EBI resource Expression Atlas (https://www.ebi.ac.uk/gxa/home).

This is, to the best of our knowledge, the first time that systematic re-analysis of public DIA datasets has been performed.
Exploitation Route Yes, definitely. The open analysis pipeline can be re-used by others, and the whole approach of re-using DIA public datasets can also be put to work by others in the community. Additionally, the quantitative results have been integrated in the resource Expression Atlas, so that they can be reused further by anyone in the community.
Sectors Digital/Communication/Information Technologies (including Software),Healthcare

 
Description White paper about the data management practices of human sensitive proteomics data
Geographic Reach Multiple continents/international 
Policy Influence Type Contribution to new or Improved professional practice
URL https://www.sciencedirect.com/science/article/pii/S153594762100044X
 
Title Availability of DIA (Data Independent Acquisition) proteomics expression data in Expression Atlas 
Description Expression Atlas (https://www.ebi.ac.uk/gxa/home) is an open science resource at the European Bioinformatics Institute that gives users a powerful way to find information about gene and protein expression. We have made available there the results of the re-analysis of 10 Data Independent Acquisition (DIA) proteomics datasets. 
Type Of Material Data handling & control 
Year Produced 2021 
Provided To Others? Yes  
Impact To the best of our knowledge, this is the first time that public DIA proteomics datasets have been re-analysed and the results have been made available in an open resource such as Expression Atlas. 
URL https://www.ebi.ac.uk/gxa/home
 
Title PRIDE database 
Description The PRIDE database is the world leading data repository for mass spectrometry proteomics data (https://www.ebi.ac.uk/pride/). Created originally in 2004, a lot of functionality/capabilities have been and continue to be added to PRIDE as a result of different BBSRC grants. PRIDE has become the world leading resource for mass spectrometry (MS) proteomics dataset and commands a huge International impact. PRIDE is also leading the activities of the International ProteomeXchange Consortium. Additionally, public proteomics data included in PRIDE is increasingly being reused and integrated in added-value bioinformatics resources: Expression Atlas (quantitative proteomics datasets), Ensembl (proteogenomics information) and UniProt (for post-translational modification data). 
Type Of Material Database/Collection of data 
Provided To Others? Yes  
Impact PRIDE has become the world leading proteomics data repository, and as such, PRIDE has an enormous International impact. It enables data reproducibility and data re-use by third parties. 
URL https://www.ebi.ac.uk/pride/
 
Title MaxDIA 
Description MaxDIA is a software platform for analyzing data-independent acquisition (DIA) proteomics data within the MaxQuant software environment. Using spectral libraries, MaxDIA achieves deep proteome coverage with substantially better coefficients of variation in protein quantification than other software. MaxDIA is equipped with accurate false discovery rate (FDR) estimates on both library-to-DIA match and protein levels, including when using whole-proteome predicted spectral libraries. This is the foundation of discovery DIA-hypothesis-free analysis of DIA samples without library and with reliable FDR control. MaxDIA performs three- or four-dimensional feature detection of fragment data, and scoring of matches is augmented by machine learning on the features of an identification. MaxDIA's bootstrap DIA workflow performs multiple rounds of matching with increasing quality of recalibration and stringency of matching to the library. Combining MaxDIA with two new technologies-BoxCar acquisition and trapped ion mobility spectrometry-both lead to deep and accurate proteome quantification. Our contribution in this software was to help integration of the output of the tool into PRIDE. 
Type Of Technology Software 
Year Produced 2021 
Impact Be part of the MaxQuant software tool (from version 2.0), probably the most used analysis software tool in proteomics 
 
Title Open analysis pipeline for DIA SWATH-MS proteomics data 
Description For the analysis of the SWATH-MS data, we constructed a reanalysis pipeline with Nextflow. This choice allows the data processing to be executed in single-computer mode, on HPC clusters, and on cloud computing platforms. The pipeline steps can be broken down into raw data conversion, Quality Control and SWATH window processing, OpenSWATH target generation and data analysis, FDR analysis and multi-run alignment with PyProphet and TRIC, and finally statistical analysis with MSstat and upload to Expression Atlas via custom submission scripts. All analysis software is containerised either from available software release of built-for-purpose to ensure a well-defined compute environment and software compatibility. The pipeline is available at https://github.com/PRIDE-reanalysis/DIA-reanalysis. 
Type Of Technology Webtool/Application 
Year Produced 2021 
Open Source License? Yes  
Impact This pipeline can be used to re-analyse public DIA SWATH-MS proteomics datasets.