Remote streaming 3D visualisation platform for raw and analysed data from biological mass spectrometry repositories

Lead Research Organisation: University of Manchester
Department Name: Medical and Human Sciences

Abstract

Biologists are increasing wishing to understand the complex interactions between the building blocks of genes, metabolites and proteins that control the function of every living organism. The field of systems biology has emerged to overcome the deficiencies of the traditional reductionist approach, which has identified the building blocks themselves and many of the individual interactions but has not been able to deduce how systems of these blocks act and react in unison. The application of systems biology is widespread, as it promises to revolutionise our understanding of healthy processes in plants, animals and humans, as well as how they break down under disease and how this breakdown can be averted. Often the systems biology approach starts with a 'snapshot' of a particular biological sample. Mass spectrometry is a pervasive technique for gaining a snapshot of the proteins or metabolites in a sample, and it does this by ionising the sample and then measuring each constituent compound's mass and quantity based on the resulting charge. This is often not enough to separate out the sample fully and therefore a preceding phase of liquid or gas chromatography is used to provide an initial separation. Due to technical and biological variations, it will be necessary to analyse the sample a number of times to get reliable readings. Interesting biochemicals can also be broken up into characteristic fragments and these measured, which often gives a confident identification of that biochemical.

All this has led systems biology to become a progressively computational discipline. Since the datasets are becoming so large, however, that there is a danger that the process becomes more and more opaque and inaccessible to mass spectrometry practitioners and so more likely to be used as a 'black box'. It is therefore vitally important that tools and platforms are available that allow expert user verification, validation and interpretation of results by checking the raw data acquired, otherwise bias, errors and false assumptions in processing will be routinely overlooked. The massive datasets prove a challenge, however, as existing tools are slow to load and process the data for visualisation which severely limits productivity and precludes the integrated comparison of whole experiments due to limited memory. Part of the reason for this is that existing data formats have not been designed for streamlined retrieval of regions of interest or at varying levels of detail necessary for fast, efficient visualisation. We propose to design such a representation for standards-complaint data, and from that we will demonstrate interactive 3D visualisations from local storage for the first time without delay. Since memory overhead issues are also mitigated, novel visualisation schemes integrating results and raw data across complete experiments will be possible, greatly facilitating the quality control, verification, validation and expert interpretation of MS analyses.

Furthermore, through development of specialised image compression, we will demonstrate real-time remote visualisation across the Internet, in a manner similar to Google Earth but for the first time extended for the demands of mass spectrometry visualisation. The European Bioinformatics Institute at Hinxton, Cambridge, has through the ProteomeXchange consortium recently launched raw data deposition into their PRIDE public data repositories, which stores vast amount of publically-funded experiments from around the world. As of September 2012, it holds 324 million mass spectra. Our remote visualisation platform will demonstrate the potential for immediate and seamless raw data access linked by online publications and web resources, which would lead to substantially improved facility, accessibility and re-use of these strategic community data sources.

Technical Summary

We propose a 3D visualisation platform for cross-validating complete experimental designs of raw and analysed data from proteomics and metabolomics mass spectrometry (MS). Current tools are critically limited by their loading and handling of the massive datasets, which severely limits productivity and precludes the integrated comparison of whole experiments. By designing a visualisation-centric raw data representation for standards-compliant MS data, we will demonstrate GPU-accelerated streaming 2D/3D interactive visualisations from local storage for the first time without delays. Since memory overhead issues are also mitigated, novel visualisation schemes integrating results and raw data across complete experiments will be possible, greatly facilitating the quality control, verification, validation and expert interpretation of MS analyses. Furthermore, through development of biologically-driven signal compression, we will demonstrate real-time remote visualisation across the Internet to demonstrate the potential for public raw data repositories such as PRIDE enabling immediate and seamless visual access to data from publications and web catalogues. This would lead to substantially improved facility, accessibility and re-use of these strategic community data resources.

The compression format will have ramifications for storage of MS data in general. We will provide lossless encoding so that the complete original datasets can also be reconstructed exactly but with the space-saving benefits of our domain-specific compression. We will work with the PSI on an industry-standard raw data representation with pluggable codecs. The visualisation platform will be aimed at standard PCs with consumer grade GPU cards. It will be integrated into PRIDE Inspector using OpenGL with GPU-computation employed for terrain rendering and decompression. We will define an API that allows modular re-use as the interactive visualisation subsystem of Proteosuite and potentially other tools.

Planned Impact

As well as the academic beneficiaries, the proposed research has prospective impact for the mass spectrometry industry. The visualisation platform will increase the capacity for validation, cross-validation and re-use of mass spectrometry data. This will make commercial mass spectrometry instrumentation, which requires considerable capital and running costs, more attractive.

The proposed visualisation platform could be seen to be in competition with software products from vendors and instrument manufacturers, particularly when integrated into Proteosuite. However, since our software is distributed with a permissive license allowing for its unrestricted re-use in other software packages, both free and commercial, we hope that our work will aid commercial software products similarly and therefore raise the level of the whole field.

There is considerable potential in this application for providing indirect benefits to UK public health, quality of life and environmental sustainability. Our stated aim is to break the limitations current tools have with loading and handling massive datasets so that quality control, verification, validation and expert interpretation of mass spectrometry analyses can be facilitated, and accessibility, facility and re-use of community raw data sources such as PRIDE can be improved. This improvement will disseminate down to the public through reduced resources, costs and overheads required for environmental, biological and biomedical discoveries and the characterisation of those discoveries. Since the platform will enable improved quality control and diagnostics, it has the potential to characterise potentially interfering effects and false discoveries, therefore avoiding subsequent misallocation of resources.

The PDRA employed on this grant will be encouraged to spearhead public dissemination and will benefit from the unique intensive cross-disciplinary interaction at CADET and EBI that brings together proteomics, metabolomics, bioinformatics and data warehousing expertise together, working towards the same goal.

Publications

10 25 50
 
Description As data rates rise, there is a danger that informatics for high-throughput biological LC-MS (Liquid Chromatography-Mass Spectrometry) becomes more and more opaque and inaccessible to practitioners. It is therefore critical that efficient visualisation tools are available to facilitate quality control, verification, validation, interpretation and sharing of raw MS data and the results of MS analyses.

Currently MS data is stored as contiguous spectra. Recall of individual spectra is relatively quick but panoramas, zooming and panning across whole datasets necessitates processing/memory overheads impractical for interactive use. Moreover, visualisation is challenging if significant quantification data at MS1 level is missing due to data dependent acquisition of MS2 spectra.

In order to tackle these issues, we have extended our seaMass technique as a novel signal decomposition method. This automatically models LC-MS data as a two-dimensional surface through selection of a sparse set of weighted B-spline basis functions from an over-complete dictionary. By ordering and partitioning the weights using an octree data model, efficient streaming visualisations are achieved. We have developed the core MS1 visualisation engine and overlay of MS2 annotations, therefore providing a quality control platform for the mass spectrometrist. This work won a poster prize at the BioNetVisA workshop of ECCB 2014, Strasbourg, and a journal manuscript has been published in the journal Proteomics.
Exploitation Route Since the visualisation is synergistic with the Proteomics Standards Initiative's data interchange formats, we were invited to talk at their Spring Workshop. This led to publication of a collaborative journal paper on data compression.
Sectors Agriculture, Food and Drink,Environment,Healthcare,Pharmaceuticals and Medical Biotechnology

 
Description With Prof Andy Jones (Liverpool) we have been awarded a BBSRC Follow-On-Fund grant to deliver a commercial production quality visualisation package for raw LC-MS data and annotated results, based on the outputs on this grant.
First Year Of Impact 2006
 
Description BBSRC Follow-on-Fund
Amount £198,649 (GBP)
Funding ID BB/N019385/1 
Organisation Biotechnology and Biological Sciences Research Council (BBSRC) 
Sector Public
Country United Kingdom
Start 04/2016 
End 09/2017
 
Description Proteomics Standards Initiative 
Organisation Human Proteome Organization
Department Proteomics Standards Inititative
Country United States 
Sector Charity/Non Profit 
PI Contribution Expertise on signal compression and data representation for application to the PSI's mzML standard interchange format for proteomics
Collaborator Contribution Implementation and validation of new signal compression approaches for mzML
Impact One publication [Teleman et al, Molecular and Cellular Proteomics, 1537-42, 2014], with open source implementation in ProteoWizard (http://proteowizard.sourceforge.net/)
Start Year 2013
 
Title seaMass 
Description The seaMass software is our open source dissemination route for the LC-MS (Liquid Chromatography - Mass Spectrometry) analysis algorithms developed by our group, including signal restoration and visualisation. 
Type Of Technology Software 
Year Produced 2014 
Open Source License? Yes  
Impact The software has only recently been released, but there is strong interest for its incorporation into the ProteoSuite's consortium's BBSRC BBR funded user-centric proteomics software (http://www.proteosuite.org/?q=aboutus). 
URL http://www.biospi.org/research/ms/seamass/
 
Title seaMass-Viz 
Description Interactive real-time streaming visualisation platform for raw proteomics and metabolomics LC-MS data. 
Type Of Technology Software 
Year Produced 2015 
Impact Software is currently a proof-of-concept. A BBSRC Follow-On-Fund grant (BB/N019385/1) was secured to develop a commercial production package based on this technology. 
URL http://www.biospi.org/research/ms/viz/