Signal-based image registration and mixed modelling for differential analysis of large scale cross-omics datasets

Lead Research Organisation: University of Manchester
Department Name: Medical and Human Sciences

Abstract

Biologists are increasing wishing to understand the complex interactions between the building blocks of genes, metabolites and proteins that control the function of every living organism. The field of systems biology has emerged to overcome the deficiencies of the traditional reductionist approach, which has identified the building blocks themselves and many of the individual interactions but has not been able to deduce how systems of these blocks act and react in unison. The application of systems biology is widespread, as it promises to revolutionise our understanding of healthy processes in plants, animals and humans, as well as how they break down under disease and how this breakdown can be averted.

Often the systems biology approach starts with a 'snapshot' of a particular biological sample. Mass spectrometry is a pervasive technique for gaining a snapshot of a sample, and it does this by ionising the sample and then measuring each constituent compound's mass and quantity based on the resulting charge. This is often not enough to separate out the sample fully and therefore a preceding phase of liquid or gas chromatography is used to provide an initial separation. Due to technical and biological variations, it will be necessary to analyse the sample a number of times to get reliable readings. Furthermore, classes of protein, metabolite and metals require different sample preparation, different chromatography approaches and different types of mass spectrometry instrumentation. These all add different kinds of biases and variation which make it extremely challenging to infer links between compounds, especially if the compounds are from different classes. To make matters worse, many snapshots are needed to capture different 'angles' of the biological process under investigation, and the instrumental conditions themselves are not entirely reproducible over time.

All this has led systems biology to become a progressively computational discipline. Since the datasets are so large, however, the existing computational techniques tend to convert the rich raw data from mass spectrometry output to a symbolic representation of compounds too early on. We instead advocate all the data across the samples should be modelled together as raw data, so statistical 'strength' can be borrowed across the collection when making decisions about whether a compound or compound interaction truly exists in the data and at what level of confidence. Unfortunately, the chromatographic step is particularly variable, so corresponding compounds have to be matched to each other before or during analysis. We propose to do this directly on the raw data so that far less compounds are missed by trying to detect them on each dataset in isolation. Furthermore, we propose that with the right 'mixed model' and on the aligned raw data, we can separate out the systematic biases in the data despite being confounded by their intermixed correlations. This will provide high quality evidence for interactions across sample classes and fuel advancements in the systems biology field.

Technical Summary

We propose to develop a generalised algorithm for aligning complex experimental designs of proteomic and metabolomic LC-MS and GC-MS data for the large-scale studies that are necessarily to ensure the success of the systems biology approach. By basing the alignment in the complete raw signal domain, simultaneously compensating for differential expression, and provision of a GPU-accelerated implementation, we anticipate significantly improved robustness and accuracy, and increased reporting of biochemical features while maintaining throughput. This method will also allow for the first time the downstream use of functional mixed modelling (FMM) methodology for differential analysis that will mine deep below the proteome and metabolome which are visible with current data processing algorithms, compensate for confounding effects and present full posterior distributions of statistical certainty. In particular, it will enables the integrated analysis of proteomics and metabolomics datasets for the first time with a universal method that simultaneously models the interdependencies between them.

We will employ a groupwise image registration approach with a physics-based deformation model. This will provide a tractable order of complexity to take into account the full raw data of the whole collection of datasets. The success of this approach is reliant on specialist modelling of the systematic bias and variation inherent in LC-MS and GC-MS. An accelerated FMM approach will then be developed using a variational Bayes formulation for incorporation directly into the alignment process. We believe this is key to (a) avoiding local optima as the posterior probabilities for these will be low, and (b) reducing the complexity of FMM to realise a tractable integrated alignment. The groupwise registration and FMM will be packaged for use by the community as a novel discovery engine, together with its comprehensive validation on large-scale cross-omics datasets.

Planned Impact

As well as the academic beneficiaries, the proposed research has significant prospective impact for the mass spectrometry industry. The discovery engine will increase the amount of usable data extracted from LC-MS and GC-MS and therefore correspondingly increase users' return on investment. This will make commercial mass spectrometry instrumentation, which requires a considerable capital and running costs, more attractive. In particular, we hope this extra research capacity will attract a wider uptake of mass spectrometry in environmental, biological and health research in industry and academia, as well as a wider audience of users and uses.

The proposed discovery engine could be seen to be in direct competition with products from software vendors and instrument manufacturers. In fact we perceive a symbiotic relationship with user-centric discovery packages such as Progenesis (Nonlinear Dynamics, Newcastle, UK). The majority of development time for these packages is spent in data import/export, graphical interface, workflow, and results presentation. They also expose interfaces to popular search engines for feature identification including Mascot (Matrix Science, London, UK), which is an essential source of complementary information for a discovery platform. We will therefore investigate the commercialisation of our methods, which could potentially occur in the short to medium term. Nevertheless, we are committed to providing our methods freely for academic use. To maximise dissemination and facility to the academic community we will pursue the interfacing of our discovery engine into the open-source ProteoSuite package of our collaborator Dr Andy Jones, University of Liverpool (see letter of support).

There is considerable potential in this application for providing indirect benefits to UK public health, quality of life and environmental sustainability. Our stated aim is to enable reliable and precise statistical evidence from large-scale cross-omics experiments, such as those using a Systems Biology approach which are increasingly becoming more essential. This improvement will disseminate down to the public through reduced resources, costs and overheads required for environmental, biological and biomedical discoveries and the characterisation of those discoveries. Since the system will identify multiple covariant effects, it is also reasonable to believe that tertiary biological processes could be identified which otherwise would go unnoticed. This has the potential to deliver further novel discoveries and characterise potentially interfering processes, therefore avoiding subsequent misallocation of resources.

The PDRA employed on this grant will be encouraged to spearhead public dissemination and will benefit from the unique intensive cross-disciplinary interaction at CADET that brings together proteomics, metabolomics and bioinformatics expertise all into the same facility and working towards the same goal.

Publications

10 25 50
 
Description In this grant, we have developed a differential analysis engine for label-free discovery mass spectrometry data that employs no prior biological knowledge of any kind. By analysing the raw MS data directly, it is generically applicable to both proteomics and metabolomics data and aims to discover statistically significant differential expression amongst small perturbations of the raw data that current feature detection and matching pipelines miss. To do this, we have adopted a group-wise image registration approach for aligning the raw images, which is needed to perform the wavelet functional mixed modelling (WFMM) method of Morris et al. (J. S. Morris, Statistics and Its Interface, vol. 5, no. 1, pp. 117-136, 2012) on the result. Beyond this grant, we will be partnering this engine with complementary methodology for statistical modelling of biological knowledge, in order to gain the best of both worlds (in BBSRC/NSF grant BB/M024954/1).

Through a new collaboration with Dr Jim Graham of the Centre of Imaging Sciences, University of Manchester, we were able to adapt his image registration technique (M. Rogers and J. Graham, IEEE Trans. Image Process., vol. 16, no. 3, pp. 624-635, 2007) for our purpose. This allowed us to bring forward testing of the WFMM on spike-in and real proteomics data collected by CADET. The results are very encouraging, as they demonstrate robust differential analysis below the detection limits of the leading commercial software Progenesis (Nonlinear Dynamics, Newcastle, UK). To summarise, our novel raw data workflow consists of: (1) Our seaMass sparse Poisson regression technique adapted to model generic smooth curved signals. We call this new approach 'image restoration'. The algorithm re-bins MS1 scans to a regular grid, reconstructs rows missing due to MS2 acquisition, and suppresses noise/bias from ion-counting statistics. The resulting 'images' are now ready for alignment; (2) The alignment method of Jim Graham adapted to correct LC deformation. This group-wise approach smoothly warps the images to align corresponding patterns while normalizing retention times; (3) The WFMM approach adapted to large-scale LC-MS datasets through parallel computing of overlapping tiles. A peer-reviewed paper on this workflow was published at the IEEE International Symposium on Biomedical Imaging (ISBI) 2014, PDRA Dr Hanqing Liao won the Early Career Investigator Award for this work at BSPR 2013, and I was awarded a Young Investigator Travel Award for MSCAL 2014.

We have also developed a new group-wise alignment method based purely on image gradients and that uses a sparsely regularised representation of the deformation field (http://www.biospi.org/research/ms/giro/). This give us a significant improvement in robustness, and is being prepared for journal publication.
Exploitation Route We have engaged with the University of Manchester's agent for intellectual property commercialisation, UMIP, and are currently in initial discussions with a mass spectrometry vendor regarding knowledge transfer of selected parts of our workflow. As the proposed research represents a component in a potential much bigger framework, we will not be seeking to sell the intellectual property or find an exclusive licensee. Instead, we will particularly investigate follow-on funding streams and knowledge transfer partnerships.

We remain committed to offering the discovery engine for academic research free of charge, and are working in close contact with Dr Andy Jones, University of Liverpool to maximise dissemination and facility to the academic community via interfacing our discovery engine into his open-source ProteoSuite package (BBSRC BBR BB/I00095X/1), for which we have been invited into the ProteoSuite consortium (http://www.proteosuite.org/?q=aboutus).

We are also now in the process of performing an extensive validation on clinical proteomics and metabolomics data in follow-up MRC NIRG grant MR/L011093/1, to realise societal healthcare impact and for submission of a high impact general readership journal publication. The alignment methodology is also essential to the new analysis methodology proposed in this MRC grant.
Sectors Agriculture, Food and Drink,Environment,Healthcare,Pharmaceuticals and Medical Biotechnology

URL http://www.biospi.org/
 
Description We have open sourced part of the workflow (http://www.seamass.net/) with the rest to follow soon. Substantial economic, environmental and health benefits could be derived indirectly with our discovery engine through the work of our end users. We are directly working with translational researchers in The University of Manchester to exemplar our approach to discover novel clinical biomarkers in proteomics data.
First Year Of Impact 2014
Sector Healthcare
 
Description Impact Accelerator Award
Amount £21,844 (GBP)
Organisation University of Liverpool 
Sector Academic/University
Country United Kingdom
Start 04/2016 
End 09/2017
 
Description Research Institute Pump Priming Programme
Amount £11,883 (GBP)
Organisation University of Manchester 
Sector Academic/University
Country United Kingdom
Start 03/2014 
End 07/2014
 
Description University of Manchester BBSRC Impact Accelerator
Amount £20,926 (GBP)
Organisation Biotechnology and Biological Sciences Research Council (BBSRC) 
Sector Public
Country United Kingdom
Start 01/2016 
End 06/2016
 
Description Young Investigator Travel Award
Amount € 1,500 (EUR)
Organisation Mass Spectrometry: Applications to the Clinical Laboratory (MSACL) 
Sector Charity/Non Profit
Country United States
Start 09/2014 
End 09/2014
 
Description Prof Jeffrey Morris 
Organisation University of Texas
Department M. D. Anderson Cancer Center
Country United States 
Sector Academic/University 
PI Contribution Translation of Prof Morris' Wavelet Functional Mixed Model methodology to the proteomics LC-MS (Liquid Chromatography - Mass Spectrometry) field.
Collaborator Contribution Access to Prof Morris' expertise and unpublished methodology in order to create our novel differential analysis workflow for raw LC-MS data.
Impact Two publications [Liao et al, IEEE ISBI 2014; Dowsey et al Proteomics, 2010, 4226-57] plus a successful submission to the September 2014 BBSRC Bilateral NSF/BIO-BBSRC responsive mode call [BB/M024954/1].
Start Year 2009
 
Title GIRO 
Description A method for the groupwise retention time alignment and intensity normalisation of LC-MS data. 
Type Of Technology Software 
Year Produced 2015 
Open Source License? Yes  
Impact Required to provides the alignment component of our raw data biomarker discovery pipeline [Liao et al, ISBI 2014]. Impact as a standalone method under assessment. 
URL http://www.biospi.org/research/ms/giro/
 
Title seaMass 
Description The seaMass software is our open source dissemination route for the LC-MS (Liquid Chromatography - Mass Spectrometry) analysis algorithms developed by our group, including signal restoration and visualisation. 
Type Of Technology Software 
Year Produced 2014 
Open Source License? Yes  
Impact The software has only recently been released, but there is strong interest for its incorporation into the ProteoSuite's consortium's BBSRC BBR funded user-centric proteomics software (http://www.proteosuite.org/?q=aboutus). 
URL http://www.biospi.org/research/ms/seamass/