Signal-based image registration and mixed modelling for differential analysis of large scale cross-omics datasets

Lead Research Organisation: University of Manchester

Department Name: Medical and Human Sciences

Abstract

Biologists are increasing wishing to understand the complex interactions between the building blocks of genes, metabolites and proteins that control the function of every living organism. The field of systems biology has emerged to overcome the deficiencies of the traditional reductionist approach, which has identified the building blocks themselves and many of the individual interactions but has not been able to deduce how systems of these blocks act and react in unison. The application of systems biology is widespread, as it promises to revolutionise our understanding of healthy processes in plants, animals and humans, as well as how they break down under disease and how this breakdown can be averted.

Often the systems biology approach starts with a 'snapshot' of a particular biological sample. Mass spectrometry is a pervasive technique for gaining a snapshot of a sample, and it does this by ionising the sample and then measuring each constituent compound's mass and quantity based on the resulting charge. This is often not enough to separate out the sample fully and therefore a preceding phase of liquid or gas chromatography is used to provide an initial separation. Due to technical and biological variations, it will be necessary to analyse the sample a number of times to get reliable readings. Furthermore, classes of protein, metabolite and metals require different sample preparation, different chromatography approaches and different types of mass spectrometry instrumentation. These all add different kinds of biases and variation which make it extremely challenging to infer links between compounds, especially if the compounds are from different classes. To make matters worse, many snapshots are needed to capture different 'angles' of the biological process under investigation, and the instrumental conditions themselves are not entirely reproducible over time.

All this has led systems biology to become a progressively computational discipline. Since the datasets are so large, however, the existing computational techniques tend to convert the rich raw data from mass spectrometry output to a symbolic representation of compounds too early on. We instead advocate all the data across the samples should be modelled together as raw data, so statistical 'strength' can be borrowed across the collection when making decisions about whether a compound or compound interaction truly exists in the data and at what level of confidence. Unfortunately, the chromatographic step is particularly variable, so corresponding compounds have to be matched to each other before or during analysis. We propose to do this directly on the raw data so that far less compounds are missed by trying to detect them on each dataset in isolation. Furthermore, we propose that with the right 'mixed model' and on the aligned raw data, we can separate out the systematic biases in the data despite being confounded by their intermixed correlations. This will provide high quality evidence for interactions across sample classes and fuel advancements in the systems biology field.

Technical Summary

We propose to develop a generalised algorithm for aligning complex experimental designs of proteomic and metabolomic LC-MS and GC-MS data for the large-scale studies that are necessarily to ensure the success of the systems biology approach. By basing the alignment in the complete raw signal domain, simultaneously compensating for differential expression, and provision of a GPU-accelerated implementation, we anticipate significantly improved robustness and accuracy, and increased reporting of biochemical features while maintaining throughput. This method will also allow for the first time the downstream use of functional mixed modelling (FMM) methodology for differential analysis that will mine deep below the proteome and metabolome which are visible with current data processing algorithms, compensate for confounding effects and present full posterior distributions of statistical certainty. In particular, it will enables the integrated analysis of proteomics and metabolomics datasets for the first time with a universal method that simultaneously models the interdependencies between them.

We will employ a groupwise image registration approach with a physics-based deformation model. This will provide a tractable order of complexity to take into account the full raw data of the whole collection of datasets. The success of this approach is reliant on specialist modelling of the systematic bias and variation inherent in LC-MS and GC-MS. An accelerated FMM approach will then be developed using a variational Bayes formulation for incorporation directly into the alignment process. We believe this is key to (a) avoiding local optima as the posterior probabilities for these will be low, and (b) reducing the complexity of FMM to realise a tractable integrated alignment. The groupwise registration and FMM will be packaged for use by the community as a novel discovery engine, together with its comprehensive validation on large-scale cross-omics datasets.

Planned Impact

As well as the academic beneficiaries, the proposed research has significant prospective impact for the mass spectrometry industry. The discovery engine will increase the amount of usable data extracted from LC-MS and GC-MS and therefore correspondingly increase users' return on investment. This will make commercial mass spectrometry instrumentation, which requires a considerable capital and running costs, more attractive. In particular, we hope this extra research capacity will attract a wider uptake of mass spectrometry in environmental, biological and health research in industry and academia, as well as a wider audience of users and uses.

The proposed discovery engine could be seen to be in direct competition with products from software vendors and instrument manufacturers. In fact we perceive a symbiotic relationship with user-centric discovery packages such as Progenesis (Nonlinear Dynamics, Newcastle, UK). The majority of development time for these packages is spent in data import/export, graphical interface, workflow, and results presentation. They also expose interfaces to popular search engines for feature identification including Mascot (Matrix Science, London, UK), which is an essential source of complementary information for a discovery platform. We will therefore investigate the commercialisation of our methods, which could potentially occur in the short to medium term. Nevertheless, we are committed to providing our methods freely for academic use. To maximise dissemination and facility to the academic community we will pursue the interfacing of our discovery engine into the open-source ProteoSuite package of our collaborator Dr Andy Jones, University of Liverpool (see letter of support).

There is considerable potential in this application for providing indirect benefits to UK public health, quality of life and environmental sustainability. Our stated aim is to enable reliable and precise statistical evidence from large-scale cross-omics experiments, such as those using a Systems Biology approach which are increasingly becoming more essential. This improvement will disseminate down to the public through reduced resources, costs and overheads required for environmental, biological and biomedical discoveries and the characterisation of those discoveries. Since the system will identify multiple covariant effects, it is also reasonable to believe that tertiary biological processes could be identified which otherwise would go unnoticed. This has the potential to deliver further novel discoveries and characterise potentially interfering processes, therefore avoiding subsequent misallocation of resources.

The PDRA employed on this grant will be encouraged to spearhead public dissemination and will benefit from the unique intensive cross-disciplinary interaction at CADET that brings together proteomics, metabolomics and bioinformatics expertise all into the same facility and working towards the same goal.

Funded Value:

£120,345

Funded Period:

Jan 13 - Jul 14

Funder:

BBSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

BB/K004158/1

Principal Investigator:

Andrew Dowsey

Research Subject:

Info. & commun. Technol. (45%)

Mathematical sciences (18%)

Tools, technologies & methods (36%)

Research Topic:

Bioinformatics (36%)

Image & Vision Computing (27%)

Information & Knowledge Mgmt (18%)

Statistics & Appl. Probability (18%)

Organisations

People	ORCID iD
Andrew Dowsey (Principal Investigator)	http://orcid.org/0000-0002-7404-9128
Warwick Dunn (Co-Investigator)	http://orcid.org/0000-0001-6924-0027
Garth Cooper (Co-Investigator)

Publications

Author Name Title

Publication Date Published

10 25 50

Liao H (2014) A new paradigm for clinical biomarker discovery and screening with Mass Spectrometry through biomedical image analysis principles

Stevens A (2014) Network analysis: a new approach to study endocrine disorders. in Journal of molecular endocrinology

Key Findings
Impact Summary
Further Funding
Collaboration
Software and Technical Products


Description	In this grant, we have developed a differential analysis engine for label-free discovery mass spectrometry data that employs no prior biological knowledge of any kind. By analysing the raw MS data directly, it is generically applicable to both proteomics and metabolomics data and aims to discover statistically significant differential expression amongst small perturbations of the raw data that current feature detection and matching pipelines miss. To do this, we have adopted a group-wise image registration approach for aligning the raw images, which is needed to perform the wavelet functional mixed modelling (WFMM) method of Morris et al. (J. S. Morris, Statistics and Its Interface, vol. 5, no. 1, pp. 117-136, 2012) on the result. Beyond this grant, we will be partnering this engine with complementary methodology for statistical modelling of biological knowledge, in order to gain the best of both worlds (in BBSRC/NSF grant BB/M024954/1). Through a new collaboration with Dr Jim Graham of the Centre of Imaging Sciences, University of Manchester, we were able to adapt his image registration technique (M. Rogers and J. Graham, IEEE Trans. Image Process., vol. 16, no. 3, pp. 624-635, 2007) for our purpose. This allowed us to bring forward testing of the WFMM on spike-in and real proteomics data collected by CADET. The results are very encouraging, as they demonstrate robust differential analysis below the detection limits of the leading commercial software Progenesis (Nonlinear Dynamics, Newcastle, UK). To summarise, our novel raw data workflow consists of: (1) Our seaMass sparse Poisson regression technique adapted to model generic smooth curved signals. We call this new approach 'image restoration'. The algorithm re-bins MS1 scans to a regular grid, reconstructs rows missing due to MS2 acquisition, and suppresses noise/bias from ion-counting statistics. The resulting 'images' are now ready for alignment; (2) The alignment method of Jim Graham adapted to correct LC deformation. This group-wise approach smoothly warps the images to align corresponding patterns while normalizing retention times; (3) The WFMM approach adapted to large-scale LC-MS datasets through parallel computing of overlapping tiles. A peer-reviewed paper on this workflow was published at the IEEE International Symposium on Biomedical Imaging (ISBI) 2014, PDRA Dr Hanqing Liao won the Early Career Investigator Award for this work at BSPR 2013, and I was awarded a Young Investigator Travel Award for MSCAL 2014. We have also developed a new group-wise alignment method based purely on image gradients and that uses a sparsely regularised representation of the deformation field (http://www.biospi.org/research/ms/giro/). This give us a significant improvement in robustness, and is being prepared for journal publication.
Exploitation Route	We have engaged with the University of Manchester's agent for intellectual property commercialisation, UMIP, and are currently in initial discussions with a mass spectrometry vendor regarding knowledge transfer of selected parts of our workflow. As the proposed research represents a component in a potential much bigger framework, we will not be seeking to sell the intellectual property or find an exclusive licensee. Instead, we will particularly investigate follow-on funding streams and knowledge transfer partnerships. We remain committed to offering the discovery engine for academic research free of charge, and are working in close contact with Dr Andy Jones, University of Liverpool to maximise dissemination and facility to the academic community via interfacing our discovery engine into his open-source ProteoSuite package (BBSRC BBR BB/I00095X/1), for which we have been invited into the ProteoSuite consortium (http://www.proteosuite.org/?q=aboutus). We are also now in the process of performing an extensive validation on clinical proteomics and metabolomics data in follow-up MRC NIRG grant MR/L011093/1, to realise societal healthcare impact and for submission of a high impact general readership journal publication. The alignment methodology is also essential to the new analysis methodology proposed in this MRC grant.
Sectors	Agriculture, Food and Drink,Environment,Healthcare,Pharmaceuticals and Medical Biotechnology
URL	http://www.biospi.org/


Description	We have open sourced part of the workflow (http://www.seamass.net/) with the rest to follow soon. Substantial economic, environmental and health benefits could be derived indirectly with our discovery engine through the work of our end users. We are directly working with translational researchers in The University of Manchester to exemplar our approach to discover novel clinical biomarkers in proteomics data.
First Year Of Impact	2014
Sector	Healthcare


Description	Impact Accelerator Award
Amount	£21,844 (GBP)
Organisation	University of Liverpool
Sector	Academic/University
Country	United Kingdom
Start	04/2016
End	09/2017


Description	Research Institute Pump Priming Programme
Amount	£11,883 (GBP)
Organisation	University of Manchester
Sector	Academic/University
Country	United Kingdom
Start	03/2014
End	07/2014


Description	University of Manchester BBSRC Impact Accelerator
Amount	£20,926 (GBP)
Organisation	Biotechnology and Biological Sciences Research Council (BBSRC)
Sector	Public
Country	United Kingdom
Start	01/2016
End	06/2016


Description	Young Investigator Travel Award
Amount	€ 1,500 (EUR)
Organisation	Mass Spectrometry: Applications to the Clinical Laboratory (MSACL)
Sector	Charity/Non Profit
Country	United States
Start	09/2014
End	09/2014


Description	Prof Jeffrey Morris
Organisation	University of Texas
Department	M. D. Anderson Cancer Center
Country	United States
Sector	Academic/University
PI Contribution	Translation of Prof Morris' Wavelet Functional Mixed Model methodology to the proteomics LC-MS (Liquid Chromatography - Mass Spectrometry) field.
Collaborator Contribution	Access to Prof Morris' expertise and unpublished methodology in order to create our novel differential analysis workflow for raw LC-MS data.
Impact	Two publications [Liao et al, IEEE ISBI 2014; Dowsey et al Proteomics, 2010, 4226-57] plus a successful submission to the September 2014 BBSRC Bilateral NSF/BIO-BBSRC responsive mode call [BB/M024954/1].
Start Year	2009


Title	GIRO
Description	A method for the groupwise retention time alignment and intensity normalisation of LC-MS data.
Type Of Technology	Software
Year Produced	2015
Open Source License?	Yes
Impact	Required to provides the alignment component of our raw data biomarker discovery pipeline [Liao et al, ISBI 2014]. Impact as a standalone method under assessment.
URL	http://www.biospi.org/research/ms/giro/


Title	seaMass
Description	The seaMass software is our open source dissemination route for the LC-MS (Liquid Chromatography - Mass Spectrometry) analysis algorithms developed by our group, including signal restoration and visualisation.
Type Of Technology	Software
Year Produced	2014
Open Source License?	Yes
Impact	The software has only recently been released, but there is strong interest for its incorporation into the ProteoSuite's consortium's BBSRC BBR funded user-centric proteomics software (http://www.proteosuite.org/?q=aboutus).
URL	http://www.biospi.org/research/ms/seamass/

Abstract

Technical Summary

Planned Impact

Organisations

People

ORCID iD

Publications