Bayesian methods for modelling and integrating metabolic data

Lead Research Organisation: Imperial College London
Department Name: School of Public Health


Recent advances in biological technology enable the measurement of multiple measures of complex systems from the cell to the whole organism. However, these technologies generate massive amount of data and it is a major task to process these robustly and efficiently. The aim of our multidisciplinary project is to devise methods to combine and analyze different data measurements arising from experiments in modern biology that will ultimately aid in the understanding of the causes of common diseases, and lead to the development of new treatments. It is now possible to investigate how complex organisms function by measuring in great detail the chemical composition of, for example, a sample of blood or urine, and also to measure how that composition changes over time, or in reaction to different treatments or experimental conditions. Perhaps most importantly, it is also possible to compare the composition across different groups that may have or not have a particular disease, and to use this comparison to understand how treatments might be developed. This exciting prospect can only be achieved, however, if the experimental data are collected and analyzed as accurately possible. This is the principal goal of our research. We will focus on so-called 'metabolic' analysis using two specific types of technology (known by the initials NMR and MS) that allow us to measure the amount of a large number of different chemicals (or metabolites) that are present in the samples of blood or other body fluids being analyzed. Metabolites are small molecules present in all organisms which are essential to the functioning of their living cells. NMR and MS are both extremely sophisticated measurement procedures that each produce a large amount of data (spectra), but although the measurements from the two technologies contain some information on the same metabolites, most of the information from the two sources is not identical, and an important statistical modelling task involves combining data from them in the most sensible fashion. We will separate this task into two components; first, the mathematical modelling of the NMR and MS metabolite spectra, and secondly the combination of the data across the two measurement systems. Both components require major input from both biologists and statisticians involved in our research programme. The statistical analysis of the large amounts of data generated by NMR and MS technologies is an extremely challenging task. Some methods for data analysis do already exist, but they do not use all the information at hand. An important advantage of our approach is that we will use physico-chemical information already available about typical metabolites to direct how we build our models and carry out our analysis. Such physico-chemical 'prior' information has been only rarely used in the analysis of metabolite data, but we feel that it provides an important guide as to how analysis should proceed. Thus we will adopt a Bayesian statistical approach that combines data and prior information in a principled fashion. However, despite being scientifically attractive, this modelling approach needs advanced computing methods so that the analysis can be implemented, and a major component of the research we will carry out will be to implement the most efficient computational strategies. Understanding and modelling the content of NMR and MS metabolite spectra is a complicated task that requires both highly specialized chemical knowledge and state of the art statistical techniques. The novelty of our project is that by using a Bayesian analysis framework we are able to harness and incorporate such specialist information. Our multidisciplinary research team that combines expertise in modelling, statistics, chemical biology and bioinformatics will ensure the success of our research programme and facilitate the dissemination of its results to a wide community.

Technical Summary

Nuclear Magnetic Resonance and Liquid Chromatography-Mass Spectrometry provide valuable complementary information on the metabolome. To make best use of this information, our project is centred on modelling and interpreting metabolic data generated by NMR and LC-MS technologies. Our goals are: (i) to statistically integrate information from these complementary platforms, (ii) to help identify unknown metabolites, (iii) to find groups of metabolites that discriminate between known population groups (e.g. linked to different experimental conditions or phenotypes). To accomplish these goals, we will work within the flexible framework of Bayesian hierarchical models and associated computations and build models that explicitly incorporate prior biological information, e.g. on the location and shape of peaks that make up the NMR profiles. We will use bases-function, e.g. wavelets, to model residual NMR and LC-MS spectrum data and will study the association between NMR and LC-MS profiles by deriving the joint distribution of the coefficients associated with each representation. These coefficients will be given a prior structure (e.g. a mixture model) that helps to find and characterise meaningful clusters of profiles. The development of realistic and reliable statistical models requires access to high quality and well defined data sets. We will benefit from data arising from the Consortium for Metabonomic Toxicology (COMET), a large collaborative project which has generated NMR metabolic profiles for over 30,000 urine samples from laboratory rats subjected to treatments known to cause toxic or other physiological stress and for which LC-MS profiles are also available for a subset (n=200). The multidisciplinary character of our project that combines expertise in statistical modelling, Bayesian computations, bioinformatics and biological chemistry is a key to its success. The bioinformatics tools built will be made available through dedicated project web pages.


10 25 50