Applications of probabilistic machine learning to medical biostatistics

Lead Research Organisation: University of Warwick
Department Name: Warwick Systems Biology Centre

Abstract

Medical research generates huge amounts of scientific data. My research creates new ways to extract the important scientific knowledge from these large, complex data-sets. I do this using Bayesian statistics, which are a particularly powerful set of techniques which combine all available information to produce the best possible results, and which can handle the rich, complicated structure that these data often contain. This can be regarded as a kind of artificial intelligence that makes sense of much bigger data-sets than the human brain could process or organise.

This approach opens up the possibility of systems medicine, modelling whole biological systems at once to gain better understanding of their function. This is becoming a hugely important area of science, but requires just the sort of statistical tools I‘ll be developing with this research.

The area of medicine I work on is cancer, particularly focusing on breast cancer, myeloma, ovarian carcinogenesis and basal cell carcinoma. I work with collaborators in medicine, who have large data-sets that cannot currently be analysed in the way I propose. By applying Bayesian methods of analysis I will be working with them to understand better the medical implications of their data, and move towards improved clinical outcomes.

Technical Summary

Biomedical research accumulates huge volumes of scientific data. Because of this, it is necessary to develop computational and statistical methods that can effectively extract important information from these data. These data-sets can present a number of challenges. The underlying biological systems are often highly complex, leading to rich structure in the data. They are typically also noisy, both as the result of measurement imprecision and also any biological variation that may be intrinsic to the system studied. The data-sets are often also large, particularly from high-throughput technologies such as microarrays that allow simultaneous measurement of whole genomes.

This work will develop novel non-parametric Bayesian methods to address key questions in three areas of medical biostatistics. Firstly, the integration of multiple sources of data into combined biomarker discovery/outcome prediction models. Secondly, the use of Bayesian model averaging techniques to develop more effective data models for clustering of gene expression data. Thirdly, the creation of hierarchical data models that can simultaneously cluster, identify biomarkers and make outcome predictions.

Non-parametric Bayesian methods represent a powerful way to address these challenges. Because they are highly flexible, they can accurately model the sort of complex structure often found in biomedical data-sets and can select in a principled way the most appropriate model structure, leading directly to superior data modelling results. Their Bayesian nature also makes them suitable for modelling uncertainty in the data and for using all available sources of information for a given analysis. The particular families of non-parametric models I‘ll be using are Gaussian Process and Dirichlet Process models, both of which are on the cutting-edge of current machine learning research.

The medical focus of this work will be cancer, with a particular early focus on data-sets investigating breast cancer, myeloma, ovarian carcinogenesis and basal cell carcinoma. My collaborators and I will extend this work both as new data-sets become available, and also as we identify promising analysis results to follow up, for example in the design of new experiments or clinical trials.

This work will produce original research in probabilistic machine learning and I will develop these innovations into robust software tools so that the broader scientific community can benefit fully from this work.

Publications

10 25 50