Statistics and machine learning for precision medicine

Lead Research Organisation: MRC Biostatistics Unit


Every person is biologically unique: even individuals who are superficially similar may show differences at the genetic and biochemical level. This diversity has real implications for medicine, and helps to explain why apparently similar patients often respond very differently to the same therapies. However, we remain limited in our ability to account for such diversity in the clinic. “Precision medicine” refers to the emerging idea of using molecular measurements (such as gene sequences or protein levels) to match individual patients to the therapies from which they are most likely to benefit. Machine learning (ML) is a field that combines ideas from computer science, mathematics and statistics to find patterns in data. Our research explores how we can exploit the power of statistics and ML to enable precision medicine. Our long-term goal is to develop methods that can exploit large datasets to provide doctors information that can help in treating patients. We work on the mathematical and computational side, inventing new ways to look at complex data that can tease out relevant patterns, and work closely with biomedical researchers in the UK, US and Europe.

Technical Summary

Our work focuses on the development and application of statistical and machine learning approaches that can exploit molecular and genomic data to assist in directing therapies to patients likely to benefit. Our efforts encompass both (i) direct prediction of therapeutic response and (ii) scalable estimation of molecular networks and dynamics that can shed light on disease mechanisms and heterogeneity, inform prediction of response and help in identifying promising therapeutic opportunities. We work on specific biomedical questions, addressed in collaboration with experimental groups, as well as methodological research in statistics and machine learning motivated by such questions.

The potential of computational approaches in medicine is increasingly clear, but the challenges posed by noisy and incomplete data, biological and clinical heterogeneity and complex underlying processes and dynamics remain substantial. Our work is aimed at developing and exploiting statistical methods that can help to surmount some of these challenges. High-dimensional approaches, networks and graphical models and inference for dynamical systems are key methodological themes in much of our work.

Two key ongoing projects are:

Data-driven characterization of biological networks in cancer. How is the genomic heterogeneity of cancer manifested at the level of biological networks, such as those involved in cell signalling? Do cancers show altered “wiring” due to genomic aberrations? And if so, how? In close collaboration with experimental partners, we are working on both theoretical and applied aspects of these questions. We are also investigating whether protein signalling networks differ by cancer type and how networks can be used to help discover and define cancer subtypes. Finally, we are developing scalable methodologies by which to systematically assess causal network estimation approaches using interventional data.

Statistical methods for personalized medicine. We are addressing statistical challenges that arise in the prediction of drug response from multiple high-throughput data types. These challenges include the large number of potential predictors (high-dimensionality), heterogeneity arising from known and unknown disease subtypes, limited number of samples and the need to integrate multiple data types.


10 25 50
Description Mills Lab, MDACC 
Organisation University of Texas
Department M. D. Anderson Cancer Center
Country United States 
Sector Academic/University 
PI Contribution We have collaborated with the Mills Lab at MD Anderson on analysis of cancer data, in particular focusing on networks and high-dimensional questions.
Collaborator Contribution The Mills Lab have led biological and experimental aspects of the research.
Impact PMIDs: 22815361, 22923301, 24871328 Yes, multi-disciplinary, involving: statistics, cancer biology and proteomics.
Start Year 2010
Description Spellman and Gray Labs, OHSU 
Organisation Oregon Health and Science University
Country United States 
Sector Academic/University 
PI Contribution We have developed novel statistical and computational methods for cancer research.
Collaborator Contribution Our partners at OHSU (labs of Paul Spellman and Joe Gray) have led the biological and experimental aspects of the research.
Impact PMIDs: 22578440, 22923301, 25161235 Papers (not on pubmed) Joint Estimation of Multiple Related Biological Networks. C. J. Oates, J. Korkola, J. W. Gray & S. Mukherjee. Annals of Applied Statistics, 8(3):1892-1919, 2014. Yes, multi-disciplinary, involving: statistics and cancer biology.
Start Year 2014