Developing novel statistical methodology incorporating biological structure for high-throughput genomic data analysis

Lead Research Organisation: University of Oxford
Department Name: Statistics

Abstract

The tools of modern genetics now enable scientists to explore and probe the human genome at unprecedented levels of scale and detail. Miniaturisation and parallelisation now enable geneticists to perform many thousands of experiments on a single chip that can fit into the palm of the hand. However, geneticists are now faced with unprecedented quantities of data that defy human efforts to analyse and interpret them. Advanced computational based methods of analysis offer a solution but existing algorithms are frequently based on ad-hoc heuristics and fail to incorporate biological knowledge and structure that a human interpreter might. This can lead to many discoveries that are subsequently found to be false in follow-up studies - a process which is highly wasteful of time and resources. This study seeks to meet the challenges posed by modern genetics by developing formal statistical methods that are able to handle massive data sets, exploit known biological structure, and assist in the decision making process of clinicians and experimental scientists. The use of sophisticated data analysis methods with advanced genomic technologies will minimise false discoveries and allow efforts to be focused on those discoveries that may then lead to furthering the understanding of human disease.

Technical Summary

State of the art high-throughput genomic technologies has transformed the way in which modern genetics is conducted. Massive parallelisation and miniaturisation of analytical techniques now enable thousands of experiments to be conducted on a single microarray. However, the advent of these new high-throughput technologies requires a quantum leap in the statistical techniques used to analyse and study the data produced. Important scientific questions can no longer be answered by purely applying simple data mining and model-free statistical methodologies to data. Instead, novel statistical methods must incorporate prior knowledge of biological structures and processes, in a way that a human researcher might draw upon their own experiences in making scientific inferences. However, incorporating biological structure and models into classical statistical methods is difficult, if not, impossible in many instances.

The objective of this study is to develop novel statistical methodologies whose features are driven by modern biological and clinical analysis requirements. In particular, statistical methodology that allows realistic biological information to be incorporated into models and produces biologically relevant inferences for the experimentalist. These statistical methods will require developments in fundamental statistical theory as well as new computational methods to allow efficient analysis. This study will utilise Bayesian statistical methodology that allows the combination of information from experimental data and prior biological knowledge within a coherent mathematical framework.

The general applicability of these statistical methods will be demonstrated using data sets from two bioanalytical technologies - SNP genotyping microarrays and ChiP-chip. In the first, computational techniques to identify and deconvolve intra-tumour heterogeneity events in tumours from SNP genotyping data will be investigated in order to produce detailed genomic profiles of chromosomal alterations, such as deletions and amplifications, of the different tumour cell populations. This capability has applications in assessing heterogeneous drug responses to cancer treatments by allowing variable cellular response to be associated with particular genetic alterations. The study will also investigate the identification of binding sites for DNA-binding proteins, such as transcription factors, from ChiP-chip data. Knowledge of the locations of protein-DNA interactions would further the understanding of gene regulation and epigenetic mechanisms in cells.

Despite differences in application, both technologies pose common data analysis problems that entail the discovery of certain biological events from large genome-wide data sets that are embedded within complex noise processes whose origins are biochemical.

Publications

10 25 50