Developing novel statistical methodology incorporating biological structure for high-throughput genomic data analysis

Lead Research Organisation: Imperial College London
Department Name: Dept of Mathematics

Abstract

The tools of modern genetics now enable scientists to explore and probe the human genome at unprecedented levels of scale and detail. Miniaturisation and parallelisation now enable geneticists to perform many thousands of experiments on a single chip that can fit into the palm of the hand. However, geneticists are now faced with unprecedented quantities of data that defy human efforts to analyse and interpret them. Advanced computational based methods of analysis offer a solution but existing algorithms are frequently based on ad-hoc heuristics and fail to incorporate biological knowledge and structure that a human interpreter might. This can lead to many discoveries that are subsequently found to be false in follow-up studies - a process which is highly wasteful of time and resources. This study seeks to meet the challenges posed by modern genetics by developing formal statistical methods that are able to handle massive data sets, exploit known biological structure, and assist in the decision making process of clinicians and experimental scientists. The use of sophisticated data analysis methods with advanced genomic technologies will minimise false discoveries and allow efforts to be focused on those discoveries that may then lead to furthering the understanding of human disease.

Technical Summary

State of the art high-throughput genomic technologies has transformed the way in which modern genetics is conducted. Massive parallelisation and miniaturisation of analytical techniques now enable thousands of experiments to be conducted on a single microarray. However, the advent of these new high-throughput technologies requires a quantum leap in the statistical techniques used to analyse and study the data produced. Important scientific questions can no longer be answered by purely applying simple data mining and model-free statistical methodologies to data. Instead, novel statistical methods must incorporate prior knowledge of biological structures and processes, in a way that a human researcher might draw upon their own experiences in making scientific inferences. However, incorporating biological structure and models into classical statistical methods is difficult, if not, impossible in many instances.

The objective of this study is to develop novel statistical methodologies whose features are driven by modern biological and clinical analysis requirements. In particular, statistical methodology that allows realistic biological information to be incorporated into models and produces biologically relevant inferences for the experimentalist. These statistical methods will require developments in fundamental statistical theory as well as new computational methods to allow efficient analysis. This study will utilise Bayesian statistical methodology that allows the combination of information from experimental data and prior biological knowledge within a coherent mathematical framework.

The general applicability of these statistical methods will be demonstrated using data sets from two bioanalytical technologies - SNP genotyping microarrays and ChiP-chip. In the first, computational techniques to identify and deconvolve intra-tumour heterogeneity events in tumours from SNP genotyping data will be investigated in order to produce detailed genomic profiles of chromosomal alterations, such as deletions and amplifications, of the different tumour cell populations. This capability has applications in assessing heterogeneous drug responses to cancer treatments by allowing variable cellular response to be associated with particular genetic alterations. The study will also investigate the identification of binding sites for DNA-binding proteins, such as transcription factors, from ChiP-chip data. Knowledge of the locations of protein-DNA interactions would further the understanding of gene regulation and epigenetic mechanisms in cells.

Despite differences in application, both technologies pose common data analysis problems that entail the discovery of certain biological events from large genome-wide data sets that are embedded within complex noise processes whose origins are biochemical.

Publications

10 25 50
publication icon
Becker J (2013) NucleoFinder: a statistical approach for the detection of nucleosome positions. in Bioinformatics (Oxford, England)

publication icon
Lee A (2010) On the utility of graphics cards to perform massively parallel simulation of advanced Monte Carlo methods. in Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America

publication icon
Titsias MK (2016) Statistical Inference in Hidden Markov Models Using -Segment Constraints. in Journal of the American Statistical Association

 
Description MRC New Investigator Research Grant
Amount £345,267 (GBP)
Funding ID MR/L001411/1 
Organisation Medical Research Council (MRC) 
Sector Academic/University
Country United Kingdom
Start 02/2014 
End 08/2017
 
Description B-CLL 
Organisation John Radcliffe Hospital
Department Haemato-Molecular Diagnostic Service
Country United Kingdom 
Sector Hospitals 
PI Contribution Bioinformatics and statistical methods development for B-CLL project.
Collaborator Contribution Provision of data and samples.Provision of data and experimental follow up of computational findings.
Impact Manuscript submitted, reviewed and undergoing revisions. A related project on B-CLL has received further funding from the WT-NHS HICF.
Start Year 2009
 
Description B-CLL 
Organisation University of Oxford
Department Wellcome Trust Centre for Human Genetics
Country United Kingdom 
Sector Academic/University 
PI Contribution Bioinformatics and statistical methods development for B-CLL project.
Collaborator Contribution Provision of data and samples.Provision of data and experimental follow up of computational findings.
Impact Manuscript submitted, reviewed and undergoing revisions. A related project on B-CLL has received further funding from the WT-NHS HICF.
Start Year 2009
 
Description Ludwig Colon Cancer Initiative 
Organisation Ludwig Institute for Cancer Research
Country United Kingdom 
Sector Academic/University 
PI Contribution Bioinformatics and statistical methodology research and development.
Collaborator Contribution Experimental and laboratory support.
Impact PMID: 20858232 Several other manuscripts in preparation.
Start Year 2009
 
Description MACS 
Organisation Queen Mary University of London
Department Barts and The London School of Medicine and Dentistry
Country United Kingdom 
Sector Academic/University 
PI Contribution Bioinformatics and statistical methods development.
Impact A manuscript describing findings is in preparation.
Start Year 2009
 
Title OncoSNP 
Description Software to find DNA copy number alterations from single nucleotide polymorphism array data from cancer samples. 
IP Reference  
Protection Copyrighted (e.g. software)
Year Protection Granted 2010
Licensed Yes
Impact Extensions of the software has been incorporated as part of a successful HICF grant proposal.