Developing novel statistical methodology incorporating biological structure for high-throughput genomic data analysis
Lead Research Organisation:
Imperial College London
Department Name: Dept of Mathematics
Abstract
The tools of modern genetics now enable scientists to explore and probe the human genome at unprecedented levels of scale and detail. Miniaturisation and parallelisation now enable geneticists to perform many thousands of experiments on a single chip that can fit into the palm of the hand. However, geneticists are now faced with unprecedented quantities of data that defy human efforts to analyse and interpret them. Advanced computational based methods of analysis offer a solution but existing algorithms are frequently based on ad-hoc heuristics and fail to incorporate biological knowledge and structure that a human interpreter might. This can lead to many discoveries that are subsequently found to be false in follow-up studies - a process which is highly wasteful of time and resources. This study seeks to meet the challenges posed by modern genetics by developing formal statistical methods that are able to handle massive data sets, exploit known biological structure, and assist in the decision making process of clinicians and experimental scientists. The use of sophisticated data analysis methods with advanced genomic technologies will minimise false discoveries and allow efforts to be focused on those discoveries that may then lead to furthering the understanding of human disease.
Technical Summary
State of the art high-throughput genomic technologies has transformed the way in which modern genetics is conducted. Massive parallelisation and miniaturisation of analytical techniques now enable thousands of experiments to be conducted on a single microarray. However, the advent of these new high-throughput technologies requires a quantum leap in the statistical techniques used to analyse and study the data produced. Important scientific questions can no longer be answered by purely applying simple data mining and model-free statistical methodologies to data. Instead, novel statistical methods must incorporate prior knowledge of biological structures and processes, in a way that a human researcher might draw upon their own experiences in making scientific inferences. However, incorporating biological structure and models into classical statistical methods is difficult, if not, impossible in many instances.
The objective of this study is to develop novel statistical methodologies whose features are driven by modern biological and clinical analysis requirements. In particular, statistical methodology that allows realistic biological information to be incorporated into models and produces biologically relevant inferences for the experimentalist. These statistical methods will require developments in fundamental statistical theory as well as new computational methods to allow efficient analysis. This study will utilise Bayesian statistical methodology that allows the combination of information from experimental data and prior biological knowledge within a coherent mathematical framework.
The general applicability of these statistical methods will be demonstrated using data sets from two bioanalytical technologies - SNP genotyping microarrays and ChiP-chip. In the first, computational techniques to identify and deconvolve intra-tumour heterogeneity events in tumours from SNP genotyping data will be investigated in order to produce detailed genomic profiles of chromosomal alterations, such as deletions and amplifications, of the different tumour cell populations. This capability has applications in assessing heterogeneous drug responses to cancer treatments by allowing variable cellular response to be associated with particular genetic alterations. The study will also investigate the identification of binding sites for DNA-binding proteins, such as transcription factors, from ChiP-chip data. Knowledge of the locations of protein-DNA interactions would further the understanding of gene regulation and epigenetic mechanisms in cells.
Despite differences in application, both technologies pose common data analysis problems that entail the discovery of certain biological events from large genome-wide data sets that are embedded within complex noise processes whose origins are biochemical.
The objective of this study is to develop novel statistical methodologies whose features are driven by modern biological and clinical analysis requirements. In particular, statistical methodology that allows realistic biological information to be incorporated into models and produces biologically relevant inferences for the experimentalist. These statistical methods will require developments in fundamental statistical theory as well as new computational methods to allow efficient analysis. This study will utilise Bayesian statistical methodology that allows the combination of information from experimental data and prior biological knowledge within a coherent mathematical framework.
The general applicability of these statistical methods will be demonstrated using data sets from two bioanalytical technologies - SNP genotyping microarrays and ChiP-chip. In the first, computational techniques to identify and deconvolve intra-tumour heterogeneity events in tumours from SNP genotyping data will be investigated in order to produce detailed genomic profiles of chromosomal alterations, such as deletions and amplifications, of the different tumour cell populations. This capability has applications in assessing heterogeneous drug responses to cancer treatments by allowing variable cellular response to be associated with particular genetic alterations. The study will also investigate the identification of binding sites for DNA-binding proteins, such as transcription factors, from ChiP-chip data. Knowledge of the locations of protein-DNA interactions would further the understanding of gene regulation and epigenetic mechanisms in cells.
Despite differences in application, both technologies pose common data analysis problems that entail the discovery of certain biological events from large genome-wide data sets that are embedded within complex noise processes whose origins are biochemical.
Organisations
- Imperial College London, United Kingdom (Lead Research Organisation)
- University of Oxford, United Kingdom (Collaboration, Fellow)
- Queen Mary, University of London, United Kingdom (Collaboration)
- John Radcliffe Hospital, United Kingdom (Collaboration)
- Ludwig Institute for Cancer Research (Collaboration)
- University of Manchester, Manchester, United Kingdom (Fellow)
Publications

Becker J
(2013)
NucleoFinder: a statistical approach for the detection of nucleosome positions.
in Bioinformatics (Oxford, England)

Filippi S
(2016)
Scalable Bayesian nonparametric measures for exploring pairwise dependence via Dirichlet Process Mixtures.
in Electronic journal of statistics

Filippi S
(2016)
Scalable Bayesian nonparametric measures for exploring pairwise dependence via Dirichlet Process Mixtures.
in Electronic journal of statistics


Lee A
(2010)
On the utility of graphics cards to perform massively parallel simulation of advanced Monte Carlo methods.
in Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America

McGuinness L
(2010)
Presynaptic NMDARs in the hippocampus facilitate transmitter release at theta frequency.
in Neuron

Mouradov D
(2013)
Survival in stage II/III colorectal cancer is independently predicted by chromosomal and microsatellite instability, but not by specific driver mutations.
in The American journal of gastroenterology

Sengupta N
(2013)
Analysis of colorectal cancers in British Bangladeshi identifies early onset, frequent mucinous histotype and a high prevalence of RBFOX1 deletion.
in Molecular cancer

Titsias MK
(2016)
Statistical Inference in Hidden Markov Models Using -Segment Constraints.
in Journal of the American Statistical Association

Wellcome Trust Case Control Consortium
(2010)
Genome-wide association study of CNVs in 16,000 cases of eight common diseases and 3,000 shared controls.
in Nature
Description | MRC New Investigator Research Grant |
Amount | £345,267 (GBP) |
Funding ID | MR/L001411/1 |
Organisation | Medical Research Council (MRC) |
Sector | Public |
Country | United Kingdom |
Start | 02/2014 |
End | 08/2017 |
Description | B-CLL |
Organisation | John Radcliffe Hospital |
Department | Haemato-Molecular Diagnostic Service |
Country | United Kingdom |
Sector | Public |
PI Contribution | Bioinformatics and statistical methods development for B-CLL project. |
Collaborator Contribution | Provision of data and samples.Provision of data and experimental follow up of computational findings. |
Impact | Manuscript submitted, reviewed and undergoing revisions. A related project on B-CLL has received further funding from the WT-NHS HICF. |
Start Year | 2009 |
Description | B-CLL |
Organisation | University of Oxford |
Department | Wellcome Trust Centre for Human Genetics |
Country | United Kingdom |
Sector | Charity/Non Profit |
PI Contribution | Bioinformatics and statistical methods development for B-CLL project. |
Collaborator Contribution | Provision of data and samples.Provision of data and experimental follow up of computational findings. |
Impact | Manuscript submitted, reviewed and undergoing revisions. A related project on B-CLL has received further funding from the WT-NHS HICF. |
Start Year | 2009 |
Description | Ludwig Colon Cancer Initiative |
Organisation | Ludwig Institute for Cancer Research |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | Bioinformatics and statistical methodology research and development. |
Collaborator Contribution | Experimental and laboratory support. |
Impact | PMID: 20858232 Several other manuscripts in preparation. |
Start Year | 2009 |
Description | MACS |
Organisation | Queen Mary University of London |
Department | Barts and The London School of Medicine and Dentistry |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | Bioinformatics and statistical methods development. |
Impact | A manuscript describing findings is in preparation. |
Start Year | 2009 |
Title | OncoSNP |
Description | Software to find DNA copy number alterations from single nucleotide polymorphism array data from cancer samples. |
IP Reference | |
Protection | Copyrighted (e.g. software) |
Year Protection Granted | 2010 |
Licensed | Yes |
Impact | Extensions of the software has been incorporated as part of a successful HICF grant proposal. |