Comparative Analyses and Genomic Sequence-Only Prediction of DNA Breakpoints via Gradient Boosting Machines and Deep Learning

Lead Research Organisation: University of Oxford
Department Name: RDM Radcliffe Department of Medicine

Abstract

Genomic insertion/deletion alterations, which occur through the formation of DNA strand breaks, are the second most significant DNA modifications after point mutations. However, unlike point mutations, the long- and short-range sequence-context dependence of DNA strand breakpoints, as well as the detailed regional variation of breakage propensities in the genome, have not been extensively interrogated through cutting edge computational means. This has prevalently been the case because of the relative sparseness of available data on such DNA alterations, coupled with complex multi-etiologic nature of DNA strand breaks that would further stratify already sparse data. Nevertheless, the importance of understanding DNA breakages led to a number of computational works outlining potential associations of CNV breakpoints vs. non-B DNA conformations, frequent SCNA breakpoints vs. cancer genes, induced DNA breaks vs. chromatin packing and abasic sites. Discrete-valued bivariate hidden Markov model developed on CNV data could reach a predictive resolution of ~300bp for about 400 breakpoints in the human genome, showing the promise for higher resolution sequence- based prediction of DNA breakpoints for, at least, CNVs. The past decade has witnessed the advent of specific sequencing methodologies for experimental genome-wide mapping of DNA breakpoints, and, cutting edge machine learning techniques started to become readily available with recent outstanding applications on biological big data. It is therefore the right time to apply the state-of-the-art machine learning techniques on the substantially expanded and well stratified DNA breakpoint data from tissues undergoing different physiological, spontaneous and pathological processes, in order to reveal their fine short- and long-range sequence dependences, with their commonalities and differences across different processes leading to DNA strand breaks.
Throughout the project, the student will gain and implement skills in computational biology, advanced data analysis, machine learning and statistics ("Quantitative skills"), while working with genomics datasets and integrating many bottom-to-top developments from thermodynamic and chemical considerations of nucleic acids ("Interdisciplinary skills").

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
MR/N013468/1 01/10/2016 30/09/2025
2434544 Studentship MR/N013468/1 01/10/2020 30/09/2024 Patrick Pflughaupt