Development of Novel Computational Strategies to Store and Interpret Next Generation Sequencing Data and Their Application to Multi-Genomic Analyses

Lead Research Organisation: Imperial College London
Department Name: Life Sciences

Abstract

AIM OF THE PHD PROJECT - High throughput sequencing (HTS) of genomes and transcriptomes will lead to the availability of sequencing data for numerous samples across many species. However, there are major problems in the exploitation of this information due to difficulties in the storage, transfer between sites, and visualisation of the large data sets. The aim of this cross-disciplinary PhD project is to (1) To develop novel data reduction methods to streamline data storage and analysis of large complex multi-genomic data (2) To develop visualisation tools to produce compacted visualisation (3) To use these tools to undertake mining of a biological dataset to investigate specific points of biological interest. DATA REDUCTION - The first challenge will be to achieve a major reduction in the size of the data without losing critical meta-data associated to each base sequenced (i.e. the quality of the data or even the original read). We will need to develop novel data reduction algorithms since traditional lossless compression techniques are unsuitable for HTS data because they do not manage both rapid decoding starting from any point in the stream combined with rapid mutual comparison of several compressed streams. Additionally, current DNA compression methods (DNACompress, LCA, and DNAzip) primarily consider a single genome algorithm. Here we will use the repeatability and the consistency of sequencing technologies: applying the same technology and method to very similar genomes sequences is likely to show strong similarities in systematic deviations (sequencing errors, variations in coverage, etc.). This would make the differential compression or other de-duplication techniques highly efficient for the whole data. The second challenge will be to design protocols to improve data transfers. A large number of scientists will be querying consolidated data sets from several locations around the world. We need to provide efficient storage that will support real time partial extraction of data at various resolutions similarly to the functionalities provided by BigBed and BigWig. In addition to data format definitions, it will be necessary to define the protocols that will efficiently support the distributed nature of the work. VISUALISATION - Existing genome browsers are not suited for large scale comparative genomics studies as at best they work for simultaneous visualization of a small number of genomes. Visualization of a large number of genomes will require the identification of new concepts for the navigation and visualization of genomic data. The data reduction techniques we will develop naturally lead towards compact data visualisation with the ability to use interactive thresholds and cut-offs to display comparative features, and the ability to toggle between data sub-sets. Once the right queries have been presented to the appropriate databases, and the results aggregated, the remaining step is to present the data in a meaningful way. APPLICATION - Our current favoured exemplar dataset is from genomic and transcriptomic studies of the obligate fungal pathogen of Barley Blumeria graminis hordei and other closely related fungi. A large collaborative effort including Butcher and Spanu (Imperial) is underway involving BBSRC support (BB/E000983/1; BB/H001646/1). Several completed genomes (>120Mbases range) are available, several others underway with international collaborators; also transcriptomes. We will use the developed computational tools to study phenotypic variation between species. Other biological topics which can be explored include analysis of strain data of plant and animal pathogens and cross genomic studies on related bacteria .

Publications

10 25 50