Data mining and characterisation of dark metagenomic sequence data

Lead Research Organisation: University of Glasgow
Department Name: College of Medical, Veterinary &Life Sci

Abstract

Studentship strategic priority area:Mathematics, statistics and computation
Keywords: Data science, bioinformatics, metagenomics, microbiome, virome

The advent of high-throughput sequencing (HTS) has led to an increasing deluge of metagenomic sequence data being deposited in online archives. HTS has enabled the routine detection of known pathogenic viruses, the discovery of novel human viruses and human-resident phages, and the characterisation of the human virome, which is essential for our understanding of the role the microbiome plays in health and disease. However, a significant proportion of sequence data cannot be classified as it displays no detectable homology to any known sequence, and such sequence data is typically discounted from further analyses. The aim of this PhD project is to develop a complete computational framework for the data mining of online archives for 'dark' sequence data, combined with the assembly, storage, clustering, and initial classification of such dark sequences. The project will answer fundamental questions related to the extent and diversity of dark sequences, classify these sequences into related groups, and predict their biological function and origin. There are 3 broad stages: (1) The development of data mining pipelines to automatically retrieve meta and sequence data from the short read archive. (2) The adaptation of existing metagenomic assembly pipelines towards sequences of unknown origin, and the development of a database system to store and query the assembled dark sequences. (3) The quantification, analysis and clustering of the identified dark sequences. The project is a combination of data science and bioinformatics, with substantial elements of computation, programming and statistics/machine learning. The student undertaking this PhD will be trained in a number of MRC priority quantitative skills area. The project is a combination of data science and bioinformatics/computational biology, with substantial elements of computation, programming and statistics/machine learning.

Publications

10 25 50