Data mining and characterisation of dark metagenomic sequence data

Lead Research Organisation: University of Glasgow
Department Name: College of Medical, Veterinary, Life Sci

Abstract

Studentship strategic priority area:Mathematics, statistics and computation
Keywords: Data science, bioinformatics, metagenomics, microbiome, virome

The advent of high-throughput sequencing (HTS) has led to an increasing deluge of metagenomic sequence data being deposited in online archives. HTS has enabled the routine detection of known pathogenic viruses, the discovery of novel human viruses and human-resident phages, and the characterisation of the human virome, which is essential for our understanding of the role the microbiome plays in health and disease. However, a significant proportion of sequence data cannot be classified as it displays no detectable homology to any known sequence, and such sequence data is typically discounted from further analyses. The aim of this PhD project is to develop a complete computational framework for the data mining of online archives for 'dark' sequence data, combined with the assembly, storage, clustering, and initial classification of such dark sequences. The project will answer fundamental questions related to the extent and diversity of dark sequences, classify these sequences into related groups, and predict their biological function and origin. There are 3 broad stages: (1) The development of data mining pipelines to automatically retrieve meta and sequence data from the short read archive. (2) The adaptation of existing metagenomic assembly pipelines towards sequences of unknown origin, and the development of a database system to store and query the assembled dark sequences. (3) The quantification, analysis and clustering of the identified dark sequences. The project is a combination of data science and bioinformatics, with substantial elements of computation, programming and statistics/machine learning. The student undertaking this PhD will be trained in a number of MRC priority quantitative skills area. The project is a combination of data science and bioinformatics/computational biology, with substantial elements of computation, programming and statistics/machine learning.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
MR/S502479/1 01/10/2018 31/07/2022
2127267 Studentship MR/S502479/1 01/10/2018 31/03/2022 Sejal Modha
 
Title Assembled unknown sequences 
Description All assembled unknown sequences generated as part of the current project are submitted to ENA as third party annotations and are accessible through BioProject PRJEB41812 
Type Of Material Data analysis technique 
Year Produced 2021 
Provided To Others? Yes  
Impact The datasets generated as part of the project are submitted to ENA and are made accessible to wider research community. 
 
Title sejmodha/UnXplore: unxplore_v1 
Description Analysis framework and results described in doi.org/10.1101/2021.01.22.427751 
Type Of Technology Software 
Year Produced 2021 
Open Source License? Yes  
Impact A systematic framework was developed to identify the so-called 'dark' sequence matter hidden within the publicly available human metagenomic datasets. 
URL https://zenodo.org/record/4502139
 
Description Annual GUVMA Rodeo 2019 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Other audiences
Results and Impact I ran our Lego-based bioinformatics with bricks activity at the CVR stall at Annual GUVMA Rodeo 2019.
Year(s) Of Engagement Activity 2019
 
Description Bioinformatics with bricks - SARS-CoV-2 Lego activity 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact A print-at-home activity developed based on the existing Lego-based bioinformatics with bricks activity to have a go at performing a reference sequence assembly like a bioinformaticians for SARS-CoV-2, the virus that causes COVID-19.
Year(s) Of Engagement Activity 2020
URL https://www.gla.ac.uk/researchinstitutes/iii/cvr/events/public%20engagement/resources/