Data mining and characterisation of dark metagenomic sequence data

Lead Research Organisation: University of Glasgow

Department Name: College of Medical, Veterinary, Life Sci

Abstract

Studentship strategic priority area:Mathematics, statistics and computation
Keywords: Data science, bioinformatics, metagenomics, microbiome, virome

The advent of high-throughput sequencing (HTS) has led to an increasing deluge of metagenomic sequence data being deposited in online archives. HTS has enabled the routine detection of known pathogenic viruses, the discovery of novel human viruses and human-resident phages, and the characterisation of the human virome, which is essential for our understanding of the role the microbiome plays in health and disease. However, a significant proportion of sequence data cannot be classified as it displays no detectable homology to any known sequence, and such sequence data is typically discounted from further analyses. The aim of this PhD project is to develop a complete computational framework for the data mining of online archives for 'dark' sequence data, combined with the assembly, storage, clustering, and initial classification of such dark sequences. The project will answer fundamental questions related to the extent and diversity of dark sequences, classify these sequences into related groups, and predict their biological function and origin. There are 3 broad stages: (1) The development of data mining pipelines to automatically retrieve meta and sequence data from the short read archive. (2) The adaptation of existing metagenomic assembly pipelines towards sequences of unknown origin, and the development of a database system to store and query the assembled dark sequences. (3) The quantification, analysis and clustering of the identified dark sequences. The project is a combination of data science and bioinformatics, with substantial elements of computation, programming and statistics/machine learning. The student undertaking this PhD will be trained in a number of MRC priority quantitative skills area. The project is a combination of data science and bioinformatics/computational biology, with substantial elements of computation, programming and statistics/machine learning.

Student:

Sejal Modha

Period of Study:

Sep 18 - Mar 22

Funder:

MRC

Project Status:

Closed

Project Category:

Studentship

Project Reference:

2127267

Health Category:

Unclassified

Organisations

University of Glasgow (Lead Research Organisation)

People	ORCID iD
David Robertson (Primary Supervisor)
Sejal Modha (Student)

Publications

Author Name

Title Publication Date Published

10 25 50

Martí-Carreras J (2020) NCBI's Virus Discovery Codeathon: Building "FIVE" -The Federated Index of Viral Experiments API Index. in Viruses

Modha S (2021) Quantifying and cataloguing unknown sequences within human microbiomes

Studentship Projects

Project Reference	Relationship	Related To	Start	End	Student Name
MR/S502479/1			30/09/2018	30/07/2022
2127267	Studentship	MR/S502479/1	30/09/2018	30/03/2022	Sejal Modha

Research Databases and Models
Software and Technical Products
Engagement Activities


Title	Assembled unknown sequences
Description	All assembled unknown sequences generated as part of the current project are submitted to ENA as third party annotations and are accessible through BioProject PRJEB41812
Type Of Material	Data analysis technique
Year Produced	2021
Provided To Others?	Yes
Impact	The datasets generated as part of the project are submitted to ENA and are made accessible to wider research community.


Title	sejmodha/UnXplore: unxplore_v1
Description	Analysis framework and results described in doi.org/10.1101/2021.01.22.427751
Type Of Technology	Software
Year Produced	2021
Open Source License?	Yes
Impact	A systematic framework was developed to identify the so-called 'dark' sequence matter hidden within the publicly available human metagenomic datasets.
URL	https://zenodo.org/record/4502139


Description	Annual GUVMA Rodeo 2019
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Other audiences
Results and Impact	I ran our Lego-based bioinformatics with bricks activity at the CVR stall at Annual GUVMA Rodeo 2019.
Year(s) Of Engagement Activity	2019


Description	Bioinformatics with bricks - SARS-CoV-2 Lego activity
Form Of Engagement Activity	Engagement focused website, blog or social media channel
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other audiences
Results and Impact	A print-at-home activity developed based on the existing Lego-based bioinformatics with bricks activity to have a go at performing a reference sequence assembly like a bioinformaticians for SARS-CoV-2, the virus that causes COVID-19.
Year(s) Of Engagement Activity	2020
URL	https://www.gla.ac.uk/researchinstitutes/iii/cvr/events/public%20engagement/resources/

Abstract

Organisations

People

ORCID iD

Publications

Studentship Projects