Data mining and characterisation of dark metagenomic sequence data
Lead Research Organisation:
University of Glasgow
Department Name: College of Medical, Veterinary, Life Sci
Abstract
Studentship strategic priority area:Mathematics, statistics and computation
Keywords: Data science, bioinformatics, metagenomics, microbiome, virome
The advent of high-throughput sequencing (HTS) has led to an increasing deluge of metagenomic sequence data being deposited in online archives. HTS has enabled the routine detection of known pathogenic viruses, the discovery of novel human viruses and human-resident phages, and the characterisation of the human virome, which is essential for our understanding of the role the microbiome plays in health and disease. However, a significant proportion of sequence data cannot be classified as it displays no detectable homology to any known sequence, and such sequence data is typically discounted from further analyses. The aim of this PhD project is to develop a complete computational framework for the data mining of online archives for 'dark' sequence data, combined with the assembly, storage, clustering, and initial classification of such dark sequences. The project will answer fundamental questions related to the extent and diversity of dark sequences, classify these sequences into related groups, and predict their biological function and origin. There are 3 broad stages: (1) The development of data mining pipelines to automatically retrieve meta and sequence data from the short read archive. (2) The adaptation of existing metagenomic assembly pipelines towards sequences of unknown origin, and the development of a database system to store and query the assembled dark sequences. (3) The quantification, analysis and clustering of the identified dark sequences. The project is a combination of data science and bioinformatics, with substantial elements of computation, programming and statistics/machine learning. The student undertaking this PhD will be trained in a number of MRC priority quantitative skills area. The project is a combination of data science and bioinformatics/computational biology, with substantial elements of computation, programming and statistics/machine learning.
Keywords: Data science, bioinformatics, metagenomics, microbiome, virome
The advent of high-throughput sequencing (HTS) has led to an increasing deluge of metagenomic sequence data being deposited in online archives. HTS has enabled the routine detection of known pathogenic viruses, the discovery of novel human viruses and human-resident phages, and the characterisation of the human virome, which is essential for our understanding of the role the microbiome plays in health and disease. However, a significant proportion of sequence data cannot be classified as it displays no detectable homology to any known sequence, and such sequence data is typically discounted from further analyses. The aim of this PhD project is to develop a complete computational framework for the data mining of online archives for 'dark' sequence data, combined with the assembly, storage, clustering, and initial classification of such dark sequences. The project will answer fundamental questions related to the extent and diversity of dark sequences, classify these sequences into related groups, and predict their biological function and origin. There are 3 broad stages: (1) The development of data mining pipelines to automatically retrieve meta and sequence data from the short read archive. (2) The adaptation of existing metagenomic assembly pipelines towards sequences of unknown origin, and the development of a database system to store and query the assembled dark sequences. (3) The quantification, analysis and clustering of the identified dark sequences. The project is a combination of data science and bioinformatics, with substantial elements of computation, programming and statistics/machine learning. The student undertaking this PhD will be trained in a number of MRC priority quantitative skills area. The project is a combination of data science and bioinformatics/computational biology, with substantial elements of computation, programming and statistics/machine learning.
Organisations
People |
ORCID iD |
David Robertson (Primary Supervisor) | |
Sejal Modha (Student) |
Publications
MartÃ-Carreras J
(2020)
NCBI's Virus Discovery Codeathon: Building "FIVE" -The Federated Index of Viral Experiments API Index.
in Viruses
Studentship Projects
Project Reference | Relationship | Related To | Start | End | Student Name |
---|---|---|---|---|---|
MR/S502479/1 | 30/09/2018 | 30/07/2022 | |||
2127267 | Studentship | MR/S502479/1 | 30/09/2018 | 30/03/2022 | Sejal Modha |
Title | Assembled unknown sequences |
Description | All assembled unknown sequences generated as part of the current project are submitted to ENA as third party annotations and are accessible through BioProject PRJEB41812 |
Type Of Material | Data analysis technique |
Year Produced | 2021 |
Provided To Others? | Yes |
Impact | The datasets generated as part of the project are submitted to ENA and are made accessible to wider research community. |
Title | sejmodha/UnXplore: unxplore_v1 |
Description | Analysis framework and results described in doi.org/10.1101/2021.01.22.427751 |
Type Of Technology | Software |
Year Produced | 2021 |
Open Source License? | Yes |
Impact | A systematic framework was developed to identify the so-called 'dark' sequence matter hidden within the publicly available human metagenomic datasets. |
URL | https://zenodo.org/record/4502139 |
Description | Annual GUVMA Rodeo 2019 |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | Local |
Primary Audience | Other audiences |
Results and Impact | I ran our Lego-based bioinformatics with bricks activity at the CVR stall at Annual GUVMA Rodeo 2019. |
Year(s) Of Engagement Activity | 2019 |
Description | Bioinformatics with bricks - SARS-CoV-2 Lego activity |
Form Of Engagement Activity | Engagement focused website, blog or social media channel |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Other audiences |
Results and Impact | A print-at-home activity developed based on the existing Lego-based bioinformatics with bricks activity to have a go at performing a reference sequence assembly like a bioinformaticians for SARS-CoV-2, the virus that causes COVID-19. |
Year(s) Of Engagement Activity | 2020 |
URL | https://www.gla.ac.uk/researchinstitutes/iii/cvr/events/public%20engagement/resources/ |