COMPUTATIONAL METHODS FOR MICROBIAL NEXT GENERATION RE-SEQUENCING DATA

Lead Research Organisation: University of Manchester
Department Name: School of Biological Sciences

Abstract

The overwhelming majority of life that has existed or exists is invisible to the naked eye (collectively termed the microbes, or microorganisms) and, including the viruses, forms large and complex communities. Characterising the species present, genome composition and genetic variation in these communities has been a major focus of 'metagenomics', the genomic study of mixed samples from the environment, or from animals or humans, for example, from an animal's gut or a soil microbial ecosystems. Contemporary sequencing technologies (next generation sequencing, NGS) have massively parallelized the determination of nucleotide order within genetic material resulting in our ability to rapidly sequence different microbes. This introduces the potential to explore microbial communities and genetic diversity on a scale that was previously unprecedented. Computational methods play a central role in the analysis, alignment and assembly of NGS data. However, the amount of data being generated is outstripping our ability to analyse them routinely, let alone carry out appropriate comparative analysis. This lack of software arises because most research effort is being directed at assembling single complete genomes from next generation sequence data. However, with microbes many interesting questions concern the diversity of sequences present in a community and population variation, revealed by 'ultra-deep' sequencing. Emerging approaches aim to build a de novo assembly of the NGS reads (each read is an individual sequence fragment corresponding to a region of a genome) in a similar fashion to a jigsaw puzzle where a picture is constructed by joining all the matching pieces together. In de novo assembly the genome sequence is constructed by allocating matching short reads together. The majority of the existing de novo assembly approaches for NGS data make extensive use of the de Bruijn graph method. However, building de Bruijn graphs for very large NGS data sets is very demanding because they require hefty computational resources. In this project we propose to develop novel computational methods, based on compressing the individual NGS reads by recasting them as numerical sequences (and working with this transformed/compressed data directly) that will be generically useful for all types of microbial data sets. In order to do this we will explore novel methods for representing short-read sequence data graphically and apply established mathematical approaches for efficient data mining. The particular problem we will address is the assembly of NGS data sets where the variation in the sample needs to be considered in the analysis. In metagenomics data variation between reads corresponds to both distinct microbial species and variation within individual species or viral populations. A particularly important focus is the ability to assembly a genome without a reference sequence for comparison (de novo assembly) as an appropriate reference genome is frequently not available for many microbes and, even when a reference is available, genome architecture can vary within a species.

Technical Summary

The aim of this project is to address a specific set of unsolved theoretical problems in the fields of metagenomics and microbiology/virology-associated sequencing projects. We will tackle practical problems, such as the need to make more efficient computer pipelines for analysing and assembling NGS data with particular emphasis on de novo assembly. Data management and processing of NGS short read data from microbes is usually done with reference to existing genomes. However, due to high-levels of variation the available algorithms can fail to align homologous reads or perform poorly in regions with frequent insertions or deletions or where genome architecture is highly variable. In this proposal we will investigate and implement novel methods for alignment assembly with and without the use of reference genomes. To achieve this we will use a novel approach for efficient analysis of NGS data by harnessing the speed and accuracy of existing time series data compression/mining techniques. Our approach will make use of these time series representation techniques to compress the individual NGS reads to lower dimensions; this thereby reduces the size of the data to be processed and analysed. Working with this transformed representation of sequence reads will speed the data analysis and improve the accuracy of any results by enabling the use of more thorough heuristics. Existing, clustering and indexing approaches, and similarity evaluation methods will be used to determine how the reads are linked to either single or multiple genome assemblies depending on the sample. Pairwise similarity levels of the reads will provide the statistical information required to assess the result of the assembly. We also aim to introduce new methods for visualising NGS alignments graphically, for example, in three-dimensional space. The proposed method will provide clear and precise visual information, for example, visually representing which regions are or are not covered by the assembly.

Planned Impact

Many scientists are using next-generation sequencing technologies in their research. In most projects it is sufficient to combine the short-read data arising from the specific sequencing platform into a consensus sequence and a number of best-practice computational methods exist. However, in the case of next generation re-sequencing projects where the aim is to study depth of variation (for example, in a viral infection of an animal or human) or metagenomic projects where the aim is to study a community of microbes including viruses, there remains a dearth of appropriate computational methods. The particular nature of metagenomic data (large sizes and complexity) produces its own challenges as well as some unprecedented opportunities (highlighted in the BBSRC's 2010 Review of Next Generation Sequencing). For the full potential of metagenomics studies to be realised there is a need for novel computational tools for NGS data analysis. Our approach will reduce the complexity of NGS data sets permitting the implementation of more rigorous algorithms and as a consequence improvements to data storage and analysis, and the reliability of the results that can be obtained from NGS data sets.

The wider exploitation of metagenomics approaches will have a number of beneficiaries:

(i) Researchers in academia and not-for-profit organisations who study biodiversity in environmental or organism samples.
(ii) Detecting and monitoring of microbial and viral pathogens in agriculture (plants or animals)
(iii) Public-health researchers who are interested in detecting novel or existing pathogens.
(iv) Researchers in the commercial sector in the form of companies developing screening techniques, environmental monitoring etc.

Our primary form of communication to potential beneficiaries will be through the joint mediums of presentations at major conferences and peer-reviewed papers in open access journals. Such traditional mediums are important to ensure the quality of the research. We will also engage directly with individuals, institutions and companies that are likely to find our research applicable starting with the collaborative partners we have listed here. Conference presentations are also very effective for establishing contact with other potential beneficiaries and collaborators in academia or industry.
 
Description We have successfully applied signal transformation and dimensionality reduction methods to high-throughput sequencing data. Despite using compressed sequence transformations, our implementation yields alignments of comparable accuracy to existing aligners, in some cases outperforming other tools at high levels of sequence diversity.
Our results demonstrate that full sequence resolution is not a prerequisite of accurate sequence alignment/assembly and that analytical performance can be retained and even enhanced through appropriate dimensionality reduction of sequences. Note, the project is still underway so not all objectives have been met yet.
Exploitation Route Our approach could be applied to other researchers data.
Sectors Pharmaceuticals and Medical Biotechnology

URL http://biorxiv.org/content/early/2015/01/27/011940
 
Title Alignment by numbers 
Description Short sequence read aligner based on signal transformation and dimensionality reduction methods. 
Type Of Technology Software 
Year Produced 2016 
Open Source License? Yes  
Impact Novel approach to sequence alignment. 
URL https://github.com/Avramis/Alignment_by_numbers