Mapping short reads in RY-space: a novel strategy for extending the phylogenetic range of Next Generation Sequence mapping algorithms

Lead Research Organisation: University of Liverpool
Department Name: Institute of Integrative Biology

Abstract

Modern DNA sequencers can generate billions of short DNA fragments that will typically be only 50 to 100 molecules (nucleotides) in size. To make sense of these fragments, or reads, it is usually necessary to compare them with previously sequenced DNA, often an entire genome, which may be several billions of nucleotides in length. This is a difficult computational challenge. Although comparing one piece of DNA with another is a relatively simple process, the sheer number of fragments generated, combined with the often large size of the reference genome, means that without well designed software it is hard, if not impossible, to process all the data produced by a single sequencing run within a practical timeframe using current computers. Fast and efficient software have been duly developed but, in order to achieve a reasonable mapping speed with the latest computers, compromises have to be made. Current software can accommodate only a few differences between short DNA sequences for a successful match to be identified (often only two or three differences within a short stretch is possible). It is this limitation which makes it very difficult to compare DNA between species, where a high level of variation is to be expected, and so it is hard to make sense of short-read data if the organism in question has not been sequenced before. Unfortunately most species of research, economic, and clinical importance have yet to be sequenced. Species that haven't been sequenced before must go through a far more expensive process of long-read genome sequencing, and this means that the genomic analysis of many organisms remain financially unfeasible with current technologies. We propose a new way of mapping DNA that will, at least in part, overcome this difficulty. Our approach is based on the observation that evolutionary patterns existing within DNA sequences are more fully revealed when the information content is reduced and the sequence pattern simplified. A stretch of DNA can be thought of as a complex pattern of four types of nucleotide: Adenine, Cytosine, Guanine and Thymine (A, C, G, and T for short). A and G are purine nucleotides, whilst C and T are pyrimidines. It has long been recognised that, as organisms evolve, the rate at which a purine mutates into another purine, or a pyrimidine to another pyrimidine, will tend to be higher than when purines mutate into pyrimidines and visa versa. This imbalance in mutation rate will create patterns of purines and pyrimidines within DNA sequences that are more stable motifs of shared ancestry between species than is the case with more noisy nucleotide patterns. We will use this more robust pattern to match sequences together from different species: we will develop software which will simplify DNA sequences down to their purine and pyrimidine content alone, compare them to identify similarities using approaches equivalent to that currently used with raw DNA sequences, then convert them back into their original nucleotides for subsequent analysis. Because the conversion from individual nucleotides to their purine/pyrimidine identities alone is simple, the speed with which translated reads can be mapped will be comparable to that achieved with raw DNA. Thus, using our strategy, mapping should almost be as quick as current methods and use similar levels of computer resources. However, the extent to which one species can be compared with another will be far greater meaning that it will be possible to sequence more organisms with low-cost short-read sequencing technologies even when a reference sequence for that particular species is not available. As a result lower cost sequencing will become practical for a much wider range of organisms than is currently possible, ensuring that the new techniques currently being developed, that rely on short read sequencing, can be applied in many more contexts.

Technical Summary

We will develop a novel algorithm for mapping Next Generation Sequencing (NGS) short reads to a reference sequence that extends the phylogenetic range of existing mapping algorithms without compromising on computer resources or time. By rendering nucleotide sequences in terms of their purine and pyrimidine content alone, we believe more robust phylogenetic detail can be revealed allowing for mapping over an increased phylogenetic distance than is currently possible with existing short-read mapping tools. The three main aims of this proposal are: 1. Formulate a basic algorithm to demonstrate fully that mapping in purine-pyrimidine sequence space (RY-space) can extend the phylogenetic range within which NGS generated reads can be mapped and that this can be achieved with computer resources comparable to that of current short read mappers. 2. Extend our algorithm to explore the potential of mapping in RY-space, developing the strategy with particular emphasis on how the method could be extended to provide additional phylogenetic insight as part of the mapping process. 3. Develop open-source software solution that implements our final algorithm that will allow RY-space mapping on all major NGS platforms. Our RY-mapper tool will be based on existing short-read mapper algorithms. Our current preference is to employ a Burrows Wheeler transform, with appropriate backtracking to allow for mismatches as implemented by the Bowtie and BWA software packages. The algorithm will be adapted to work in RY mapping space. We will develop a novel validation step, based around the phylogenetic distance estimation K80 model of Kimura (1980). Range and utility of mapping tool will be assessed, paying particular attention to read length, impact of low complexity, GC content, and intra and inter-gene sequence challenges. Pilot software will consist of Perl wrapper scripts around existing mapping tools. Final software will be based on existing C/C++ open source code.

Planned Impact

Who will benefit from this research? The most immediate benefit will be to researchers, both in the academic community and within the commercial private sector, who want to take full advantage of the latest next generation sequencing (NGS) technologies. In particular, areas of research where sequencing resources have been limited until now will benefit, as our methodology and associated software will enable greater use of low-cost short-read sequencing technologies. Other major beneficiaries will be the national and international genome sequencing centres who provide sequencing services to both the academic and commercial sectors. This will be in terms of the services they can offer, but also in the range of research areas they will be able to cater for. Longer term, this research will form the foundations for further, much needed, improvements in software provision for next generation sequencing. As the capacity of NGS technologies to generate sequence data continues to increase, they will impose ever greater burdens on the research community's computing infrastructure. Methodological enhancements of the sort described in this proposal will be necessary to ensure we can adequately accommodate this explosion in data. How will they benefit from this research? Our proposal will extend the phylogenetic range of current Next Generation (NGS) sequencing mapping algorithms. As such, this will provide great benefit to any researcher planning to exploit low-cost short-read NGS technology but lack a suitable reference sequence to interpret their data. Currently, the most economic method of generating sequence data is to employ short-read sequencing technologies as represented by the Illumina and Applied Biosystems SOLiD platforms. However, to work with short reads (typically 50-100 bp), it is almost inevitable that a reference sequence will be required to interpret the data: analysis of short read data without a guide reference is possible, but is almost always an involved process due to the limitations of current de-novo assembly algorithms and the degree of manual intervention required. Current mapping algorithms require that the reference be of the same species as the organism being sequenced. For many researchers, this is not possible: the majority of organisms of economic, clinical, or general research interest have yet to be sequenced sufficiently with longer read technologies. Through our methodology, and with our software, researchers will be far less constrained. It will be possible to compare organisms belonging to different taxa. It will also greatly assist in leveraging short read sequencing technologies into metagenomics studies (for example, gut microflora, soil microbial communities, etc) bringing with them substantial cost savings. Our proposal will also allow Dr Kevin Ashelford, the Researcher-Co-Investigator on this project, to further develop and enhance his software development skills which will have direct value in all employment sectors whether they be in the public or private sector.

Publications

10 25 50
 
Description We have developed software for improving the alignment of data from short read sequencers to refernece genomes from distantly related species.
Exploitation Route The algothythem may be used for other sequencing applications such as phylogenetics. Once we have released the code it may be used in a wide spectrum of comparative genomics work
Sectors Aerospace, Defence and Marine,Agriculture, Food and Drink,Education,Pharmaceuticals and Medical Biotechnology

URL https://github.com/riteshkrishna/longshadow)
 
Description The software developed in this application have been used by the centre for genomic research at tliverpool to support projects for numerous academics and some commercial groups such as unilever.
First Year Of Impact 2013
Sector Agriculture, Food and Drink,Education
Impact Types Economic