Tool for finding linked genetic polymorphisms in reference-less complex plant genomes from unassembled next-generation reads.

Lead Research Organisation: Earlham Institute
Department Name: Research Faculty

Abstract

Differences in the genome of individuals of the same species, called polymorphisms, are the genetic basis of traits such as resistance or susceptibility to disease. By identifying polymorphisms it is possible to pinpoint either the agents of resistance or susceptibility or at the least locate regions on the genome that are placed nearby and can act as positional markers that can be associated with the trait of interest. Many wild populations of plant species that are closely related to domesticated varieties important for food and industry are resistant to common diseases that could potentially devastate important crops across the world. Combating these diseases chemically is both costly and environmentally damaging so breeding varieties that are resistant is absolutely necessary for food security. Genetic methods for identifying markers are time consuming and require large amounts of expensive and slow laboratory work. New methods in high-throughput DNA sequencing are able to comprehensively sample entire genomes at an affordable cost. These technologies return many millions of small fragments not a continuous sequence. The volumes of data generated by the NGS instruments has resulted in the need for new methods to assemble the fragments or align to an existing, previously assembled reference sequence. Currently, polymorphism identification relies on having some sort of reference to which sequence reads can be aligned. Aligned reads are then examined for consensus differences to the reference that indicate a genetic difference between the genome of that sampled in the reads and the reference. Naturally this is only possible where a reference genome exists. Since the creation of even a rough draft genome sequence can take many months, the detection of polymorphisms specifying resistance to diseases in relatives to agriculturally important organisms that have no such reference becomes a massively time consuming and difficult task. When reference sequence is available identifying polymorphisms among many individuals from a population, to associate genotypes with specific phenotypes for example, require many cycles of alignment and comparison. Our objective is to develop a tool that takes advantage of the recent developments in high-throughput DNA sequencing and new computational methods to identify polymorphisms between multiple sources without the need for comparison with a reference sequence. These methods will allow us to detect genetic variants directly from the raw sequences reads without the requirement of a reference genome. The time required would be on the order of hours, rather than months or years in the case where assembly may be required. The tool will produce short but useful genomic mini-assemblies with embedded polymorphisms that can be utilised by bench workers for downstream experiments. We will be able to provide ranking of SNPs and classifications based on the provenance of different reads, for example detecting SNPs common to individuals with a trait. The tool will be an important addition to the repertoire of methods available to bioinformaticians involved in polymorphism detection and could invaluable to projects without an available reference sequence. The tool will also prove useful to bioinformaticians with a reference sequence, we will be able to remove the need for many sequential alignments to a reference and compress subsequent polymorphism detection into a single step.

Technical Summary

SNP detection with next-generation sequencing technology requires a reference genome to which reads may be aligned. A recent collaboration between bioinformatics groups at TGAC and TSL have worked on an extension of the Cortex framework (Caccamo and Iqbal) to develop a SNP discovery platform leveraging the multicolour capabilities in Cortex to work directly on bulked samples. Our implementation will also classify SNPs based on colour and number of paths through each node in a bubble, allowing distinction between homo- and heterozygous SNPs. Each called SNP is ranked based on a heuristic that takes into account kmer coverage, quality scores, deviation from any expected ratio of kmers in each colour (when detecting heterozygous SNPs) and equality of coverage on each branch. As a proof of principal experiment we used Illumina reads from cDNA sequence of Solanum berthaultii populations heterozygous resistant to Late Blight or homozygous susceptible in our software to predict SNPs that were later verified using conventional sequencing experiments. We propose to extend our implementation to handle more complex bubbles than the naïve canonical simple SNP form, in particular Indels, nested bubbles, bubbles that originate from a path on another bubble and bubbles within k of each other. These events are very relevant for the type of heterozygous and homeologous repeats present in plant genomes. We will use the colour concept to allow association studies to relate selected bubbles present in bulks or accessions with interesting phenotypes . The tool is timely and innovative, as it will allow groups working on next-generation genetics projects with un-sequenced organisms to take fuller advantage of next-generation sequence data and speed up research programs significantly. The bubble detection and colour algorithms we plan will allow for excellent flexibility of the tool in detecting many polymorphism classes and provide much quicker creation of genetic markers.

Planned Impact

A tool capable of identifying polymorphisms without a reference sequence would benefit scientists working in genetics, speeding research significantly and making it possible to work with organisms where currently genomic resources are few. This would inspire scientists in many fields to do new analyses with species that are not currently tractable. Further it would energise the field of polymorphism with SNP detection and help foster stimulate research into an exciting new branch of tool development. Our tool would expedite experimentation by speeding the time from sequence acquisition to polymorphism detection and help stimulate new discoveries in the field of biotechnology. Non-academic groups who would benefit from what our tool could provide would include biotechnology companies; those involved in breeding plants and animals for agrinomically important traits and indirectly therefore the agricultural community, including farmers. The PIs will take the lead on managing the impact plan. The plan will be an agenda item at monthly project meetings. Both PIs have excellent track records in communicating the outcomes of their research to a broad audience. Primarily this is through publication in academic journals, but also increasingly in open forums like the internet. We provide regular project updates and code releases via sources like Twitter and github. TSL/TGAC has a dedicated communications office for release of information to the general public through websites and the media. Our software will be open-source and released under non-restrictive license to the academic community via laboratory websites, links from published articles and code sharing sites. Discoveries made with the tool, e.g. SNPs or genes linked to disease resistance will be covered by TSL/TGAC's Technology Transfer Policy based on maintaining close links with those who are able to make use discoveries for the benefit of society. Discoveries at TSL/JIC are monitored ro establish whether they present opportunities to obtain Intellectual Property Protection. This is typically through patenting. The PIs will oversee the impact activities and when necessary will seek the assistance of other project members and expert staff at TSL/TGAC. Where impact activities include technology transfer or outreach/press release the relevant office at TSL/TGAC will be involved. The PIs have prior experience at writing scientific and general articles, as well as developing websites. Postdoctoral workers will be encouraged to develop their communication skills within both the academic and non-academic community, with the latter aimed at an understanding of the wider value of their research.

Publications

10 25 50
 
Description We developed three tools: bubbleparser, bubbleparser.PM (Perl module) and 2Kplus2, for identifying the variation within the genomes of a plant population, from only short lengths of sequence. This enables such analysis to be performed with significantly less data than would be required to first assemble a full genome sequence. We focused on ranking the results to decrease the number of false positives. All the software developed within this grant are available as free open source software respectively the former on Github and the later on Sourceforge.
Exploitation Route The source code and tools are available for free public use and can be applied to any species or group of species (metagenomic samples).
Sectors Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Healthcare,Pharmaceuticals and Medical Biotechnology

URL https://github.com/richardmleggett/bubbleparse