A platform for massive parallel sequencing of longPCR amplicons

Lead Research Organisation: Natural History Museum
Department Name: Life Sciences

Abstract

New generation sequencing techniques offer an unprecedented means of sequencing genes and genomes at a fraction of previous costs and at a phenomenal density of coverage. A variety of platforms offer different techniques. 454 pyrosequencing, also known as massive parallel sequencing, has the advantage of providing relatively long sequence reads (~450 nucleotides) in over 1 million individual reaction chambers on a pico-titre plate; developments are under way to capture even longer reads. When mixing templates from different sources there is a need to link sequences with their source. Two ways are possible and include (i) processing individual samples on single pico-titre plates or individual gasketed sections of a plate (up to 16), or (ii) chemically-tagging templates with unique sample-specific markers. Long lengths of DNA (up to 20,000 nucleotides) are routinely amplified with specialised polymerase chain reactions for a diversity of purposes by a wide variety of users of molecular tools. By sequencing the ends of these long amplicons using traditional methods, and by relying on bioinformatic tools to accurately unscramble the data, we propose a method that allows hundreds of long amplicons to be pooled, fragmented, massively parallel sequenced, accurately reassembled and identified, thus reducing existing costs by orders of magnitude. The technique will allow routine multiplex sequencing of longPCRs where only short fragments could be sequenced previously, or where expensive sample-specific tagging and/or cloning was required. We will test the methodology by generating longPCR amplicons from parasitic helminths, for which: (i) we have a wide diversity of samples available and considerable experience of handling, (ii) there is wide interest and need, including diagnostics, biodiversity studies and evolutionary parasitology. Simulation studies will be used in conjunction with real data to develop, refine and test the bioinformatics pipeline for wider application. The methodology and associated open access computer applications will be transferable to any biological system where diverse longPCR fragments are sequenced regardless of the origin of the DNA.

Technical Summary

High throughput massive parallel sequencing of mixed amplicons, if the identity of original amplicons needs to be known, requires sample-specific markers to be added to amplicon libraries, either through the addition of Multiplex Identifiers or user-designed markers in Parallel Tagged Sequencing. Adding sample-specific markers prior to PCR or emPCR library construction is costly and time-consuming. Alternatively, up to 16 individual samples can be run concurrently on separate sections on a 454 plate, although this halves the total number of reads achievable; a mixture of MIDs and gasketed plates allows up to 192 samples to be run concurrently but numerous additional costly steps are required prior to emPCR. We have noted and shown that amongst many samples of longPCRs, particularly those including relatively rapidly evolving protein-coding genes (e.g. lengths of mitochondrial (mt) DNA), sequences from different species (even sister taxa) are sufficiently different from one another that contig assembly programs can be tailored to untangle and assemble mixed sequences accurately. With sufficient differences between original longPCRs and long read lengths offered by 454 technology, pooled samples can be multiplexed, massively parallel sequenced and reassembled without chemical-tagging of individual reads. Instead, Sanger sequencing ends of each longPCR, offers quality control and unique 500bp identifiers with which to assign identity to reassembled contigs. Using established primer sets and readily available material we proposes to demonstrate that a mixed pooled sample of long PCRs from complete 28S rDNA and mtDNA from a diversity of parasitic helminths can be sequenced to completion with >100x coverage in a single 454 run. Bioinformatic tools building on available assembly software and scripts, will be developed to optimise a pipeline for accurately reassembling the data. Simulations will be run to evaluate the limits of the approach for future applications.

Planned Impact

Users of longPCR amplicons for screening genetic and genomic diversity are widespread across disciplines as disparate as biomedical science, applied molecular biology, genomics, population genetics, and evolutionary biology. Researchers and fields of biological research and application requiring accurate, cost-effective, high throughput sequencing of longPCRs, without the need for cloning steps or amplicon-barcoding will benefit. Until now, routine sequencing of longPCRs has not been cost-effective and has taken considerable time, thus preventing routine use of these established applications and high fidelity PCR enzymes. This project will bring the cost down by at least 1-2 orders of magnitude whilst increasing speed and depth of sequencing by many orders of magnitude. Publication through open access articles, online postings to end-user groups, listservers and a dedicated project web page will promote the development of these methods and the associated bioinformatics tools. Other beneficiaries include bioinformaticians developing methods of accurate contig assembly from next generation sequencing (NGS) methods, who require dedicated code and methodologies for untangling multiplexed sequences and reassembling larger fragments accurately. The NGS community will benefit from access to open source code, commentaries and results of simulations posted on a variety of websites. Targeted audiences will be addressed through seminars and conferences in association with partners (Applied Genomics Facility, Liverpool), collaborators (Univ. Melbourne) and with direct assistance from the Natural History Museum's Press Office, Research & Consulting Office and Interactive Media teams. We expect that Roche (454 Life Sciences) will take an active interest in the use of their platform (and longPCR kits) for the development of these tools and resources, and we will engage directly with them and other companies with alternative NGS platforms in promoting the results of the project, in line with BBSRC recommendations.

Publications

10 25 50
publication icon
Gasser RB (2012) Mitochondrial genome of Angiostrongylus vasorum: comparison with congeners and implications for studying the population genetics and epidemiology of this parasite. in Infection, genetics and evolution : journal of molecular epidemiology and evolutionary genetics in infectious diseases

 
Description Next generation sequencing (NGS) allows rapid, high throughput characterization of DNA molecules. An expensive stage of the process is the requirement of making a "library" of fragments from each DNA that we want to sequence. Usually, individual libraries need to be made for each kind of DNA of interest. Tagging libraries with different short DNA identifying tags makes it possible to sequence multiple libraries in the same sequencing run. However, we have devised a way to pool, sequence and identify multiple long DNA fragments including complete mitochondrial genomes, without the need for expensive tagging or sample preparation. The success relies on accurate reassembly of data and determining the identity of reassembled fragments from short, small reference sequences.
Exploitation Route Currently being explored with key taxonomic groups (helminths of medical and veterinary importance)
Sectors Agriculture, Food and Drink,Environment,Healthcare,Pharmaceuticals and Medical Biotechnology

 
Description The findings have been successfully implemented in various biodiversity surveys within the NHM. NGS advances have allowed the method to be used across multiple scales. The method is being applied to biodiversity studies as well as characterisation of helminths of biomedical and veterinary importance.
First Year Of Impact 2013
Sector Agriculture, Food and Drink,Pharmaceuticals and Medical Biotechnology
Impact Types Economic

 
Description ARC Linkage
Amount $240,000 (AUD)
Funding ID LP100100091 
Organisation Australian Research Council 
Sector Public
Country Australia
Start 06/2010 
End 06/2012