Using Solexa/Illumina methods to investigate plant pathogen variation and transcriptome

Lead Research Organisation: University of East Anglia
Department Name: Sainsbury Laboratory

Abstract

Using the Illumina Genome Analyzer (www.illumina.com) re-sequencing 5 more races of Hp apart from Emoy2 (Noco2, Maks9, Cala2, Waco9, Hind2) will be carried out with the objective of identifying genes that show diversifying selection. To identify expression patterns within these genomic regions, a serial analysis of gene expression (SAGE)-based mRNA profiling method (Velculescu et al. 1995 Science 270: 484-487) will be established using Solexa sequencing. This novel method will be based on the SMART cDNA protocol (www.clontech.com) to obtain reads from 5' end. This will reveal where transcripts start and can also be used for semi-quantitative analysis of expression levels and to give information about when different genes are expressed during different stages of infection. To identify new and verify predicted open reading frames, a method will be established to sequence the transcriptome of the pathogen growing in planta. Transcriptome analysis of an obligate biotroph pathogen takes advantage of a newly developed method in the Jones lab for enriching genes expressed by pathogens in plants (Rougon and Jones, unpublished), which will be combined with a cDNA normalization technique. The Solexa sequencing approach relies on attachment of randomly fragmented (nebulised) DNA to a flow cell. Since short cDNAs do not fragment randomly, a method will be established which allows cDNA concatamerisation prior to random fragmentation. We would like to apply this cDNA method to pathogens whose genomes are not yet sequenced. Different methods currently available for assembling short reads that have proved useful with bacterial DNA, will be tested and adapted for cDNA de novo assembly. A computational method will be developed to combine all data into one database which allows easy access to information about variation of genome sequences between races, expression levels, gene structure and possible functions. All data will be made publicly available.

Technical Summary

Using a massively parallel sequencing approach, the Illumina Genome Analyzer (www.illumina.com) can generate more than one billion base pairs in a single run. Two runs will be enough to re-sequence and re-assemble one strain of Hp. The assembly will be done based on the available Hp Emoy2 reference genome. The goal of our new SAGE protocol based on the SMART cDNA method is to enrich for tags at the 5' UTR of the mRNA instead of tags generated in the current protocol from random DpnII or NlaIII sites. In order to sequence the whole transcriptome, starting from total mRNA of an infected leaf, a normalization step is necessary to maximize representation of less abundant genes. In planta expressed Hp genes will be enriched by a cDNA selection method established in the lab (Rougon and Jones, unpublished). To achieve random fragmentation, short cDNAs need to be concatamerised before nebulisation. Concatamerisation will be achieved by applying modifications to the 'Creator SMART cDNA' kit so that 5' ends of the cDNA can be ligated at high concentration to 3' ends. Assembly methods will be based on recently developed Short Sequence Assembly by progressive K-mer search and 3' read Extension (SSAKE v1.1) program which was successfully tested for de novo sequence assembly (Warren et al. 2006 Bioinformatics 23: 500-501). The new SSAKE version 1.1 and related programs are particularly adapted to Solexa sequencing and meet the concerns that Solexa reads show higher error rates at the end of the sequences when signal intensity decreases during a sequencing run. Other short read assembly methods developed in the Birney lab (eg 'Velvet') are currently being established and are more tolerant to higher error rates. One run should enable 60- deep sequencing of most genes which based on our experience with bacterial genome sequencing, should be sufficient to generate contigs of >1000 bp by de novo assembly.

Publications

10 25 50
 
Description The genomes of some filamentous plant pathogens are little studied because they can only be cultivated on living hosts, limiting availability of RNA and DNA. New DNA sequencing methods with high data output enable the gene sets of these agronomically important organisms to be defined.

During the funding period we managed to establish (i) genomic, (ii) cDNA and (iii) expression profiling techniques using the GA2 sequencer. Pipelines were developed to analyse and combine the data output, and make it available (http://gbrowse2.tsl.ac.uk/cgi-bin/gb2/gbrowse/hpa_emoy2/ ).

(i) We resequenced 6 races of Hyaloperonospora arabidopsidis (Hpa) to varying depths, including 33x deep sequencing of race Emoy2 which was previously sequenced 9.5x using Sanger sequencing. To complement the Hpa genome assembly, we aligned all Hpa Emoy2 Illumina reads to version 7.1 of the genome and used VELVET software to assemble reads not matching. This revealed 2.4 Mbp missing in the Sanger assembly. Combining Sanger and Illumina data greatly improved the Hpa genome assembly. This version (7.4) was used for further analyses and the Hpa genome browser, and is close to what will appear in the Hpa Emoy2 genome paper (in prep).

To address diversity between the 6 strains sequenced, all reads were aligned to the Hpa Emoy2 genome and SNPs, Indels (heterozygous and homozygous) and large scale deletions were detected.

To identify candidate defence-suppressing proteins ("effectors") we used a list of 147 proteins carrying a signal peptide and conserved RXLR motif (Kamoun 2006 Annu Rev Phytopathol 44: 41-60). Of 147 RxLR effectors, 99 RxLRs were identified with orthologues in at least one other Hpa species without any indels. We tested for positive selection using CODEML pair wise comparisons, CODEML M7, M8, M8a models, and yn00 gene evolution tests (Yang Z. 2009 Mol Biol Evol. 26(8):1715-21). Using CODEML pair wise comparisons, 38 RxLR proteins had an omega > 1, suggesting positive selection. Out of these 36 had omega > 5.

Testing the effectors for consistency with M7 (neutral selection), M8 (diversifying selection) and M8a (relaxed selection) models, we found 35 RxLR effectors that were more likely to have codons under positive selection than neutral selection (P value of 0.05) of which 20 were not likely to be under relaxed selection. Overall, 74 of 99 effector candidates showed positive selection in at least one test.

(ii) We developed a method to sequence total cDNA from infected leaves. The cDNA sequences were used to validate Emoy2 gene models and re-train gene prediction for the Hpa genome project (paper in preparation). Our cDNA data also revealed how much of the coding regions were assembled in the genome. As a control, we aligned our cDNA data to the Arabidopsis genome and gene models. 35.9% of the cDNA aligned to the genome and 34.3% aligned to the gene models - this validated the method. We aligned 23.4% of the cDNA to the V7 Hpa genome, but only 10.1% of the cDNA aligned to the gene models (31393 genes). This shows that many Hpa genes have yet to be identified. Of the 31393 Hpa predicted genes, 11923 (>3 hits) had evidence for expression. Complementing the Sanger assembly with the Velvet assembly of unaligned Illumina reads increased the amount of cDNA aligning to the combined Illumina-Sanger assembly to 26.5%, allowing us to improve the Hpa genome and gene models.

Emoy2 cDNA data reveal which genes are expressed in planta. Out of the 74 effector candidates showing evidence for positive selection 61 (>3 hits) are expressed and are being tested in further pathogenicity tests.

iii) We established a 5' tag sequencing method for expression profiling based on the Clontech SMART cDNA kit. Sample barcoding enabled us to run three samples per lane of a flow cell. We took samples after 0, 12, 24 and 72hpi and 7dpi. This provides enough resolution to distinguish early and late effectors and allows us to rule out genes expressed during sporulation.
Exploitation Route there is now widespread adoption of Illumina methods for sequencing pathogen genomes and genes expressed by pathogens in hosts
Sectors Agriculture, Food and Drink

URL http://www.ncbi.nlm.nih.gov/pubmed/21750662