Resequencing Arabidopsis thaliana

Lead Research Organisation: European Bioinformatics Institute
Department Name: Ensembl Group

Abstract

Describe the proposed research in simple terms in a way that could be publicised to a general audience [up to 4000 chars] Variation, such as flowering time, height and leaf colour between plants of the same species can be explained in part by differences between their DNA sequences, and this fact can be used to identify genes responsible for many phenotypes of importance in agriculture. One way of doing this this requires a genetic reference population, which is a set of inbred lines of plants of the same species (ie varieties of plants that breed true and contain little or no genetic variation within each variety but which differ between varieties) whose genome sequences are known, at least approximately, and on which the phenotype of interest, such as flowering time, is measured. Then by correlating the differences observed between the phenotypes measured across the lines with differences between their DNA sequences, it is possible to find DNA changes that may be responsible for the phenotypes, and hence identify the responsible genes. Because all flowering plants have a common ancestor and share similar genes, understanding the function in one plant species can often be translated to another. Therefore by working with a simple model plant, the thale cress Arabidopsis Thaliana, which is easy to grow and has a short generation time, it is possible to discover gene function and then apply this information to agriculturally important crops, to improve yields to the benefit of mankind. We have developed a reference population of 763 Arabidopsis inbred lines, shortly to be expanded to ober 1000 lines. They have been bred by repeatedly crossing 19 existing varieties of this plant that were collected from the wild and from across the world. The lines have been inbred (called 'selfing') for several generations until each line has a fixed DNA sequence which is a random mosaic of the 19 founders. Each line is a different mosaic. We have begun to use these lines to find the genes responsible for traits such as flowering time, but in order to make the best use we need to know the genome sequence of each line. Fortunately we don't need to sequence each of the 763 lines, which is costly. Instead we can infer their sequences from the 19 founder genomes because we know the mosaic structure of the 763. Recent technological improvements make it possible to sequence genomes much more cheaply and quickly. The genome of Arabidopsis Thaliana is about 120 million bases long and can now be sequenced in about a day. We propose to sequence the genomes of 17 founders (the other two genomes are already sequenced) and make this data publicly available. We will develop software and statistical methods so that DNA variation between the 19 genomes can be used to help identify functionally important variations in the 763 lines. We have already distributed the lines by depositing their seeds in the A. thaliana stock centre so that others can use this resource. The genomes of each of the founders will also be of interest for studies of evolution and population genetics. They will be annotated by the Ensembl Plants team at EBI and the annotations displayed on the Ensembl genome browser.

Technical Summary

Describe the proposed research in a manner suitable for a specialist reader. This summary will be made publicly available if the proposal is funded. [up to 2000 characters] This project will resequence the genomes of 17 accessions of Arabidopsis thaliana using the Solexa platform. The accessions are the founders of a panel of ~1000 recombinant inbred lines (HSRILs), of which 763 have been bred and which we will deposit in the A thaliana seed stock centre as a public resource (BBSRC grant BB/D016029/1). We are genotyping the lines and founders in Autumn 2007 using 1536 Illumiina SNPs. As the genome of each HSRIL is a mosaic of the 19 founders we can impute the mosaic from the SNP genotypes using a Hidden Markov Model (1). Knowledge of the mosaic structures combined with the sequences of the 19 founders will enable us to impute the sequence of each of the 763 lines; hence we can perform whole genome association and test every imputed SNP for association with any phenotype measured across the lines. Using the Solexa sequencing platform, millions 35bp reads can be generated in one run at low cost. Results of resequencing Bur-0 by Detlef Weigel's lab suggest that, with 3 runs of a Solexa instrument (or 1.5 paired-read run per genome) over 80% of the genome and of the SNPs/short indels will be recovered accurately. All the data will be made publicly available, through Ensembl and by adapting the GSCANDB database (2). This was developed for human and mouse whole genome association mapping, and will be modified to display the genome scans of the A. thaliana mapping panel, incorporating gene annotations to identify candidate genes and link into external Arabidopsis genome resources. It also will provide a mechanism for collaborators to publish their genome scans. (1) Mott R, Talbot CJ, Turri MG, Collins AC, & Flint J (2000) Proc Natl Acad Sci U S A 97, 12649-12654 (2) Taylor M, Valdar W, Kumar A, Flint J, & Mott R (2007) Bioinformatics 23, 1545-1549

Publications

10 25 50

publication icon
International Arabidopsis Informatics Consortium (2010) An international bioinformatics infrastructure to underpin the Arabidopsis community. in The Plant cell

publication icon
Kersey PJ (2010) Ensembl Genomes: extending Ensembl across the taxonomic space. in Nucleic acids research

publication icon
Kinsella RJ (2011) Ensembl BioMarts: a hub for data retrieval across taxonomic space. in Database : the journal of biological databases and curation

 
Description Our partners on the project sequenced 19 strains of Arabidopsis thaliana, and used the data to make insights into Arabidopsis biology and population genomics. The work also served as a contribution to the wider 1001 genomes project to develop an extensive catalogue of natural variation in this species. EMBL-EBI developed a database to hold the results of this work (and some of the other emerging Arabidopsis resequencing and phenotyping data), and have made this information available through web and programmatic interfaces as part of the Ensembl Plants project.
Exploitation Route The data itself is not directly usable in non-academic contexts, but the project has enabled the deployment of similar software for variation data sets from crop plants, including rice, barley, maize and grape. This data, especially if correlated with phenotypic data, can be directly exploited when breeding for crop improvement. The research has generated information about sequence differences between Arabidopsis strains, which is useful for the study of genomic evolution, and for linking genotype to phenotype and (by extension) to an understanding of molecular function and its implications. Ensembl Plants provides easy access to this data.
Sectors Agriculture, Food and Drink,Environment

URL http://plants.ensembl.org/Arabidopsis_thaliana/Info/Index
 
Description Our role in this grant was to provide data access for the Arabidopsis resequencing and variation data
First Year Of Impact 2010
Sector Agriculture, Food and Drink,Environment
Impact Types Economic

 
Title Arabidopsis databases and releases 
Description Variation data from the Arabidopsis 1001 genomes project, including that data directly generated from this project, has been made available through Ensembl Plants, an integrative database providing access to genome scale data from a wide variety of plant species. 
Type Of Material Database/Collection of data 
Year Produced 2009 
Provided To Others? Yes  
Impact The development of new infrastructure at EBI (funded through other sources) to accommodate plant variation data. 
URL http://plants.ensembl.org/Arabidopsis_thaliana/Info/Annotation#variation