Resequencing Arabidopsis thaliana

Lead Research Organisation: University of Oxford
Department Name: Wellcome Trust Centre for Human Genetics

Abstract

Variation, such as flowering time, height and leaf colour between plants of the same species can be explained in part by differences between their DNA sequences, and this fact can be used to identify genes responsible for many phenotypes of importance in agriculture. One way of doing this this requires a genetic reference population, which is a set of inbred lines of plants of the same species (ie varieties of plants that breed true and contain little or no genetic variation within each variety but which differ between varieties) whose genome sequences are known, at least approximately, and on which the phenotype of interest, such as flowering time, is measured. Then by correlating the differences observed between the phenotypes measured across the lines with differences between their DNA sequences, it is possible to find DNA changes that may be responsible for the phenotypes, and hence identify the responsible genes. Because all flowering plants have a common ancestor and share similar genes, understanding the function in one plant species can often be translated to another. Therefore by working with a simple model plant, the thale cress Arabidopsis Thaliana, which is easy to grow and has a short generation time, it is possible to discover gene function and then apply this information to agriculturally important crops, to improve yields to the benefit of mankind. We have developed a reference population of 763 Arabidopsis inbred lines, shortly to be expanded to over 1000 lines. They have been bred by repeatedly crossing 19 existing varieties of this plant that were collected from the wild and from across the world. The lines have been inbred (called 'selfing') for several generations until each line has a fixed DNA sequence which is a random mosaic of the 19 founders. Each line is a different mosaic. We have begun to use these lines to find the genes responsible for traits such as flowering time, but in order to make the best of the data use we need to know the genome sequence of each line. Fortunately we don't need to sequence each of the 1000 lines, which is too costly. Instead we can infer their sequences from the 19 founder genomes because we know the mosaic structure of each line. Recent technological improvements make it possible to sequence genomes much more cheaply and quickly. The genome of Arabidopsis Thaliana is about 120 million bases long and can now be sequenced in about a day. We propose to sequence the genomes of 17 founders (the other two genomes are already sequenced) and make this data publicly available. We will develop software and statistical methods so that DNA variation between the 19 genomes can be used to help identify functionally important variations in the lines. We have already distributed the lines by depositing their seeds in the A. thaliana stock centre so that others can use this resource. The genomes of each of the founders will also be of interest for studies of evolution and population genetics. They will be annotated by the Ensembl Plants team at EBI and the annotations displayed on the Ensembl genome browser.

Technical Summary

This project will resequence the genomes of 17 accessions of Arabidopsis thaliana using the Solexa platform. The accessions are the founders of a panel of ~1000 recombinant inbred lines (HSRILs), of which 763 have been bred and which we will deposit in the A thaliana seed stock centre as a public resource (BBSRC grant BB/D016029/1). We are genotyping the lines and founders in Autumn 2007 using 1536 Illumiina SNPs. As the genome of each HSRIL is a mosaic of the 19 founders we can impute the mosaic from the SNP genotypes using a Hidden Markov Model (1). Knowledge of the mosaic structures combined with the sequences of the 19 founders will enable us to impute the sequence of each HSRIL line; hence we can perform whole genome association and test every imputed SNP for association with any phenotype measured across the lines. Using the Solexa sequencing platform, millions 35bp reads can be generated in one run at low cost. Results of resequencing Bur-0 by Detlef Weigel's lab suggest that, with 3 runs of a Solexa instrument (or 1.5 paired-read run per genome) over 80% of the genome and of the SNPs/short indels will be recovered accurately. All the data will be made publicly available, through Ensembl and by adapting the GSCANDB database (2). This was developed for human and mouse whole genome association mapping, and will be modified to display the genome scans of the A. thaliana mapping panel, incorporating gene annotations to identify candidate genes and link into external Arabidopsis genome resources. It also will provide a mechanism for collaborators to publish their genome scans. (1) Mott R, Talbot CJ, Turri MG, Collins AC, & Flint J (2000) Proc Natl Acad Sci U S A 97, 12649-12654 (2) Taylor M, Valdar W, Kumar A, Flint J, & Mott R (2007) Bioinformatics 23, 1545-1549

Publications

10 25 50
 
Description Individuals within a species usually have different genomes, and this affects their physical appearance, behaviour and survival. However,it is usually assumed that these individual genetic differences are small, so that we can transfer detailed information about a single typical genome, known as the reference, across to other individuals in order to predict the changes we are likely to see due to the presence of DNA variation.

In this study we use the plant Arabidopsis thaliana to show that his assumption is false. We sequenced the genomes of 18 individuals (known as accessions) using next-generation sequencing. We also measured the expression of genes in each genome, and carefully mapped the encoding of the genes in the DNA from each accession back to the reference. We found that the reference encoding of the genes did not translate well to the other accessions, predicting that the majority of genes would be non-functional in at least one accession. In fact, because of changes to the way gene sequences are spliced before they are translated into protein, most of these apparently important alterations were ameliorated.

Our study shows how important it is to annotate the genomes of individuals in order to predict the genes accurately.

The results of this study, along with other work, underpinned a successful application to BBSRC for a £3.2M Lola BB/T002182/1 "What determines protein abundance in plants?" awarded to Rothamsted Research, UCL and Cambridge. This project will start in 2020 and uses the genome data collected in the current award.
Exploitation Route Other groups are using the software and algorithms we developed to perform similar studies in other species, both plant and animal.
Sectors Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Environment,Other

URL http://mtweb.cs.ucl.ac.uk/mus/www/19genomes/
 
Description Our findings were published in "Multiple reference genomes and transcriptomes for Arabidopsis thaliana" (2011) Gan X, Mott R. Nature. 477:419-423; PMC4856438. They attracted a News and Views article Our later paper "Genomic Rearrangements in Arabidopsis Considered as Quantitative Traits" (2017) Imprialou M, Mott R. Genetics. 205(4):1425-1441; PMC5378104 uses the data from this study. This paper was one of the Genetic journals highlights for 2017.
First Year Of Impact 2011
Sector Digital/Communication/Information Technologies (including Software),Education
 
Description What determines protein abundance in plants?
Amount £3,354,456 (GBP)
Funding ID BB/T002182/1 
Organisation Biotechnology and Biological Sciences Research Council (BBSRC) 
Sector Public
Country United Kingdom
Start 10/2020 
End 10/2025
 
Title Multiple reference genomes and transcriptomes for Arabidopsis thaliana 
Description DNA sequencing data are deposited in the European Nucleotide Archive (www.ebi.ac.uk/ena/) under accession number ERP000565. Accession ID for data is ERP000565 
Type Of Material Database/Collection of data 
Year Produced 2011 
Provided To Others? No  
Impact No actual impacts realised to date 
URL http://www.ebi.ac.uk/ena/
 
Description 19 genomes of Arabidopsis thaliana 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact This webpage page contains resources relating to the 19 genomes project, in which the genome sequences, transcriptomes and protein annotations of 19 accessions of the plant Arabidopsis thaliana are described. These genomes are the founders of the MAGIC genetic reference population of recombinant inbred lines, and contribute to the 1001 Arabidopsis genomes project.

no actual impacts realised to date
Year(s) Of Engagement Activity 2011,2012,2013,2014
URL http://mus.well.ox.ac.uk/19genomes/