16 ERA-CAPS: 1001 Genomes Plus

Lead Research Organisation: Royal Botanic Gardens Kew
Department Name: Science Directorate Office

Abstract

Understanding how genetic variation translates into phenotypic variation, and how this translation depends on the environment, is a major challenge for modern biology. It is fundamental to human genetics and agriculture, as well as evolutionary biology. Thanks to advances in technology, it is now possible to start answering this question by sequencing entire populations and connecting this information to phenotypic data, whether this be public health records, crop yield data, or the ability to withstand stress in a controlled experiment or in nature.
There is, however, an important aspect that is often glossed over in all these (often highly publicized) efforts: we are still far from fully describing genetic variation on a population scale. The "next-generation" sequencing methods that have made it economically feasible to screen large numbers of individuals (the almost mythical "$1000 Human Genome") do not actually produce complete genome sequences - they produce massive numbers of very short sequence fragments that must be aligned to a reference genome in order to identify variants. Because of this, only simple variants (single nucleotide and very short insertion/deletion polymorphisms) are reported, and the results are invariably biased with respect to what is present or missing in the reference genome. Large or complex structural variants, as well as simple variants inside complex variants are generally missed completely.
It is currently not known how serious this problem is, for the simple reason that finding out requires completely assembling large number of genomes, and comparing the result to data generated using standard methods. This is the objective of the 1001G+ proposal. Long-read sequencing has now advanced to a stage where generating nearly complete genomes for large samples is feasible - at least for organisms with relatively small genomes. Building on our success with the "1001 Genomes Project", we will assemble at least 50 genomes from a diverse collection of Arabidopsis thaliana strains, annotate them with transcriptome and epigenome information, and develop tools to make the results available to the community. This will go a long way toward answering the question of what is hidden in the part of the genome we currently cannot see - certainly in A. thaliana, but our results (and the tools and concepts we develop to find, interpret, and share complete information on sequence variants) will pave the way for similar studies in organisms with larger genomes, where the hidden part is likely to be relatively larger, and perhaps even more important.
The project brings together a team of a researchers with complementary skills, considerable management expertise and a strong track record of collaborating to deliver results for the community. In addition, regular meetings with leaders of complementary efforts in other organisms will ensure the broader relevance of the project.

Technical Summary

Based on long-read read sequencing technology, we will assemble genomes from at least 50 geographically diverse accessions of A. thaliana to a standard similar to or exceeding that of the original Col-0 reference genome. Our initial plan is to use PacBio sequencing and the Canu assembler, but we will adapt to emerging technologies. We will identify all types of variants distinguishing these genomes, including SNPs, small indels, TEs, and large structural variants (SVs) including inversions, transpositions, duplications and so on. We will then use the aligned genomes to build a pan-genome graph, which describes the relationship between genomes at multiple scales, including the frequency of different sequence variants. With the genome graph in hand, we will confidently type all common SVs in the global A. thaliana population, by using the short read data already available from the 1001 Genomes Project.
We will annotate the fully assembled genomes and use the new SV information to describe much more accurately the causes and consequences of variation in genome-wide gene expression and DNA methylation. We will first do this in the accessions with platinum standard genome assemblies, but will subsequently use the genome graph from these assemblies to describe and interpret transcriptome and methylome variation in the entire set of 1001 Genomes accessions, for a true species-wide description of major drivers of functional diversification.
We will develop effective computational tools for storage and on-the-fly analysis of SVs across a large number of individual genomes. An important component will be new approaches to the dynamic, interactive display of structural variants and their functional impact on gene structure and function. We will explore the use of the vg tools package to provide an interface to the graph, and integration of such tools within the Ensembl Plants interface maintained at EMBL-EBI.

Planned Impact

Genomic variation lies at the heart of biology: it explains differences between individuals and species, is the cause of disease and neo-functionalisation, it's emergence and spread is the evolutionary process, while it's distribution provides powerful insight into current function and historical change. As genome sequencing costs fall, it is increasingly likely that we obtain high-quality sequences for many individuals in a species, and are able to piece together a comprehensive understanding of population-wide variation for the first time, including larger, structural variants which have not been accessible by earlier techniques utilising sort read-sequencing. It is increasingly anticipated we will move to a model of reference graphs to represent the sequences present in a population, and the order in which those sequences are found. But most work on genomes graphs to date has been proof-of-concept, and has not yet been applied comprehensively.
Arabidopsis is a widely used model species with a small and well-annotated reference genome. Moreover, an extensive catalogue of short variants found in this species already exists. In this project, we will generated at least 50 high quality reference genomes from different cultivars of Arabidopsis, and uses these as the basis for exploring models for annotation, storage, and display of pan-genomic data including large-scale structural variants.
Beneficiaries of this project will include academics working on Arabidopsis, who will be able to understand the basis of phenotypes in cultivars of interest on the basis of mutations in their genomes. They will also benefit anyone who wants to carry out a genome-wide screen for genes associated with a phenotype, providing a high-quality marker panel of use in GWAS experiments, even for loci not present in the current reference Columbia-0. The new genomes are also likely to form the basis of a new reference graph for Arabidopsis, which will serve as the basis for future reference annotation maintained by databases such as TAIR and Araport.
Secondly, the resource will more generally provide insight into the mechanisms of evolution and adaptation that have allowed Arabidopsis to grow in many, diverse environments, and inform our understanding of these processes in other species.
Thirdly, the generation of genuinely high-quality sequences for a reasonably small genome will provide the perfect model for testing and developing pipelines for the annotation, modelling and visualisation of pan-genomes. As sequencing costs fall, it will become increasingly normal to have this information available for many species, yet as present, scientists have mostly only prototypes, without information about overall genomic structure ("the bag of genes" model) or gene-level graphs. The new Arabidopsis genomes are perfect for testing graph-based models and visualisations on true genome-scale data, and will inform future attempts to do something similar for more difficult (larger, more diverse) genomes subsequently.

Publications

10 25 50