16 ERA-CAPS: 1001 Genomes Plus

Lead Research Organisation: Royal Botanic Gardens
Department Name: Science Directorate Office

Abstract

Understanding how genetic variation translates into phenotypic variation, and how this translation depends on the environment, is a major challenge for modern biology. It is fundamental to human genetics and agriculture, as well as evolutionary biology. Thanks to advances in technology, it is now possible to start answering this question by sequencing entire populations and connecting this information to phenotypic data, whether this be public health records, crop yield data, or the ability to withstand stress in a controlled experiment or in nature.
There is, however, an important aspect that is often glossed over in all these (often highly publicized) efforts: we are still far from fully describing genetic variation on a population scale. The "next-generation" sequencing methods that have made it economically feasible to screen large numbers of individuals (the almost mythical "$1000 Human Genome") do not actually produce complete genome sequences - they produce massive numbers of very short sequence fragments that must be aligned to a reference genome in order to identify variants. Because of this, only simple variants (single nucleotide and very short insertion/deletion polymorphisms) are reported, and the results are invariably biased with respect to what is present or missing in the reference genome. Large or complex structural variants, as well as simple variants inside complex variants are generally missed completely.
It is currently not known how serious this problem is, for the simple reason that finding out requires completely assembling large number of genomes, and comparing the result to data generated using standard methods. This is the objective of the 1001G+ proposal. Long-read sequencing has now advanced to a stage where generating nearly complete genomes for large samples is feasible - at least for organisms with relatively small genomes. Building on our success with the "1001 Genomes Project", we will assemble at least 50 genomes from a diverse collection of Arabidopsis thaliana strains, annotate them with transcriptome and epigenome information, and develop tools to make the results available to the community. This will go a long way toward answering the question of what is hidden in the part of the genome we currently cannot see - certainly in A. thaliana, but our results (and the tools and concepts we develop to find, interpret, and share complete information on sequence variants) will pave the way for similar studies in organisms with larger genomes, where the hidden part is likely to be relatively larger, and perhaps even more important.
The project brings together a team of a researchers with complementary skills, considerable management expertise and a strong track record of collaborating to deliver results for the community. In addition, regular meetings with leaders of complementary efforts in other organisms will ensure the broader relevance of the project.

Technical Summary

Based on long-read read sequencing technology, we will assemble genomes from at least 50 geographically diverse accessions of A. thaliana to a standard similar to or exceeding that of the original Col-0 reference genome. Our initial plan is to use PacBio sequencing and the Canu assembler, but we will adapt to emerging technologies. We will identify all types of variants distinguishing these genomes, including SNPs, small indels, TEs, and large structural variants (SVs) including inversions, transpositions, duplications and so on. We will then use the aligned genomes to build a pan-genome graph, which describes the relationship between genomes at multiple scales, including the frequency of different sequence variants. With the genome graph in hand, we will confidently type all common SVs in the global A. thaliana population, by using the short read data already available from the 1001 Genomes Project.
We will annotate the fully assembled genomes and use the new SV information to describe much more accurately the causes and consequences of variation in genome-wide gene expression and DNA methylation. We will first do this in the accessions with platinum standard genome assemblies, but will subsequently use the genome graph from these assemblies to describe and interpret transcriptome and methylome variation in the entire set of 1001 Genomes accessions, for a true species-wide description of major drivers of functional diversification.
We will develop effective computational tools for storage and on-the-fly analysis of SVs across a large number of individual genomes. An important component will be new approaches to the dynamic, interactive display of structural variants and their functional impact on gene structure and function. We will explore the use of the vg tools package to provide an interface to the graph, and integration of such tools within the Ensembl Plants interface maintained at EMBL-EBI.

Planned Impact

Genomic variation lies at the heart of biology: it explains differences between individuals and species, is the cause of disease and neo-functionalisation, it's emergence and spread is the evolutionary process, while it's distribution provides powerful insight into current function and historical change. As genome sequencing costs fall, it is increasingly likely that we obtain high-quality sequences for many individuals in a species, and are able to piece together a comprehensive understanding of population-wide variation for the first time, including larger, structural variants which have not been accessible by earlier techniques utilising sort read-sequencing. It is increasingly anticipated we will move to a model of reference graphs to represent the sequences present in a population, and the order in which those sequences are found. But most work on genomes graphs to date has been proof-of-concept, and has not yet been applied comprehensively.
Arabidopsis is a widely used model species with a small and well-annotated reference genome. Moreover, an extensive catalogue of short variants found in this species already exists. In this project, we will generated at least 50 high quality reference genomes from different cultivars of Arabidopsis, and uses these as the basis for exploring models for annotation, storage, and display of pan-genomic data including large-scale structural variants.
Beneficiaries of this project will include academics working on Arabidopsis, who will be able to understand the basis of phenotypes in cultivars of interest on the basis of mutations in their genomes. They will also benefit anyone who wants to carry out a genome-wide screen for genes associated with a phenotype, providing a high-quality marker panel of use in GWAS experiments, even for loci not present in the current reference Columbia-0. The new genomes are also likely to form the basis of a new reference graph for Arabidopsis, which will serve as the basis for future reference annotation maintained by databases such as TAIR and Araport.
Secondly, the resource will more generally provide insight into the mechanisms of evolution and adaptation that have allowed Arabidopsis to grow in many, diverse environments, and inform our understanding of these processes in other species.
Thirdly, the generation of genuinely high-quality sequences for a reasonably small genome will provide the perfect model for testing and developing pipelines for the annotation, modelling and visualisation of pan-genomes. As sequencing costs fall, it will become increasingly normal to have this information available for many species, yet as present, scientists have mostly only prototypes, without information about overall genomic structure ("the bag of genes" model) or gene-level graphs. The new Arabidopsis genomes are perfect for testing graph-based models and visualisations on true genome-scale data, and will inform future attempts to do something similar for more difficult (larger, more diverse) genomes subsequently.

Publications

10 25 50

publication icon
Eizenga JM (2020) Pangenome Graphs. in Annual review of genomics and human genetics

 
Description The project, an ERA-CAPS award partnered with awards to collaborators in Germany and Austria, has developed and analysed a set of genome sequences from multiple strains of Arabidopsi thaliana, resulting in an improved understanding of the structural variability of the genome in this species. Findings are currently being prepared for publication.

At Kew, we have contributed to the data analysis, and have also worked on the specific challenge of visualising these data. We first assembled an informal consortium of people working on other species with an interest in the problem, developed a new conceptual framework for the simultaneous representation and visualisation of multiple genomes, and implemented a tool (Pantograph) supporting this framework. The tools was tested as a web-based platform to display genome sequences from the novel coronavirus 2019-nCoV, the causative agent of COVID-19, and development continues to support the scaleable visualisation of larger genomes such as Arabidopsis.
Exploitation Route The tool, once complete, will be usable by research communities working on any species/data sets beyond the Arabidopsis data set that is the targeted application in the grant proposal (for example, it will also be usable for human genomic data, and has already been used to explore genome variation in the novel coronavirus 2019-nCoV). More generally, the undertaken research in the project should significantly change our understanding of the process of the genome variation, and the provide us with new ways of thinking and measurements to describe it.
Sectors Digital/Communication/Information Technologies (including Software),Healthcare

URL https://graphgenome.org
 
Description Pantograph Collaboration. 
Organisation Computomics
Country Germany 
Sector Private 
PI Contribution Our research team conceived and assembled the partnership, and have been the main drivers of its subsequent activities. Within the partnership, we focused on development of the visualisation elements.
Collaborator Contribution Other partners concentrated on development of the algorithms for constructing graphs from genomic sequences, and contributed to the overall development and design of the workflow.
Impact A prototype version of the Pantograph pan-genome viewer has been developed and made publicly available, and populated with data from SARS-CoV-2 genomes.
Start Year 2019
 
Description Pantograph Collaboration. 
Organisation Eberhard Karls University of Tübingen
Country Germany 
Sector Academic/University 
PI Contribution Our research team conceived and assembled the partnership, and have been the main drivers of its subsequent activities. Within the partnership, we focused on development of the visualisation elements.
Collaborator Contribution Other partners concentrated on development of the algorithms for constructing graphs from genomic sequences, and contributed to the overall development and design of the workflow.
Impact A prototype version of the Pantograph pan-genome viewer has been developed and made publicly available, and populated with data from SARS-CoV-2 genomes.
Start Year 2019
 
Description Pantograph Collaboration. 
Organisation IN-PART Publishing Ltd.
Country United Kingdom 
Sector Private 
PI Contribution Our research team conceived and assembled the partnership, and have been the main drivers of its subsequent activities. Within the partnership, we focused on development of the visualisation elements.
Collaborator Contribution Other partners concentrated on development of the algorithms for constructing graphs from genomic sequences, and contributed to the overall development and design of the workflow.
Impact A prototype version of the Pantograph pan-genome viewer has been developed and made publicly available, and populated with data from SARS-CoV-2 genomes.
Start Year 2019
 
Description Pantograph Collaboration. 
Organisation Karlsruhe Institute of Technology
Country Germany 
Sector Academic/University 
PI Contribution Our research team conceived and assembled the partnership, and have been the main drivers of its subsequent activities. Within the partnership, we focused on development of the visualisation elements.
Collaborator Contribution Other partners concentrated on development of the algorithms for constructing graphs from genomic sequences, and contributed to the overall development and design of the workflow.
Impact A prototype version of the Pantograph pan-genome viewer has been developed and made publicly available, and populated with data from SARS-CoV-2 genomes.
Start Year 2019
 
Description Pantograph Collaboration. 
Organisation Lipscomb University
Country United States 
Sector Academic/University 
PI Contribution Our research team conceived and assembled the partnership, and have been the main drivers of its subsequent activities. Within the partnership, we focused on development of the visualisation elements.
Collaborator Contribution Other partners concentrated on development of the algorithms for constructing graphs from genomic sequences, and contributed to the overall development and design of the workflow.
Impact A prototype version of the Pantograph pan-genome viewer has been developed and made publicly available, and populated with data from SARS-CoV-2 genomes.
Start Year 2019
 
Description Pantograph Collaboration. 
Organisation Pwani University
Country Kenya 
Sector Academic/University 
PI Contribution Our research team conceived and assembled the partnership, and have been the main drivers of its subsequent activities. Within the partnership, we focused on development of the visualisation elements.
Collaborator Contribution Other partners concentrated on development of the algorithms for constructing graphs from genomic sequences, and contributed to the overall development and design of the workflow.
Impact A prototype version of the Pantograph pan-genome viewer has been developed and made publicly available, and populated with data from SARS-CoV-2 genomes.
Start Year 2019
 
Description Pantograph Collaboration. 
Organisation Swiss Institute of Bioinformatics (SIB)
Country Switzerland 
Sector Charity/Non Profit 
PI Contribution Our research team conceived and assembled the partnership, and have been the main drivers of its subsequent activities. Within the partnership, we focused on development of the visualisation elements.
Collaborator Contribution Other partners concentrated on development of the algorithms for constructing graphs from genomic sequences, and contributed to the overall development and design of the workflow.
Impact A prototype version of the Pantograph pan-genome viewer has been developed and made publicly available, and populated with data from SARS-CoV-2 genomes.
Start Year 2019
 
Description Pantograph Collaboration. 
Organisation University of California, Santa Cruz
Country United States 
Sector Academic/University 
PI Contribution Our research team conceived and assembled the partnership, and have been the main drivers of its subsequent activities. Within the partnership, we focused on development of the visualisation elements.
Collaborator Contribution Other partners concentrated on development of the algorithms for constructing graphs from genomic sequences, and contributed to the overall development and design of the workflow.
Impact A prototype version of the Pantograph pan-genome viewer has been developed and made publicly available, and populated with data from SARS-CoV-2 genomes.
Start Year 2019
 
Description Pantograph Collaboration. 
Organisation University of Göttingen
Country Germany 
Sector Academic/University 
PI Contribution Our research team conceived and assembled the partnership, and have been the main drivers of its subsequent activities. Within the partnership, we focused on development of the visualisation elements.
Collaborator Contribution Other partners concentrated on development of the algorithms for constructing graphs from genomic sequences, and contributed to the overall development and design of the workflow.
Impact A prototype version of the Pantograph pan-genome viewer has been developed and made publicly available, and populated with data from SARS-CoV-2 genomes.
Start Year 2019
 
Description Pantograph Collaboration. 
Organisation University of Northern Colorado
Country United States 
Sector Academic/University 
PI Contribution Our research team conceived and assembled the partnership, and have been the main drivers of its subsequent activities. Within the partnership, we focused on development of the visualisation elements.
Collaborator Contribution Other partners concentrated on development of the algorithms for constructing graphs from genomic sequences, and contributed to the overall development and design of the workflow.
Impact A prototype version of the Pantograph pan-genome viewer has been developed and made publicly available, and populated with data from SARS-CoV-2 genomes.
Start Year 2019
 
Description Pantograph Collaboration. 
Organisation University of Rome Tor Vergata
Country Italy 
Sector Academic/University 
PI Contribution Our research team conceived and assembled the partnership, and have been the main drivers of its subsequent activities. Within the partnership, we focused on development of the visualisation elements.
Collaborator Contribution Other partners concentrated on development of the algorithms for constructing graphs from genomic sequences, and contributed to the overall development and design of the workflow.
Impact A prototype version of the Pantograph pan-genome viewer has been developed and made publicly available, and populated with data from SARS-CoV-2 genomes.
Start Year 2019
 
Description Pantograph Collaboration. 
Organisation University of Tokyo
Country Japan 
Sector Academic/University 
PI Contribution Our research team conceived and assembled the partnership, and have been the main drivers of its subsequent activities. Within the partnership, we focused on development of the visualisation elements.
Collaborator Contribution Other partners concentrated on development of the algorithms for constructing graphs from genomic sequences, and contributed to the overall development and design of the workflow.
Impact A prototype version of the Pantograph pan-genome viewer has been developed and made publicly available, and populated with data from SARS-CoV-2 genomes.
Start Year 2019