Beyond a single reference: Building high quality graph genomes capturing global diversity

Lead Research Organisation: University of Edinburgh
Department Name: Roslin Institute

Abstract

Most species show substantial amounts of genetic diversity between individuals and populations. However, large amounts of this genetic diversity are missing, and therefore inaccessible, using current reference genomes. Almost all references are derived from just one or a small handful of individuals, which are collapsed into a single pseudo-haploid representation, meaning hundreds of megabases of pan-genome DNA sequence are missing from most mammalian reference sequences. This includes regions likely to be associated with important phenotypes such as environmental adaptation and disease tolerance. Not only can these regions not be studied in current analyses reliant upon these genomes, but several studies have highlighted how reliance on these haploid references deleteriously bias analyses, even in the regions of the genome that are present. Reference mapping biases impact analyses as fundamental as genetic variant calling and gene expression studies, which ultimately means they are likely to be deleteriously affecting thousands of studies a year.
The emerging field of genome graphs aims to mitigate these issues by incorporating the diversity observed across individuals into a single graph representation of the species' pan-genome. This ensures all genomic regions can be captured and mitigates issues such as mapping biases by incorporating all known alleles and haplotypes as alternative routes through the graph. Despite their advantages, few high-quality graph genomes are currently available, primarily because the generation, annotation and visualisation of graph genomes is challenging, providing barriers to their wider use. The aim of this project is to drive forward the use of graph genomes by addressing these issues. By producing reusable, containerised pipelines for generating and working with genome graphs, researchers will be able to rapidly generate and update graph genomes for their species of interest. We will use these pipelines and data from previous BBR projects to generate and make available the first high quality cattle graph genome resource, encompassing the spectrum of genetic variants from large structural variants across sub-species to single nucleotide variants within breeds. To ensure graph genomes can be widely accessed the third and final resource will be a new portal for viewing richly annotated genome graphs.

By facilitating the rapid creation of graph genomes compatible with relevant downstream alignment and variant calling software freely and publicly accessible, enabling their downstream visualisation, and developing a new cattle graph genome, we expect this project to make a significant contribution to livestock research, ranging from studies mapping genetic loci linked to economically important traits to those understanding the evolution of species. Additionally, the pipelines developed will be immediately transferrable to the production of graph genomes for other species, significantly extending the impact of project outputs.

Technical Summary

Despite the advantages of graph genomes, they remain largely unused in livestock research. They are difficult to construct and almost no high quality, pre-compiled and freely accessible graph genomes currently exist. Constructing a comprehensive graph genome that incorporates the spectrum of genetic variation, from large structural changes to single nucleotide variants (SNVs), can involve over fifty distinct analysis steps. This is in contrast to current pseudo-haploid reference genomes that can be downloaded with a single click. If graph genomes are to be more widely used, reducing the biases in downstream analyses, these barriers to their wider use need to be overcome. To address these issues, we propose to generate reusable, efficient and easy to use pipelines using Nextflow and docker for creating mammalian graph genomes. These pipelines will made freely available and fed directly into various genome projects for domesticated species. Using these pipelines we will construct and make available a high quality, richly annotated cattle graph genome incorporating the spectrum of changes from large structural variants down to breed-specific SNVs. This will largely eliminate the barriers to using graph genomes in cattle research studies, one of the most widely accessed species on the Ensembl genome browser. Given the advantages of using graph genomes, from more accurate variant calling to the reduction in reference biases in allele specific expression studies, we expect such resources to have substantial downstream impacts. Improving the calling of large and small genetic variants will improve the ability to meaningfully map the genetic basis of economically important traits. As there are few current resources for visualising richly annotated graph genome, we will also develop a graph genome browser built upon JBrowse2. This will allow users to visualise annotations along the alternative sequences present in the graph genome backbone.

Planned Impact

Reference genomes are a core foundation of modern biological research. They provide the backbone for variant calling, genome assembly, RNA-seq and other sequencing analyses. They are used to annotate genes and functional elements and provide a common frame of reference for their locations. The variant alleles of an individual are recorded in population studies with respect to those found in the reference. Therefore, the ubiquity and fundamental importance of reference genomes means that these resources will have a wide range of both short- and long-term beneficiaries across academia, industry and ultimately the wider public.

Key beneficiaries of the cattle graph genome, and the graph genomes generated for other species using the reusable pipelines, will in the short term include UK and global academics and breeding companies mapping genes linked to economically important traits and diseases. Important production traits and diseases have already been linked to structural variants in cattle, sheep and pigs but these are currently difficult to detect and assay. The improved variant calling, and better representation of structural variants will improve the ability to map such functional loci for breeding and potentially gene editing studies across species. Graph genomes will allow for the full diversity of genetic variants to be more accurately tested against key phenotypes and inform the interpretation of genetic association and population genetic study results by providing candidate functional structural variants in relevant regions.

The immediate downstream beneficiaries of this will be livestock holders. The mature UK livestock breeding industry is well placed to exploit improved genomic resources and the economic benefits of improving local breeds through genetics are clear. Incorporating a novel Holstein-Friesian assembly into the graph genome, the UK's most common breed, will further ensure its relevance to the UK. Better representation across breeds of immune loci, such as the MHC, will substantially improve genetic association studies with disease, still a major barrier to productively rearing livestock.

Long term beneficiaries will extend far beyond just farmers. The current use of insecticides and acaricides have significant impacts on the environment and soil fertility for example, and reducing their use through understanding alternative mechanisms of reducing disease burden could have substantial longer term environmental benefits. Furthermore, exploiting variants linked to drought tolerance, methane emissions and environmental adaptation could enable the development of less resource intensive, but productive breeds and help address some of the key challenges likely to arise from climate change.

A key challenge for graph genomes will be breaking the inertia built up from the use of current single, haploid genomes. To complement our proposed UK based training courses to address this, in conjunction with BecA (Biosciences Eastern and Central Africa) and CTLGH we will use this resource in our ongoing training courses and workshops for African students and scientists.

Publications

10 25 50