Beyond a single reference: Building high quality graph genomes capturing global diversity

Lead Research Organisation: University of Edinburgh
Department Name: Roslin Institute

Abstract

Most species show substantial amounts of genetic diversity between individuals and populations. However, large amounts of this genetic diversity are missing, and therefore inaccessible, using current reference genomes. Almost all references are derived from just one or a small handful of individuals, which are collapsed into a single pseudo-haploid representation, meaning hundreds of megabases of pan-genome DNA sequence are missing from most mammalian reference sequences. This includes regions likely to be associated with important phenotypes such as environmental adaptation and disease tolerance. Not only can these regions not be studied in current analyses reliant upon these genomes, but several studies have highlighted how reliance on these haploid references deleteriously bias analyses, even in the regions of the genome that are present. Reference mapping biases impact analyses as fundamental as genetic variant calling and gene expression studies, which ultimately means they are likely to be deleteriously affecting thousands of studies a year.
The emerging field of genome graphs aims to mitigate these issues by incorporating the diversity observed across individuals into a single graph representation of the species' pan-genome. This ensures all genomic regions can be captured and mitigates issues such as mapping biases by incorporating all known alleles and haplotypes as alternative routes through the graph. Despite their advantages, few high-quality graph genomes are currently available, primarily because the generation, annotation and visualisation of graph genomes is challenging, providing barriers to their wider use. The aim of this project is to drive forward the use of graph genomes by addressing these issues. By producing reusable, containerised pipelines for generating and working with genome graphs, researchers will be able to rapidly generate and update graph genomes for their species of interest. We will use these pipelines and data from previous BBR projects to generate and make available the first high quality cattle graph genome resource, encompassing the spectrum of genetic variants from large structural variants across sub-species to single nucleotide variants within breeds. To ensure graph genomes can be widely accessed the third and final resource will be a new portal for viewing richly annotated genome graphs.

By facilitating the rapid creation of graph genomes compatible with relevant downstream alignment and variant calling software freely and publicly accessible, enabling their downstream visualisation, and developing a new cattle graph genome, we expect this project to make a significant contribution to livestock research, ranging from studies mapping genetic loci linked to economically important traits to those understanding the evolution of species. Additionally, the pipelines developed will be immediately transferrable to the production of graph genomes for other species, significantly extending the impact of project outputs.

Technical Summary

Despite the advantages of graph genomes, they remain largely unused in livestock research. They are difficult to construct and almost no high quality, pre-compiled and freely accessible graph genomes currently exist. Constructing a comprehensive graph genome that incorporates the spectrum of genetic variation, from large structural changes to single nucleotide variants (SNVs), can involve over fifty distinct analysis steps. This is in contrast to current pseudo-haploid reference genomes that can be downloaded with a single click. If graph genomes are to be more widely used, reducing the biases in downstream analyses, these barriers to their wider use need to be overcome. To address these issues, we propose to generate reusable, efficient and easy to use pipelines using Nextflow and docker for creating mammalian graph genomes. These pipelines will made freely available and fed directly into various genome projects for domesticated species. Using these pipelines we will construct and make available a high quality, richly annotated cattle graph genome incorporating the spectrum of changes from large structural variants down to breed-specific SNVs. This will largely eliminate the barriers to using graph genomes in cattle research studies, one of the most widely accessed species on the Ensembl genome browser. Given the advantages of using graph genomes, from more accurate variant calling to the reduction in reference biases in allele specific expression studies, we expect such resources to have substantial downstream impacts. Improving the calling of large and small genetic variants will improve the ability to meaningfully map the genetic basis of economically important traits. As there are few current resources for visualising richly annotated graph genome, we will also develop a graph genome browser built upon JBrowse2. This will allow users to visualise annotations along the alternative sequences present in the graph genome backbone.

Planned Impact

Reference genomes are a core foundation of modern biological research. They provide the backbone for variant calling, genome assembly, RNA-seq and other sequencing analyses. They are used to annotate genes and functional elements and provide a common frame of reference for their locations. The variant alleles of an individual are recorded in population studies with respect to those found in the reference. Therefore, the ubiquity and fundamental importance of reference genomes means that these resources will have a wide range of both short- and long-term beneficiaries across academia, industry and ultimately the wider public.

Key beneficiaries of the cattle graph genome, and the graph genomes generated for other species using the reusable pipelines, will in the short term include UK and global academics and breeding companies mapping genes linked to economically important traits and diseases. Important production traits and diseases have already been linked to structural variants in cattle, sheep and pigs but these are currently difficult to detect and assay. The improved variant calling, and better representation of structural variants will improve the ability to map such functional loci for breeding and potentially gene editing studies across species. Graph genomes will allow for the full diversity of genetic variants to be more accurately tested against key phenotypes and inform the interpretation of genetic association and population genetic study results by providing candidate functional structural variants in relevant regions.

The immediate downstream beneficiaries of this will be livestock holders. The mature UK livestock breeding industry is well placed to exploit improved genomic resources and the economic benefits of improving local breeds through genetics are clear. Incorporating a novel Holstein-Friesian assembly into the graph genome, the UK's most common breed, will further ensure its relevance to the UK. Better representation across breeds of immune loci, such as the MHC, will substantially improve genetic association studies with disease, still a major barrier to productively rearing livestock.

Long term beneficiaries will extend far beyond just farmers. The current use of insecticides and acaricides have significant impacts on the environment and soil fertility for example, and reducing their use through understanding alternative mechanisms of reducing disease burden could have substantial longer term environmental benefits. Furthermore, exploiting variants linked to drought tolerance, methane emissions and environmental adaptation could enable the development of less resource intensive, but productive breeds and help address some of the key challenges likely to arise from climate change.

A key challenge for graph genomes will be breaking the inertia built up from the use of current single, haploid genomes. To complement our proposed UK based training courses to address this, in conjunction with BecA (Biosciences Eastern and Central Africa) and CTLGH we will use this resource in our ongoing training courses and workshops for African students and scientists.

Publications

10 25 50

publication icon
Talenti A (2022) A cattle graph genome incorporating global breed diversity in Nature Communications

publication icon
Talenti A (2021) nf-LO: A Scalable, Containerized Workflow for Genome-to-Genome Lift Over. in Genome biology and evolution

 
Description We have created and made available the first cattle graph genome capturing global breed diversity and created new software enabling users to lift annotations between any pair of genomes, which will be particularly useful with the ever growing number of genome assemblies being created within and across species.
Exploitation Route Graph genomes, new genome assemblies and the tools for working with them have a diverse range of uses across studies. In particular for calling genetic variants missed when using a single reference genome. This work is therefore expected to feed into a diverse range of work downstream regarding the improvement of livestock.
Sectors Agriculture, Food and Drink

URL https://www.bomabrowser.com/
 
Description Data from the project was used to support a patent application
First Year Of Impact 2022
Sector Agriculture, Food and Drink
Impact Types Economic

 
Title Boran genome assembly 
Description We have generated a Boran HiFi assembly scaffolded with Bionano optical mapping data 
Type Of Material Database/Collection of data 
Year Produced 2021 
Provided To Others? No  
Impact We are using this to investigate the genetic basis of heritable tolerance to East Coast fever observed among this animal's pedigree. 
 
Title Cattle genome assemblies for Ankole and NDama breeds 
Description High quality reference genome assemblies for two cattle breeds generated from PacBio and Illumina sequencing data and bionano optical mapping data. 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
Impact The data has been submitted for publication 
URL https://www.bomabrowser.com/cattle.html
 
Description Member of bovine pangenome advisory committee 
Organisation U.S. Department of Agriculture USDA
Country United States 
Sector Public 
PI Contribution The bovine pangenome consortium was set up to generate new cattle genome assemblies from 100+ breeds. We are contributing expertise, samples and genome assemblies to the consortium
Collaborator Contribution The other members of the BPC are also contributing expertise and assemblies.
Impact The BPC has only recently been set up.
Start Year 2020
 
Title BOmA (Bovine Omic Atlas) 
Description BOmA is a genome browser tailored for viewing cattle omic data, including that being generated alongside or as part of this award. Data currently on the browser spans both water buffalo and cattle and for example includes genotypes from 420 global cattle breeds and optical mapping, ATAC-seq, RNA-seq and RRBS data for various breeds. The first version of the browser is available here https://www.bomabrowser.com/ and we are currently in the process of updating it to support visualising graph genomes 
Type Of Technology Webtool/Application 
Year Produced 2019 
Impact The browser has already been used to prioritise candidate functional sites, for example, in regions putatively linked to trypanasome and T.parva tolerance. 
URL https://www.bomabrowser.com/
 
Title nf-LO: A scalable, containerised workflow for genome-to-genome lift over 
Description The increasing availability of new genome assemblies often comes with an impaired amount of associated resources, limiting the range of studies that can be performed. A workaround is to lift over annotations from better annotated genomes. Generating the data to perform a liftover, however, is computationally and labour intensive and only a limited number are currently available on public databases. We present nf-LO (nextflow-LiftOver), an easy to use and scalable Nextflow pipeline for performing liftovers between species. The workflow ships dependencies through containers, and is easy to implement and scale in a wide range of systems. 
Type Of Technology Software 
Year Produced 2021 
Open Source License? Yes  
Impact This software is being used to annotate new cattle assemblies we are generating 
URL https://github.com/evotools/nf-LO