Beyond a single reference: Building high quality graph genomes capturing global diversity

Lead Research Organisation: University of Edinburgh

Department Name: Roslin Institute

Abstract

Most species show substantial amounts of genetic diversity between individuals and populations. However, large amounts of this genetic diversity are missing, and therefore inaccessible, using current reference genomes. Almost all references are derived from just one or a small handful of individuals, which are collapsed into a single pseudo-haploid representation, meaning hundreds of megabases of pan-genome DNA sequence are missing from most mammalian reference sequences. This includes regions likely to be associated with important phenotypes such as environmental adaptation and disease tolerance. Not only can these regions not be studied in current analyses reliant upon these genomes, but several studies have highlighted how reliance on these haploid references deleteriously bias analyses, even in the regions of the genome that are present. Reference mapping biases impact analyses as fundamental as genetic variant calling and gene expression studies, which ultimately means they are likely to be deleteriously affecting thousands of studies a year.
The emerging field of genome graphs aims to mitigate these issues by incorporating the diversity observed across individuals into a single graph representation of the species' pan-genome. This ensures all genomic regions can be captured and mitigates issues such as mapping biases by incorporating all known alleles and haplotypes as alternative routes through the graph. Despite their advantages, few high-quality graph genomes are currently available, primarily because the generation, annotation and visualisation of graph genomes is challenging, providing barriers to their wider use. The aim of this project is to drive forward the use of graph genomes by addressing these issues. By producing reusable, containerised pipelines for generating and working with genome graphs, researchers will be able to rapidly generate and update graph genomes for their species of interest. We will use these pipelines and data from previous BBR projects to generate and make available the first high quality cattle graph genome resource, encompassing the spectrum of genetic variants from large structural variants across sub-species to single nucleotide variants within breeds. To ensure graph genomes can be widely accessed the third and final resource will be a new portal for viewing richly annotated genome graphs.

By facilitating the rapid creation of graph genomes compatible with relevant downstream alignment and variant calling software freely and publicly accessible, enabling their downstream visualisation, and developing a new cattle graph genome, we expect this project to make a significant contribution to livestock research, ranging from studies mapping genetic loci linked to economically important traits to those understanding the evolution of species. Additionally, the pipelines developed will be immediately transferrable to the production of graph genomes for other species, significantly extending the impact of project outputs.

Technical Summary

Despite the advantages of graph genomes, they remain largely unused in livestock research. They are difficult to construct and almost no high quality, pre-compiled and freely accessible graph genomes currently exist. Constructing a comprehensive graph genome that incorporates the spectrum of genetic variation, from large structural changes to single nucleotide variants (SNVs), can involve over fifty distinct analysis steps. This is in contrast to current pseudo-haploid reference genomes that can be downloaded with a single click. If graph genomes are to be more widely used, reducing the biases in downstream analyses, these barriers to their wider use need to be overcome. To address these issues, we propose to generate reusable, efficient and easy to use pipelines using Nextflow and docker for creating mammalian graph genomes. These pipelines will made freely available and fed directly into various genome projects for domesticated species. Using these pipelines we will construct and make available a high quality, richly annotated cattle graph genome incorporating the spectrum of changes from large structural variants down to breed-specific SNVs. This will largely eliminate the barriers to using graph genomes in cattle research studies, one of the most widely accessed species on the Ensembl genome browser. Given the advantages of using graph genomes, from more accurate variant calling to the reduction in reference biases in allele specific expression studies, we expect such resources to have substantial downstream impacts. Improving the calling of large and small genetic variants will improve the ability to meaningfully map the genetic basis of economically important traits. As there are few current resources for visualising richly annotated graph genome, we will also develop a graph genome browser built upon JBrowse2. This will allow users to visualise annotations along the alternative sequences present in the graph genome backbone.

Planned Impact

Reference genomes are a core foundation of modern biological research. They provide the backbone for variant calling, genome assembly, RNA-seq and other sequencing analyses. They are used to annotate genes and functional elements and provide a common frame of reference for their locations. The variant alleles of an individual are recorded in population studies with respect to those found in the reference. Therefore, the ubiquity and fundamental importance of reference genomes means that these resources will have a wide range of both short- and long-term beneficiaries across academia, industry and ultimately the wider public.

Key beneficiaries of the cattle graph genome, and the graph genomes generated for other species using the reusable pipelines, will in the short term include UK and global academics and breeding companies mapping genes linked to economically important traits and diseases. Important production traits and diseases have already been linked to structural variants in cattle, sheep and pigs but these are currently difficult to detect and assay. The improved variant calling, and better representation of structural variants will improve the ability to map such functional loci for breeding and potentially gene editing studies across species. Graph genomes will allow for the full diversity of genetic variants to be more accurately tested against key phenotypes and inform the interpretation of genetic association and population genetic study results by providing candidate functional structural variants in relevant regions.

The immediate downstream beneficiaries of this will be livestock holders. The mature UK livestock breeding industry is well placed to exploit improved genomic resources and the economic benefits of improving local breeds through genetics are clear. Incorporating a novel Holstein-Friesian assembly into the graph genome, the UK's most common breed, will further ensure its relevance to the UK. Better representation across breeds of immune loci, such as the MHC, will substantially improve genetic association studies with disease, still a major barrier to productively rearing livestock.

Long term beneficiaries will extend far beyond just farmers. The current use of insecticides and acaricides have significant impacts on the environment and soil fertility for example, and reducing their use through understanding alternative mechanisms of reducing disease burden could have substantial longer term environmental benefits. Furthermore, exploiting variants linked to drought tolerance, methane emissions and environmental adaptation could enable the development of less resource intensive, but productive breeds and help address some of the key challenges likely to arise from climate change.

A key challenge for graph genomes will be breaking the inertia built up from the use of current single, haploid genomes. To complement our proposed UK based training courses to address this, in conjunction with BecA (Biosciences Eastern and Central Africa) and CTLGH we will use this resource in our ongoing training courses and workshops for African students and scientists.

Funded Value:

£436,525

Funded Period:

Nov 20 - Oct 23

Funder:

BBSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

BB/T019468/1

Principal Investigator:

James Prendergast

Research Subject:

Omic sciences & technologies (44%)

Tools, technologies & methods (55%)

Research Topic:

Bioinformatics (33%)

Functional genomics (22%)

Genomics (22%)

eScience (22%)

Organisations

People	ORCID iD
James Prendergast (Principal Investigator)
Alan Archibald (Co-Investigator)
Liam Morrison (Co-Investigator)	http://orcid.org/0000-0002-8304-9066
Tim Connelley (Co-Investigator)
Andrea Talenti (Researcher Co-Investigator)	http://orcid.org/0000-0003-1309-3667

Publications

Author Name

Title Publication Date Published

|< < 1 2 > >|

10 25 50

Andrea Talenti (2022) A cattle graph genome incorporating global breed diversity

Gundappa M (2023) High performance imputation of structural and single nucleotide variants in Atlantic salmon using low-coverage whole genome sequencing

Powell J (2023) Profiling the immune epigenome across global cattle breeds. in Genome biology

Smith TPL (2023) The Bovine Pangenome Consortium: democratizing production and accessibility of genome assemblies for global cattle breeds and other bovine species. in Genome biology

Talenti A (2022) A cattle graph genome incorporating global breed diversity.

Talenti A (2021) nf-LO: A Scalable, Containerized Workflow for Genome-to-Genome Lift Over. in Genome biology and evolution

Talenti A (2023) Continent-wide genomic analysis of the African buffalo ( Syncerus caffer )

Talenti A (2021) A cattle graph genome incorporating global breed diversity

Talenti A (2021) nf-LO: A scalable, containerised workflow for genome-to-genome lift over

Key Findings
Impact Summary
Research Databases and Models
Collaboration
Software and Technical Products


Description	We have created and made available the first cattle graph genome capturing global breed diversity and created a suite of new software tools. This includes nf-lo (https://github.com/evotools/nf-LO) that enables users to lift annotations between any pair of genomes, which has already been used across a range of projects including the human telomere to telomere project. We have also generated a pipeline for creating ancestral genomes from graph genomes as part of our nSPECTRa workflow (https://github.com/evotools/nSPECTRa) and are generating an as yet unreleased pipeline for using graph genomes for imputation analyses that we expect to release this year.
Exploitation Route	Graph genomes, new genome assemblies and the tools for working with them have a diverse range of uses across studies. In particular for calling genetic variants missed when using a single reference genome. This work is therefore expected to feed into a diverse range of work downstream regarding the improvement of livestock. Not only are the software tools and resources that we have generated publically available for the wider community we are also directly contributing our genomes and resources to the major cattle pangenome efforts, including the long read consortium and bovine pangenome consortium. These are expected to even further leverage the outcomes of this grant for further benefits for the community.
Sectors	Agriculture Food and Drink
URL	https://www.bomabrowser.com/


Description	Data from the project was used to support a patent application
First Year Of Impact	2022
Sector	Agriculture, Food and Drink
Impact Types	Economic


Title	Boran genome assembly
Description	We have generated a Boran HiFi assembly scaffolded with Bionano optical mapping data
Type Of Material	Database/Collection of data
Year Produced	2021
Provided To Others?	No
Impact	We are using this to investigate the genetic basis of heritable tolerance to East Coast fever observed among this animal's pedigree.


Title	Cattle genome assemblies for Ankole and NDama breeds
Description	High quality reference genome assemblies for two cattle breeds generated from PacBio and Illumina sequencing data and bionano optical mapping data.
Type Of Material	Database/Collection of data
Year Produced	2020
Provided To Others?	Yes
Impact	The data has been submitted for publication
URL	https://www.bomabrowser.com/cattle.html


Description	Member of bovine pangenome advisory committee
Organisation	U.S. Department of Agriculture USDA
Country	United States
Sector	Public
PI Contribution	The bovine pangenome consortium was set up to generate new cattle genome assemblies from 100+ breeds. We are contributing expertise, samples and genome assemblies to the consortium
Collaborator Contribution	The other members of the BPC are also contributing expertise and assemblies.
Impact	The BPC has only recently been set up.
Start Year	2020


Title	BOmA (Bovine Omic Atlas)
Description	BOmA is a genome browser tailored for viewing cattle omic data, including that being generated alongside or as part of this award. Data currently on the browser spans both water buffalo and cattle and for example includes genotypes from 420 global cattle breeds and optical mapping, ATAC-seq, RNA-seq and RRBS data for various breeds. The first version of the browser is available here https://www.bomabrowser.com/ and we are currently in the process of updating it to support visualising graph genomes
Type Of Technology	Webtool/Application
Year Produced	2019
Impact	The browser has already been used to prioritise candidate functional sites, for example, in regions putatively linked to trypanasome and T.parva tolerance.
URL	https://www.bomabrowser.com/


Title	evotools/CattleGraphGenomePaper: Code for Talenti et al. A cattle graph genome incorporating global breed diversity.
Description	This release contains the code used for the analyses in Talenti el al. A cattle graph genome incorporating global breed diversity.
Type Of Technology	Software
Year Produced	2021
URL	https://zenodo.org/record/5749431


Title	evotools/CattleGraphGenomePaper: Code for Talenti et al. A cattle graph genome incorporating global breed diversity.
Description	This release contains the code used for the analyses in Talenti el al. A cattle graph genome incorporating global breed diversity.
Type Of Technology	Software
Year Produced	2021
URL	https://zenodo.org/record/5749432


Title	evotools/nSPECTRa: Release 1.0.0
Description	Nextflow workflow to compute the mutation spectra
Type Of Technology	Software
Year Produced	2024
Open Source License?	Yes
Impact	https://www.biorxiv.org/content/10.1101/2023.12.02.569698v1
URL	https://zenodo.org/doi/10.5281/zenodo.10784678


Title	nf-LO: A scalable, containerised workflow for genome-to-genome lift over
Description	The increasing availability of new genome assemblies often comes with an impaired amount of associated resources, limiting the range of studies that can be performed. A workaround is to lift over annotations from better annotated genomes. Generating the data to perform a liftover, however, is computationally and labour intensive and only a limited number are currently available on public databases. We present nf-LO (nextflow-LiftOver), an easy to use and scalable Nextflow pipeline for performing liftovers between species. The workflow ships dependencies through containers, and is easy to implement and scale in a wide range of systems.
Type Of Technology	Software
Year Produced	2021
Open Source License?	Yes
Impact	This software is being widely used, including as part of generation of the human telomere to telomere genome (Rhie, Arang, et al. "The complete sequence of a human Y chromosome." Nature 621.7978 (2023): 344-354.).
URL	https://github.com/evotools/nf-LO