A high-throughput platform for assembling genome data: the Saphyr

Lead Research Organisation: Queen Mary University of London
Department Name: Sch of Biological and Chemical Sciences

Abstract

Since the completion of the human genome project in 2003, after 13 years of work and billions of dollars, the ability to produce genomic data has accelerated at phenomenal rates. Advances in sequencing technologies mean that just 20 years on we can now sequence a genome in hours, not years, at the cost of a few hundred dollars. As such, genome sequence data can be produced by small and large labs alike, and numerous regional, national and global consortia have been newly launched with the goal of sequencing millions of species worldwide in the coming years.

Unfortunately, while the explosion of thousands of genome datasets has created new opportunities - such as screening genomes for variants that correlate with ecologically-important traits such as disease resistance - it has also created equally steep challenges. A particular problem concerns the nature of genomic datasets; even after deep-sequencing, most "genomes" actually comprise large numbers of non-contiguous fragments that cannot be stitched together into chromosomes. This incomplete assembly obstructs inferences of linkage, synteny, and genotype-phenotype association, and is especially severe for non-model plants, fungi and animals that are characterised by large complex genomes.

The Saphyr platform (Bionano Genomics) overcomes this obstacle by directly imaging megabase length fluorescently barcoded optical molecules. DNA is treated with an enzyme that deposits fluorescent marks based at specific sites, and then millions of large DNA molecules are passed through proprietary nanochannels and their images are converted to graphical barcodes. Using software, barcoded molecules are compared against each other, and, based on matching patterns of fluro-marks, assembled into 'optical contigs'; these are structural scaffolds, independent of sequence, that represent very large regions. Since these contigs easily span breaks in sequence assembly, they allow the generation of chromosome length 'hybrid scaffolds' made up of optical and sequence contigs, where the short-read sequence assemblies have been in silico enzyme treated so that they exhibit the same barcode patterns as the optical molecules.

Optical mapping is considered an essential step in producing high quality genomes, and features in the pipelines used by many large sequencing consortia, including the newly-launched Darwin Tree of Life Project, which aims to capture the genomic information of the ~66,000 UK species. The Saphyr is the only high-throughput optical mapping platform, yet there is currently no dedicated facility for NERC-funded research working at the interface of genomics and environmental science.

We will establish a NERC-oriented next generation optical mapping facility at QMUL, underpinned by our technical support for isolating ultra-high molecular weight DNA from a range of biological samples and species. This high throughout facility will meet current demand for generating high quality genomes, and will thus enhance the UK's technical capacity in the fast developing field of environmental genomics. Specifically, the facility will enable researchers to (1) include genome assemblies in current and future projects, (2) play more active roles in large genome networks, and (3) exploit the platform's versatility to tackle novel research questions regarding the genome-wide distributions of specific DNA motifs and features.

At an institutional level, the Saphyr will allow us to expand our research portfolio and address fundamental questions in ecological and evolutionary genomics in diverse non-model organisms, which range from trees and bacteria to mammals and annelids. Moreover, the Saphyr will enhance training of specialised staff, enhance our capacity to establish collaborations with other UK research institutions, and thus emerge as a common technical resource for the environmental-oriented research community.

Planned Impact

The impact:

The Saphyr is the only high throughput optical mapping platform for genome assembly. Its installation at QM, underpinned by our expertise in ultra-high molecular weight DNA, will provide a new level of capacity in genomics research, both locally and among the NERC community. It will allow us to generate gold-standard, chromosome-scale assemblies for 100s-1000s individuals, regardless of genome size. Due to its unsurpassed throughput and cost effectiveness, the Saphyr is in demand. Our facility will help remove current bottlenecks in ecological genomics research in the UK, and will offer new opportunities to study biological phenomena at unprecedented scales, such as the role of structural and epigenetic variants in ecological responses to environmental change. It will also enable UK-based biologists to participate more actively in global large-scale genome projects. We will measure impact by tracking numbers of users (QM, regional, UK, overseas), and resulting publications and grants.

Societal & economic benefits:

By enabling rapid genome assembly, the Saphyr offers wide-ranging potential benefits. For example, at QM, Prof R Buggs is using genomics to uncover the genetic basis of resistance to diseases in trees, including ash dieback, which has decimated the UK's ash population with an estimated cost of ~£15 billion. At a national level, the Darwin Tree of Life Project aims to capture the genomic information of ~66,000 UK species, a highly ambitious challenge in ecological science that has wide anticipated societal and economic impact. Similar initiatives concern specific taxa, such as the 'Genome 10K', focusing on 10,000 vertebrate species [1]. Such consortia contribute to the US-led Earth BioGenome Project, aimed at sequencing 1.5 million species [2]. The success of these endeavours, and of individual projects, rest on high-quality genome assemblies, which can be generated rapidly by the Saphyr.
[1] Journal of Heredity, 100, 659-674.
[2] Nature, 2/11/2018, doi: 10.1038/d41586-018-07279-z

Actual and potential beneficiaries:

Research
Actual and potential beneficiaries will be direct users of the Saphyr, both internal and external to QM, who require genome assemblies for their research. Several such users have active funding. With the publication of these research findings, the asset will have broader impact, benefiting researchers working in fields of ecological and evolutionary genomics from across the global community.

Collaborations
A Saphyr based at QM and made available to the NERC and wider community will enhance the competitiveness of grant applications that either need genome assemblies, or which aim to exploit the Saphyr's other potential uses (e.g. identifying structural rearrangements, or genome-wide mapping of specific DNA motifs). As such, the facility will encourage new collaborations, especially between researchers with direct access and those based overseas.

Data
Direct data outputs will be the optical genomic maps and the resulting genome assemblies. Such datasets are, by convention, deposited in public sequence repositories, and so actual and potential beneficiaries will include other researchers who can utilize these data in their own research. Indirect outputs will be any data generated using genome assemblies; e.g. loci identified in genome-wide association studies of disease resistance in trees, or population frequencies of structural rearrangements. As such, beneficiaries will likely encompass researchers from broad fields, and other stakeholders such as funding agencies and policy makers.

Networks
A Saphyr will allow UK users to take more active roles in large research networks that require optical mapping, including major genome consortia. Rossiter is part of a the Bat1K consortium aimed at sequencing 1000 bat genomes, which currently uses a Saphyr in Germany, while Buggs and Elphick are involved in consortia focused on trees and starfish, respectively.

Publications

10 25 50
 
Description London Next Generation Sequencing Symposium 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Professional Practitioners
Results and Impact 24th May 2022 London Next Generation Sequencing Symposium 2022 at the Crick Institute - Presentation.

Part of the London Genomics Network- UK The London Genomics Network is formed of 17 laboratories based in 8 academic institutions around London. These laboratories provide genomics support and expertise to researchers in their own institutions and the wider scientific community. https://londongenomicsnetwork.org/member/
Year(s) Of Engagement Activity 2022
URL https://londongenomicsnetwork.org/member/