Underpinning UK Bioscience Research with high-throughput single molecule sequencing

Lead Research Organisation: University of Edinburgh
Department Name: Sch of Biological Sciences

Abstract

The genomics revolution is entering its third phase with the advent of technologies that can generate single molecule DNA sequencing data extending 100s of kilobases. Long DNA sequence reads can be used to generate high quality reference genomes, to better understand gene expression, and to sequence and sort DNA in complex mixtures such as environmental samples of soil or water. The Pacific Biosciences Sequel II System is a new long-read DNA sequencing platform that generates extremely large amounts of data at an affordable cost. A key feature of Sequel II is that it can generate highly accurate long reads, by reading a DNA molecule multiple times to correct errors. This grant will acquire the Sequel II platform, the first in Scotland, embed this within Edinburgh Genomics, and allow us to offer low-cost, high quality long read sequencing as a service to the biological and biotechnology research community. We will use the investment to support researchers who want to generate genomic data for diverse topics such as agriculture, biotechnology, genetics, immunology and synthetic biology.

Technical Summary

The PacBio Sequel II is an evolution of the original Sequel single molecule sequencer. The improved technology and flow cell chemistry, in the Sequel II, will allow EdGe to generate highly accurate sequence reads of long DNA and RNA molecules. Single molecule, long read technology has benefits over short read technologies (e.g. Illumina) where assembly or mapping often result in ambiguities. The longer reads generated by single molecule sequencing do not need local assembly; they are more confidently assigned to the correct location. The complexity of genomes (DNA) and transcriptomes (RNA) can best be addressed with longer reads. For example, full length RNA molecules, with usually many variants per gene, can be identified and compared between samples (letter of support from Richard Kuo). When assembling chromosome level sequences, long reads help to resolve issues like repetitive regions and structural variation. The advantages of the Sequel II (over the Sequel) are its greatly increased throughput, due to its larger, 8 million Zero Mode Wave guide flow cell and longer movie times for data collection. The increased number of reads results in a lower cost per base (less than half the cost of Sequel). This means that research is more cost effective, allowing more sequence depth, sample numbers (for additional replicates, or biological questions) or both. For example, 40-50 bacterial samples can be sequenced for de novo assembly on a single SMRT cell. However, the biggest improvement in Sequel II is its accuracy. 'HiFi' reads are generated by circular consensus sequencing mode where the same strand is sequenced several times. This increases accuracy from 86% (single pass accuracy) to >99.9%, allowing detection of single nucleotide variants and helping to generate highly accurate phased genome and transcriptome assemblies. The improved accuracy of the Sequel II will allow hard to sequence areas, like the MHC (co-I Dr Connelley), to be accurately sequenced at scale.

Planned Impact

The initial impact of this investment will be in high quality biological research published in peer reviewed journals, conference proceedings and invited lectures. Often, high impact work is also the subject of print and broadcast media articles, which broaden the reach of the work to inform the general public and policy makers. The Sequel II will allow more studies, at lower costs, and with better results than previous technologies. Because Edinburgh Genomics (EdGe) is an open access Genome Facility, the impact of the investment will be seen across the University of Edinburgh but also in other institutes in the UK, and in collaboration with scientists abroad. The expected lifespan of the sequencer is around five years, and it is expected that more than 100 projects will benefit from access to the sequencer.

These studies will have broad societal impact as well as basic scientific value. The research excellence afforded by this new technology will promote invaluable real world benefits to areas such as food security (through animal health, crop disease resistance), health and medical science (in disease areas such as cardiovascular, inflammation, cancer and reproductive biology and exciting technologies like stem cell therapy), and our ecosystems (with environmental research in oceans and forests).

Students, postdocs and PIs from the University of Edinburgh and from across the UK will benefit from our new wet and dry lab long read sequencing course. This will train 48 scientists per year in the process of using long read sequence data in biological research. Additional bioinformatics-only workshops will be offered to a larger number (<100) of attendees per year. These 'upskilled' scientists will feed back into more high quality long read biological research across agricultural, medical and environmental biology.

Through public engagement, the science that this investment facilitates will be brought to a large audience and will feed back into further valuable research as its value becomes clearer and public opinion demands more from our scientists. Influence on governmental and funding policies will increase the speed and scope of science's impact on our lives.

Publications

10 25 50
 
Description In total we have successfully sequenced 39 samples for 11 different PIs. 6PIs from the University of Edinburgh and 5 PIx from five different external academic institutions. These are



· 20 samples from 4 PIs in the University who are Co-Investigators in this grant for the purchase of the PacBio sequencer:

· Daniel MacQueen, 8 samples for HiFi WGS - structural variant analysis of Oyster populations.

· James Prendergast, 6 samples for HiFi WGS - Cattle genome de novo assembly.

· Mick Watson, 5 samples for HiFi WGS - metagenomics analysis.

· Tim Connelley, 1 sample for HiFi amplicon sequencing.



· 3 samples are from 2 PIs in the University of Edinburgh (which are no Co-Investigators in the grant)

· 1 PI from the Roslin Institute, 2 samples for HiFi RNA sequencing (Teloprime kit) - Sheep transcriptome analysis

· 1 PI from the Wellcome Centre for Cell Biology, 1 sample for HiFI WGS.



· 16 samples are from 5 PIs from 5 different academic institutions outside the University of Edinburgh.

· University of Glasgow, 1 sample for HiFi WGS - European lamprey genome sequencing.

· Weizmann Institute of Science, 7 samples for HiFi IsoSeq - Cacti RNA; and 4 samples for HiFi WGS - Cacti genome assembly.

· University of Birmingham, 1 sample for HiFi amplicon sequencing.

· Royal Botanic Garden Edinburgh (RBGE), 1 sample for HiFi IsoSeq - Transcriptome sequencing of Streptocarpus rexii.

· Imperial College London, 2 samples for CLR WGS - De novo Genome assembly and methylome analysis of Streptococcus pyogenes sub-lineages.



I only provided names of the PIs that are co-investigators in this grant as the T&Cs of researchFish says not to put info that cannot be made public. We cannot disclose users information publicly, but I don't think that applies to the co-investigators of the grant.



In addition we have in the Facility another 15 samples currently being processed (not finished yet), all for HiFi whole genome sequencing. 10 of them are from 4 PIs in the University of Edinburgh. Two of them are Co-Investigators of the PacBio grant, James Prendergast and Alex Twyford with three cattle and four Scottish Thistle samples respectively both for de novo genome sequencing and assembly. The other 5 samples external to UoE are from the RBGE (4 samples) and the University of Kent (1 sample).



Also we have confirmed another 3 projects (two of them from the University) accounting for a total 17 samples that we are waiting for samples to be sent. From the two projects from the University that we are still awaiting samples, one of them is from your group for 6 mouse samples for WGS.
Exploitation Route We provide a sequencing service, so every single project done will allow research groups to generate new genomic insights for many species due to the nature of the sequencing technology (high accuracy long-read single-molecule sequencing). In summary, the outcomes of this funding will make possible for researchers to obtain high quality reference genome of many species, generation of pan-genomes, large structural variants studies of genomes, full characterization of transcriptomes defining true transcription starts and allowing complete annotation of genomes, higher resolution (up to species level) for microbiome and metagenomics studies and methylation studies of genomes.

In particular from the projects done so far the data generated with the PacBio Sequel IIe sequencer will allow to others

- The study of large structural variants of oyster populations

- Generation of cattle high quality reference genome and pan-genomes

- Metagenomics analysis at the species level from environmental samples

- Full transcriptome characterization and annotation of sheep tissues

- Creation of a European lamprey high quality reference genome

- Generation of several cacti high quality reference genomes and full transcriptome annotations

- High quality reference genomes characterizations for several plants like Streptocarpus sp. and Cirsium vulgare.

- Bacterial high quality reference genomes and methylation analysis (e.g. Streptococcus pyogenes)
Sectors Agriculture, Food and Drink,Healthcare,Pharmaceuticals and Medical Biotechnology