Adaptive sampling ('Read Until') methods in optimised nanopore sequencing technologies

Lead Research Organisation: European Bioinformatics Institute
Department Name: Sequence Database Group

Abstract

Over the last three decades, DNA sequencing has become a key technology across and beyond the life sciences. Indeed, few areas of biological research remain untouched by either the direct use of the technology or knowledge that is derived from others' work in which sequencing has been used. The technology has advanced rapidly. In the mid-2000s, a second generation of sequencing technologies, quite unlike the first, brought a step change in the rate at which sequencing machines could operate, and a corresponding vast reduction in the cost. These technologies now dominate and have led to a wealth of new and impactful scientific findings, not least as the core sequencing technology behind many thousands of animal, plant, fungal and bacterial projects. We are now on the cusp of a third-wave of technology, 'nanopore' sequencing, again quite unlike those that proceed it, that promises similar game-changing advances. In the 'Adaptive Sampling' project, we recognise the potential of nanopore sequencing and focus on a particular, as yet under-explored, feature of the technology that promises very significant impact.
Nanopore sequencing uses microscopic pores that can be engineered and organised onto a surface. The pores allow DNA molecules to pass through one at a time from one side of the surface to the other. As they transit, the pores provide a direct read-out of the bases (A, C, G and T) that pass the inner surface of the pore. The user places a mixture of DNA molecules (fragments of a whole genome) above the pore, which then captures the end of a DNA molecule and starts to draw it through, reading its sequence as it goes. The control of the system is so refined that, if desired, a DNA molecule can be rejected from a pore before it has been fully sequenced and the capture process can start again rapidly.
A key challenge for all sequencing platforms is that some parts of genomes are 'difficult to sequence' and others are not. Because of this, to be certain that a genome sequencing experiment has captured all parts of a genome, the user must set the experiment up to read the genome many times (often 30), so that the difficult regions are read at least once. With Adaptive Sampling, we plan to overcome this obstacle with software that will rapidly read the early sequence from a pore, and make a decision about whether the part of the genome that is emerging from the pore has been read already or is yet to be read. Based on this, a decision can be made as to whether or not to reject the DNA molecule from the pore or to carry on reading to the end. The time saving to be achieved by avoiding re-sequencing in this way will be substantial, driving at far more cost-effective, rapid and 'targeted' sequencing.
While our technology will be useful broadly, we will work specifically with five example challenges, in which the tools will be useful. These cover detection and identification of infectious bacteria, the study of agricultural livestock, investigation of crop plant genomes, work on farmed fish to understand responses to disease-causing species and the analysis of communities of microbial species in the environment.
There is substantial novelty in this approach. In previous work on Ebola virus, we have shown that rejecting reads using a prototype of our software has potential. What we now propose will be the first example, to the best of our knowledge, of a sequencing approach in which data analysis (previously something that happened after sequencing was completed) has direct impact on the way in which the physical sequencing machine itself is operated during a sequencing experiment.
As part of the project, aiming at the broadest possible benefit to the research community, we plan to publish the software and hold two workshops in which we disseminate what we have developed to technologists, genomics laboratories, research scientists and industry.

Technical Summary

We propose to develop algorithms to enable adaptive sampling of DNA in real time by exploiting the unique property of nanopore sequencers, that data are streamed from nanopores and that the Oxford Nanopore Technology minION device allows the specific molecules to be ejected from a nanopore at any time, regardless of how completely it has been read. For this, two linked, but distinct, problems must be solved: The DNA molecule (represented by changes in current flow) must be mapped rapidly to a reference and an accept/reject decision must be made based on accumulated previous mapping events. We will address both of these problems using five model cases of direct relevance to BBSRC science:
1. Rapid even coverage in bacterial genome sequencing (e.g. pathogen identification in food-borne disease)
2. Even coverage in diploid genome resequencing (e.g. marker and variant discovery in livestock welfare and breeding)
3. Sequencing of genomic regions of interest that are recalcitrant to conventional sequencing (e.g. in crop plant genomics)
4. Maximising discovery and quantification of low-abundance transcripts (e.g. in fish pathogen response transcriptomics)
5. Coordination of multi-sample sequencing in complex mixtures (e.g. in comparative metagenomics studies)
To achieve rapid matching of early read data to reference sequence we will explore several indexing/pre-computing strategies, including Fast Fourier Transform of streamed data; wavelet transform of the stream followed by indexing; discretisation of the signal and suffix tree or FM-index processing. This tool would run on the laptop local to the sequencer. In contrast, the logical process for accepting or rejecting specific reads will be managed by an external server system running appropriate pipelines on the minoTour minION analysis platform. Templates will be generated for minoTour allowing experienced users to generate pipelines for further specific use cases.

Planned Impact

The application of sequencing technologies underpins much of biological research today. Our approach, adaptive sampling in nanopore-based sequencing, serves to eliminate coverage bias and focus resolving power and thus has numerous beneficiaries. Within the broad UK and global academic and applied science communities these methods will benefit both those already using, and those yet to use, sequencing methods.
The direct impacts of our work will be delivered as an enabling software technology that allows broad use of adaptive sampling. During the project we will specifically demonstrate the technology in five areas of biological research and application, each of which represents a challenge area for current sequencing approaches. These are the rapid sequencing of bacterial pathogens for identification, typing and resistance profiling purposes (demonstrating coverage control in diploid genome sequencing), marker and variant discovery in livestock resequencing (even coverage in diploid genome sequencing), access to regions that are difficult to sequence in higher plants, particularly the crop species (targeted genomic region sequencing), pathogen response transcriptome characterisation and profiling in farmed fish species (low-abundance transcript sequencing) and comparative metagenomics (coverage/focus control in multi-sample sequencing). We expect direct impact on groups of researchers who use sequencing approaches in these areas, including, but not limited to, those who have expressed support for the project (see letters of support).

Through the capacity to eliminate coverage bias, sequencing costs will be reduced, making sequencing available to areas of research and application for which cost remains prohibitive (such as deep population biology of crops, the discovery of low frequency variant alleles for livestock breeding programmes and the profiling of expression in non-model species). Through the ability to focus on defined regions, adaptive sampling will bring powerful methods to areas such as ecology and biodiversity (barcoding, whole-ecosystem analysis, occurrences and abundance), environmental sensing (water safety, environmental health, sentinel markers for pollution and climate change), food chain control (food species/breed/line validation, forensic tracking), border and trade control (invasive species, illegal trade in controlled species), bioenergy (investigation of new species, yield improvement), public health (environmental and zoonotic pathogen sinks, epidemiology of anti-microbial drug resistance) and animal health (surveillance, outbreak detection, transmission control).

The UK has long been established at the forefront of sequencing technology and the application of adaptive sampling methods to nanopore technologies will serve to continue this trend.
 
Description We have delivered an integrated software product that brings together the individual components constructed and tuned during the project. This final system comprises three major components: the system that communicates with and controls pores; a "fast loop" matching server that maps emerging sequences to reference with a given "mask", sending decisions back to the control module; and the "slow loop" that generates and updates the mask based on the accumulated full sequences.

Control system: this has been developed by the Nottingham team and implemented as a significant extension and refactoring of the MinoTour system. Based on high-performance read calling software that emerged from Oxford Nanopore (Scrappie) during the project, we made and enacted a decision to base call to produce the fast loop matching entity; MinoTour receives emerging signal via the MinKNOW API, calls bases, despatches sequences to the fast loop server and sends decisions received as block/unblock control messages again via MinKNOW. Published software from the project is available at https://github.com/LooseLab under minotourapp, minotourcli, ru and read_until_api_v2.

Fast loop server: this component has been developed by the EMBL-EBI group, as a Java server that runs rapid sequence matching in C using the BWA aligner, which shows optimal performance for short sequences. In the system, mask look-up and mask update leverage memory-mapped mask files to reduce file opening overheads (Java NIO) and overall the system supports thousands of parallel operations giving a wide margin. Mask update is supported with a RESTful interface provided to the slow loop. The fast loop server performs effectively within laptop-scale compute on a human genome data set. Published software from the project is available from https://github.com/EGA-archive/ont_readuntil_server.

Slow loop: the EMBL-EBI group has designed an access-optimised data structure for the mask managed by the slow loop that is sufficiently performant in the fast loop server. "BOSS-RUNS" has been developed and published as pre-print by both Nottingham and EMBL-EBI groups that provides a mathematical and algorithmic framework for defining the "value" of sequence reads that cover certain genomic positions, and of the value of likely reads (https://www.biorxiv.org/content/10.1101/2020.02.07.938670v2.full). Simulations of whole bacterial genome sequencing under uniform and biased coverage, reduced representation bacterial genome sequencing from input whole bacterial DNA (multi-locus sequence typing and core genome multi-locus sequence typing), yeast haploid and diploid whole genome sequencing provide a strategy for the use of Read Until in these scenarios. Published software is available at https://bitbucket.org/nicofmay/readuntilstrategy/src/master/.

Further demonstration of our system, published as pre-print, covers a number of Read Until sequencing scenarios: resolving specific chromosomes from human genome sequencing, resolving low-abundance genomes from metagenomics sequencing and selective sequencing of a panel of cancer genes (https://www.biorxiv.org/content/10.1101/2020.02.03.926956v2.full).

Our original proposal for the Read Until project asserted 5 challenge areas that we would address. We have directly addressed challenge areas 1 (even coverage in bacterial sequencing) and 2 (even coverage in diploid genome sequencing) in the simulations we describe above. Our real-world sequencing demonstrations described above are challenging cases for which opportunities arose, that relate, but are not directly the same as, original challenge areas 3 (sequencing of recalcitrant regions) and 4 (discovery and quantitation of low abundance transcripts).

In a recent addition (in the last year), the project's ReadFish software (https://github.com/LooseLab/readfish) has been integrated into ONT's MinKNOW control software to enable adaptive sampling simply and easily for end users.

Finally, we have again seen rapid growth in ONT data published in the European Nucleotide Archive. With a total of 81,418 ONT-sequenced libraries, of which 67,972 (83%) have arrived since March 2020, data presented from EMBL-EBI's services come from some 400 submitting centres and cover more than 2,000 species and/or metagenomics biomes. Data submissions are supported through documentation (e.g. https://ena-docs.readthedocs.io/en/latest/submit/fileprep/reads.html) and teh ENA's email help desk.
Exploitation Route We expect impacts of value to the UK and international bioscience community, through the delivery of software components that enable and empower those using nanopore sequencing. We have completed the integration of the Read Until technical system and have published all software. We have demonstrated through simulation and in the laboratory challenging cases in which the system performs well. As the publications on these outputs emerge (currently available as pre-prints) we expect uptake by the broader community. Future community work envisaged includes support for the publication of Read Until sequencing strategy choices alongside published sequence data.
Sectors Agriculture, Food and Drink,Environment,Healthcare,Manufacturing, including Industrial Biotechology,Culture, Heritage, Museums and Collections,Pharmaceuticals and Medical Biotechnology

 
Description The work in Read Until has provided software and expertise that have contributed to the scientific value and overall usefulness of nanopore-based sequencing across and beyond life science. Since the project was completed, the focus of software and service development work has built on Read Until inputs, alongside those from other projects. Active services remain in place at the European Nucleotide Archive that support data publication, access and reuse that were improved through the Read Until project. These services form an open global and permanent foundation for scientists to share and access nanopore data through the International Nucleotide Sequence Database Collaboration.
First Year Of Impact 2019
Sector Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Environment,Healthcare,Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology
Impact Types Policy & public services

 
Title BulkVIS 
Description BulkVIS is a tool for detailed analysis of raw signal data during Nanopore sequencing. This tool enables identification of longer reads than have previously been reported and more detailed understanding of how nanopore sequencing occurs. 
Type Of Material Technology assay or reagent 
Year Produced 2018 
Provided To Others? Yes  
Impact The identification of the longest molecule sequenced to date. https://www.bbc.co.uk/news/science-environment-46046024 
URL https://github.com/LooseLab/bulkvis
 
Title MinoTour version 1 
Description MinoTour is a complete laboratory information management system for Nanopore sequencing. It also includes customisable real time analysis. 
Type Of Material Improvements to research infrastructure 
Year Produced 2019 
Provided To Others? Yes  
Impact This is a revision of a previously available tool and feeds in to several of our other projects. 
URL https://github.com/looselab/minotourapp
 
Title Minotour Client 
Description This is a python tool to upload data into our minoTour application. 
Type Of Material Improvements to research infrastructure 
Year Produced 2019 
Provided To Others? Yes  
Impact This is feeding in to many of our existing projects. 
URL https://github.com/LooseLab/minotourcli
 
Title Read Until API updates 
Description We have overhauled the Oxford Nanopore Read Until API 
Type Of Material Technology assay or reagent 
Year Produced 2020 
Provided To Others? Yes  
Impact This tool will be partially integrated in to Oxford Nanopore Technologies own tools. 
URL https://www.github.com/looselab/read_until_api_v2
 
Title European Nucleotide Archive 
Description Repository and database of record for sequence data. 
Type Of Material Database/Collection of data 
Provided To Others? Yes  
Impact Foundation for sequence-based science across applications, platforms and taxonomies. 
URL https://www.ebi.ac.uk/ena/browser/home
 
Title European Nucleotide Archive - support for Oxford Nanopore Technologies data types 
Description The European Nucleotide Archive continues to support data flows from generators of Oxford Nanopore Technologies data, including those using software and tools originating from the Read Until project. 
Type Of Material Database/Collection of data 
Year Produced 2019 
Provided To Others? Yes  
Impact Global have open access to comprehensive deposition, search and access services around Oxford Nanopore Technologies data in INSDC databases. 
URL https://www.ebi.ac.uk/ena/browser/home
 
Description The Telomere-to-Telomere (T2T) consortium is an open, community-based effort to generate the first complete assembly of a human genome. 
Organisation National Institutes of Health (NIH)
Department National Human Genome Research Institute (NHGRI)
Country United States 
Sector Public 
PI Contribution I have been contributing expertise, time and sequencing data to the activities of of the telomere-to-telomere consortium. The goal of this consortium is to sequence the first human genome from telomere-to-telomere. Our expertise through the Long Read Club has been exploited to enable this goal.
Collaborator Contribution Other partners have generated sequencing data, analysed and assembled reads and presented this work.
Impact No outputs to date.
Start Year 2019
 
Description The Telomere-to-Telomere (T2T) consortium is an open, community-based effort to generate the first complete assembly of a human genome. 
Organisation University of California, Santa Cruz
Country United States 
Sector Academic/University 
PI Contribution I have been contributing expertise, time and sequencing data to the activities of of the telomere-to-telomere consortium. The goal of this consortium is to sequence the first human genome from telomere-to-telomere. Our expertise through the Long Read Club has been exploited to enable this goal.
Collaborator Contribution Other partners have generated sequencing data, analysed and assembled reads and presented this work.
Impact No outputs to date.
Start Year 2019
 
Title Read Until - mathematical model software - EMBL-EBI 
Description Software used in mathematical model development and simulation - EMBL-EBI 
Type Of Technology Software 
Year Produced 2020 
Open Source License? Yes  
Impact Software related to Ready Until used to develop mathematical model, for simulations and to define sequencing strategies - EMBL-EBI 
URL https://bitbucket.org/nicofmay/readuntilstrategy/src/master/
 
Title Read Until software - EMBL-EBI 
Description Read Until component - A Python-based Short Loop / Short Read Masked Match Server - EMBL-EBI 
Type Of Technology Software 
Year Produced 2019 
Open Source License? Yes  
Impact This software has impact as a component of the Read Until system 
URL https://github.com/EGA-archive/ont_readuntil_server
 
Title Read Until software - Python3 Read Until API implementation - Nottingham 
Description Python3 Read Until API implementation - a component of the Read Until system. 
Type Of Technology Software 
Year Produced 2019 
Open Source License? Yes  
Impact This has impact as a component of the Read Until system. 
URL https://github.com/LooseLab/read_until_api_v2
 
Title Read Until software - Read Until scripts - Nottingham 
Description Read Until scripts components of Read Until system 
Type Of Technology Software 
Year Produced 2019 
Open Source License? Yes  
Impact This has impact as a component of the Read Until system. 
URL https://github.com/LooseLab/ru
 
Title Read Until software - minotourcli - Nottingham 
Description minotourcli component of Read Until system 
Type Of Technology Software 
Year Produced 2018 
Open Source License? Yes  
Impact This software has impact as part of the Read Until system. 
URL https://github.com/LooseLab/minotourcli
 
Title minotour v 1 
Description Minotour is a real time set of tools for analysis of nanopore data. 
Type Of Technology Software 
Year Produced 2019 
Open Source License? Yes  
Impact This is being used across a number of our projects. 
URL http://minotour.nottingham.ac.uk
 
Description Grand Challenges in Genomics - Invited Panel Speaker - Joint meeting of the NHGRI/Wellcome Trust, London, Feb 2019 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Policymakers/politicians
Results and Impact Grand Challenges in Genomics was a meeting to discuss the next ten years of Genomics and the ways in which both NHGRI and the Wellcome Trust should target investment and funding in the future.
Year(s) Of Engagement Activity 2019
 
Description Long Read Club 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Long Read Club is an informal grouping of users interested in exploring long read sequencing technologies in all their guises. We are raising awareness of methods, best practice and experience. This is being done through a website, twitter account and youtube channel. Over 900 have signed up to the email list, nearly 700 followers on twitter and over 130 people have subscribed to the youtube channel.
Year(s) Of Engagement Activity 2019
URL http://youtube.com/c/longreadclub
 
Description Matt Loose Presentation at Nanopore Community Day, Oslo, NO 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Matt Loose presented at a community event targeting those using Nanopore methods and those developing applications around these.
Year(s) Of Engagement Activity 2018
 
Description Matt Loose presentation and instruction at Porecamp Nanopore Training Course, Birmingham, UK 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Matt loose presented and served as an instructor at a nanopore sequencing community event.
Year(s) Of Engagement Activity 2017
 
Description Matt Loose presentation at London Calling 2017, London, UK 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Matt Loose presented at a large event targeting those using Nanopore methods and those developing applications around these.
Year(s) Of Engagement Activity 2017
 
Description Matt Loose presentation at Nanopore Community Day, Utrecht, NL 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Matt Loose presentation at community event targeting those using Nanopore methods and those developing applications around these.
Year(s) Of Engagement Activity 2017
 
Description Matt Loose presentation at Nanopore Day, Cambridge, UK 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Matt Loose presented at a community event targeting those using Nanopore methods and those developing applications around these.
Year(s) Of Engagement Activity 2017
 
Description Matt Loose presentation at Viapath Symposium, London, UK 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Matt Loose presented at a community event targeting those using nanopore methods and those developing applications around these.
Year(s) Of Engagement Activity 2017
 
Description Matt Loose presented and instructed at Texas A&M Porecamp Training course, Texas, US 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Matt loose presented and served as an instructor at a nanopore sequencing community event.
Year(s) Of Engagement Activity 2017
 
Description Oxford Nanopore - Basecallng Consensus Hackathon - Invited Contributor - July (2018) 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact An invitation only hackathon to investigate questions around base calling and sequence consensus.
Year(s) Of Engagement Activity 2018
 
Description Singapore Genome Centre - Porecamp Singapore Training Course - Lead Instructor and Keynote - Sept (2018) 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Porecamp is an instructional course for using nanopore sequencing in the lab and the field. It is open to all and serves to increase the uptake of nanopore sequencing globally.
Year(s) Of Engagement Activity 2018
 
Description University of British Columbia - Porecamp Training Course - Lead Instructor and Keynote - May (2018) 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Porecamp is a training course to encourage uptake of Nanopore sequencing in the field and laboratory.
Year(s) Of Engagement Activity 2018