Adaptive sampling ('Read Until') methods in optimised nanopore sequencing technologies

Lead Research Organisation: European Bioinformatics Institute

Department Name: Sequence Database Group

Abstract

Over the last three decades, DNA sequencing has become a key technology across and beyond the life sciences. Indeed, few areas of biological research remain untouched by either the direct use of the technology or knowledge that is derived from others' work in which sequencing has been used. The technology has advanced rapidly. In the mid-2000s, a second generation of sequencing technologies, quite unlike the first, brought a step change in the rate at which sequencing machines could operate, and a corresponding vast reduction in the cost. These technologies now dominate and have led to a wealth of new and impactful scientific findings, not least as the core sequencing technology behind many thousands of animal, plant, fungal and bacterial projects. We are now on the cusp of a third-wave of technology, 'nanopore' sequencing, again quite unlike those that proceed it, that promises similar game-changing advances. In the 'Adaptive Sampling' project, we recognise the potential of nanopore sequencing and focus on a particular, as yet under-explored, feature of the technology that promises very significant impact.
Nanopore sequencing uses microscopic pores that can be engineered and organised onto a surface. The pores allow DNA molecules to pass through one at a time from one side of the surface to the other. As they transit, the pores provide a direct read-out of the bases (A, C, G and T) that pass the inner surface of the pore. The user places a mixture of DNA molecules (fragments of a whole genome) above the pore, which then captures the end of a DNA molecule and starts to draw it through, reading its sequence as it goes. The control of the system is so refined that, if desired, a DNA molecule can be rejected from a pore before it has been fully sequenced and the capture process can start again rapidly.
A key challenge for all sequencing platforms is that some parts of genomes are 'difficult to sequence' and others are not. Because of this, to be certain that a genome sequencing experiment has captured all parts of a genome, the user must set the experiment up to read the genome many times (often 30), so that the difficult regions are read at least once. With Adaptive Sampling, we plan to overcome this obstacle with software that will rapidly read the early sequence from a pore, and make a decision about whether the part of the genome that is emerging from the pore has been read already or is yet to be read. Based on this, a decision can be made as to whether or not to reject the DNA molecule from the pore or to carry on reading to the end. The time saving to be achieved by avoiding re-sequencing in this way will be substantial, driving at far more cost-effective, rapid and 'targeted' sequencing.
While our technology will be useful broadly, we will work specifically with five example challenges, in which the tools will be useful. These cover detection and identification of infectious bacteria, the study of agricultural livestock, investigation of crop plant genomes, work on farmed fish to understand responses to disease-causing species and the analysis of communities of microbial species in the environment.
There is substantial novelty in this approach. In previous work on Ebola virus, we have shown that rejecting reads using a prototype of our software has potential. What we now propose will be the first example, to the best of our knowledge, of a sequencing approach in which data analysis (previously something that happened after sequencing was completed) has direct impact on the way in which the physical sequencing machine itself is operated during a sequencing experiment.
As part of the project, aiming at the broadest possible benefit to the research community, we plan to publish the software and hold two workshops in which we disseminate what we have developed to technologists, genomics laboratories, research scientists and industry.

Technical Summary

We propose to develop algorithms to enable adaptive sampling of DNA in real time by exploiting the unique property of nanopore sequencers, that data are streamed from nanopores and that the Oxford Nanopore Technology minION device allows the specific molecules to be ejected from a nanopore at any time, regardless of how completely it has been read. For this, two linked, but distinct, problems must be solved: The DNA molecule (represented by changes in current flow) must be mapped rapidly to a reference and an accept/reject decision must be made based on accumulated previous mapping events. We will address both of these problems using five model cases of direct relevance to BBSRC science:
1. Rapid even coverage in bacterial genome sequencing (e.g. pathogen identification in food-borne disease)
2. Even coverage in diploid genome resequencing (e.g. marker and variant discovery in livestock welfare and breeding)
3. Sequencing of genomic regions of interest that are recalcitrant to conventional sequencing (e.g. in crop plant genomics)
4. Maximising discovery and quantification of low-abundance transcripts (e.g. in fish pathogen response transcriptomics)
5. Coordination of multi-sample sequencing in complex mixtures (e.g. in comparative metagenomics studies)
To achieve rapid matching of early read data to reference sequence we will explore several indexing/pre-computing strategies, including Fast Fourier Transform of streamed data; wavelet transform of the stream followed by indexing; discretisation of the signal and suffix tree or FM-index processing. This tool would run on the laptop local to the sequencer. In contrast, the logical process for accepting or rejecting specific reads will be managed by an external server system running appropriate pipelines on the minoTour minION analysis platform. Templates will be generated for minoTour allowing experienced users to generate pipelines for further specific use cases.

Planned Impact

The application of sequencing technologies underpins much of biological research today. Our approach, adaptive sampling in nanopore-based sequencing, serves to eliminate coverage bias and focus resolving power and thus has numerous beneficiaries. Within the broad UK and global academic and applied science communities these methods will benefit both those already using, and those yet to use, sequencing methods.
The direct impacts of our work will be delivered as an enabling software technology that allows broad use of adaptive sampling. During the project we will specifically demonstrate the technology in five areas of biological research and application, each of which represents a challenge area for current sequencing approaches. These are the rapid sequencing of bacterial pathogens for identification, typing and resistance profiling purposes (demonstrating coverage control in diploid genome sequencing), marker and variant discovery in livestock resequencing (even coverage in diploid genome sequencing), access to regions that are difficult to sequence in higher plants, particularly the crop species (targeted genomic region sequencing), pathogen response transcriptome characterisation and profiling in farmed fish species (low-abundance transcript sequencing) and comparative metagenomics (coverage/focus control in multi-sample sequencing). We expect direct impact on groups of researchers who use sequencing approaches in these areas, including, but not limited to, those who have expressed support for the project (see letters of support).

Through the capacity to eliminate coverage bias, sequencing costs will be reduced, making sequencing available to areas of research and application for which cost remains prohibitive (such as deep population biology of crops, the discovery of low frequency variant alleles for livestock breeding programmes and the profiling of expression in non-model species). Through the ability to focus on defined regions, adaptive sampling will bring powerful methods to areas such as ecology and biodiversity (barcoding, whole-ecosystem analysis, occurrences and abundance), environmental sensing (water safety, environmental health, sentinel markers for pollution and climate change), food chain control (food species/breed/line validation, forensic tracking), border and trade control (invasive species, illegal trade in controlled species), bioenergy (investigation of new species, yield improvement), public health (environmental and zoonotic pathogen sinks, epidemiology of anti-microbial drug resistance) and animal health (surveillance, outbreak detection, transmission control).

The UK has long been established at the forefront of sequencing technology and the application of adaptive sampling methods to nanopore technologies will serve to continue this trend.

Funded Value:

£310,680

Funded Period:

Mar 17 - Feb 20

Funder:

BBSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

BB/N018877/1

Principal Investigator:

Guy Cochrane

Research Subject:

Omic sciences & technologies (63%)

Tools, technologies & methods (28%)

Research Topic:

Bioinformatics (21%)

Genomics (49%)

Tools for the biosciences (7%)

Transcriptomics (14%)

Organisations

People	ORCID iD
Guy Cochrane (Principal Investigator)
Ewan Birney (Co-Investigator)

Publications

Author Name

Title Publication Date Published

10 25 50

Harrison PW (2021) The European Nucleotide Archive in 2020. in Nucleic acids research

Jain M (2018) Nanopore sequencing and assembly of a human genome with ultra-long reads. in Nature biotechnology

Koren S (2019) Reply to 'Errors in long-read assemblies can critically affect protein prediction'. in Nature biotechnology

Loose M (2018) Finding the Needle: Targeted Nanopore Sequencing and CRISPR-Cas9. in The CRISPR journal

Payne A (2020) Nanopore adaptive sequencing for mixed samples, whole exome capture and targeted panels

Weilguny L (2020) Dynamic, adaptive sampling during nanopore sequencing using Bayesian experimental design

Weilguny L (2023) Dynamic, adaptive sampling during nanopore sequencing using Bayesian experimental design. in Nature biotechnology

Workman R (2018) Nanopore native RNA sequencing of a human poly(A) transcriptome

Key Findings
Impact Summary
Research Databases and Models
Research Tools and Methods
Collaboration
Software and Technical Products
Engagement Activities


Description	We have delivered an integrated software product that brings together the individual components constructed and tuned during the project. This final system comprises three major components: the system that communicates with and controls pores; a "fast loop" matching server that maps emerging sequences to reference with a given "mask", sending decisions back to the control module; and the "slow loop" that generates and updates the mask based on the accumulated full sequences. Control system: this has been developed by the Nottingham team and implemented as a significant extension and refactoring of the MinoTour system. Based on high-performance read calling software that emerged from Oxford Nanopore (Scrappie) during the project, we made and enacted a decision to base call to produce the fast loop matching entity; MinoTour receives emerging signal via the MinKNOW API, calls bases, despatches sequences to the fast loop server and sends decisions received as block/unblock control messages again via MinKNOW. Published software from the project is available at https://github.com/LooseLab under minotourapp, minotourcli, ru and read_until_api_v2. Fast loop server: this component has been developed by the EMBL-EBI group, as a Java server that runs rapid sequence matching in C using the BWA aligner, which shows optimal performance for short sequences. In the system, mask look-up and mask update leverage memory-mapped mask files to reduce file opening overheads (Java NIO) and overall the system supports thousands of parallel operations giving a wide margin. Mask update is supported with a RESTful interface provided to the slow loop. The fast loop server performs effectively within laptop-scale compute on a human genome data set. Published software from the project is available from https://github.com/EGA-archive/ont_readuntil_server. Slow loop: the EMBL-EBI group has designed an access-optimised data structure for the mask managed by the slow loop that is sufficiently performant in the fast loop server. "BOSS-RUNS" has been developed and published as pre-print by both Nottingham and EMBL-EBI groups that provides a mathematical and algorithmic framework for defining the "value" of sequence reads that cover certain genomic positions, and of the value of likely reads (https://www.biorxiv.org/content/10.1101/2020.02.07.938670v2.full). Simulations of whole bacterial genome sequencing under uniform and biased coverage, reduced representation bacterial genome sequencing from input whole bacterial DNA (multi-locus sequence typing and core genome multi-locus sequence typing), yeast haploid and diploid whole genome sequencing provide a strategy for the use of Read Until in these scenarios. Published software is available at https://bitbucket.org/nicofmay/readuntilstrategy/src/master/. Further demonstration of our system, published as pre-print, covers a number of Read Until sequencing scenarios: resolving specific chromosomes from human genome sequencing, resolving low-abundance genomes from metagenomics sequencing and selective sequencing of a panel of cancer genes (https://www.biorxiv.org/content/10.1101/2020.02.03.926956v2.full). Our original proposal for the Read Until project asserted 5 challenge areas that we would address. We have directly addressed challenge areas 1 (even coverage in bacterial sequencing) and 2 (even coverage in diploid genome sequencing) in the simulations we describe above. Our real-world sequencing demonstrations described above are challenging cases for which opportunities arose, that relate, but are not directly the same as, original challenge areas 3 (sequencing of recalcitrant regions) and 4 (discovery and quantitation of low abundance transcripts). In a recent addition (in the last year), the project's ReadFish software (https://github.com/LooseLab/readfish) has been integrated into ONT's MinKNOW control software to enable adaptive sampling simply and easily for end users. Finally, we have again seen rapid growth in ONT data published in the European Nucleotide Archive. With a total of 81,418 ONT-sequenced libraries, of which 67,972 (83%) have arrived since March 2020, data presented from EMBL-EBI's services come from some 400 submitting centres and cover more than 2,000 species and/or metagenomics biomes. Data submissions are supported through documentation (e.g. https://ena-docs.readthedocs.io/en/latest/submit/fileprep/reads.html) and teh ENA's email help desk.
Exploitation Route	We expect impacts of value to the UK and international bioscience community, through the delivery of software components that enable and empower those using nanopore sequencing. We have completed the integration of the Read Until technical system and have published all software. We have demonstrated through simulation and in the laboratory challenging cases in which the system performs well. As the publications on these outputs emerge (currently available as pre-prints) we expect uptake by the broader community. Future community work envisaged includes support for the publication of Read Until sequencing strategy choices alongside published sequence data.
Sectors	Agriculture Food and Drink Environment Healthcare Manufacturing including Industrial Biotechology Culture Heritage Museums and Collections Pharmaceuticals and Medical Biotechnology


Description	The work in Read Until has provided software and expertise that have contributed to the scientific value and overall usefulness of nanopore-based sequencing across and beyond life science. Since the project was completed, the focus of software and service development work has built on Read Until inputs, alongside those from other projects. Active services remain in place at the European Nucleotide Archive that support data publication, access and reuse that were improved through the Read Until project. These services form an open global and permanent foundation for scientists to share and access nanopore data through the International Nucleotide Sequence Database Collaboration.
First Year Of Impact	2019
Sector	Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Environment,Healthcare,Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology
Impact Types	Policy & public services


Title	BulkVIS
Description	BulkVIS is a tool for detailed analysis of raw signal data during Nanopore sequencing. This tool enables identification of longer reads than have previously been reported and more detailed understanding of how nanopore sequencing occurs.
Type Of Material	Technology assay or reagent
Year Produced	2018
Provided To Others?	Yes
Impact	The identification of the longest molecule sequenced to date. https://www.bbc.co.uk/news/science-environment-46046024
URL	https://github.com/LooseLab/bulkvis


Title	MinoTour version 1
Description	MinoTour is a complete laboratory information management system for Nanopore sequencing. It also includes customisable real time analysis.
Type Of Material	Improvements to research infrastructure
Year Produced	2019
Provided To Others?	Yes
Impact	This is a revision of a previously available tool and feeds in to several of our other projects.
URL	https://github.com/looselab/minotourapp


Title	Minotour Client
Description	This is a python tool to upload data into our minoTour application.
Type Of Material	Improvements to research infrastructure
Year Produced	2019
Provided To Others?	Yes
Impact	This is feeding in to many of our existing projects.
URL	https://github.com/LooseLab/minotourcli


Title	Read Until API updates
Description	We have overhauled the Oxford Nanopore Read Until API
Type Of Material	Technology assay or reagent
Year Produced	2020
Provided To Others?	Yes
Impact	This tool will be partially integrated in to Oxford Nanopore Technologies own tools.
URL	https://www.github.com/looselab/read_until_api_v2


Title	European Nucleotide Archive
Description	Repository and database of record for sequence data.
Type Of Material	Database/Collection of data
Provided To Others?	Yes
Impact	Foundation for sequence-based science across applications, platforms and taxonomies.
URL	https://www.ebi.ac.uk/ena/browser/home


Title	European Nucleotide Archive - support for Oxford Nanopore Technologies data types
Description	The European Nucleotide Archive continues to support data flows from generators of Oxford Nanopore Technologies data, including those using software and tools originating from the Read Until project.
Type Of Material	Database/Collection of data
Year Produced	2019
Provided To Others?	Yes
Impact	Global have open access to comprehensive deposition, search and access services around Oxford Nanopore Technologies data in INSDC databases.
URL	https://www.ebi.ac.uk/ena/browser/home


Description	The Telomere-to-Telomere (T2T) consortium is an open, community-based effort to generate the first complete assembly of a human genome.
Organisation	National Institutes of Health (NIH)
Department	National Human Genome Research Institute (NHGRI)
Country	United States
Sector	Public
PI Contribution	I have been contributing expertise, time and sequencing data to the activities of of the telomere-to-telomere consortium. The goal of this consortium is to sequence the first human genome from telomere-to-telomere. Our expertise through the Long Read Club has been exploited to enable this goal.
Collaborator Contribution	Other partners have generated sequencing data, analysed and assembled reads and presented this work.
Impact	No outputs to date.
Start Year	2019


Description	The Telomere-to-Telomere (T2T) consortium is an open, community-based effort to generate the first complete assembly of a human genome.
Organisation	University of California, Santa Cruz
Country	United States
Sector	Academic/University
PI Contribution	I have been contributing expertise, time and sequencing data to the activities of of the telomere-to-telomere consortium. The goal of this consortium is to sequence the first human genome from telomere-to-telomere. Our expertise through the Long Read Club has been exploited to enable this goal.
Collaborator Contribution	Other partners have generated sequencing data, analysed and assembled reads and presented this work.
Impact	No outputs to date.
Start Year	2019


Title	Read Until - mathematical model software - EMBL-EBI
Description	Software used in mathematical model development and simulation - EMBL-EBI
Type Of Technology	Software
Year Produced	2020
Open Source License?	Yes
Impact	Software related to Ready Until used to develop mathematical model, for simulations and to define sequencing strategies - EMBL-EBI
URL	https://bitbucket.org/nicofmay/readuntilstrategy/src/master/


Title	Read Until software - EMBL-EBI
Description	Read Until component - A Python-based Short Loop / Short Read Masked Match Server - EMBL-EBI
Type Of Technology	Software
Year Produced	2019
Open Source License?	Yes
Impact	This software has impact as a component of the Read Until system
URL	https://github.com/EGA-archive/ont_readuntil_server


Title	Read Until software - Python3 Read Until API implementation - Nottingham
Description	Python3 Read Until API implementation - a component of the Read Until system.
Type Of Technology	Software
Year Produced	2019
Open Source License?	Yes
Impact	This has impact as a component of the Read Until system.
URL	https://github.com/LooseLab/read_until_api_v2


Title	Read Until software - Read Until scripts - Nottingham
Description	Read Until scripts components of Read Until system
Type Of Technology	Software
Year Produced	2019
Open Source License?	Yes
Impact	This has impact as a component of the Read Until system.
URL	https://github.com/LooseLab/ru


Title	Read Until software - minotourcli - Nottingham
Description	minotourcli component of Read Until system
Type Of Technology	Software
Year Produced	2018
Open Source License?	Yes
Impact	This software has impact as part of the Read Until system.
URL	https://github.com/LooseLab/minotourcli


Title	minotour v 1
Description	Minotour is a real time set of tools for analysis of nanopore data.
Type Of Technology	Software
Year Produced	2019
Open Source License?	Yes
Impact	This is being used across a number of our projects.
URL	http://minotour.nottingham.ac.uk


Description	Grand Challenges in Genomics - Invited Panel Speaker - Joint meeting of the NHGRI/Wellcome Trust, London, Feb 2019
Form Of Engagement Activity	A formal working group, expert panel or dialogue
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Policymakers/politicians
Results and Impact	Grand Challenges in Genomics was a meeting to discuss the next ten years of Genomics and the ways in which both NHGRI and the Wellcome Trust should target investment and funding in the future.
Year(s) Of Engagement Activity	2019


Description	Long Read Club
Form Of Engagement Activity	Engagement focused website, blog or social media channel
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Long Read Club is an informal grouping of users interested in exploring long read sequencing technologies in all their guises. We are raising awareness of methods, best practice and experience. This is being done through a website, twitter account and youtube channel. Over 900 have signed up to the email list, nearly 700 followers on twitter and over 130 people have subscribed to the youtube channel.
Year(s) Of Engagement Activity	2019
URL	http://youtube.com/c/longreadclub


Description	Matt Loose Presentation at Nanopore Community Day, Oslo, NO
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Matt Loose presented at a community event targeting those using Nanopore methods and those developing applications around these.
Year(s) Of Engagement Activity	2018


Description	Matt Loose presentation and instruction at Porecamp Nanopore Training Course, Birmingham, UK
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Matt loose presented and served as an instructor at a nanopore sequencing community event.
Year(s) Of Engagement Activity	2017


Description	Matt Loose presentation at London Calling 2017, London, UK
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Matt Loose presented at a large event targeting those using Nanopore methods and those developing applications around these.
Year(s) Of Engagement Activity	2017


Description	Matt Loose presentation at Nanopore Community Day, Utrecht, NL
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Matt Loose presentation at community event targeting those using Nanopore methods and those developing applications around these.
Year(s) Of Engagement Activity	2017


Description	Matt Loose presentation at Nanopore Day, Cambridge, UK
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Professional Practitioners
Results and Impact	Matt Loose presented at a community event targeting those using Nanopore methods and those developing applications around these.
Year(s) Of Engagement Activity	2017


Description	Matt Loose presentation at Viapath Symposium, London, UK
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Professional Practitioners
Results and Impact	Matt Loose presented at a community event targeting those using nanopore methods and those developing applications around these.
Year(s) Of Engagement Activity	2017


Description	Matt Loose presented and instructed at Texas A&M Porecamp Training course, Texas, US
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Matt loose presented and served as an instructor at a nanopore sequencing community event.
Year(s) Of Engagement Activity	2017


Description	Oxford Nanopore - Basecallng Consensus Hackathon - Invited Contributor - July (2018)
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	An invitation only hackathon to investigate questions around base calling and sequence consensus.
Year(s) Of Engagement Activity	2018


Description	Singapore Genome Centre - Porecamp Singapore Training Course - Lead Instructor and Keynote - Sept (2018)
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Porecamp is an instructional course for using nanopore sequencing in the lab and the field. It is open to all and serves to increase the uptake of nanopore sequencing globally.
Year(s) Of Engagement Activity	2018


Description	University of British Columbia - Porecamp Training Course - Lead Instructor and Keynote - May (2018)
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Porecamp is a training course to encourage uptake of Nanopore sequencing in the field and laboratory.
Year(s) Of Engagement Activity	2018