BlobToolKit: Identification and analysis of non-target data in all Eukaryotic genome projects

Lead Research Organisation: University of Edinburgh
Department Name: Sch of Biological Sciences


Genomics has become one of the cornerstones of biology. Knowing an organism's genome sequence immediately allows us to work out what kinds of biology it is able to do, and acts as a platform upon which we can build experiments to test, for example, the dynamics of gene activity during stress or disease. If genomes are the cornerstones, genome databases are the libraries built from these data that allow science to collaborate and build upon its successes. Genome sequencing is getting easier, as technologies improve by leaps and bounds: new, high throughput sequencers and advanced computing. The human genome cost $3 billion to sequence the first time round: now it would cost about $15,000. This reduction in cost has opened up genome sequencing to many research projects on new species, and there are now about 30,000 bacterial genomes and 3,000 eukaryotic genomes in public databases.

When genomes are contaminated, the genome databases, the reference libraries, are also contaminated, and the scientific process becomes muddied: errors can be made that affect many later steps in understanding the natural world, or exploiting it for bioscience. Obviously no scientist knowingly submits contaminated genome data to the central databases, but as genome sequencing projects become more common, more and more contaminated data are getting into the databases of record.

How does contamination happen? Organisms live in environments with other species, and it is often not possible or not advisable to separate these before making DNA to be sequenced. For example, most animals have bacteria in their guts, and getting rid of these before extracting DNA from a whole specimen of a tiny species is difficult. Similarly, plants naturally have communities of fungi and bacteria growing in and on their leaves and roots. In the case of symbiotic organisms, where the interaction is very intimate, the specimen is indivisible. The genomes of the different contributing species will be mixed up in the raw sequence data generated from such samples.

We propose to build a set of computational tools, BlobToolKit, that will identify contaminants. BlobToolKit will be useful both during the process of making new genomes for the first time (where they will separate out the different organisms in the mix of raw sequence data), and during reanalyses of existing genome assemblies.

BlobToolKit will be made freely available as a standalone program, as a service on the internet, and as a system that will be plugged into the big public databases to report on possible contamination. The project, a collaboration between the University of Edinburgh and the European Bioinformatics Institute, aims, within 3 years, to have identified all the problems in "legacy" genomes already submitted to public databases, and to have in place a system that prevents further contamination happening.

BlobToolKit reports will be provided as part of the submission process to those scientists reporting genome assemblies, ensuring the exposure of our technology to its users. We will further promote BlobToolKit by publication of our results in open access journals, presentations and workshops at relevant meetings, discussion with standards organisations, delivering training workshops to interested groups of scientists, and maintaining a rich resource of training and tutorial materials on the web. Our aim is to steer the scientific community to a culture in which contamination in genome assembly is understood and expected, and freely available and versatile software tools are known that can assist in the flagging and prevention of contamination in the public record.

Technical Summary

Many next generation genome datasets derive from a mixture of taxa - either because the mixture is a biologically relevant unit (symbionts, organisms with associated metabiomes), or because the sample was, or became, contaminated. Separation of reads into bins corresponding to distinct organisms is essential for analysis, as mixed assemblies result in erroneous inferences - e.g. of species physiology, horizontal gene transfer, and holobiont biology. Unfortunately, public databases are already contaminated by wrongly taxonomically assigned sequences.

We propose to develop BlobToolKit, based on our successful Blobtools, to both clean the existing public databases and to ensure that future submissions are correctly annotated. BlobToolKit will use a range of algorithms to delineate distinct sequence and read bins in next generation data, and use these separate bins for independent analyses. BlobToolKit will include an interactive visualisation platform that will facilitate exploration of assembly data, and thus the generation of high-quality assemblies.

BlobToolKit use modes will be delivered by distinct packaging of the core software:
It will be used by researchers assembling de novo, as part of high-quality assembly pipelines - delivered through a command line version, accessible through an API.
It will be used by authors, editors, reviewers and database curators as a quality check before submission, acceptance of manuscripts or accessioning - delivered through a cloud-based Galaxy instance.
It will be used by databases to display interactive graphical reports on accessioned genomes and support data reuse - delivered through API integration with databases.

Core development will be carried out in Edinburgh, and integration with service delivery through the European Nucleotide Archive will be delivered from the EMBL-EBI, Hinxton. We will develop training and outreach materials to promote uptake of BlobToolKit in the research community.

Planned Impact

We and others have identified a critical issue with contamination in sequence attribution in genomic sequences in the public databases. To rectify this legacy problem and to reduce its impact on future data submissions we propose a toolkit, BlobToolKit, that aids producers and users in identifying and correctly classifying such data.

How will BlobToolKit impact science and industry?
This work will have impact beyond the purely academic sphere of those generating genome sequences. In particular we envisage impacts in:
* Clinical science and delivery, as pathogens and other possibly harmful species will be correctly identified;
* Food production, where the improvement of methods of fermentation by microbes such as in brewing and cheese manufacture, requires access to accurately attributed sequence data;
* Crop science, as data relevant to invasive and pathogenic species will be available for monitoring, control and eradication programmes;
* Livestock health, as data relevant to emerging threats to production to crop and livestock species from novel or imported pathogens will be available for monitoring and eradication programmes;
* Biofuel species development, where yield optimisation depends upon a clear mechanistic understanding of the genomics of the species to hand and its relatives, free from contaminant sequence
* Drug discovery, where the process of initial lead definition will not be fatally misled by misattributed sequence;
* Bioprospecting, where correct linkage between sequences and the organisms they derive from will speed identification of useful bioactives;
* Biotechnology, where the engineering of synthetic pathways requires accurate identification and characterisation of genomic material to its correct sources.
We also recognise that SMEs are beginning to generate genome assemblies for target species, and BlobToolKit will aid these in generating high-quality data on which future investment can be based. The toolkit will be available under an appropriate open software license, permitting installation on local servers as well as on private cloud computing systems.

How will prospective users become informed about BlobToolKit?
By embedding BlobToolKit in standalone, cloud, and database-proximal versions, and by developing novel interactive visualisations, we will ensure that it has wide uptake and open availability. We will deliver BlobToolKit-enabled assessments of public data via a plugin to the ENA web data services. This will reach tens of thousands of data users per year. By annotating sequences with suspect annotation, we will improve the sequence search results, and interpretation of downloaded data, for many tens of thousands more.

Overall, the toolkit will serve to correct the scientific record at source, and provide an independent measure of data quality and reliability for future reuse.

Ultimately we hope that BlobToolKit will become part of the hidden but essential infrastructure that supports UK and global bioscience, whether academic or commercial. "Users" will realise that the data they are using has been screened by BlobToolKit, and will expect BlobToolKit stamps of credibility on data they access and exploit.


10 25 50

Related Projects

Project Reference Relationship Related To Start End Award Value
BB/P024238/1 30/06/2017 29/06/2019 £354,128
BB/P024238/2 Transfer BB/P024238/1 30/06/2019 29/06/2020 £119,462
Description We have developed new ways of analysing complex mixtures of genomes that result from assembly of metagenomic samples. These include interactive viewers and data models for holding complex annotation information concerning likely taxonomic attribution of contigs within an assembly. The toolkit is now mature and published and BlobToolKit views of published assemblies are now available for over half of the genomes represented in INSDC databases. We have identified suspected, but previously unexplored, issues with the public databases that house sequence data. Researchers have submitted genome sequences where the data that is claimed to be from one species is actually from more than one, either because the original specimen was infected by a parasite or pathogen, or where there was contamination during the DNA sequencing process. This has revealed for the first time the genomes of some exciting parasites of animals: a parasite related to malaria in a primate genome dataset, and a set of related parasites in a number of bird genome datasets. These discoveries serve to "clean up" the scientific record, and to open new avenues of research.
Exploitation Route Blob Tool it is in use across the world, and views and screenshots from BTK analyses appear in online presentations and in publications - often without explicit attribution (i.e. the tool has become an invisible part of the genome science "process" - which was just our aim). The toolkit is published open access and thus the programs are available to all for reuse. The toolkit we have developed is already in use worldwide, and is helping to prevent contamination occurring in the future. Others are exploring additional measures of genome integrity that could be added to the core analysis modes in our toolkit.
Sectors Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Environment,Healthcare

Title Computing Infrastructure Upgrades 2018-19 
Description In 2018 we upgraded the Blaxter lab compute cluster to be a cloud-based system and added RAM and compute nodes to a total of 1024 nodes and just over 6 Tb RAM. There is also a 0.33 Pbyte disk farm 
Type Of Material Technology assay or reagent 
Year Produced 2018 
Provided To Others? Yes  
Impact The compute cluster is now used by seven research groups in the School of Biological Sciences, funded by NERC, BBSRC, ERC and Wellcome Trust 
Title BlobToolKit 1.0 
Description BlobToolKit is a complete refactoring of blobtools with a design focussing on a client-server interface, new agile databasing, interactive viewing and download. It has been deployed on local and Embassy cloud platforms. In produces reports of genome assembly quality that are easily interpreted, and embeddable in other applications and services. 
Type Of Technology Software 
Year Produced 2018 
Open Source License? Yes  
Impact BlobToolKit has been used to screen ~400 of the 2000 genomes deposited in ENA for contamination, and reports generated for incorporation in ENA presentations of these data. 
Title Blobtools 
Description The python program analyses genome assemblies to generate data that can be used to filter contaminants and other complex mixtures. It produces both tabular and graphical output.The goal of many genome sequencing projects is to provide a complete representation of a target genome (or genomes) as underpinning data for further analyses. However, it can be problematic to identify which sequences in an assembly truly derive from the target genome(s) and which are derived from associated microbiome or contaminant organisms. We present BlobTools, a modular command-line solution for visualisation, quality control and taxonomic partitioning of genome datasets. Using guanine+cytosine content of sequences, read coverage in sequencing libraries and taxonomy of sequence similarity matches, BlobTools can assist in primary partitioning of data, leading to improved assemblies, and screening of final assemblies for potential contaminants. Through simulated paired-end read dataset,s containing a mixture of metazoan and bacterial taxa, we illustrate the main BlobTools workflow and suggest useful parameters for taxonomic partitioning of low-complexity metagenome assemblies. 
Type Of Technology Software 
Year Produced 2017 
Open Source License? Yes  
Impact Blobtools is in use worldwide. 
Title Blobtools v1.1 
Description Blobtools is a standalone toolkit that allows users to screen genome and other assemblies for potential contaminants. Version 1.1 of blobtools is an upgrade release that supports python 3.7 and has additional enhancements in terms of speed and outputs 
Type Of Technology Software 
Year Produced 2019 
Open Source License? Yes  
Impact Blobtools is gaining wide traction in genome assembly communities. 
Title Drl/Blobtools: Blobtools V1.0 
Description major release 
Type Of Technology Software 
Year Produced 2017 
Description is the latest iteration of the BTK pipeline with improved visualisation, analytic and download functionality. Similar to BlobTools v1, BlobTools2 is a command line tool designed to aid genome assembly QC and contaminant/cobiont detection and filtering. In addition to supporting interactive visualisation, a motivation for this reimplementation was to provide greater flexibility to include new types of information, such as BUSCO results and BLAST hit distributions. BlobTools2 supports command-line filtering of datasets, assembly files and read files based on values or categories assigned to assembly contigs/scaffolds through the blobtools filter command. Interactive filters and selections made using the BlobToolKit Viewer can be reproduced on the command line and used to generate new, filtered datasets which retain all fields from the original dataset. BlobTools2 is built around a file-based data structure, with data for each field contained in a separate JSON file within a directory (BlobDir) containing a single meta.json file with metadata for each field and the dataset as a whole. Additional fields can be added to an existing BlobDir using the blobtools add command, which parses an input to generate one or more additional JSON files and updates the dataset metadata. Fields are treated as generic datatypes, Variable (e.g. gc content, length and coverage), Category (e.g. taxonomic assignment based on BLAST hits) alongside Array and MultiArray datatypes to store information such as start, end, NCBI taxid and bitscore for a set of blast hits to a single sequence. Support for new analyses can be added to BlobTools2 by creating a new python module with an appropriate parse function. 
Type Of Technology Software 
Year Produced 2021 
Open Source License? Yes  
Impact BTK is now used as the standard production viewer for Darwin Tree of Life and other major genome sequencing projects. 
Description 2018 Edinburgh genomics Training Workshops 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Edinburgh Genomics delivered a rich portfolio of training courses in 2018, ranging from a one day Interoduction to Linux (delivered three times) through to an intensive "spring school" in Bioinformatics for Genomics (one week). The courses were delivered in Edinburgh, and had from 12 to 50 participants. The participants came from across the UK HEI sector, including undergraduates, postgraduates and postdoctoral researchers, and also attracted overseas attendants (mainly from other European countries). The training strand is funded from student fees and from BBSRC and NERC dedicated sources. The training strand employs a full time Training Manager who both administers the scheme and develops and delivers courses.
Year(s) Of Engagement Activity 2018
Description Bioinformatics Training Workshops, Buenos Aires and LaPlata, Argentina 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact We preseented two week-long workshops under the auspices of CONICET and the University of La Plata, on bioinformatics tools for next generation genomics, inlcuding Blobtools, GenomeHubs and retated topics.
Year(s) Of Engagement Activity 2018