BlobToolKit: Identification and analysis of non-target data in all Eukaryotic genome projects

Lead Research Organisation: The Wellcome Trust Sanger Institute
Department Name: Research Directorate


Genomics has become one of the cornerstones of biology. Knowing an organism's genome sequence immediately allows us to work out what kinds of biology it is able to do, and acts as a platform upon which we can build experiments to test, for example, the dynamics of gene activity during stress or disease. If genomes are the cornerstones, genome databases are the libraries built from these data that allow science to collaborate and build upon its successes. Genome sequencing is getting easier, as technologies improve by leaps and bounds: new, high throughput sequencers and advanced computing. The human genome cost $3 billion to sequence the first time round: now it would cost about $15,000. This reduction in cost has opened up genome sequencing to many research projects on new species, and there are now about 30,000 bacterial genomes and 3,000 eukaryotic genomes in public databases.

When genomes are contaminated, the genome databases, the reference libraries, are also contaminated, and the scientific process becomes muddied: errors can be made that affect many later steps in understanding the natural world, or exploiting it for bioscience. Obviously no scientist knowingly submits contaminated genome data to the central databases, but as genome sequencing projects become more common, more and more contaminated data are getting into the databases of record.

How does contamination happen? Organisms live in environments with other species, and it is often not possible or not advisable to separate these before making DNA to be sequenced. For example, most animals have bacteria in their guts, and getting rid of these before extracting DNA from a whole specimen of a tiny species is difficult. Similarly, plants naturally have communities of fungi and bacteria growing in and on their leaves and roots. In the case of symbiotic organisms, where the interaction is very intimate, the specimen is indivisible. The genomes of the different contributing species will be mixed up in the raw sequence data generated from such samples.

We propose to build a set of computational tools, BlobToolKit, that will identify contaminants. BlobToolKit will be useful both during the process of making new genomes for the first time (where they will separate out the different organisms in the mix of raw sequence data), and during reanalyses of existing genome assemblies.

BlobToolKit will be made freely available as a standalone program, as a service on the internet, and as a system that will be plugged into the big public databases to report on possible contamination. The project, a collaboration between the University of Edinburgh and the European Bioinformatics Institute, aims, within 3 years, to have identified all the problems in "legacy" genomes already submitted to public databases, and to have in place a system that prevents further contamination happening.

BlobToolKit reports will be provided as part of the submission process to those scientists reporting genome assemblies, ensuring the exposure of our technology to its users. We will further promote BlobToolKit by publication of our results in open access journals, presentations and workshops at relevant meetings, discussion with standards organisations, delivering training workshops to interested groups of scientists, and maintaining a rich resource of training and tutorial materials on the web. Our aim is to steer the scientific community to a culture in which contamination in genome assembly is understood and expected, and freely available and versatile software tools are known that can assist in the flagging and prevention of contamination in the public record.

Technical Summary

Many next generation genome datasets derive from a mixture of taxa - either because the mixture is a biologically relevant unit (symbionts, organisms with associated metabiomes), or because the sample was, or became, contaminated. Separation of reads into bins corresponding to distinct organisms is essential for analysis, as mixed assemblies result in erroneous inferences - e.g. of species physiology, horizontal gene transfer, and holobiont biology. Unfortunately, public databases are already contaminated by wrongly taxonomically assigned sequences.

We propose to develop BlobToolKit, based on our successful Blobtools, to both clean the existing public databases and to ensure that future submissions are correctly annotated. BlobToolKit will use a range of algorithms to delineate distinct sequence and read bins in next generation data, and use these separate bins for independent analyses. BlobToolKit will include an interactive visualisation platform that will facilitate exploration of assembly data, and thus the generation of high-quality assemblies.

BlobToolKit use modes will be delivered by distinct packaging of the core software:
It will be used by researchers assembling de novo, as part of high-quality assembly pipelines - delivered through a command line version, accessible through an API.
It will be used by authors, editors, reviewers and database curators as a quality check before submission, acceptance of manuscripts or accessioning - delivered through a cloud-based Galaxy instance.
It will be used by databases to display interactive graphical reports on accessioned genomes and support data reuse - delivered through API integration with databases.

Core development will be carried out in Edinburgh, and integration with service delivery through the European Nucleotide Archive will be delivered from the EMBL-EBI, Hinxton. We will develop training and outreach materials to promote uptake of BlobToolKit in the research community.

Planned Impact

We and others have identified a critical issue with contamination in sequence attribution in genomic sequences in the public databases. To rectify this legacy problem and to reduce its impact on future data submissions we propose a toolkit, BlobToolKit, that aids producers and users in identifying and correctly classifying such data.

How will BlobToolKit impact science and industry?
This work will have impact beyond the purely academic sphere of those generating genome sequences. In particular we envisage impacts in:
* Clinical science and delivery, as pathogens and other possibly harmful species will be correctly identified;
* Food production, where the improvement of methods of fermentation by microbes such as in brewing and cheese manufacture, requires access to accurately attributed sequence data;
* Crop science, as data relevant to invasive and pathogenic species will be available for monitoring, control and eradication programmes;
* Livestock health, as data relevant to emerging threats to production to crop and livestock species from novel or imported pathogens will be available for monitoring and eradication programmes;
* Biofuel species development, where yield optimisation depends upon a clear mechanistic understanding of the genomics of the species to hand and its relatives, free from contaminant sequence
* Drug discovery, where the process of initial lead definition will not be fatally misled by misattributed sequence;
* Bioprospecting, where correct linkage between sequences and the organisms they derive from will speed identification of useful bioactives;
* Biotechnology, where the engineering of synthetic pathways requires accurate identification and characterisation of genomic material to its correct sources.
We also recognise that SMEs are beginning to generate genome assemblies for target species, and BlobToolKit will aid these in generating high-quality data on which future investment can be based. The toolkit will be available under an appropriate open software license, permitting installation on local servers as well as on private cloud computing systems.

How will prospective users become informed about BlobToolKit?
By embedding BlobToolKit in standalone, cloud, and database-proximal versions, and by developing novel interactive visualisations, we will ensure that it has wide uptake and open availability. We will deliver BlobToolKit-enabled assessments of public data via a plugin to the ENA web data services. This will reach tens of thousands of data users per year. By annotating sequences with suspect annotation, we will improve the sequence search results, and interpretation of downloaded data, for many tens of thousands more.

Overall, the toolkit will serve to correct the scientific record at source, and provide an independent measure of data quality and reliability for future reuse.

Ultimately we hope that BlobToolKit will become part of the hidden but essential infrastructure that supports UK and global bioscience, whether academic or commercial. "Users" will realise that the data they are using has been screened by BlobToolKit, and will expect BlobToolKit stamps of credibility on data they access and exploit.

Related Projects

Project Reference Relationship Related To Start End Award Value
BB/P024238/1 30/06/2017 29/06/2019 £354,128
BB/P024238/2 Transfer BB/P024238/1 30/06/2019 29/06/2020 £119,462
Description We have identified suspected, but previously unexplored, issues with the public databases that house sequence data. Researchers have submitted genome sequences where the data that is claimed to be from one species is actually from more than one, either because the original specimen was infected by a parasite or pathogen, or where there was contamination during the DNA sequencing process. This has revealed for the first time the genomes of some exciting parasites of animals: a parasite related to malaria in a primate genome dataset, and a set of related parasites in a number of bird genome datasets. These discoveries serve to "clean up" the scientific record, and to open new avenues of research.
Exploitation Route The toolkit we have developed is already in use worldwide, and is helping to prevent contamination occurring in the future. Others are exploring additional measures of genome integrity that could be added to the core analysis modes in our toolkit.
Sectors Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Environment,Healthcare

Title GoaT Genomes on a Tree 
Description GoaT uses Elastic Search to return for any taxon an estimate of its genome size and karyotype. it serves to aggregate data currently available in a disparate 9and previously undiscoverable) series of publications and datasets. It also estimates values for parent nodes (genera, families, etc) in the taxonomic tree, and to estimate values for species for which no measurements are available. It has a open API. 
Type Of Technology Webtool/Application 
Year Produced 2020 
Open Source License? Yes  
Impact GoaT is being used across the Darwin Tree of Life and Earth Biogenome Project to deliver estimates to back up genome sequencing efforts 
Description is the latest iteration of the BTK pipeline with improved visualisation, analytic and download functionality. Similar to BlobTools v1, BlobTools2 is a command line tool designed to aid genome assembly QC and contaminant/cobiont detection and filtering. In addition to supporting interactive visualisation, a motivation for this reimplementation was to provide greater flexibility to include new types of information, such as BUSCO results and BLAST hit distributions. BlobTools2 supports command-line filtering of datasets, assembly files and read files based on values or categories assigned to assembly contigs/scaffolds through the blobtools filter command. Interactive filters and selections made using the BlobToolKit Viewer can be reproduced on the command line and used to generate new, filtered datasets which retain all fields from the original dataset. BlobTools2 is built around a file-based data structure, with data for each field contained in a separate JSON file within a directory (BlobDir) containing a single meta.json file with metadata for each field and the dataset as a whole. Additional fields can be added to an existing BlobDir using the blobtools add command, which parses an input to generate one or more additional JSON files and updates the dataset metadata. Fields are treated as generic datatypes, Variable (e.g. gc content, length and coverage), Category (e.g. taxonomic assignment based on BLAST hits) alongside Array and MultiArray datatypes to store information such as start, end, NCBI taxid and bitscore for a set of blast hits to a single sequence. Support for new analyses can be added to BlobTools2 by creating a new python module with an appropriate parse function. 
Type Of Technology Software 
Year Produced 2021 
Open Source License? Yes  
Impact BTK is now used as the standard production viewer for Darwin Tree of Life and other major genome sequencing projects.