BlobToolKit: Identification and analysis of non-target data in all Eukaryotic genome projects

Lead Research Organisation: Wellcome Sanger Institute
Department Name: Research Directorate

Abstract

Genomics has become one of the cornerstones of biology. Knowing an organism's genome sequence immediately allows us to work out what kinds of biology it is able to do, and acts as a platform upon which we can build experiments to test, for example, the dynamics of gene activity during stress or disease. If genomes are the cornerstones, genome databases are the libraries built from these data that allow science to collaborate and build upon its successes. Genome sequencing is getting easier, as technologies improve by leaps and bounds: new, high throughput sequencers and advanced computing. The human genome cost $3 billion to sequence the first time round: now it would cost about $15,000. This reduction in cost has opened up genome sequencing to many research projects on new species, and there are now about 30,000 bacterial genomes and 3,000 eukaryotic genomes in public databases.

When genomes are contaminated, the genome databases, the reference libraries, are also contaminated, and the scientific process becomes muddied: errors can be made that affect many later steps in understanding the natural world, or exploiting it for bioscience. Obviously no scientist knowingly submits contaminated genome data to the central databases, but as genome sequencing projects become more common, more and more contaminated data are getting into the databases of record.

How does contamination happen? Organisms live in environments with other species, and it is often not possible or not advisable to separate these before making DNA to be sequenced. For example, most animals have bacteria in their guts, and getting rid of these before extracting DNA from a whole specimen of a tiny species is difficult. Similarly, plants naturally have communities of fungi and bacteria growing in and on their leaves and roots. In the case of symbiotic organisms, where the interaction is very intimate, the specimen is indivisible. The genomes of the different contributing species will be mixed up in the raw sequence data generated from such samples.

We propose to build a set of computational tools, BlobToolKit, that will identify contaminants. BlobToolKit will be useful both during the process of making new genomes for the first time (where they will separate out the different organisms in the mix of raw sequence data), and during reanalyses of existing genome assemblies.

BlobToolKit will be made freely available as a standalone program, as a service on the internet, and as a system that will be plugged into the big public databases to report on possible contamination. The project, a collaboration between the University of Edinburgh and the European Bioinformatics Institute, aims, within 3 years, to have identified all the problems in "legacy" genomes already submitted to public databases, and to have in place a system that prevents further contamination happening.

BlobToolKit reports will be provided as part of the submission process to those scientists reporting genome assemblies, ensuring the exposure of our technology to its users. We will further promote BlobToolKit by publication of our results in open access journals, presentations and workshops at relevant meetings, discussion with standards organisations, delivering training workshops to interested groups of scientists, and maintaining a rich resource of training and tutorial materials on the web. Our aim is to steer the scientific community to a culture in which contamination in genome assembly is understood and expected, and freely available and versatile software tools are known that can assist in the flagging and prevention of contamination in the public record.

Technical Summary

Many next generation genome datasets derive from a mixture of taxa - either because the mixture is a biologically relevant unit (symbionts, organisms with associated metabiomes), or because the sample was, or became, contaminated. Separation of reads into bins corresponding to distinct organisms is essential for analysis, as mixed assemblies result in erroneous inferences - e.g. of species physiology, horizontal gene transfer, and holobiont biology. Unfortunately, public databases are already contaminated by wrongly taxonomically assigned sequences.

We propose to develop BlobToolKit, based on our successful Blobtools, to both clean the existing public databases and to ensure that future submissions are correctly annotated. BlobToolKit will use a range of algorithms to delineate distinct sequence and read bins in next generation data, and use these separate bins for independent analyses. BlobToolKit will include an interactive visualisation platform that will facilitate exploration of assembly data, and thus the generation of high-quality assemblies.

BlobToolKit use modes will be delivered by distinct packaging of the core software:
It will be used by researchers assembling de novo, as part of high-quality assembly pipelines - delivered through a command line version, accessible through an API.
It will be used by authors, editors, reviewers and database curators as a quality check before submission, acceptance of manuscripts or accessioning - delivered through a cloud-based Galaxy instance.
It will be used by databases to display interactive graphical reports on accessioned genomes and support data reuse - delivered through API integration with databases.

Core development will be carried out in Edinburgh, and integration with service delivery through the European Nucleotide Archive will be delivered from the EMBL-EBI, Hinxton. We will develop training and outreach materials to promote uptake of BlobToolKit in the research community.

Planned Impact

We and others have identified a critical issue with contamination in sequence attribution in genomic sequences in the public databases. To rectify this legacy problem and to reduce its impact on future data submissions we propose a toolkit, BlobToolKit, that aids producers and users in identifying and correctly classifying such data.

How will BlobToolKit impact science and industry?
This work will have impact beyond the purely academic sphere of those generating genome sequences. In particular we envisage impacts in:
* Clinical science and delivery, as pathogens and other possibly harmful species will be correctly identified;
* Food production, where the improvement of methods of fermentation by microbes such as in brewing and cheese manufacture, requires access to accurately attributed sequence data;
* Crop science, as data relevant to invasive and pathogenic species will be available for monitoring, control and eradication programmes;
* Livestock health, as data relevant to emerging threats to production to crop and livestock species from novel or imported pathogens will be available for monitoring and eradication programmes;
* Biofuel species development, where yield optimisation depends upon a clear mechanistic understanding of the genomics of the species to hand and its relatives, free from contaminant sequence
* Drug discovery, where the process of initial lead definition will not be fatally misled by misattributed sequence;
* Bioprospecting, where correct linkage between sequences and the organisms they derive from will speed identification of useful bioactives;
* Biotechnology, where the engineering of synthetic pathways requires accurate identification and characterisation of genomic material to its correct sources.
We also recognise that SMEs are beginning to generate genome assemblies for target species, and BlobToolKit will aid these in generating high-quality data on which future investment can be based. The toolkit will be available under an appropriate open software license, permitting installation on local servers as well as on private cloud computing systems.

How will prospective users become informed about BlobToolKit?
By embedding BlobToolKit in standalone, cloud, and database-proximal versions, and by developing novel interactive visualisations, we will ensure that it has wide uptake and open availability. We will deliver BlobToolKit-enabled assessments of public data via a plugin to the ENA web data services. This will reach tens of thousands of data users per year. By annotating sequences with suspect annotation, we will improve the sequence search results, and interpretation of downloaded data, for many tens of thousands more.

Overall, the toolkit will serve to correct the scientific record at source, and provide an independent measure of data quality and reliability for future reuse.

Ultimately we hope that BlobToolKit will become part of the hidden but essential infrastructure that supports UK and global bioscience, whether academic or commercial. "Users" will realise that the data they are using has been screened by BlobToolKit, and will expect BlobToolKit stamps of credibility on data they access and exploit.

Publications

10 25 50

publication icon
Caurcel C (2021) MolluscDB: a genome and transcriptome database for molluscs in Philosophical Transactions of the Royal Society B: Biological Sciences

 
Description We have identified suspected, but previously unexplored, issues with the public databases that house sequence data. Researchers have submitted genome sequences where the data that is claimed to be from one species is actually from more than one, either because the original specimen was infected by a parasite or pathogen, or where there was contamination during the DNA sequencing process. This has revealed for the first time the genomes of some exciting parasites of animals: a parasite related to malaria in a primate genome dataset, and a set of related parasites in a number of bird genome datasets. These discoveries serve to "clean up" the scientific record, and to open new avenues of research.
Exploitation Route The toolkit we have developed is already in use worldwide, and is helping to prevent contamination occurring in the future. Others are exploring additional measures of genome integrity that could be added to the core analysis modes in our toolkit.
Sectors Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Environment,Healthcare

URL https://blobtoolkit.genomehubs.org
 
Title BlobToolKit Analysis Resource 
Description https://blobtoolkit.genomehubs.org/viewer This resource offers analysis of 10,000 of the 13,000 eukaryotic genome sequences available in public databases (INSDC) including blobplots, contamination screening, BUSCO analyses and much more. 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
Impact This toolkit is used globally to quality check and report on genome sequences, particularly within the Earth BioGenome Project network of networks. BTK analyses are summarised in and linked-to from the genome assembly pages in the ENA database. 
URL https://blobtoolkit.genomehubs.org/viewer
 
Title GoaT Genomes on a Tree 
Description GoaT uses Elastic Search to return for any taxon an estimate of its genome size and karyotype. it serves to aggregate data currently available in a disparate 9and previously undiscoverable) series of publications and datasets. It also estimates values for parent nodes (genera, families, etc) in the taxonomic tree, and to estimate values for species for which no measurements are available. It has a open API. 
Type Of Technology Webtool/Application 
Year Produced 2020 
Open Source License? Yes  
Impact GoaT is being used across the Darwin Tree of Life and Earth Biogenome Project to deliver estimates to back up genome sequencing efforts 
URL http://goat.genomehubs.org
 
Title blobtoolkit/blobtoolkit: 4.0.6 
Description Commits d2be1d5: make failed mv return true in release action (Richard Challis) #136,#103 103a715: add option to skip running windowmasker for large assemblies (Richard Challis) #136,#103 4b7d196: update path to lib functions (Richard Challis) #136,#103 43f12f4: Bump version: 4.0.5 ? 4.0.6 (Richard Challis) #136,#103 
Type Of Technology Software 
Year Produced 2023 
Open Source License? Yes  
Impact Blobtoolkit is continuing to be developed and maintained. We are adding new features (for example an interacive plotting library for features along chromosomal scaffolds) and dealing with bug and feature requests. We have also worked to piprline the toolkit into snakemake and are developing nextflow versions, and keeping the docker/singularity images up to date. While we choose not to track downloads, the frequency of citation of the toolkit's V3 core paper (345 direct citations, 4500 views) and the version 2 paper (1277 downloads, >8000 views) attests to wide uptake and usage. 
URL https://zenodo.org/record/7573430
 
Title https://github.com/blobtoolkit/blobtoolkit 
Description https://github.com/blobtoolkit/blobtoolkit is the latest iteration of the BTK pipeline with improved visualisation, analytic and download functionality. Similar to BlobTools v1, BlobTools2 is a command line tool designed to aid genome assembly QC and contaminant/cobiont detection and filtering. In addition to supporting interactive visualisation, a motivation for this reimplementation was to provide greater flexibility to include new types of information, such as BUSCO results and BLAST hit d 
Type Of Technology Software 
Year Produced 2021 
Open Source License? Yes  
Impact BTK is now used as the standard production viewer for Darwin Tree of Life and other major genome sequencing projects. 
URL https://github.com/blobtoolkit/blobtoolkit
 
Title https://github.com/genomehubs 
Description Genomes on a Tree (GoaT) GoaT is built using GenomeHubs 2.0, to present genome-relevant metadata for all Eukaryotic taxa across the tree of life. Metadata in GoaT include, genome assembly attributes, genome sizes, C values, and chromosome numbers from multiple sources. GoaT platform main goals: Serve as a centralized source of genome-relevant metadata for the global community Operate as the sequencing tracking system for the Earth Biogenome Project Network 
Type Of Technology Software 
Year Produced 2021 
Open Source License? Yes  
Impact GoaT is now the core dataservice behind progress tracking for the Earth Biogenome Project, Darwin Tree of Life Project and many other large scale biodiversity genomics initiatives. 
URL https://goat.genomehubs.org/