BlobToolKit: Identification and analysis of non-target data in all Eukaryotic genome projects

Lead Research Organisation: University of Edinburgh

Department Name: Sch of Biological Sciences

Abstract

Genomics has become one of the cornerstones of biology. Knowing an organism's genome sequence immediately allows us to work out what kinds of biology it is able to do, and acts as a platform upon which we can build experiments to test, for example, the dynamics of gene activity during stress or disease. If genomes are the cornerstones, genome databases are the libraries built from these data that allow science to collaborate and build upon its successes. Genome sequencing is getting easier, as technologies improve by leaps and bounds: new, high throughput sequencers and advanced computing. The human genome cost $3 billion to sequence the first time round: now it would cost about $15,000. This reduction in cost has opened up genome sequencing to many research projects on new species, and there are now about 30,000 bacterial genomes and 3,000 eukaryotic genomes in public databases.

When genomes are contaminated, the genome databases, the reference libraries, are also contaminated, and the scientific process becomes muddied: errors can be made that affect many later steps in understanding the natural world, or exploiting it for bioscience. Obviously no scientist knowingly submits contaminated genome data to the central databases, but as genome sequencing projects become more common, more and more contaminated data are getting into the databases of record.

How does contamination happen? Organisms live in environments with other species, and it is often not possible or not advisable to separate these before making DNA to be sequenced. For example, most animals have bacteria in their guts, and getting rid of these before extracting DNA from a whole specimen of a tiny species is difficult. Similarly, plants naturally have communities of fungi and bacteria growing in and on their leaves and roots. In the case of symbiotic organisms, where the interaction is very intimate, the specimen is indivisible. The genomes of the different contributing species will be mixed up in the raw sequence data generated from such samples.

We propose to build a set of computational tools, BlobToolKit, that will identify contaminants. BlobToolKit will be useful both during the process of making new genomes for the first time (where they will separate out the different organisms in the mix of raw sequence data), and during reanalyses of existing genome assemblies.

BlobToolKit will be made freely available as a standalone program, as a service on the internet, and as a system that will be plugged into the big public databases to report on possible contamination. The project, a collaboration between the University of Edinburgh and the European Bioinformatics Institute, aims, within 3 years, to have identified all the problems in "legacy" genomes already submitted to public databases, and to have in place a system that prevents further contamination happening.

BlobToolKit reports will be provided as part of the submission process to those scientists reporting genome assemblies, ensuring the exposure of our technology to its users. We will further promote BlobToolKit by publication of our results in open access journals, presentations and workshops at relevant meetings, discussion with standards organisations, delivering training workshops to interested groups of scientists, and maintaining a rich resource of training and tutorial materials on the web. Our aim is to steer the scientific community to a culture in which contamination in genome assembly is understood and expected, and freely available and versatile software tools are known that can assist in the flagging and prevention of contamination in the public record.

Technical Summary

Many next generation genome datasets derive from a mixture of taxa - either because the mixture is a biologically relevant unit (symbionts, organisms with associated metabiomes), or because the sample was, or became, contaminated. Separation of reads into bins corresponding to distinct organisms is essential for analysis, as mixed assemblies result in erroneous inferences - e.g. of species physiology, horizontal gene transfer, and holobiont biology. Unfortunately, public databases are already contaminated by wrongly taxonomically assigned sequences.

We propose to develop BlobToolKit, based on our successful Blobtools, to both clean the existing public databases and to ensure that future submissions are correctly annotated. BlobToolKit will use a range of algorithms to delineate distinct sequence and read bins in next generation data, and use these separate bins for independent analyses. BlobToolKit will include an interactive visualisation platform that will facilitate exploration of assembly data, and thus the generation of high-quality assemblies.

BlobToolKit use modes will be delivered by distinct packaging of the core software:
It will be used by researchers assembling de novo, as part of high-quality assembly pipelines - delivered through a command line version, accessible through an API.
It will be used by authors, editors, reviewers and database curators as a quality check before submission, acceptance of manuscripts or accessioning - delivered through a cloud-based Galaxy instance.
It will be used by databases to display interactive graphical reports on accessioned genomes and support data reuse - delivered through API integration with databases.

Core development will be carried out in Edinburgh, and integration with service delivery through the European Nucleotide Archive will be delivered from the EMBL-EBI, Hinxton. We will develop training and outreach materials to promote uptake of BlobToolKit in the research community.

Planned Impact

We and others have identified a critical issue with contamination in sequence attribution in genomic sequences in the public databases. To rectify this legacy problem and to reduce its impact on future data submissions we propose a toolkit, BlobToolKit, that aids producers and users in identifying and correctly classifying such data.

How will BlobToolKit impact science and industry?
This work will have impact beyond the purely academic sphere of those generating genome sequences. In particular we envisage impacts in:
* Clinical science and delivery, as pathogens and other possibly harmful species will be correctly identified;
* Food production, where the improvement of methods of fermentation by microbes such as in brewing and cheese manufacture, requires access to accurately attributed sequence data;
* Crop science, as data relevant to invasive and pathogenic species will be available for monitoring, control and eradication programmes;
* Livestock health, as data relevant to emerging threats to production to crop and livestock species from novel or imported pathogens will be available for monitoring and eradication programmes;
* Biofuel species development, where yield optimisation depends upon a clear mechanistic understanding of the genomics of the species to hand and its relatives, free from contaminant sequence
* Drug discovery, where the process of initial lead definition will not be fatally misled by misattributed sequence;
* Bioprospecting, where correct linkage between sequences and the organisms they derive from will speed identification of useful bioactives;
* Biotechnology, where the engineering of synthetic pathways requires accurate identification and characterisation of genomic material to its correct sources.
We also recognise that SMEs are beginning to generate genome assemblies for target species, and BlobToolKit will aid these in generating high-quality data on which future investment can be based. The toolkit will be available under an appropriate open software license, permitting installation on local servers as well as on private cloud computing systems.

How will prospective users become informed about BlobToolKit?
By embedding BlobToolKit in standalone, cloud, and database-proximal versions, and by developing novel interactive visualisations, we will ensure that it has wide uptake and open availability. We will deliver BlobToolKit-enabled assessments of public data via a plugin to the ENA web data services. This will reach tens of thousands of data users per year. By annotating sequences with suspect annotation, we will improve the sequence search results, and interpretation of downloaded data, for many tens of thousands more.

Overall, the toolkit will serve to correct the scientific record at source, and provide an independent measure of data quality and reliability for future reuse.

Ultimately we hope that BlobToolKit will become part of the hidden but essential infrastructure that supports UK and global bioscience, whether academic or commercial. "Users" will realise that the data they are using has been screened by BlobToolKit, and will expect BlobToolKit stamps of credibility on data they access and exploit.

Funded Value:

£354,127

Funded Period:

Jul 17 - Jun 19

Funder:

BBSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

BB/P024238/1

Principal Investigator:

Mark Blaxter

Research Subject:

Tools, technologies & methods (98%)

Research Topic:

Bioinformatics (56%)

Environmental Informatics (42%)

Organisations

University of Edinburgh (Lead Research Organisation)

People	ORCID iD
Mark Blaxter (Principal Investigator)

Publications

Author Name

Title Publication Date Published

|< < 1 2 3 4 5 6 7 8 9 10 > >|

10 25 50

Adkins P (2022) The genome sequence of the grey top shell, Steromphala cineraria (Linnaeus, 1758) in Wellcome Open Research

Ashworth M (2023) The genome sequence of the thick-headed fly, Myopa tessellatipennis (Motschulsky, 1859) in Wellcome Open Research

Aunin E (2021) The complete genome sequence of Eimeria tenella (Tyzzer 1929), a common gut parasite of chickens. in Wellcome open research

Beltran T (2019) Comparative Epigenomics Reveals that RNA Polymerase II Pausing and Chromatin Domain Organization Control Nematode piRNA Biogenesis. in Developmental cell

Bishop G (2021) The genome sequence of the small tortoiseshell butterfly, Aglais urticae (Linnaeus, 1758) in Wellcome Open Research

Blaxter M (2023) The genome sequence of the crab hacker barnacle, Sacculina carcini (Thompson, 1836) in Wellcome Open Research

Boyes D (2021) The genome sequence of the poplar hawk-moth, Laothoe populi (Linnaeus, 1758) in Wellcome Open Research

Boyes D (2021) The genome sequence of the swallow prominent, Pheosia tremula (Clerck, 1759). in Wellcome open research

Boyes D (2023) The genome sequence of the Yellow-line Quaker, Agrochola macilenta (Hubner, 1809) in Wellcome Open Research

Boyes D (2023) The genome sequence of the Shuttle-shaped Dart, Agrotis puta (Hu¨bner, 1803) in Wellcome Open Research

Key Findings
Research Databases and Models
Research Tools and Methods
Software and Technical Products
Engagement Activities


Description	We have developed new ways of analysing complex mixtures of genomes that result from assembly of metagenomic samples. These include interactive viewers and data models for holding complex annotation information concerning likely taxonomic attribution of contigs within an assembly. The toolkit is now mature and published and BlobToolKit views of published assemblies are now available for over half of the genomes represented in INSDC databases. We have identified suspected, but previously unexplored, issues with the public databases that house sequence data. Researchers have submitted genome sequences where the data that is claimed to be from one species is actually from more than one, either because the original specimen was infected by a parasite or pathogen, or where there was contamination during the DNA sequencing process. This has revealed for the first time the genomes of some exciting parasites of animals: a parasite related to malaria in a primate genome dataset, and a set of related parasites in a number of bird genome datasets. These discoveries serve to "clean up" the scientific record, and to open new avenues of research.
Exploitation Route	Blob Tool it is in use across the world, and views and screenshots from BTK analyses appear in online presentations and in publications - often without explicit attribution (i.e. the tool has become an invisible part of the genome science "process" - which was just our aim). The toolkit is published open access and thus the programs are available to all for reuse. The toolkit we have developed is already in use worldwide, and is helping to prevent contamination occurring in the future. Others are exploring additional measures of genome integrity that could be added to the core analysis modes in our toolkit.
Sectors	Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Environment,Healthcare
URL	http://blobtoolkit.genomehubs.org/


Title	Computing Infrastructure Upgrades 2018-19
Description	In 2018 we upgraded the Blaxter lab compute cluster to be a cloud-based system and added RAM and compute nodes to a total of 1024 nodes and just over 6 Tb RAM. There is also a 0.33 Pbyte disk farm
Type Of Material	Technology assay or reagent
Year Produced	2018
Provided To Others?	Yes
Impact	The compute cluster is now used by seven research groups in the School of Biological Sciences, funded by NERC, BBSRC, ERC and Wellcome Trust


Title	BlobToolKit Analysis Resource
Description	https://blobtoolkit.genomehubs.org/viewer This resource offers analysis of 10,000 of the 13,000 eukaryotic genome sequences available in public databases (INSDC) including blobplots, contamination screening, BUSCO analyses and much more.
Type Of Material	Database/Collection of data
Year Produced	2020
Provided To Others?	Yes
Impact	This toolkit is used globally to quality check and report on genome sequences, particularly within the Earth BioGenome Project network of networks. BTK analyses are summarised in and linked-to from the genome assembly pages in the ENA database.
URL	https://blobtoolkit.genomehubs.org/viewer


Title	BlobToolKit 1.0
Description	BlobToolKit is a complete refactoring of blobtools with a design focussing on a client-server interface, new agile databasing, interactive viewing and download. It has been deployed on local and Embassy cloud platforms. In produces reports of genome assembly quality that are easily interpreted, and embeddable in other applications and services.
Type Of Technology	Software
Year Produced	2018
Open Source License?	Yes
Impact	BlobToolKit has been used to screen ~400 of the 2000 genomes deposited in ENA for contamination, and reports generated for incorporation in ENA presentations of these data.
URL	http://blobtoolkit.genomehubs.org/


Title	Blobtools
Description	The python program analyses genome assemblies to generate data that can be used to filter contaminants and other complex mixtures. It produces both tabular and graphical output.The goal of many genome sequencing projects is to provide a complete representation of a target genome (or genomes) as underpinning data for further analyses. However, it can be problematic to identify which sequences in an assembly truly derive from the target genome(s) and which are derived from associated microbiome or contaminant organisms. We present BlobTools, a modular command-line solution for visualisation, quality control and taxonomic partitioning of genome datasets. Using guanine+cytosine content of sequences, read coverage in sequencing libraries and taxonomy of sequence similarity matches, BlobTools can assist in primary partitioning of data, leading to improved assemblies, and screening of final assemblies for potential contaminants. Through simulated paired-end read dataset,s containing a mixture of metazoan and bacterial taxa, we illustrate the main BlobTools workflow and suggest useful parameters for taxonomic partitioning of low-complexity metagenome assemblies.
Type Of Technology	Software
Year Produced	2017
Open Source License?	Yes
Impact	Blobtools is in use worldwide.
URL	https://f1000research.com/articles/6-1287/v1


Title	Blobtools v1.1
Description	Blobtools is a standalone toolkit that allows users to screen genome and other assemblies for potential contaminants. Version 1.1 of blobtools is an upgrade release that supports python 3.7 and has additional enhancements in terms of speed and outputs
Type Of Technology	Software
Year Produced	2019
Open Source License?	Yes
Impact	Blobtools is gaining wide traction in genome assembly communities.
URL	https://github.com/DRL/blobtools


Title	Drl/Blobtools: Blobtools V1.0
Description	major release
Type Of Technology	Software
Year Produced	2017


Title	blobtoolkit/blobtoolkit: 4.0.6
Description	Commits d2be1d5: make failed mv return true in release action (Richard Challis) #136,#103 103a715: add option to skip running windowmasker for large assemblies (Richard Challis) #136,#103 4b7d196: update path to lib functions (Richard Challis) #136,#103 43f12f4: Bump version: 4.0.5 ? 4.0.6 (Richard Challis) #136,#103
Type Of Technology	Software
Year Produced	2023
Open Source License?	Yes
Impact	Blobtoolkit is continuing to be developed and maintained. We are adding new features (for example an interacive plotting library for features along chromosomal scaffolds) and dealing with bug and feature requests. We have also worked to piprline the toolkit into snakemake and are developing nextflow versions, and keeping the docker/singularity images up to date. While we choose not to track downloads, the frequency of citation of the toolkit's V3 core paper (345 direct citations, 4500 views) and the version 2 paper (1277 downloads, >8000 views) attests to wide uptake and usage.
URL	https://zenodo.org/record/7573430


Title	https://github.com/blobtoolkit/blobtoolkit
Description	https://github.com/blobtoolkit/blobtoolkit is the latest iteration of the BTK pipeline with improved visualisation, analytic and download functionality. Similar to BlobTools v1, BlobTools2 is a command line tool designed to aid genome assembly QC and contaminant/cobiont detection and filtering. In addition to supporting interactive visualisation, a motivation for this reimplementation was to provide greater flexibility to include new types of information, such as BUSCO results and BLAST hit d
Type Of Technology	Software
Year Produced	2021
Open Source License?	Yes
Impact	BTK is now used as the standard production viewer for Darwin Tree of Life and other major genome sequencing projects.
URL	https://github.com/blobtoolkit/blobtoolkit


Description	2018 Edinburgh genomics Training Workshops
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Postgraduate students
Results and Impact	Edinburgh Genomics delivered a rich portfolio of training courses in 2018, ranging from a one day Interoduction to Linux (delivered three times) through to an intensive "spring school" in Bioinformatics for Genomics (one week). The courses were delivered in Edinburgh, and had from 12 to 50 participants. The participants came from across the UK HEI sector, including undergraduates, postgraduates and postdoctoral researchers, and also attracted overseas attendants (mainly from other European countries). The training strand is funded from student fees and from BBSRC and NERC dedicated sources. The training strand employs a full time Training Manager who both administers the scheme and develops and delivers courses.
Year(s) Of Engagement Activity	2018


Description	Bioinformatics Training Workshops, Buenos Aires and LaPlata, Argentina
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Postgraduate students
Results and Impact	We preseented two week-long workshops under the auspices of CONICET and the University of La Plata, on bioinformatics tools for next generation genomics, inlcuding Blobtools, GenomeHubs and retated topics.
Year(s) Of Engagement Activity	2018