Blobtoolkit: Identification and analysis of non-target data in all Eukaryotic genome projects

Lead Research Organisation: European Bioinformatics Institute
Department Name: OMICs

Abstract

Abstracts are not currently available in GtR for all funded research. This is normally because the abstract was not required at the time of proposal submission, but may be because it included sensitive information such as personal details.

Planned Impact

We and others have identified a critical issue with contamination in sequence attribution in genomic sequences in the public databases. To rectify this legacy problem and to reduce its impact on future data submissions we propose a toolkit, BlobToolKit, that aids producers and users in identifying and correctly classifying such data.

How will BlobToolKit impact science and industry?
This work will have impact beyond the purely academic sphere of those generating genome sequences. In particular we envisage impacts in:
* Clinical science and delivery, as pathogens and other possibly harmful species will be correctly identified;
* Food production, where the improvement of methods of fermentation by microbes such as in brewing and cheese manufacture, requires access to accurately attributed sequence data;
* Crop science, as data relevant to invasive and pathogenic species will be available for monitoring, control and eradication programmes;
* Livestock health, as data relevant to emerging threats to production to crop and livestock species from novel or imported pathogens will be available for monitoring and eradication programmes;
* Biofuel species development, where yield optimisation depends upon a clear mechanistic understanding of the genomics of the species to hand and its relatives, free from contaminant sequence
* Drug discovery, where the process of initial lead definition will not be fatally misled by misattributed sequence;
* Bioprospecting, where correct linkage between sequences and the organisms they derive from will speed identification of useful bioactives;
* Biotechnology, where the engineering of synthetic pathways requires accurate identification and characterisation of genomic material to its correct sources.
We also recognise that SMEs are beginning to generate genome assemblies for target species, and BlobToolKit will aid these in generating high-quality data on which future investment can be based. The toolkit will be available under an appropriate open software license, permitting installation on local servers as well as on private cloud computing systems.

How will prospective users become informed about BlobToolKit?
By embedding BlobToolKit in standalone, cloud, and database-proximal versions, and by developing novel interactive visualisations, we will ensure that it has wide uptake and open availability. We will deliver BlobToolKit-enabled assessments of public data via a plugin to the ENA web data services. This will reach tens of thousands of data users per year. By annotating sequences with suspect annotation, we will improve the sequence search results, and interpretation of downloaded data, for many tens of thousands more.

Overall, the toolkit will serve to correct the scientific record at source, and provide an independent measure of data quality and reliability for future reuse.

Ultimately we hope that BlobToolKit will become part of the hidden but essential infrastructure that supports UK and global bioscience, whether academic or commercial. "Users" will realise that the data they are using has been screened by BlobToolKit, and will expect BlobToolKit stamps of credibility on data they access and exploit.

Publications

10 25 50
publication icon
Burgin J (2023) The European Nucleotide Archive in 2022. in Nucleic acids research

publication icon
Cummins C (2022) The European Nucleotide Archive in 2021. in Nucleic acids research

publication icon
Harrison PW (2021) The European Nucleotide Archive in 2020. in Nucleic acids research

 
Description By March 2018, we had developed a concept and plan for the improvement and enrichment of services at the European Nucleotide Archive (ENA), through the integration of BlobToolKit tools in alignment with objectives 2 ("REPORT MODE") and 3 ("THE PUBLIC RECORD", "GATEKEEPING THE PUBLIC DATABASES"). Following the recruitment of a Bioinformatician (Edward Richards) to work on the project in April 2018, we have progressed in delivering on our plan in the current reporting year.

We have carried out an assessment of public data in ENA to prepare for processing by BlobToolKit. While BlobToolKit can provide some useful visualisations for assemblies without raw data, the combination of assembly and raw read data from which the assembly was generated yield the most sensitive results. Comprising at the time of analysis some 7,719 Eukaryotic assemblies, 343 species are covered in assembly space, of which 248 (72%) were associated with a single set of raw reads and thus appropriate for the deepest BlobToolKit analysis. The remainder were associated with no reads or more than one set of reads with no clarity as to which reads were actually used in the assembly, and hence not optimal for BlobToolKit.

This observation led us to promote a standard tagging system (the "RUN_REF" attribute) to allow users in future to assert a precise relationship between reads used in an assembly and the assembly itself.

From EMBL-EBI, we have developed familiarity with BlobToolKit and tested it in a number of scenarios. We have fed back suggestions, including a concept for a summary statistic of some sort that could be used to draw users to the tool from ENA records.

SInce March 2019, we have completed the deployment of BlobToolKit in ENA services. The system is fully automated: first, new assembly data arriving in ENA flow into an EBI Embassy cloud-based instance of BlobToolKit; second, the resultant BlobToolKit analysis output data is despatched to a further Embassy cloud virtual machine that provides a "Notebook server" which renders the data into HTML with various BlobToolKit visualisations; finally Notebooks are are called dynamically and presented in the ENA Browser (https://www.ebi.ac.uk/ena/browser/home) directly alongside ENA assembly data as users search and browse these records. BlobToolKit visualisations are currently available for 3,400 ENA assemblies; see https://www.ebi.ac.uk/ena/browser/view/GCA_000697235.1 for example.

March 2023 update: In March 2023 there were 6,716 ENA assemblies for which BlobToolKit analyses were available. (Search "XREF source" for "BlobToolKit" on https://www.ebi.ac.uk/ena/browser/xref to return complete list with links)
Exploitation Route We expect that there has been impact across genome-enabled biosciences. Enrichment of public data available to the scientific community from the European Nucleotide Archive under this project has led to more reliable and more informative assembly data sets. These will likely have impacted the academic community (e.g. in providing a more trustworthy starting point for mechanistic molecular biology investigations), the public sector (e.g. in the more accurate profiling of pathogen circulation) and the private sector (e.g. in the discovery and adaptation of biochemistries of biotechnological importance).
Sectors Agriculture, Food and Drink,Education,Environment,Healthcare,Manufacturing, including Industrial Biotechology,Culture, Heritage, Museums and Collections,Pharmaceuticals and Medical Biotechnology

 
Title European Nucleotide Archive - embedded BlobToolKit report views 
Description Embedded views of BlobToolKit reports on genome assembly quality, available for many eukaryotic genones, e.g. https://www.ebi.ac.uk/ena/browser/view/GCA_000150275.1?show=blobtoolkit. 
Type Of Material Database/Collection of data 
Year Produced 2019 
Provided To Others? Yes  
Impact Users of genome assembly data around the world have direct access to genome quality visualisations directly alongside primary INSDC records. 
URL https://www.ebi.ac.uk/ena/browser/view/GCA_000150275.1?show=blobtoolkit
 
Description Tree of Life Programme 
Organisation The Wellcome Trust Sanger Institute
Country United Kingdom 
Sector Charity/Non Profit 
PI Contribution This partnership concerns the linked BBSRC award to Mark Blaxter covering development of the BlobToolkit concept and software. Contributions to date from my team relate to system design and planning as to how the technology can be integrated into the European Nucleotide Archive. Amended 3.2022: Mark Blaxter and team have moved from University of Edinburgh to WT Sanger Institute, with whom we continue the collaboration.
Collaborator Contribution The Blaxter lab. is responsible for the provision of the original Blobtools software, its extension and adaptation to user workflows. Work so far includes the design processes around the core software architecture (re-definition of inputs and outputs and definition of Application Programmatic Interface) and the prototyping of the "visualiser" component.
Impact BlobToolKit genome quality reports continue to be shown for Eukaryotic genome assemblies shown from our European Nucleotide Archive services.
Start Year 2020
 
Description Tree of Life Programme 
Organisation University of Edinburgh
Country United Kingdom 
Sector Academic/University 
PI Contribution This partnership concerns the linked BBSRC award to Mark Blaxter covering development of the BlobToolkit concept and software. Contributions to date from my team relate to system design and planning as to how the technology can be integrated into the European Nucleotide Archive. Amended 3.2022: Mark Blaxter and team have moved from University of Edinburgh to WT Sanger Institute, with whom we continue the collaboration.
Collaborator Contribution The Blaxter lab. is responsible for the provision of the original Blobtools software, its extension and adaptation to user workflows. Work so far includes the design processes around the core software architecture (re-definition of inputs and outputs and definition of Application Programmatic Interface) and the prototyping of the "visualiser" component.
Impact BlobToolKit genome quality reports continue to be shown for Eukaryotic genome assemblies shown from our European Nucleotide Archive services.
Start Year 2020
 
Description Invited talk at JOBIM 2019, Nantes, July 2019, covering inter alia BlobToolKit engineering into ENA; Cochrane. 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Invited plenary presentation on ENA, including BlobToolKit integration into the system, at JOBIM (Journées Ouvertes Biologie, Informatique et Mathématiques) 2019, the annual national French bioinformatics meeting; Guy Cochrane.
Year(s) Of Engagement Activity 2019
URL https://jobim2019.sciencesconf.org/
 
Description Invited talk at the First ELIXIR-Greece All-Hands meeting, Athens, September 2019, covering inter alia BlobToolKit engineering into ENA; Cochrane. 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Invited plenary presentation on ENA, including BlobToolKit integration into the system, at the First ELIXIR-Greece All-Hands meeting, Athens, September 2019; Guy Cochrane.
Year(s) Of Engagement Activity 2019
URL https://www.elixir-greece.org/node/239