BBR GenomeHubs - agile genome databasing for neglected organisms of agricultural, development and biodiversity importance

Lead Research Organisation: The Wellcome Trust Sanger Institute
Department Name: Research Directorate

Abstract

Building the first draft of the human genome cost around £2.5 billion. New sequencing technologies mean the cost of resequencing a human has reduced over a million-fold. This reduction in cost also transforms genomics approaches to many other biological questions. Genomics is now commonly applied to diverse goals from crop and livestock improvement, through pathogen and parasite surveillance, to biodiversity assessment. Many research communities are now able to generate reference genomes for their target species, compare genomes across suites of related species and sequence many individuals of the same species to investigate how variation between genome sequences affects biology.
With these benefits come the challenges of managing a deluge of data, of analysing the data to answer questions, and of making the data and results available to others. For raw sequence data deposition in "databases of record" (internationally-supported systems that collect, collate and store for posterity) is standard. However, many discoveries are based on intensively analysed data - raw sequence is "assembled" to predict the whole genome sequence, genes are predicted in this genome sequence, and their functions are inferred by a range of annotation tools. Capturing these analyses in databases of record is strongly encouraged, but is technically difficult.
For a few species, researchers have developed dedicated genome exploration databases that collect and collate not only sequence but also annotation and functional data, and present it in a way that facilitates integration. These databases require considerable expertise and effort to set up, maintain and keep current with the latest scientific developments. Thus, for the majority of species, and especially species of interest to the developing world, dedicated databases do not exist and communities lack the resources to plug this gap.
During a previous BBR project, we developed an approach to genome databasing, named GenomeHubs, that removes the barriers to creating and maintaining a dedicated genomics resource for any species group. We do this by greatly simplifying the process of importing data into, and hosting an instance of, the most comprehensive genome database platform, Ensembl. Using the carefully-engineered Ensembl system, we have developed tools that standardise data from diverse sources, run automated analyses, import analysis results back into the database and visualise the genome and annotations through a web interface.
In this proposal we will develop GenomeHubs further to make it straightforward for researchers to run all the steps to assemble, annotate and run standard analyses on any genome or set of genomes and share these results with the wider community. We will add new analyses and visualisations and we will help users through collaboration and training in the setup and use of GenomeHubs.
This application is being made in tandem with one to the BBSRC BBR Global Challenges Research Funding call, which will work with Lower and Middle Income Country (LMIC) scientists to develop and exploit GenomeHubs for their needs. Genomics is being increasingly applied to problems of the developing world, in particular improvement of crop plants and local farm animals, understanding and combating infectious disease, and biodiversity conservation. This project will work very closely with the GCRF GenomeHubs outreach project, bringing the technology to LMIC researchers and supporting their use of GenomeHubs. We will link research communities, promote data sharing and enhance the pooling of resources and understanding to solve shared problems. We will develop collaborations with key scientists in LMICs who will act as Ambassadors for GenomeHubs, and collaborate closely with LMIC researchers to develop new code, new visualisations and new analytic tools for GenomeHubs to meet their requirements.

Technical Summary

Genome databasing is critical in ensuring that the costly results of genome scale analyses are available to the research community. Richly-featured genome database solutions are also substrates for novel research activity - aggregating data across projects and species, and asking and answering important questions. Several independent solutions are available for genome databasing. These are tailored to fit their research communities - e.g. the human and model organism communities to have rich database tools for dense data analysis. We propose to leverage investment in one of these - the Ensembl database and data visualisation system - to allow communities working on non-model, less well-resourced organisms to benefit from this high-quality toolset.
We have developed routines that make the establishment and population of an Ensembl database much easier than previously, developed new visualisations and engineered a simple-to-manage data sharing/enquiry system called GenomeHubs. We now propose to develop GenomeHubs in several ways, under the guidance and feedback of the several communities we collaborate with. Specifically, we will track changes in the underlying Ensembl codebase, ensuring that GenomeHubs remain current. We will develop routines for deposition of data, collated and normalised in GenomeHubs, into the public databases of record, through the ENA. We will develop new visualisation and data interrogation toolkits that use the underpinning Ensembl database structure to extract new composite data types, build new views on the data, and federate searches across database instances. We will provide Galaxy and virtual machine instances of the GenomeHubs pipelines so that users can access them easily. We will build and support our user communities through workshops, and encourage other developers to build plugins and other developments of the GenomeHubs code.

Planned Impact

The genomics revolution is transforming the impact of genetic approaches to understanding the functioning of organisms, and exploiting genomics information is an essential component of transformation of basic knowledge into societal impact. However fragmentation of data outputs from groundbreaking projects makes synthesis difficult, and transferring inferences drawn from one species to another problematic - every finding risks becoming an anecdote and not part of a coherent narrative. This existing problem and future risk defines the areas in which we hope our GenomeHubs project will have impact.
We expect GenomeHubs to have a lasting impact on the practice of genomics studies. This impact is predicated on the ease with which research communities will be able to collaborate to build portals for their high-dimensional data, and thus make it accessible for reuse and reanalysis. In the absence of the GenomeHubs system it is very likely that the current status quo - of raw data (mostly) making it to databases of record, but the majority of analysed data being lost to science, and unavailable to society - will continue to the detriment of society and science.

To assure the impact of our project we propose a program of outreach and education components that will make the community aware of the problem, elicit debate about the likely solutions, and deliver training in the practice of data integration and reuse.
For academic research teams, we will offer support in establishing and maintaining GenomeHubs tailored to their needs. We will build skills in gathering and interpreting community needs, and in delivering tailored solutions. We will build robust and agile systems such that researchers can establish new GenomeHub databases with the minimum of background knowledge in systems administration or bioinformatics. We will assist academics in using GenomeHubs by developing training materials and running week-long summer schools in genome assembly annotation and databasing, with support for PhD students to attend free of charge.
For SME we will offer a system they can be secure in installing and running to analyse their own data, mirror from other sites to explore public data, or collaborate with academic researchers to merge datasets and reap the benefits of shared analyses. We envision impacts in the areas of pest and pathogen control (understanding the comparative genomics and systems biology of loci involved in pathogenesis, and of loci that may be targets for control strategies), in crop species improvement (both plant and animal), in understanding the effects of interventions such as chemical treatment on organisms, and in developing monitoring tools for adverse and beneficial effects of interventions.
For state and NGO stakeholders, we will offer a validated source of comparative and integrated data on biodiversity, on organisms' responses to change and challenge, and a platform on which they can base policy and intervention decisions. We will also ensure best outcome for public investment in genomics, by making it more likely that funded projects deposit data openly for others to reuse. The loss of scientific "capital", in the form of inaccessible knowledge derived from raw data, is significant, and GenomeHubs embody a strong statement that this capital is valuable.
Training delivered through summer schools and other methods will be accessible to researchers, students, and scientists in SME, NGO and governmental organisations. The training will show these users how to best exploit the GenomeHubs and the data they contain to promote their own agendas.

Publications

10 25 50
publication icon
Caurcel C (2021) MolluscDB: a genome and transcriptome database for molluscs. in Philosophical transactions of the Royal Society of London. Series B, Biological sciences

publication icon
Stevens L (2020) The Genome of Caenorhabditis bovis. in Current biology : CB

publication icon
Yarra T (2021) A Bivalve Biomineralization Toolbox. in Molecular biology and evolution

Related Projects

Project Reference Relationship Related To Start End Award Value
BB/R015325/1 30/09/2018 29/06/2019 £362,520
BB/R015325/2 Transfer BB/R015325/1 31/05/2020 30/08/2022 £311,965
 
Description We are trying to 'democratise" genomics, to make the process available to all wherever they work. The GenomeHubs mantra is to generate systems that are robust, easy to install, and of the highest technical quality. We have been refocussing our efforts to try to integrate a wide diversity of kinds of data, and most recently have developed "Genomes on a Tree", a service for the global biodiversity genomics community that assists in coordination and delivery of projects. The Earth BioGenome Project has proposed sequencing the genomes of all species on earth in the next decade, and Genomes on a Tree and GenomeHubs are ready to deliver to the many challenges this massive, distributed project will raise.
Exploitation Route The GenomeHubs approach is open - our software is free for others to use and modify. We expect that others will build on our foundations.
Sectors Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Environment,Healthcare

URL https://goat.genomehubs.org
 
Title GoaT Genomes on a Tree 
Description GoaT uses Elastic Search to return for any taxon an estimate of its genome size and karyotype. it serves to aggregate data currently available in a disparate 9and previously undiscoverable) series of publications and datasets. It also estimates values for parent nodes (genera, families, etc) in the taxonomic tree, and to estimate values for species for which no measurements are available. It has a open API. 
Type Of Technology Webtool/Application 
Year Produced 2020 
Open Source License? Yes  
Impact GoaT is being used across the Darwin Tree of Life and Earth Biogenome Project to deliver estimates to back up genome sequencing efforts 
URL http://goat.genomehubs.org
 
Title https://github.com/blobtoolkit/blobtoolkit 
Description https://github.com/blobtoolkit/blobtoolkit is the latest iteration of the BTK pipeline with improved visualisation, analytic and download functionality. Similar to BlobTools v1, BlobTools2 is a command line tool designed to aid genome assembly QC and contaminant/cobiont detection and filtering. In addition to supporting interactive visualisation, a motivation for this reimplementation was to provide greater flexibility to include new types of information, such as BUSCO results and BLAST hit distributions. BlobTools2 supports command-line filtering of datasets, assembly files and read files based on values or categories assigned to assembly contigs/scaffolds through the blobtools filter command. Interactive filters and selections made using the BlobToolKit Viewer can be reproduced on the command line and used to generate new, filtered datasets which retain all fields from the original dataset. BlobTools2 is built around a file-based data structure, with data for each field contained in a separate JSON file within a directory (BlobDir) containing a single meta.json file with metadata for each field and the dataset as a whole. Additional fields can be added to an existing BlobDir using the blobtools add command, which parses an input to generate one or more additional JSON files and updates the dataset metadata. Fields are treated as generic datatypes, Variable (e.g. gc content, length and coverage), Category (e.g. taxonomic assignment based on BLAST hits) alongside Array and MultiArray datatypes to store information such as start, end, NCBI taxid and bitscore for a set of blast hits to a single sequence. Support for new analyses can be added to BlobTools2 by creating a new python module with an appropriate parse function. 
Type Of Technology Software 
Year Produced 2021 
Open Source License? Yes  
Impact BTK is now used as the standard production viewer for Darwin Tree of Life and other major genome sequencing projects. 
URL https://github.com/blobtoolkit/blobtoolkit
 
Title https://github.com/genomehubs 
Description Genomes on a Tree (GoaT) GoaT is built using GenomeHubs 2.0, to present genome-relevant metadata for all Eukaryotic taxa across the tree of life. Metadata in GoaT include, genome assembly attributes, genome sizes, C values, and chromosome numbers from multiple sources. GoaT platform main goals: Serve as a centralized source of genome-relevant metadata for the global community Operate as the sequencing tracking system for the Earth Biogenome Project Network 
Type Of Technology Software 
Year Produced 2021 
Open Source License? Yes  
Impact GoaT is now the core dataservice behind progress tracking for the Earth Biogenome Project, Darwin Tree of Life Project and many other large scale biodiversity genomics initiatives. 
URL https://goat.genomehubs.org/