BBR GenomeHubs - agile genome databasing for neglected organisms of agricultural, development and biodiversity importance

Lead Research Organisation: University of Edinburgh
Department Name: Sch of Biological Sciences

Abstract

Building the first draft of the human genome cost around £2.5 billion. New sequencing technologies mean the cost of resequencing a human has reduced over a million-fold. This reduction in cost also transforms genomics approaches to many other biological questions. Genomics is now commonly applied to diverse goals from crop and livestock improvement, through pathogen and parasite surveillance, to biodiversity assessment. Many research communities are now able to generate reference genomes for their target species, compare genomes across suites of related species and sequence many individuals of the same species to investigate how variation between genome sequences affects biology.
With these benefits come the challenges of managing a deluge of data, of analysing the data to answer questions, and of making the data and results available to others. For raw sequence data deposition in "databases of record" (internationally-supported systems that collect, collate and store for posterity) is standard. However, many discoveries are based on intensively analysed data - raw sequence is "assembled" to predict the whole genome sequence, genes are predicted in this genome sequence, and their functions are inferred by a range of annotation tools. Capturing these analyses in databases of record is strongly encouraged, but is technically difficult.
For a few species, researchers have developed dedicated genome exploration databases that collect and collate not only sequence but also annotation and functional data, and present it in a way that facilitates integration. These databases require considerable expertise and effort to set up, maintain and keep current with the latest scientific developments. Thus, for the majority of species, and especially species of interest to the developing world, dedicated databases do not exist and communities lack the resources to plug this gap.
During a previous BBR project, we developed an approach to genome databasing, named GenomeHubs, that removes the barriers to creating and maintaining a dedicated genomics resource for any species group. We do this by greatly simplifying the process of importing data into, and hosting an instance of, the most comprehensive genome database platform, Ensembl. Using the carefully-engineered Ensembl system, we have developed tools that standardise data from diverse sources, run automated analyses, import analysis results back into the database and visualise the genome and annotations through a web interface.
In this proposal we will develop GenomeHubs further to make it straightforward for researchers to run all the steps to assemble, annotate and run standard analyses on any genome or set of genomes and share these results with the wider community. We will add new analyses and visualisations and we will help users through collaboration and training in the setup and use of GenomeHubs.
This application is being made in tandem with one to the BBSRC BBR Global Challenges Research Funding call, which will work with Lower and Middle Income Country (LMIC) scientists to develop and exploit GenomeHubs for their needs. Genomics is being increasingly applied to problems of the developing world, in particular improvement of crop plants and local farm animals, understanding and combating infectious disease, and biodiversity conservation. This project will work very closely with the GCRF GenomeHubs outreach project, bringing the technology to LMIC researchers and supporting their use of GenomeHubs. We will link research communities, promote data sharing and enhance the pooling of resources and understanding to solve shared problems. We will develop collaborations with key scientists in LMICs who will act as Ambassadors for GenomeHubs, and collaborate closely with LMIC researchers to develop new code, new visualisations and new analytic tools for GenomeHubs to meet their requirements.

Technical Summary

Genome databasing is critical in ensuring that the costly results of genome scale analyses are available to the research community. Richly-featured genome database solutions are also substrates for novel research activity - aggregating data across projects and species, and asking and answering important questions. Several independent solutions are available for genome databasing. These are tailored to fit their research communities - e.g. the human and model organism communities to have rich database tools for dense data analysis. We propose to leverage investment in one of these - the Ensembl database and data visualisation system - to allow communities working on non-model, less well-resourced organisms to benefit from this high-quality toolset.
We have developed routines that make the establishment and population of an Ensembl database much easier than previously, developed new visualisations and engineered a simple-to-manage data sharing/enquiry system called GenomeHubs. We now propose to develop GenomeHubs in several ways, under the guidance and feedback of the several communities we collaborate with. Specifically, we will track changes in the underlying Ensembl codebase, ensuring that GenomeHubs remain current. We will develop routines for deposition of data, collated and normalised in GenomeHubs, into the public databases of record, through the ENA. We will develop new visualisation and data interrogation toolkits that use the underpinning Ensembl database structure to extract new composite data types, build new views on the data, and federate searches across database instances. We will provide Galaxy and virtual machine instances of the GenomeHubs pipelines so that users can access them easily. We will build and support our user communities through workshops, and encourage other developers to build plugins and other developments of the GenomeHubs code.

Planned Impact

The genomics revolution is transforming the impact of genetic approaches to understanding the functioning of organisms, and exploiting genomics information is an essential component of transformation of basic knowledge into societal impact. However fragmentation of data outputs from groundbreaking projects makes synthesis difficult, and transferring inferences drawn from one species to another problematic - every finding risks becoming an anecdote and not part of a coherent narrative. This existing problem and future risk defines the areas in which we hope our GenomeHubs project will have impact.
We expect GenomeHubs to have a lasting impact on the practice of genomics studies. This impact is predicated on the ease with which research communities will be able to collaborate to build portals for their high-dimensional data, and thus make it accessible for reuse and reanalysis. In the absence of the GenomeHubs system it is very likely that the current status quo - of raw data (mostly) making it to databases of record, but the majority of analysed data being lost to science, and unavailable to society - will continue to the detriment of society and science.

To assure the impact of our project we propose a program of outreach and education components that will make the community aware of the problem, elicit debate about the likely solutions, and deliver training in the practice of data integration and reuse.
For academic research teams, we will offer support in establishing and maintaining GenomeHubs tailored to their needs. We will build skills in gathering and interpreting community needs, and in delivering tailored solutions. We will build robust and agile systems such that researchers can establish new GenomeHub databases with the minimum of background knowledge in systems administration or bioinformatics. We will assist academics in using GenomeHubs by developing training materials and running week-long summer schools in genome assembly annotation and databasing, with support for PhD students to attend free of charge.
For SME we will offer a system they can be secure in installing and running to analyse their own data, mirror from other sites to explore public data, or collaborate with academic researchers to merge datasets and reap the benefits of shared analyses. We envision impacts in the areas of pest and pathogen control (understanding the comparative genomics and systems biology of loci involved in pathogenesis, and of loci that may be targets for control strategies), in crop species improvement (both plant and animal), in understanding the effects of interventions such as chemical treatment on organisms, and in developing monitoring tools for adverse and beneficial effects of interventions.
For state and NGO stakeholders, we will offer a validated source of comparative and integrated data on biodiversity, on organisms' responses to change and challenge, and a platform on which they can base policy and intervention decisions. We will also ensure best outcome for public investment in genomics, by making it more likely that funded projects deposit data openly for others to reuse. The loss of scientific "capital", in the form of inaccessible knowledge derived from raw data, is significant, and GenomeHubs embody a strong statement that this capital is valuable.
Training delivered through summer schools and other methods will be accessible to researchers, students, and scientists in SME, NGO and governmental organisations. The training will show these users how to best exploit the GenomeHubs and the data they contain to promote their own agendas.

Publications

10 25 50
 
Description We are trying to 'democratise" genomics, to make the process available to all wherever they work. The GenomeHubs mantra is to generate systems that are robust, easy to install, and of the highest technical quality. We have been refocussing our efforts to try to integrate a wide diversity of kinds of data, and most recently have developed "Genomes on a Tree", a service for the global biodiversity genomics community that assists in coordination and delivery of projects. The Earth BioGenome Project has proposed sequencing the genomes of all species on earth in the next decade, and Genomes on a Tree and GenomeHubs are ready to deliver to the many challenges this massive, distributed project will raise.
Exploitation Route The GenomeHubs approach is open - our software is free for others to use and modify. We expect that others will build on our foundations.
Sectors Agriculture, Food and Drink,Environment,Healthcare

URL https://goat.genomehubs.org/preview
 
Title Computing Infrastructure Upgrades 2018-19 
Description In 2018 we upgraded the Blaxter lab compute cluster to be a cloud-based system and added RAM and compute nodes to a total of 1024 nodes and just over 6 Tb RAM. There is also a 0.33 Pbyte disk farm 
Type Of Material Technology assay or reagent 
Year Produced 2018 
Provided To Others? Yes  
Impact The compute cluster is now used by seven research groups in the School of Biological Sciences, funded by NERC, BBSRC, ERC and Wellcome Trust 
 
Title MolluscDB 
Description MolluscDB is a PartiGene database covering the transcriptomes of a number of mollusc species. 
Type Of Material Database/Collection of data 
Provided To Others? No  
Impact Mollusc DB has been used in several published works, including our own on Lymnaea stagnalis pond snails 
URL http://www.nematodes.org/NeglectedGenomes/MOLLUSCA/index.html
 
Title BlobToolKit 1.0 
Description BlobToolKit is a complete refactoring of blobtools with a design focussing on a client-server interface, new agile databasing, interactive viewing and download. It has been deployed on local and Embassy cloud platforms. In produces reports of genome assembly quality that are easily interpreted, and embeddable in other applications and services. 
Type Of Technology Software 
Year Produced 2018 
Open Source License? Yes  
Impact BlobToolKit has been used to screen ~400 of the 2000 genomes deposited in ENA for contamination, and reports generated for incorporation in ENA presentations of these data. 
URL http://blobtoolkit.genomehubs.org/
 
Title GoaT Genomes on a Tree 
Description GoaT uses Elastic Search to return for any taxon an estimate of its genome size and karyotype. it serves to aggregate data currently available in a disparate 9and previously undiscoverable) series of publications and datasets. It also estimates values for parent nodes (genera, families, etc) in the taxonomic tree, and to estimate values for species for which no measurements are available. It has a open API. 
Type Of Technology Webtool/Application 
Year Produced 2020 
Open Source License? Yes  
Impact GoaT is being used across the Darwin Tree of Life and Earth Biogenome Project to deliver estimates to back up genome sequencing efforts 
URL http://goat.genomehubs.org
 
Description 2018 Edinburgh genomics Training Workshops 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Edinburgh Genomics delivered a rich portfolio of training courses in 2018, ranging from a one day Interoduction to Linux (delivered three times) through to an intensive "spring school" in Bioinformatics for Genomics (one week). The courses were delivered in Edinburgh, and had from 12 to 50 participants. The participants came from across the UK HEI sector, including undergraduates, postgraduates and postdoctoral researchers, and also attracted overseas attendants (mainly from other European countries). The training strand is funded from student fees and from BBSRC and NERC dedicated sources. The training strand employs a full time Training Manager who both administers the scheme and develops and delivers courses.
Year(s) Of Engagement Activity 2018
 
Description Bioinformatics Training Workshops, Buenos Aires and LaPlata, Argentina 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact We preseented two week-long workshops under the auspices of CONICET and the University of La Plata, on bioinformatics tools for next generation genomics, inlcuding Blobtools, GenomeHubs and retated topics.
Year(s) Of Engagement Activity 2018