BBR GenomeHubs - agile genome databasing for neglected organisms of agricultural, development and biodiversity importance

Lead Research Organisation: University of Edinburgh

Department Name: Sch of Biological Sciences

Abstract

Building the first draft of the human genome cost around £2.5 billion. New sequencing technologies mean the cost of resequencing a human has reduced over a million-fold. This reduction in cost also transforms genomics approaches to many other biological questions. Genomics is now commonly applied to diverse goals from crop and livestock improvement, through pathogen and parasite surveillance, to biodiversity assessment. Many research communities are now able to generate reference genomes for their target species, compare genomes across suites of related species and sequence many individuals of the same species to investigate how variation between genome sequences affects biology.
With these benefits come the challenges of managing a deluge of data, of analysing the data to answer questions, and of making the data and results available to others. For raw sequence data deposition in "databases of record" (internationally-supported systems that collect, collate and store for posterity) is standard. However, many discoveries are based on intensively analysed data - raw sequence is "assembled" to predict the whole genome sequence, genes are predicted in this genome sequence, and their functions are inferred by a range of annotation tools. Capturing these analyses in databases of record is strongly encouraged, but is technically difficult.
For a few species, researchers have developed dedicated genome exploration databases that collect and collate not only sequence but also annotation and functional data, and present it in a way that facilitates integration. These databases require considerable expertise and effort to set up, maintain and keep current with the latest scientific developments. Thus, for the majority of species, and especially species of interest to the developing world, dedicated databases do not exist and communities lack the resources to plug this gap.
During a previous BBR project, we developed an approach to genome databasing, named GenomeHubs, that removes the barriers to creating and maintaining a dedicated genomics resource for any species group. We do this by greatly simplifying the process of importing data into, and hosting an instance of, the most comprehensive genome database platform, Ensembl. Using the carefully-engineered Ensembl system, we have developed tools that standardise data from diverse sources, run automated analyses, import analysis results back into the database and visualise the genome and annotations through a web interface.
In this proposal we will develop GenomeHubs further to make it straightforward for researchers to run all the steps to assemble, annotate and run standard analyses on any genome or set of genomes and share these results with the wider community. We will add new analyses and visualisations and we will help users through collaboration and training in the setup and use of GenomeHubs.
This application is being made in tandem with one to the BBSRC BBR Global Challenges Research Funding call, which will work with Lower and Middle Income Country (LMIC) scientists to develop and exploit GenomeHubs for their needs. Genomics is being increasingly applied to problems of the developing world, in particular improvement of crop plants and local farm animals, understanding and combating infectious disease, and biodiversity conservation. This project will work very closely with the GCRF GenomeHubs outreach project, bringing the technology to LMIC researchers and supporting their use of GenomeHubs. We will link research communities, promote data sharing and enhance the pooling of resources and understanding to solve shared problems. We will develop collaborations with key scientists in LMICs who will act as Ambassadors for GenomeHubs, and collaborate closely with LMIC researchers to develop new code, new visualisations and new analytic tools for GenomeHubs to meet their requirements.

Technical Summary

Genome databasing is critical in ensuring that the costly results of genome scale analyses are available to the research community. Richly-featured genome database solutions are also substrates for novel research activity - aggregating data across projects and species, and asking and answering important questions. Several independent solutions are available for genome databasing. These are tailored to fit their research communities - e.g. the human and model organism communities to have rich database tools for dense data analysis. We propose to leverage investment in one of these - the Ensembl database and data visualisation system - to allow communities working on non-model, less well-resourced organisms to benefit from this high-quality toolset.
We have developed routines that make the establishment and population of an Ensembl database much easier than previously, developed new visualisations and engineered a simple-to-manage data sharing/enquiry system called GenomeHubs. We now propose to develop GenomeHubs in several ways, under the guidance and feedback of the several communities we collaborate with. Specifically, we will track changes in the underlying Ensembl codebase, ensuring that GenomeHubs remain current. We will develop routines for deposition of data, collated and normalised in GenomeHubs, into the public databases of record, through the ENA. We will develop new visualisation and data interrogation toolkits that use the underpinning Ensembl database structure to extract new composite data types, build new views on the data, and federate searches across database instances. We will provide Galaxy and virtual machine instances of the GenomeHubs pipelines so that users can access them easily. We will build and support our user communities through workshops, and encourage other developers to build plugins and other developments of the GenomeHubs code.

Planned Impact

The genomics revolution is transforming the impact of genetic approaches to understanding the functioning of organisms, and exploiting genomics information is an essential component of transformation of basic knowledge into societal impact. However fragmentation of data outputs from groundbreaking projects makes synthesis difficult, and transferring inferences drawn from one species to another problematic - every finding risks becoming an anecdote and not part of a coherent narrative. This existing problem and future risk defines the areas in which we hope our GenomeHubs project will have impact.
We expect GenomeHubs to have a lasting impact on the practice of genomics studies. This impact is predicated on the ease with which research communities will be able to collaborate to build portals for their high-dimensional data, and thus make it accessible for reuse and reanalysis. In the absence of the GenomeHubs system it is very likely that the current status quo - of raw data (mostly) making it to databases of record, but the majority of analysed data being lost to science, and unavailable to society - will continue to the detriment of society and science.

To assure the impact of our project we propose a program of outreach and education components that will make the community aware of the problem, elicit debate about the likely solutions, and deliver training in the practice of data integration and reuse.
For academic research teams, we will offer support in establishing and maintaining GenomeHubs tailored to their needs. We will build skills in gathering and interpreting community needs, and in delivering tailored solutions. We will build robust and agile systems such that researchers can establish new GenomeHub databases with the minimum of background knowledge in systems administration or bioinformatics. We will assist academics in using GenomeHubs by developing training materials and running week-long summer schools in genome assembly annotation and databasing, with support for PhD students to attend free of charge.
For SME we will offer a system they can be secure in installing and running to analyse their own data, mirror from other sites to explore public data, or collaborate with academic researchers to merge datasets and reap the benefits of shared analyses. We envision impacts in the areas of pest and pathogen control (understanding the comparative genomics and systems biology of loci involved in pathogenesis, and of loci that may be targets for control strategies), in crop species improvement (both plant and animal), in understanding the effects of interventions such as chemical treatment on organisms, and in developing monitoring tools for adverse and beneficial effects of interventions.
For state and NGO stakeholders, we will offer a validated source of comparative and integrated data on biodiversity, on organisms' responses to change and challenge, and a platform on which they can base policy and intervention decisions. We will also ensure best outcome for public investment in genomics, by making it more likely that funded projects deposit data openly for others to reuse. The loss of scientific "capital", in the form of inaccessible knowledge derived from raw data, is significant, and GenomeHubs embody a strong statement that this capital is valuable.
Training delivered through summer schools and other methods will be accessible to researchers, students, and scientists in SME, NGO and governmental organisations. The training will show these users how to best exploit the GenomeHubs and the data they contain to promote their own agendas.

Funded Value:

£362,519

Funded Period:

Sep 18 - Jun 19

Funder:

BBSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

BB/R015325/1

Principal Investigator:

Mark Blaxter

Research Subject:

Omic sciences & technologies (70%)

Tools, technologies & methods (28%)

Research Topic:

Bioinformatics (28%)

Functional genomics (28%)

Genomics (42%)

Organisations

University of Edinburgh (Lead Research Organisation)

People	ORCID iD
Mark Blaxter (Principal Investigator)

Publications

Author Name

Title Publication Date Published

|< < 1 2 3 4 5 6 7 > >|

10 25 50

Ashworth M (2023) The genome sequence of the thick-headed fly, Myopa tessellatipennis (Motschulsky, 1859). in Wellcome open research

Beltran T (2019) Comparative Epigenomics Reveals that RNA Polymerase II Pausing and Chromatin Domain Organization Control Nematode piRNA Biogenesis. in Developmental cell

Blaxter M (2023) The genome sequence of the crab hacker barnacle, Sacculina carcini (Thompson, 1836) in Wellcome Open Research

Boyes D (2023) The genome sequence of the Elbow-stripe Grass-veneer, Agriphila geniculea (Haworth, 1811). in Wellcome open research

Boyes D (2023) The genome sequence of the Yellow-line Quaker, Agrochola macilenta (Hubner, 1809) in Wellcome Open Research

Boyes D (2023) The genome sequence of the Lunar Hornet, Sesia bembeciformis (Hübner 1806). in Wellcome open research

Boyes D (2023) The genome sequence of the Birch Marble, Apotomis betuletana (Haworth, 1811). in Wellcome open research

Boyes D (2023) The genome sequence of the Buff Ermine, Spilarctia lutea (Hufnagel, 1766). in Wellcome open research

Boyes D (2023) The genome sequence of the Mother Shipton moth , Euclidia mi (Clerck, 1759). in Wellcome open research

Boyes D (2023) The genome sequence of the Common Plume moth, Emmelina monodactyla (Linnaeus, 1758) in Wellcome Open Research

Related Projects

Project Reference	Relationship	Related To	Start	End	Award Value
BB/R015325/1			30/09/2018	29/06/2019	£362,520
BB/R015325/2	Transfer	BB/R015325/1	31/05/2020	30/08/2022	£311,965

Key Findings
Research Databases and Models
Research Tools and Methods
Software and Technical Products
Engagement Activities


Description	We are trying to 'democratise" genomics, to make the process available to all wherever they work. The GenomeHubs mantra is to generate systems that are robust, easy to install, and of the highest technical quality. We have been refocussing our efforts to try to integrate a wide diversity of kinds of data, and most recently have developed "Genomes on a Tree", a service for the global biodiversity genomics community that assists in coordination and delivery of projects. The Earth BioGenome Project has proposed sequencing the genomes of all species on earth in the next decade, and Genomes on a Tree and GenomeHubs are ready to deliver to the many challenges this massive, distributed project will raise.
Exploitation Route	The GenomeHubs approach is open - our software is free for others to use and modify. We expect that others will build on our foundations.
Sectors	Agriculture Food and Drink Environment Healthcare
URL	https://goat.genomehubs.org/preview


Title	Computing Infrastructure Upgrades 2018-19
Description	In 2018 we upgraded the Blaxter lab compute cluster to be a cloud-based system and added RAM and compute nodes to a total of 1024 nodes and just over 6 Tb RAM. There is also a 0.33 Pbyte disk farm
Type Of Material	Technology assay or reagent
Year Produced	2018
Provided To Others?	Yes
Impact	The compute cluster is now used by seven research groups in the School of Biological Sciences, funded by NERC, BBSRC, ERC and Wellcome Trust


Title	Data from: Genomic architecture and introgression shape a butterfly radiation
Description	We probe the history of rapidly radiating Heliconius butterflies by means of 20 new genome assemblies and employ them to investigate the genomic architecture of gene flow among lineages. By developing a test to distinguish incomplete lineage sorting from introgression, we demonstrate that histories of loci that differ from the species tree arose mostly through introgression. Moreover, these loci are underrepresented in low recombination and gene-rich regions, consistent with the purging of introgressed alleles tightly linked with incompatibility loci. Additionally, our analysis identifies an inversion that captures a color pattern switch locus which was transferred between lineages via introgression and is convergent with a similar rearrangement in another part of the genus. This analysis of multiple de novo genome sequences enables an improved understanding of the importance of introgression and selective processes in adaptive radiation.
Type Of Material	Database/Collection of data
Year Produced	2019
Provided To Others?	Yes
URL	https://datadryad.org/stash/dataset/doi:10.5061/dryad.b7bj832


Title	MolluscDB
Description	MolluscDB is a PartiGene database covering the transcriptomes of a number of mollusc species.
Type Of Material	Database/Collection of data
Provided To Others?	No
Impact	Mollusc DB has been used in several published works, including our own on Lymnaea stagnalis pond snails
URL	http://www.nematodes.org/NeglectedGenomes/MOLLUSCA/index.html


Title	BlobToolKit 1.0
Description	BlobToolKit is a complete refactoring of blobtools with a design focussing on a client-server interface, new agile databasing, interactive viewing and download. It has been deployed on local and Embassy cloud platforms. In produces reports of genome assembly quality that are easily interpreted, and embeddable in other applications and services.
Type Of Technology	Software
Year Produced	2018
Open Source License?	Yes
Impact	BlobToolKit has been used to screen ~400 of the 2000 genomes deposited in ENA for contamination, and reports generated for incorporation in ENA presentations of these data.
URL	http://blobtoolkit.genomehubs.org/


Title	GoaT Genomes on a Tree
Description	GoaT uses Elastic Search to return for any taxon an estimate of its genome size and karyotype. it serves to aggregate data currently available in a disparate 9and previously undiscoverable) series of publications and datasets. It also estimates values for parent nodes (genera, families, etc) in the taxonomic tree, and to estimate values for species for which no measurements are available. It has a open API.
Type Of Technology	Webtool/Application
Year Produced	2020
Open Source License?	Yes
Impact	GoaT is being used across the Darwin Tree of Life and Earth Biogenome Project to deliver estimates to back up genome sequencing efforts
URL	http://goat.genomehubs.org


Description	2018 Edinburgh genomics Training Workshops
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Postgraduate students
Results and Impact	Edinburgh Genomics delivered a rich portfolio of training courses in 2018, ranging from a one day Interoduction to Linux (delivered three times) through to an intensive "spring school" in Bioinformatics for Genomics (one week). The courses were delivered in Edinburgh, and had from 12 to 50 participants. The participants came from across the UK HEI sector, including undergraduates, postgraduates and postdoctoral researchers, and also attracted overseas attendants (mainly from other European countries). The training strand is funded from student fees and from BBSRC and NERC dedicated sources. The training strand employs a full time Training Manager who both administers the scheme and develops and delivers courses.
Year(s) Of Engagement Activity	2018


Description	Bioinformatics Training Workshops, Buenos Aires and LaPlata, Argentina
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Postgraduate students
Results and Impact	We preseented two week-long workshops under the auspices of CONICET and the University of La Plata, on bioinformatics tools for next generation genomics, inlcuding Blobtools, GenomeHubs and retated topics.
Year(s) Of Engagement Activity	2018