Building a genome analytic resource for the lepidopteran community

Lead Research Organisation: University of Edinburgh
Department Name: Sch of Biological Sciences

Abstract

Genome sequencing, and large-scale, population genomic analysis, has suddenly become affordable. The explosion of data presents tremendous opportunity for ground-breaking research based on integration of data from independently-organized, community-driven genome projects, but this in turn requires shared database resources. For model organisms, genome databasing efforts grew with the research communities, and there are mature portals for deep investigation across many large scale datasets - the databases themselves have become a substrate for (meta-) research of high impact. For communities new to genomic (and population genomic) approaches, the need for accessible databases is even more pressing, as researchers are less likely to be fluent in the peculiar languages of genomics and in high-throughput bioinformatics. Here we propose the founding of a community database for lepidopteran genomics, LepBase, to meet the needs of the growing community of researchers using genomics to understand Lepidoptera as crop pests, as potentially invasive species, as developmental models, and as key taxa for understanding the interplay between ecology, genomics, evolution and speciation. While initially focussed on the available lepidopteran genomes the project will meet the challenge of future genomic riches (over 20 genomes 'in the pipeline') by building a platform that focuses on the needs of the lepidopteran research community.
The challenge of integrating newly developed genomic resources across taxa is not a new one, and several computational frameworks exist to support such endeavors (such as the ENSEMBL project, and the GMOD ecosystem of tools). Central, aggregative database efforts, such as ENSEMBL Genomes, provide an effective and powerful, one-size-fits-all solution to genome warehousing. Coordinating with smaller research communities to directly implement clade-specific resources is overwhelming the resources of institutions that have a mandate to generate integrated genomic databases. ENSEMBL now advocates a multi-tiered approach to the aggregation, integration, and dissemination of the rapidly increasing wealth of genomic information arising from community-driven genome projects so that species-level genomic resources can flow 'upstream' into the pan-genome database.
The goals of our project are: to develop a community-wide, comparative database for the Lepidoptera using the ENSEMBL platform; to institute effective tools for ongoing community annotation of emerging genomes; to forge close links with ENSEMBL Genomes to ensure upload of lepidopteran genomes into the global resource; to implement new modes of data visualisation and analysis in the ENSEMBL framework to meet community needs; and to provide training in genomics to the community of lepidopteran researchers. The LepBase database will also be a working model of community-driven databases that drive not only clade-specific research programmes but also enable the flow of knowledge from species-specific genome projects into a comprehensive framework.
The project will be based in the Blaxter bioinformatics and genomics group in Edinburgh, in association with the GenePool Genomics Facility (currently engaged in sequencing butterfly and moth species), with project partners in the Jiggins Heliconius research group in Cambridge and Dasmahapatra in York, and the support of lepidopteran researchers worldwide. Initial focus will be on the genus Heliconius, for which a complete genome sequence and abundant annotation, transcriptome and resequencing data already exist. The database will be rapidly extended into silkmoth, Bicyclus, Danaus and other species. The resource will be overseen by a Scientific Advisory Board drawn from across the range of lepidopteran researchers, and will aim for financial sustainability beyond the tenure of the award through development of a 'subscription' model of funding from research partners.

Technical Summary

Top-tier databases such as ENSEMBL Genomes do not have the resource, domain-specific expertise and reach to nurture high-quality databasing of emerging genomes. It is thus proposed that focused Tier 2 databases are established that act as community aggregative databases, delivering focused support to their user groups, and also feed quality controlled data up to the central aggregative databases such as ENSEMBL Genomes. Here we propose the establishment of a Tier 2 database for Lepidoptera, LepBase, that will capitalize on the leading position of UK research groups (largely funded by BBSRC) in the rapidly expanding field of lepidopteran genomics.

We will develop a range of tools and resources that will benefit wider research communities using Lepidoptera as model species or where better-organised lepidopteran genomic data can make a difference. Code and pipelines developed during the project are likely to be of much wider utility, and LepBase will serve as a model of Tier 2 aggregative genome databases.

We will first install and test the ENSEMBL code base, and develop 'standard' ENSEMBL instances for Heliconius melpomene and other published lepidopteran genomes. We will use both the community supplied annotations and standardized optimal annotation pipelines within ENSEMBL to deliver richly annotated genome portals.

We will use these first genomes as exemplars to write, test and deploy code for ENSEMBL for several novel modalities of data, including population genetic measures, geospatial analyses and clade-specific orthology and synteny.

We will install and deploy a community annotation portal (CAP) that will allow experts in the communities to comment on, vary and add annotation.

We will expand into additional genomes as they become available to us, and promote the resource to the lepidopteran community and interested external stakeholders (public and industrial) through meetings, visits, training workshops and web-based media.

Planned Impact

This proposal aims to deliver a common internet-access portal onto the many lepidopteran genomes being generated, and to develop new tools to interrogate these genomes in the wider context of the whole order.
The main beneficiaries will be lepidopteran genomics researchers, who will have a unified portal in which to contextualise and analyse their own data. We will engage this community by direct communication, encouraging groups to submit data to the project, and to assist us in getting their genome sequences represented. This will be achieved by attendance at community meetings (the Kansas Arthropod Genomics Workshop) and through blogs and twitter feeds from the project. We will maintain a project blog, describing the architecture of the site, the decisions made in development, successes and problems and prospects. The corporate twitter feed will be used to communicate database updates and improvements, and to pass on important news from the world of lepidopteran genomics.

A second group of beneficiaries will be lepidopteran biologists in general. A wide range of specialisms use lepidoptera as target organisms, from neurobiology. through evolutionary genetics. to behavioural ecology. We will engage this community by making the portal easy to use for non-genomics specialists, providing data summaries of utility to research teams focused on one or a few genes and pathways. Again, our blog and twitter feed will be used to keep this community informed, and we will make sure our team has representation at the key meetings and workshops where lepidopteran research is presented (for example the annual Evolution meetings).

A third key stakeholder group are the companies and research teams who are developing new tools to combat lepidopteran pests. Our database will be useful in defining possible drug and biocontrol targets, and in revealing the diversity and conservation of these targets across the order. Similarly, the biotechnology industry has keen interest in biomaterials from Lepidoptera - such as silks and new semiochemicals. The development of pathway-oriented annotations and the breadth of species collected in LepBase will permit more rational selection of lead enzymes and products. We will keep these organisations and individuals informed of developments through blog and twitter feeds, and presentations at relevant meetings.

Arthropod genomics is burgeoning, and the i5k initiative is coordinating a hoped-for 5000 genomes (in the first instance). The wider arthropod genomics community will benefit directly from use of LepBase, and also from our pushing LepBase genomes and updates into ENSEMBL Genomes. We will ensure that the i5k site and the wider arthropod genomics community is kept informed through the blog, twitter feed and direct emailing to interest groups.

As a model Tier 2 database, LepBase will be of interest to those developing similar systems for their taxa of interest. We will open our code development and ideas to colleagues running similar initiatives worldwide, and make sure we keep up with their work. Our code will, hopefully be integrated into the core ENSEMBL codebase, but meanwhile (and in addition) we will make it available on github.

We will strive to publish the database in the annual NAR Databases issue, highlighting updates and enhancements. Other publications, in open access journals, will also communicate to our key audiences.

The general public has strong interest in butterflies and moths as charismatic species. We will maintain general interest pages on the web presence of LepBase describing our work, and make available for download factsheets describing each species and the core biology the genome is revealing. These will be made available to butterfly farms, natural history museums and other interested parties. The database will be publicised at open days and science fairs in the three home institutions as available.

Publications

10 25 50
 
Title Transmissions exhibition 
Description The Blaxter lab collaborated closely wth artists-in-residence (see http://www.ascus.org.uk/ciie-micro-residency-artists-announced/) in the Centre for immunity Infection and Evolution to inspire and be part of the final exhibition "Transmissions". Mark Blaxter appears in the film work produced by Anne Milne, and the work of the lab inspired Jo hodges and Robbie Coleman to produce a piece dedicated to the lab. 
Type Of Art Artwork 
Year Produced 2014 
Impact 'Transmissions' was showcased to the general public within a group exhibition 'Parallel Perspectives' in Summerhall as part of the Edinburgh International Science Festival 2015 art programme, How The Light Gets In . This exhibition of work susequently travelled LifeSpace, Dundee, returning to Edinburgh to showcase at the Tent Gallery, Edinburgh College of Art. 
URL http://www.ascus.org.uk/ciie-micro-residency-2/
 
Description We have built a new model for sharing genomic data in the community of scientists that work on butterflies and moths. This development allows these scientists to place their work in the rich context of all other work on these important species, and is being used across the world. This method of sharing and collaborating is helping to build resource in labs, identifying new scientific avenues, helping with the identification of new targets for combatting pests and making sure data are discoverable and accessible.
Exploitation Route We are now recovering "ownership" of LepBase as a proposed transfer to the Smithsonian institution in the US has not been possible due to federal computing restrictions.
Sectors Agriculture, Food and Drink,Environment

URL https://lepbase.org
 
Description BBSRC Project Grant (Genome Databasing)
Amount £671,655 (GBP)
Funding ID BB/K020161/1 
Organisation Biotechnology and Biological Sciences Research Council (BBSRC) 
Sector Public
Country United Kingdom
Start 11/2013 
End 11/2016
 
Title Computing Infrastructure Upgrades 2018-19 
Description In 2018 we upgraded the Blaxter lab compute cluster to be a cloud-based system and added RAM and compute nodes to a total of 1024 nodes and just over 6 Tb RAM. There is also a 0.33 Pbyte disk farm 
Type Of Material Technology assay or reagent 
Year Produced 2018 
Provided To Others? Yes  
Impact The compute cluster is now used by seven research groups in the School of Biological Sciences, funded by NERC, BBSRC, ERC and Wellcome Trust 
 
Title LepBase - lepidoptera genome database 
Description LepBase is a tier two ENSEMBL database for the genomes of all lepidoptera. Genome sequencing, and large-scale, population genomic analysis, has suddenly become affordable. The explosion of data presents tremendous opportunity for ground-breaking research based on integration of data from independently-organized, community-driven genome projects, but this in turn requires shared database resources. For model organisms, genome databasing efforts grew with the research communities, and there are mature portals for deep investigation across many large scale datasets - the databases themselves have become a substrate for (meta-) research of high impact. For communities new to genomic (and population genomic) approaches, the need for accessible databases is even more pressing, as researchers are less likely to be fluent in the peculiar languages of genomics and in high-throughput bioinformatics. Here we propose the founding of a community database for lepidopteran genomics, LepBase, to meet the needs of the growing community of researchers using genomics to understand Lepidoptera as crop pests, as potentially invasive species, as developmental models, and as key taxa for understanding the interplay between ecology, genomics, evolution and speciation. While initially focussed on the available lepidopteran genomes the project will meet the challenge of future genomic riches (over 20 genomes 'in the pipeline') by building a platform that focuses on the needs of the lepidopteran research community. The challenge of integrating newly developed genomic resources across taxa is not a new one, and several computational frameworks exist to support such endeavors (such as the ENSEMBL project, and the GMOD ecosystem of tools). Central, aggregative database efforts, such as ENSEMBL Genomes, provide an effective and powerful, one-size-fits-all solution to genome warehousing. Coordinating with smaller research communities to directly implement clade-specific resources is overwhelming the resources of institutions that have a mandate to generate integrated genomic databases. ENSEMBL now advocates a multi-tiered approach to the aggregation, integration, and dissemination of the rapidly increasing wealth of genomic information arising from community-driven genome projects so that species-level genomic resources can flow 'upstream' into the pan-genome database. The goals of our project are: to develop a community-wide, comparative database for the Lepidoptera using the ENSEMBL platform; to institute effective tools for ongoing community annotation of emerging genomes; to forge close links with ENSEMBL Genomes to ensure upload of lepidopteran genomes into the global resource; to implement new modes of data visualisation and analysis in the ENSEMBL framework to meet community needs; and to provide training in genomics to the community of lepidopteran researchers. The LepBase database will also be a working model of community-driven databases that drive not only clade-specific research programmes but also enable the flow of knowledge from species-specific genome projects into a comprehensive framework. The project will be based in the Blaxter bioinformatics and genomics group in Edinburgh, in association with the GenePool Genomics Facility (currently engaged in sequencing butterfly and moth species), with project partners in the Jiggins Heliconius research group in Cambridge and Dasmahapatra in York, and the support of lepidopteran researchers worldwide. Initial focus will be on the genus Heliconius, for which a complete genome sequence and abundant annotation, transcriptome and resequencing data already exist. The database will be rapidly extended into silkmoth, Bicyclus, Danaus and other species. The resource will be overseen by a Scientific Advisory Board drawn from across the range of lepidopteran researchers, and will aim for financial sustainability beyond the tenure of the award through development of a 'subscription' model of funding from research partners. 
Type Of Material Database/Collection of data 
Year Produced 2014 
Provided To Others? Yes  
Impact N/A 
URL http://lepbase.org/
 
Title MolluscDB 
Description MolluscDB is a PartiGene database covering the transcriptomes of a number of mollusc species. 
Type Of Material Database/Collection of data 
Provided To Others? No  
Impact Mollusc DB has been used in several published works, including our own on Lymnaea stagnalis pond snails 
URL http://www.nematodes.org/NeglectedGenomes/MOLLUSCA/index.html
 
Title ensembl.tardigrades.org 
Description A database for tardigrade genome analysis 
Type Of Material Database/Collection of data 
Year Produced 2016 
Provided To Others? Yes  
Impact The database is being used in investigations of the biology of tardigrade extremophile biology. 
URL http://ensembl.tardigrades.org
 
Title nematod.es 
Description nematod.es is a collection housing nematode genome data in a set of BADGER genome exploration environments. 
Type Of Material Database/Collection of data 
Year Produced 2013 
Provided To Others? Yes  
Impact The database has been accessed thousands of times, largely for browsing and data download. The existence of the database has allowed us to build many fruitful collaborations. 
URL http://www.nematod.es
 
Title ngenomes 
Description ngenomes is a database for the display of genome assemblies from the Blaxter lab. It used the genomeHubs - LepBase ENSEMBL code 
Type Of Material Database/Collection of data 
Year Produced 2016 
Provided To Others? Yes  
Impact Colleagues globally have used the database for identification of target loci, exploration of relationships and download of data. 
URL http://ensembl.ngenomes.org
 
Description Heliconius genome consortium 
Organisation University of Cambridge
Department Department of Zoology
Country United Kingdom 
Sector Academic/University 
PI Contribution Offering genomic sequencing and reseaquencing technologies; close involvement in experimental design.
Collaborator Contribution Joining the consortium has allowed GenePool to develop custom targeted resequencing technologies, and associated bioinformatics skiils, and made us much more visible across this area of science
Impact The Heliconius melpomene genome has been assembled and annotated. A manuscript describing this work is submitted for publication. On the basis of the GenePool's involvement in this consortium we have coordinated a ladybird genome consortium with 6 partner laboratories, and currently have 2 genomes in sequencing.
Start Year 2010
 
Title Blobtools v1.1 
Description Blobtools is a standalone toolkit that allows users to screen genome and other assemblies for potential contaminants. Version 1.1 of blobtools is an upgrade release that supports python 3.7 and has additional enhancements in terms of speed and outputs 
Type Of Technology Software 
Year Produced 2019 
Open Source License? Yes  
Impact Blobtools is gaining wide traction in genome assembly communities. 
URL https://github.com/DRL/blobtools
 
Title EasyMirror/EasyImport 
Description These tools simplify the rool out of customised ENSEMBL databases. 
Type Of Technology Software 
Year Produced 2016 
Open Source License? Yes  
Impact We have been inundated with requests for support and assistance in delivering these tools, and with praise for their ease of use and timeliness. 
URL http://www.genomehubs.org
 
Title TAGC-plots and Blobsplorer tools for genomics 
Description Generating the raw data for a de novo genome assembly project for a target eukaryotic species is relatively easy. This democratisation of access to large-scale data has allowed many research teams to plan to assemble the genomes of non-model organisms. These new genome targets are very different from the traditional, inbred, laboratory reared model organisms. They are often small, and cannot be isolated free of their environment - whether ingested food, the surrounding host organism of parasites, or commensal and symbiotic organisms attached to or within the individuals sampled. Preparation of pure DNA originating from a single species can be technically impossible, but assembly of mixed-organism DNA can be difficult, as most genome assemblers perform poorly when faced with multiple genomes in different stoichiometries. This class of problem is common in metagenomic datasets that deliberately try to capture all the genomes present in an environment, but replicon assembly is not often the goal of such programmes. Here we present an approach to extracting from mixed DNA sequence data subsets that correspond to single species' genomes and thus improving genome assembly. We use both numerical (proportion of GC bases and read coverage) and biological (best-matching sequence in annotated databases) indicators to aid partitioning of draft assembly contigs, and the reads that contribute to those contigs, into distinct bins that can then be subjected to rigorous, optimised assembly. We also present a tool, blobsplorer, that aids exploration and selection of subsets from GC/coverage/taxonomy annotated datasets. Partitioning the data in this way "rescues" poorly assembled genomes, and reveals unexpected symbionts and commensals in eukaryotic genome projects. The TAGC-plot pipeline script is available from http://github.com/blaxterlab/blobology, and the Blobsplorer tool from https://github.com/mojones/blobsplorer. 
Type Of Technology Software 
Year Produced 2013 
Open Source License? Yes  
Impact Blobsplorer/TAGC plots are now in wide use in genomics. The toolkit has been featured in several courses and publications. 
URL http://github.com/blaxterlab/blobology
 
Description Bioinformatics Training Workshops, Buenos Aires and LaPlata, Argentina 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact We preseented two week-long workshops under the auspices of CONICET and the University of La Plata, on bioinformatics tools for next generation genomics, inlcuding Blobtools, GenomeHubs and retated topics.
Year(s) Of Engagement Activity 2018
 
Description Blaxter group - presentations and outreach 2016 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact The Blaxter group presented work at a wide range of national and international conferences, including PopGroup, the Arthropod Genomics Workshop, The C. elegans International Meeting, The Hydra Helminthology meeting, The European Society for Nematology, The UK Genome Science meeting, and others. At many of these venues, in addition to offering platform or poster presentations, we also presented workshops or training activities.
Year(s) Of Engagement Activity 2016
 
Description Blaxter group presentations and outreach 2015 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Mark Blaxter and research team communication and outreach 2015

Globodera genomics and blobtools software
25/02/2015 JHI Postgraduate Student Competition 2015 James Hutton Institute, Aberdeen, UK A tale of Two Peaks: Analysing Genomic Data from Potato Cyst Nematodes Talk
26/03/2015 JHI Cell and Molecular Sciences (CMS) seminar James Hutton Institute, Invergowrie, Dundee, UK Frustration and happiness : (De)-constructing parasite genomes Talk
16/06/2015 JHI Dundee effector consortium (DEC) meeting 2015 Birnam Arts and Conference Centre, Birnam, UK Variation within the Globodera pallida species complex: preliminary results Talk
03/09/2015 Molecular and Cellular Biology of Helminth Parasites IX Bratsera Hotel, Hydra, Greece Inter- and intra-specific analyses of the effector complement in potato cyst nematodes Poster
18/09/2015 UoE Postgraduate Poster Day University of Edinburgh, Edinburgh, UK Inter- and intra-specific analyses of the effector complement in potato cyst nematodes Poster
26/09/2015 Edinburgh University Doors Open Day University of Edinburgh, Edinburgh, UK Potato Cyst Nematodes (PCN) - Nematode parasites of potatoes Poster
30/11/2015 NextGenBug University of Edinburgh, Edinburgh, UK Blobtools: Blobology 2.0 Talk
01/12/2015 UK pollinator genomics meeting Roslin Institute, Edinburgh, UK Bees and Blobs Talk

LepBase
06/03/2015 EMARES Cambridge, UK The Bicyclus Genome Project Talk
06/03/2015 EMARES Cambridge, UK An introduction to Lepbase Talk
17/06/2015 Arthropod Genomics Manhattan, Kansas, USA Lepbase - A multi genome database for the Lepidoptera Poster
24/07/2015 10th Heliconius Meeting Gamboa, Panama Lepbase - A multi genome database for the Lepidoptera (API demonstration) Workshop
24/07/2015 10th Heliconius Meeting Panama Lepbase - A multi genome database for the Lepidoptera Poster
26/07/2015 10th Heliconius Meeting Panama Lepbase Workshop Talk
04/09/2015 Edinburgh Bioinformatics Edinburgh, UK Lepbase - A multi genome database for the Lepidoptera Talk
26/09/2015 Open Doors Day "Make a butterfly" interactive exhibition
26/09/2015 Edinburgh University Doors Open Day Edinburgh, UK Lepbase Multiple Sequence Alignments game Poster+Game
28/10/2015 NextgenBUG Dundee, UK Lepbase - an Ensembl (and more) for the Lepidoptera Talk

Nematode genomics
24.06.2015 20th International C. elegans Meeting Los Angeles USA A new evolutionary framework for the phylum Nematoda: a case study of HOX cluster evolution Poster
24.06.2015 20th International C. elegans Meeting Los Angeles USA Caenorhabditis Genomes Project Workshop (organiser and chair) Talk
24.06.2015 20th International C. elegans Meeting Los Angeles USA Current status of the CGP in Edinburgh Talk

Meloidogyne genomics
10-14 August 2015 ESEB Lausanne-Switzerland Genomic consequences of hybridization and the loss of meiotic recombination in Root-knot nematodes poster
15-18 December 2015 PopGroup Edinburgh-UK Genomic consequences of hybridization and the loss of meiotic recombination in Root-knot nematodes talk
23 February 2016 NextGenBug Edinburgh-UK Genomics of Root-knot nematodes talk
Year(s) Of Engagement Activity 2015
 
Description Blaxter lab workshops 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The Blaxter lab took our software products and research tools to various venues (Arthropod Genomics, UK Genome Science meeting, Butterfly Genomics) to present as workshops, training events or interactive sessions
Year(s) Of Engagement Activity 2016
 
Description Press releases and website 
Form Of Engagement Activity A magazine, newsletter or online publication
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact We have engaged actively with the University press office to promote press coverage of our research outcomes, particularly major publications (which have had coverage in national and international newspapers) and in blogs and other online media. We have also promoted major new initiatives such as additional core funding of the Edinburgh genomics facility.

Increased visibility of Edinburgh Genomics within the community; requests for comment by funders and government on matters pertaining to genomics.
Year(s) Of Engagement Activity 2009,2010,2011,2012,2013,2014,2015,2016