Building a genome analytic resource for the lepidopteran community

Lead Research Organisation: University of Cambridge
Department Name: Zoology


Genome sequencing, and large-scale, population genomic analysis, has suddenly become affordable. The explosion of data presents tremendous opportunity for ground-breaking research based on integration of data from independently-organized, community-driven genome projects, but this in turn requires shared database resources. For model organisms, genome databasing efforts grew with the research communities, and there are mature portals for deep investigation across many large scale datasets - the databases themselves have become a substrate for (meta-) research of high impact. For communities new to genomic (and population genomic) approaches, the need for accessible databases is even more pressing, as researchers are less likely to be fluent in the peculiar languages of genomics and in high-throughput bioinformatics. Here we propose the founding of a community database for lepidopteran genomics, LepBase, to meet the needs of the growing community of researchers using genomics to understand Lepidoptera as crop pests, as potentially invasive species, as developmental models, and as key taxa for understanding the interplay between ecology, genomics, evolution and speciation. While initially focussed on the available lepidopteran genomes the project will meet the challenge of future genomic riches (over 20 genomes 'in the pipeline') by building a platform that focuses on the needs of the lepidopteran research community.
The challenge of integrating newly developed genomic resources across taxa is not a new one, and several computational frameworks exist to support such endeavors (such as the ENSEMBL project, and the GMOD ecosystem of tools). Central, aggregative database efforts, such as ENSEMBL Genomes, provide an effective and powerful, one-size-fits-all solution to genome warehousing. Coordinating with smaller research communities to directly implement clade-specific resources is overwhelming the resources of institutions that have a mandate to generate integrated genomic databases. ENSEMBL now advocates a multi-tiered approach to the aggregation, integration, and dissemination of the rapidly increasing wealth of genomic information arising from community-driven genome projects so that species-level genomic resources can flow 'upstream' into the pan-genome database.
The goals of our project are: to develop a community-wide, comparative database for the Lepidoptera using the ENSEMBL platform; to institute effective tools for ongoing community annotation of emerging genomes; to forge close links with ENSEMBL Genomes to ensure upload of lepidopteran genomes into the global resource; to implement new modes of data visualisation and analysis in the ENSEMBL framework to meet community needs; and to provide training in genomics to the community of lepidopteran researchers. The LepBase database will also be a working model of community-driven databases that drive not only clade-specific research programmes but also enable the flow of knowledge from species-specific genome projects into a comprehensive framework.
The project will be based in the Blaxter bioinformatics and genomics group in Edinburgh, in association with the GenePool Genomics Facility (currently engaged in sequencing butterfly and moth species), with project partners in the Jiggins Heliconius research group in Cambridge and Dasmahapatra in York, and the support of lepidopteran researchers worldwide. Initial focus will be on the genus Heliconius, for which a complete genome sequence and abundant annotation, transcriptome and resequencing data already exist. The database will be rapidly extended into silkmoth, Bicyclus, Danaus and other species. The resource will be overseen by a Scientific Advisory Board drawn from across the range of lepidopteran researchers, and will aim for financial sustainability beyond the tenure of the award through development of a 'subscription' model of funding from research partners.

Technical Summary

Top-tier databases such as ENSEMBL Genomes do not have the resource, domain-specific expertise and reach to nurture high-quality databasing of emerging genomes. It is thus proposed that focused Tier 2 databases are established that act as community aggregative databases, delivering focused support to their user groups, and also feed quality controlled data up to the central aggregative databases such as ENSEMBL Genomes. Here we propose the establishment of a Tier 2 database for Lepidoptera, LepBase, that will capitalize on the leading position of UK research groups (largely funded by BBSRC) in the rapidly expanding field of lepidopteran genomics.

We will develop a range of tools and resources that will benefit wider research communities using Lepidoptera as model species or where better-organised lepidopteran genomic data can make a difference. Code and pipelines developed during the project are likely to be of much wider utility, and LepBase will serve as a model of Tier 2 aggregative genome databases.

We will first install and test the ENSEMBL code base, and develop 'standard' ENSEMBL instances for Heliconius melpomene and other published lepidopteran genomes. We will use both the community supplied annotations and standardized optimal annotation pipelines within ENSEMBL to deliver richly annotated genome portals.

We will use these first genomes as exemplars to write, test and deploy code for ENSEMBL for several novel modalities of data, including population genetic measures, geospatial analyses and clade-specific orthology and synteny.

We will install and deploy a community annotation portal (CAP) that will allow experts in the communities to comment on, vary and add annotation.

We will expand into additional genomes as they become available to us, and promote the resource to the lepidopteran community and interested external stakeholders (public and industrial) through meetings, visits, training workshops and web-based media.

Planned Impact

This proposal aims to deliver a common internet-access portal onto the many lepidopteran genomes being generated, and to develop new tools to interrogate these genomes in the wider context of the whole order.
The main beneficiaries will be lepidopteran genomics researchers, who will have a unified portal in which to contextualise and analyse their own data. We will engage this community by direct communication, encouraging groups to submit data to the project, and to assist us in getting their genome sequences represented. This will be achieved by attendance at community meetings (the Kansas Arthropod Genomics Workshop) and through blogs and twitter feeds from the project. We will maintain a project blog, describing the architecture of the site, the decisions made in development, successes and problems and prospects. The corporate twitter feed will be used to communicate database updates and improvements, and to pass on important news from the world of lepidopteran genomics.
A second group of beneficiaries will be lepidopteran biologists in general. A wide range of specialisms use lepidoptera as target organisms, from neurobiology through evolutionary genetics to behavioural ecology. We will engage this community by making the portal easy to use for non-genomics specialists, and providing data summaries of utility to research teams focused on one or a few genes and pathways. Again, our blog and twitter feed will be used to keep this community informed, and we will also make sure our team has representation at the key meetings and workshops where lepidopteran research is presented (for example the annual Evolution meetings).
A third key stakeholder group are the companies and research teams who are developing new tools to combat lepidopteran pests. Our database will be useful in defining possible drug and biocontrol targets, and in revealing the diversity and conservation of these targets across the order. Similarly, the biotechnology industry has keen interest in biomaterials from Lepidoptera - such as silks and new semiochemicals. The development of pathway-oriented annotations and the breadth of species collected in LepBase will permit more rational selection of lead enzymes and products. We will keep these organisations and individuals informed of developments through blog and twitter feeds, and presentations at relevant meetings.
Arthropod genomics is burgeoning, and the i5k initiative is coordinating a hoped-for 5000 genomes (in the first instance). The wider arthropod genomics community will benefit directly from use of LepBase, and also from our pushing LepBase genomes and updates into ENSEMBL Genomes. We will ensure that the i5k site and the wider arthropod genomics community is kept informed through the blog, twitter feed and direct emailing to interest groups.
As a model Tier 2 database, LepBase will be of interest to those developing similar systems for their taxa of interest. We will open our code development and ideas to colleagues running similar initiatives worldwide, and make sure we keep up with their work. Our code will, hopefully be integrated into the core ENSEMBL codebase, but meanwhile (and in addition) we will make it available on github.
We will strive to publish the database in the annual NAR Databases issue, highlighting updates and enhancements. Other publications, in open access journals, will also communicate to our key audiences.
The general public has strong interest in butterflies and moths as charismatic species. We will maintain general interest pages on the web presence of LepBase describing our work, and make available for download factsheets describing each species and the core biology the genome is revealing. These will be made available to butterfly farms, natural history museums and other interested parties. The database will be publicised at open days and science fairs in the three home institutions as available.


10 25 50
Description We have established a publicly available and widely used resource for mining lepidopteran genome data
Exploitation Route Our model for community ENSEMBL databases has considerable potential for rolling out to other communities outside the lepidoptera.

The LepBase resources is being extended to handle variant data and other forms of information
Sectors Agriculture, Food and Drink,Environment,Manufacturing, including Industrial Biotechology

Description The grant has provided a comprehensive web tool for the analysis of lepidopteran genomes. There is considerable interest in using such resources for developing agricultural tools for control of lepidopteran crop pests. We do not yet have documented evidence that our resource has been used in this way but we will try to provide further information on this. In addition we have developed software tools that allow local ENSEMBL databases to be established by other groups, which will greatly facilitate the use of these resources for hosting genome databases for any organism. There is already interest in using these for other taxa such as Molluscs
First Year Of Impact 2015
Sector Agriculture, Food and Drink,Education,Environment
Impact Types Societal

Title LepBase 
Description LepBase is a community resource providing lepidopteran genome sequences in a genome browser based on the ENSEMBL format. This provides tools for comparative genome analysis. This database is the first community level database to use the ENSEMBL genome browser established outside the EBI and as such provides a model for future community genomics databases 
Type Of Material Database/Collection of data 
Year Produced 2015 
Provided To Others? Yes  
Impact The resource is being widely used by the lepidopteran genome community for functional and research applications 
Title ENSEMBL Plugin and other code 
Description Code for local installation of ENSEMBL databases 
Type Of Technology Software 
Year Produced 2015 
Impact Other local ENSEMBL databases established for taxa such as Molluscs 
Title Lepbase Genome Database 
Description A web database for interrogation of genome sequences across the Lepidoptera (moths and butterflies), a large group of insects that includes many pest species 
Type Of Technology Webtool/Application 
Year Produced 2015 
Impact This is now a widely used resource for the genomics community in our field