NextGenPartiGene: next generation transcriptome assembly annotation and exploitation toolkit

Lead Research Organisation: University of Edinburgh
Department Name: Sch of Biological Sciences

Abstract

Biologists have access to ever improving toolkits with which to ask probing questions of the natural world. One revolutionary development that has taken place over the last forty years is the advent of DNA sequencing. We now have the ability to decipher the genome sequence (or 'genetic blueprint') of any organism, and from this work out how they tick. About five years ago, this genomics revolution stepped up a gear, with the introduction of DNA sequencing technologies that increased the rate of genome sequencing, and reduced the cost, many, many fold. These 'next generation' technologies have suddenly made it possible for many researchers to start using genome sequencing in their work. However, as with any new technology, new solutions bring new problems. In the case of genome sequencing it is a 'rich person's' problem: researchers now can generate hundreds to thousands of times as much data as they used to, in a small fraction of the time, but they do not have the computer tools to process and understand it. The reduced cost of sequencing also means that many researchers who now can afford to use this technology do not have the long training required in computing to successfully analyse the floods of data. We propose to develop a set of easy-to-use tools, which we call NextGenPartiGene, using 'next generation' computing frameworks, that will alleviate this problem. We are focussing on the problem of working out what genes an organism is using (or 'expressing'), and what it is that these genes are likely to be doing. By sampling only the expressed genes of an organism (or a part of an organism, such as a leaf or a particular tissue type) it is possible to build up a detailed picture of the kinds of biochemical pathways the organism is running (what it can eat and what wastes it produces), and how experimental interventions change these pathways. We will build the NextGenPartiGene toolkit using an emerging model for such projects: the idea that much of the hard work is done by a server computer, running clever programmes behind the scenes, and that this server is driven by a client, accessed through a standard web browser. By building this client-server toolkit, we will be able to guide researchers with vast amounts of next-generation sequencing data down the best-practice, tried-and-tested paths to full and fruitful analysis. This means they will be able to extract maximum information from their data, and maximum value from their research funding. We will release the NextGenPartiGene tools as open-access software, so that others are both free to use it, and free to modify and improve it to fit their needs.

Technical Summary

Next generation sequencing technologies have qualitatively changed the way we acquire and analyse transcriptomes by making it possible to generate vast amounts of sequence data very cheaply. As the sequencing effort required to generate transcriptome-scale data has decreased, the bioinformatics effort required to analyse and annotate them has grown proportionally bigger. The combined effects of increased affordability of sequencing and decentralization of sequencing facilities means that the bulk of the burden of analysis falls on researchers who are not bioinformatics specialists. These conditions create a need for a user-friendly, robust transcriptome analysis package that can handle the volume of data produced by next-gen technologies. Our existing transcriptomics pipeline, PartiGene, is designed for last-generation sequencing technologies and written using last-generation programming techniques. We propose to develop and release a complete replacement, NextGenPartiGene, which will be built on modern programming technology and will incorporate best-practice transcriptome analysis. NextGenPartiGene will run completely within a web browser, allowing data sharing to be built in as a core feature, and will combine third party applications (for assembly and annotation) with custom visualization tools to provide a complete transcriptomics analysis and data mining workflow. NextGenPartiGene will be built using the Grails web framework, allowing rapid development and straightforward deployment and where possible will use parallelization to take advantage of multiple processor cores and speed up analysis. The database schema will be designed from scratch to cope with the expected volumes of data, and will take advantage of the full-text indexing integrated in postgreSQL 8.3 to offer comprehensive searching of annotations.

Planned Impact

NextGenPartiGene is envisaged as an enabling tool. The beneficiaries of this research will be mainly, in the first instance, academics and small to medium enterprise companies using next generation sequencing approaches in the analysis of novel species or novel treatments of well-studied species. By building efficient, fit-for-purpose and open-access tools, we will promote best practice across the field. As we are releasing the software openly, it will not impact in terms of direct financial (i.e. intellectual property rights) benefit to ourselves or to the University, but it will facilitate the exploitation of these tools by such users. By taking the weight of construction and testing of usable software we release such beneficiaries to better produce the outcomes they are qualified to, be they improved biological understanding, or better exploitation of a biological or biotechnological resource. In particular, the need for discovery, development and testing of new crop organisms, whether they are animals, plants, fungi or other eukaryotes, for goals of biofuels production, ecological remediation and food security assurance, will be aided by more efficient and trustworthy bioinformatics tools. Genomics and transcriptomics are now a first port of call in development of novel organisms for exploitation, whether to understand their basic biology and biochemistry, to unravel the mechanisms behind desirable traits, or to develop of genetic markers for assisted breeding programmes. NextGenPartiGene can be a key resource in achieving these goals. In particular, by reducing the time and resource needed to turn raw data into mineable databases, it will increase the effciency and productivity of next generation transcriptomics approaches across the board. Our tools will also promote data sharing between users, thus giving them enhanced ability to fruitfully cooperate on shared projects. By offering a unified solution, collaborating institutions and organisations and companies can either open their analyses (via the open API of the NextGenPartiGene suite) or the web browser to outside scrutiny, or simply merge datasets produced independently (because the underlying data structure will be the same).

Publications

10 25 50
 
Title Transmissions exhibition 
Description The Blaxter lab collaborated closely wth artists-in-residence (see http://www.ascus.org.uk/ciie-micro-residency-artists-announced/) in the Centre for immunity Infection and Evolution to inspire and be part of the final exhibition "Transmissions". Mark Blaxter appears in the film work produced by Anne Milne, and the work of the lab inspired Jo hodges and Robbie Coleman to produce a piece dedicated to the lab. 
Type Of Art Artwork 
Year Produced 2014 
Impact 'Transmissions' was showcased to the general public within a group exhibition 'Parallel Perspectives' in Summerhall as part of the Edinburgh International Science Festival 2015 art programme, How The Light Gets In . This exhibition of work susequently travelled LifeSpace, Dundee, returning to Edinburgh to showcase at the Tent Gallery, Edinburgh College of Art. 
URL http://www.ascus.org.uk/ciie-micro-residency-2/
 
Description The AfterParty web application suite is available for beta testing at the AfterParty web site. The tool incorporates all of the core functionality planned, and additionallty has new visualisation and data integration tools that make the platform very adaptable and useful. The tool is already in use by a number of research groups across the UK.
URL http://afterparty.bio.ed.ac.uk
 
Description BBSRC Project Grant (Genome Databasing)
Amount £671,655 (GBP)
Funding ID BB/K020161/1 
Organisation Biotechnology and Biological Sciences Research Council (BBSRC) 
Sector Public
Country United Kingdom
Start 11/2013 
End 11/2016
 
Title MolluscDB 
Description MolluscDB is a PartiGene database covering the transcriptomes of a number of mollusc species. 
Type Of Material Database/Collection of data 
Provided To Others? No  
Impact Mollusc DB has been used in several published works, including our own on Lymnaea stagnalis pond snails 
URL http://www.nematodes.org/NeglectedGenomes/MOLLUSCA/index.html
 
Title NEMBASE4 
Description NEMBASE is a database analysing the transcriptomes of 67 different species of nematode, including many important parasites of humans, andimals and plants. Nematode parasites are of major importance in human health and agriculture, and free-living species deliver essential ecosystem services. The genomics revolution has resulted in the production of many small datasets of expressed sequence tags (ESTs) from a phylogenetically wide range of nematode species, but these are not easily compared. NEMBASE4 presents a single portal onto extensively functionally annotated transcriptomes from over sixty species of nematodes, including plant and animal parasites and free-living taxa. Using the PartiGene suite of tools, we have assembled the ESTs publicly available for each species into a high-quality set of putative transcripts. These transcripts have been translated to produce a protein sequence resource, and each annotated with functional information derived from comparison to well-studied nematode species such as Caenorhabditis elegans and also other non-nematode resources. By cross-comparing the sequences within NEMBASE4, we have also generated a protein family assignment for each translation. The data are presented in an openly-accessible, interactive database. We have used NEMBASE4 to examine the uniqueness of the transcriptomes of major clades of parasitic nematodes, identifying lineage-restricted genes that may underpin particular parasitic phenotypes, and identify nematode-unique protein families that may be developed as drug targets. 
Type Of Material Database/Collection of data 
Year Produced 2010 
Provided To Others? Yes  
Impact NEMBASE gene models and analyses are exported widely to other analysis groups including WORMBASE and Nematode.net 
URL http://www.nematodes.org/nembase4/
 
Title TardiBase 
Description Tardibase houses data and analyses relating to the genome and transcriptome of Hypsibius dujardini, a limnetic tardigrade. 
Type Of Material Database/Collection of data 
Year Produced 2011 
Provided To Others? Yes  
Impact TardiBase and the data within it have been the bases of a number of publications. 
URL http://www.tardigrades.org
 
Title AfterParty transcriptome analysis tool 
Description Background Next-generation DNA sequencing technologies have made it possible to generate transcriptome data for novel organisms quickly and cheaply, to the extent that the effort required to annotate and publish a new transcriptome is greater than the effort required to sequence it. Often, following publication, details of the annotation effort are only available in summary form, hindering subsequent exploitation of the data. To promote best-practice in annotation and to ensure that data remain accessible, we have written afterParty, a web application that allows users to assemble, annotate and publish novel transcriptomes using only a web browser. Results afterParty is a robust web application that implements best-practice transcriptome assembly, annotation, browsing, searching, and visualization. Users can turn a collection of reads (from Roche 454 chemistry) or assembled contigs (from any sequencing chemistry, including Illumina Solexa RNA-Seq) into a searchable, browsable transcriptome resource and quickly make it publicly available. Contigs are functionally annotated based on similarity to known sequences and protein domains. Once assembled and annotated, transcriptomes derived from multiple species or libraries can be compared and searched. afterParty datasets can either be created using the existing afterParty server, or using local instances that can be easily built using a virtual machine. afterParty includes powerful visualization tools for transcriptome dataset exploration and uses a flexible annotation architecture which will allow additional types of annotation to be added in the future. Conclusions afterParty's main use case scenario is one in which a working biologist has generated a large volume of transcribed sequence data and wishes to turn it into a useful resource that has some durability. By reducing the effort, bioinformatics skills, and computational resources needed to annotate and publish a transcriptome, afterParty will facilitate the annotation and sharing of sequence data that would otherwise remain unavailable. A typical metazoan transcriptome containing several tens of thousands of contigs can be annotated in a few minutes of interactive time and a few days of computational time. 
Type Of Technology Software 
Year Produced 2013 
Open Source License? Yes  
Impact AfterParty now has a community of users worldwide. 
URL http://afterparty.bio.ed.ac.uk
 
Title Badger genome exploration environment 
Description Summary: High quality draft genomes are now easy to generate, as sequencing and assembly costs have dropped dramatically. However, building a user friendly, searchable website and database for a new annotated genome data is not straight forward. Here we present Badger, a lightweight and easy-to-install genome exploration environment designed for next generation, non-model organism genomes. Availability: Badger is released under the GPL and is available at http://badger.bio.ed.ac.uk/. We show two working examples: (1) a test dataset included with the source code and (2) a collection of four filarial nematode genomes. 
Type Of Technology Software 
Year Produced 2013 
Open Source License? Yes  
Impact Badger has been used extensively by the Blaxter lab and others to present genome data to the world. 
URL http://badger.bio.ed.ac.uk
 
Title TAGC-plots and Blobsplorer tools for genomics 
Description Generating the raw data for a de novo genome assembly project for a target eukaryotic species is relatively easy. This democratisation of access to large-scale data has allowed many research teams to plan to assemble the genomes of non-model organisms. These new genome targets are very different from the traditional, inbred, laboratory reared model organisms. They are often small, and cannot be isolated free of their environment - whether ingested food, the surrounding host organism of parasites, or commensal and symbiotic organisms attached to or within the individuals sampled. Preparation of pure DNA originating from a single species can be technically impossible, but assembly of mixed-organism DNA can be difficult, as most genome assemblers perform poorly when faced with multiple genomes in different stoichiometries. This class of problem is common in metagenomic datasets that deliberately try to capture all the genomes present in an environment, but replicon assembly is not often the goal of such programmes. Here we present an approach to extracting from mixed DNA sequence data subsets that correspond to single species' genomes and thus improving genome assembly. We use both numerical (proportion of GC bases and read coverage) and biological (best-matching sequence in annotated databases) indicators to aid partitioning of draft assembly contigs, and the reads that contribute to those contigs, into distinct bins that can then be subjected to rigorous, optimised assembly. We also present a tool, blobsplorer, that aids exploration and selection of subsets from GC/coverage/taxonomy annotated datasets. Partitioning the data in this way "rescues" poorly assembled genomes, and reveals unexpected symbionts and commensals in eukaryotic genome projects. The TAGC-plot pipeline script is available from http://github.com/blaxterlab/blobology, and the Blobsplorer tool from https://github.com/mojones/blobsplorer. 
Type Of Technology Software 
Year Produced 2013 
Open Source License? Yes  
Impact Blobsplorer/TAGC plots are now in wide use in genomics. The toolkit has been featured in several courses and publications. 
URL http://github.com/blaxterlab/blobology
 
Description Blaxter group - presentations and outreach 2016 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact The Blaxter group presented work at a wide range of national and international conferences, including PopGroup, the Arthropod Genomics Workshop, The C. elegans International Meeting, The Hydra Helminthology meeting, The European Society for Nematology, The UK Genome Science meeting, and others. At many of these venues, in addition to offering platform or poster presentations, we also presented workshops or training activities.
Year(s) Of Engagement Activity 2016
 
Description Blaxter group presentations and outreach 2015 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Mark Blaxter and research team communication and outreach 2015

Globodera genomics and blobtools software
25/02/2015 JHI Postgraduate Student Competition 2015 James Hutton Institute, Aberdeen, UK A tale of Two Peaks: Analysing Genomic Data from Potato Cyst Nematodes Talk
26/03/2015 JHI Cell and Molecular Sciences (CMS) seminar James Hutton Institute, Invergowrie, Dundee, UK Frustration and happiness : (De)-constructing parasite genomes Talk
16/06/2015 JHI Dundee effector consortium (DEC) meeting 2015 Birnam Arts and Conference Centre, Birnam, UK Variation within the Globodera pallida species complex: preliminary results Talk
03/09/2015 Molecular and Cellular Biology of Helminth Parasites IX Bratsera Hotel, Hydra, Greece Inter- and intra-specific analyses of the effector complement in potato cyst nematodes Poster
18/09/2015 UoE Postgraduate Poster Day University of Edinburgh, Edinburgh, UK Inter- and intra-specific analyses of the effector complement in potato cyst nematodes Poster
26/09/2015 Edinburgh University Doors Open Day University of Edinburgh, Edinburgh, UK Potato Cyst Nematodes (PCN) - Nematode parasites of potatoes Poster
30/11/2015 NextGenBug University of Edinburgh, Edinburgh, UK Blobtools: Blobology 2.0 Talk
01/12/2015 UK pollinator genomics meeting Roslin Institute, Edinburgh, UK Bees and Blobs Talk

LepBase
06/03/2015 EMARES Cambridge, UK The Bicyclus Genome Project Talk
06/03/2015 EMARES Cambridge, UK An introduction to Lepbase Talk
17/06/2015 Arthropod Genomics Manhattan, Kansas, USA Lepbase - A multi genome database for the Lepidoptera Poster
24/07/2015 10th Heliconius Meeting Gamboa, Panama Lepbase - A multi genome database for the Lepidoptera (API demonstration) Workshop
24/07/2015 10th Heliconius Meeting Panama Lepbase - A multi genome database for the Lepidoptera Poster
26/07/2015 10th Heliconius Meeting Panama Lepbase Workshop Talk
04/09/2015 Edinburgh Bioinformatics Edinburgh, UK Lepbase - A multi genome database for the Lepidoptera Talk
26/09/2015 Open Doors Day "Make a butterfly" interactive exhibition
26/09/2015 Edinburgh University Doors Open Day Edinburgh, UK Lepbase Multiple Sequence Alignments game Poster+Game
28/10/2015 NextgenBUG Dundee, UK Lepbase - an Ensembl (and more) for the Lepidoptera Talk

Nematode genomics
24.06.2015 20th International C. elegans Meeting Los Angeles USA A new evolutionary framework for the phylum Nematoda: a case study of HOX cluster evolution Poster
24.06.2015 20th International C. elegans Meeting Los Angeles USA Caenorhabditis Genomes Project Workshop (organiser and chair) Talk
24.06.2015 20th International C. elegans Meeting Los Angeles USA Current status of the CGP in Edinburgh Talk

Meloidogyne genomics
10-14 August 2015 ESEB Lausanne-Switzerland Genomic consequences of hybridization and the loss of meiotic recombination in Root-knot nematodes poster
15-18 December 2015 PopGroup Edinburgh-UK Genomic consequences of hybridization and the loss of meiotic recombination in Root-knot nematodes talk
23 February 2016 NextGenBug Edinburgh-UK Genomics of Root-knot nematodes talk
Year(s) Of Engagement Activity 2015
 
Description Blaxter lab workshops 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The Blaxter lab took our software products and research tools to various venues (Arthropod Genomics, UK Genome Science meeting, Butterfly Genomics) to present as workshops, training events or interactive sessions
Year(s) Of Engagement Activity 2016
 
Description Press releases and website 
Form Of Engagement Activity A magazine, newsletter or online publication
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact We have engaged actively with the University press office to promote press coverage of our research outcomes, particularly major publications (which have had coverage in national and international newspapers) and in blogs and other online media. We have also promoted major new initiatives such as additional core funding of the Edinburgh genomics facility.

Increased visibility of Edinburgh Genomics within the community; requests for comment by funders and government on matters pertaining to genomics.
Year(s) Of Engagement Activity 2009,2010,2011,2012,2013,2014,2015,2016