NextGenPartiGene: next generation transcriptome assembly annotation and exploitation toolkit

Lead Research Organisation: University of Edinburgh

Department Name: Sch of Biological Sciences

Abstract

Biologists have access to ever improving toolkits with which to ask probing questions of the natural world. One revolutionary development that has taken place over the last forty years is the advent of DNA sequencing. We now have the ability to decipher the genome sequence (or 'genetic blueprint') of any organism, and from this work out how they tick. About five years ago, this genomics revolution stepped up a gear, with the introduction of DNA sequencing technologies that increased the rate of genome sequencing, and reduced the cost, many, many fold. These 'next generation' technologies have suddenly made it possible for many researchers to start using genome sequencing in their work. However, as with any new technology, new solutions bring new problems. In the case of genome sequencing it is a 'rich person's' problem: researchers now can generate hundreds to thousands of times as much data as they used to, in a small fraction of the time, but they do not have the computer tools to process and understand it. The reduced cost of sequencing also means that many researchers who now can afford to use this technology do not have the long training required in computing to successfully analyse the floods of data. We propose to develop a set of easy-to-use tools, which we call NextGenPartiGene, using 'next generation' computing frameworks, that will alleviate this problem. We are focussing on the problem of working out what genes an organism is using (or 'expressing'), and what it is that these genes are likely to be doing. By sampling only the expressed genes of an organism (or a part of an organism, such as a leaf or a particular tissue type) it is possible to build up a detailed picture of the kinds of biochemical pathways the organism is running (what it can eat and what wastes it produces), and how experimental interventions change these pathways. We will build the NextGenPartiGene toolkit using an emerging model for such projects: the idea that much of the hard work is done by a server computer, running clever programmes behind the scenes, and that this server is driven by a client, accessed through a standard web browser. By building this client-server toolkit, we will be able to guide researchers with vast amounts of next-generation sequencing data down the best-practice, tried-and-tested paths to full and fruitful analysis. This means they will be able to extract maximum information from their data, and maximum value from their research funding. We will release the NextGenPartiGene tools as open-access software, so that others are both free to use it, and free to modify and improve it to fit their needs.

Technical Summary

Next generation sequencing technologies have qualitatively changed the way we acquire and analyse transcriptomes by making it possible to generate vast amounts of sequence data very cheaply. As the sequencing effort required to generate transcriptome-scale data has decreased, the bioinformatics effort required to analyse and annotate them has grown proportionally bigger. The combined effects of increased affordability of sequencing and decentralization of sequencing facilities means that the bulk of the burden of analysis falls on researchers who are not bioinformatics specialists. These conditions create a need for a user-friendly, robust transcriptome analysis package that can handle the volume of data produced by next-gen technologies. Our existing transcriptomics pipeline, PartiGene, is designed for last-generation sequencing technologies and written using last-generation programming techniques. We propose to develop and release a complete replacement, NextGenPartiGene, which will be built on modern programming technology and will incorporate best-practice transcriptome analysis. NextGenPartiGene will run completely within a web browser, allowing data sharing to be built in as a core feature, and will combine third party applications (for assembly and annotation) with custom visualization tools to provide a complete transcriptomics analysis and data mining workflow. NextGenPartiGene will be built using the Grails web framework, allowing rapid development and straightforward deployment and where possible will use parallelization to take advantage of multiple processor cores and speed up analysis. The database schema will be designed from scratch to cope with the expected volumes of data, and will take advantage of the full-text indexing integrated in postgreSQL 8.3 to offer comprehensive searching of annotations.

Planned Impact

NextGenPartiGene is envisaged as an enabling tool. The beneficiaries of this research will be mainly, in the first instance, academics and small to medium enterprise companies using next generation sequencing approaches in the analysis of novel species or novel treatments of well-studied species. By building efficient, fit-for-purpose and open-access tools, we will promote best practice across the field. As we are releasing the software openly, it will not impact in terms of direct financial (i.e. intellectual property rights) benefit to ourselves or to the University, but it will facilitate the exploitation of these tools by such users. By taking the weight of construction and testing of usable software we release such beneficiaries to better produce the outcomes they are qualified to, be they improved biological understanding, or better exploitation of a biological or biotechnological resource. In particular, the need for discovery, development and testing of new crop organisms, whether they are animals, plants, fungi or other eukaryotes, for goals of biofuels production, ecological remediation and food security assurance, will be aided by more efficient and trustworthy bioinformatics tools. Genomics and transcriptomics are now a first port of call in development of novel organisms for exploitation, whether to understand their basic biology and biochemistry, to unravel the mechanisms behind desirable traits, or to develop of genetic markers for assisted breeding programmes. NextGenPartiGene can be a key resource in achieving these goals. In particular, by reducing the time and resource needed to turn raw data into mineable databases, it will increase the effciency and productivity of next generation transcriptomics approaches across the board. Our tools will also promote data sharing between users, thus giving them enhanced ability to fruitfully cooperate on shared projects. By offering a unified solution, collaborating institutions and organisations and companies can either open their analyses (via the open API of the NextGenPartiGene suite) or the web browser to outside scrutiny, or simply merge datasets produced independently (because the underlying data structure will be the same).

Funded Value:

£123,949

Funded Period:

Aug 11 - Jan 13

Funder:

BBSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

BB/I023585/1

Principal Investigator:

Mark Blaxter

Research Subject:

Omic sciences & technologies (13%)

Tools, technologies & methods (52%)

Research Topic:

Bioinformatics (39%)

Transcriptomics (13%)

eScience (13%)

Organisations

University of Edinburgh (Lead Research Organisation)

People	ORCID iD
Mark Blaxter (Principal Investigator)
Martin Jones (Researcher Co-Investigator)

Publications

Author Name Title Publication

Date Published

10 25 50

Elsworth B (2013) Badger--an accessible genome exploration environment. in Bioinformatics (Oxford, England)

Jones M (2013) afterParty: turning raw transcriptomes into permanent resources. in BMC bioinformatics

Davison A (2016) Formin Is Associated with Left-Right Asymmetry in the Pond Snail and the Frog. in Current biology : CB

Kumar S (2013) Blobology: exploring raw genome data for contaminants, symbionts and parasites using taxon-annotated GC-coverage plots. in Frontiers in genetics

Sadd BM (2015) The genomes of two key bumblebee species with primitive eusocial organization. in Genome biology

Quintana JF (2015) Extracellular Onchocerca-derived small RNAs in host nodules and blood. in Parasites & vectors

Blaxter M (2015) The evolution of parasitism in Nematoda. in Parasitology

Artistic and Creative Products
Key Findings
Further Funding
Research Databases and Models
Software and Technical Products
Engagement Activities


Title	Transmissions exhibition
Description	The Blaxter lab collaborated closely wth artists-in-residence (see http://www.ascus.org.uk/ciie-micro-residency-artists-announced/) in the Centre for immunity Infection and Evolution to inspire and be part of the final exhibition "Transmissions". Mark Blaxter appears in the film work produced by Anne Milne, and the work of the lab inspired Jo hodges and Robbie Coleman to produce a piece dedicated to the lab.
Type Of Art	Artwork
Year Produced	2014
Impact	'Transmissions' was showcased to the general public within a group exhibition 'Parallel Perspectives' in Summerhall as part of the Edinburgh International Science Festival 2015 art programme, How The Light Gets In . This exhibition of work susequently travelled LifeSpace, Dundee, returning to Edinburgh to showcase at the Tent Gallery, Edinburgh College of Art.
URL	http://www.ascus.org.uk/ciie-micro-residency-2/


Description	The AfterParty web application suite is available for beta testing at the AfterParty web site. The tool incorporates all of the core functionality planned, and additionallty has new visualisation and data integration tools that make the platform very adaptable and useful. The tool is already in use by a number of research groups across the UK.
URL	http://afterparty.bio.ed.ac.uk


Description	BBSRC Project Grant (Genome Databasing)
Amount	£671,655 (GBP)
Funding ID	BB/K020161/1
Organisation	Biotechnology and Biological Sciences Research Council (BBSRC)
Sector	Public
Country	United Kingdom
Start	11/2013
End	11/2016


Title	MolluscDB
Description	MolluscDB is a PartiGene database covering the transcriptomes of a number of mollusc species.
Type Of Material	Database/Collection of data
Provided To Others?	No
Impact	Mollusc DB has been used in several published works, including our own on Lymnaea stagnalis pond snails
URL	http://www.nematodes.org/NeglectedGenomes/MOLLUSCA/index.html


Title	NEMBASE4
Description	NEMBASE is a database analysing the transcriptomes of 67 different species of nematode, including many important parasites of humans, andimals and plants. Nematode parasites are of major importance in human health and agriculture, and free-living species deliver essential ecosystem services. The genomics revolution has resulted in the production of many small datasets of expressed sequence tags (ESTs) from a phylogenetically wide range of nematode species, but these are not easily compared. NEMBASE4 presents a single portal onto extensively functionally annotated transcriptomes from over sixty species of nematodes, including plant and animal parasites and free-living taxa. Using the PartiGene suite of tools, we have assembled the ESTs publicly available for each species into a high-quality set of putative transcripts. These transcripts have been translated to produce a protein sequence resource, and each annotated with functional information derived from comparison to well-studied nematode species such as Caenorhabditis elegans and also other non-nematode resources. By cross-comparing the sequences within NEMBASE4, we have also generated a protein family assignment for each translation. The data are presented in an openly-accessible, interactive database. We have used NEMBASE4 to examine the uniqueness of the transcriptomes of major clades of parasitic nematodes, identifying lineage-restricted genes that may underpin particular parasitic phenotypes, and identify nematode-unique protein families that may be developed as drug targets.
Type Of Material	Database/Collection of data
Year Produced	2010
Provided To Others?	Yes
Impact	NEMBASE gene models and analyses are exported widely to other analysis groups including WORMBASE and Nematode.net
URL	http://www.nematodes.org/nembase4/


Title	TardiBase
Description	Tardibase houses data and analyses relating to the genome and transcriptome of Hypsibius dujardini, a limnetic tardigrade.
Type Of Material	Database/Collection of data
Year Produced	2011
Provided To Others?	Yes
Impact	TardiBase and the data within it have been the bases of a number of publications.
URL	http://www.tardigrades.org


Title	AfterParty transcriptome analysis tool
Description	Background Next-generation DNA sequencing technologies have made it possible to generate transcriptome data for novel organisms quickly and cheaply, to the extent that the effort required to annotate and publish a new transcriptome is greater than the effort required to sequence it. Often, following publication, details of the annotation effort are only available in summary form, hindering subsequent exploitation of the data. To promote best-practice in annotation and to ensure that data remain accessible, we have written afterParty, a web application that allows users to assemble, annotate and publish novel transcriptomes using only a web browser. Results afterParty is a robust web application that implements best-practice transcriptome assembly, annotation, browsing, searching, and visualization. Users can turn a collection of reads (from Roche 454 chemistry) or assembled contigs (from any sequencing chemistry, including Illumina Solexa RNA-Seq) into a searchable, browsable transcriptome resource and quickly make it publicly available. Contigs are functionally annotated based on similarity to known sequences and protein domains. Once assembled and annotated, transcriptomes derived from multiple species or libraries can be compared and searched. afterParty datasets can either be created using the existing afterParty server, or using local instances that can be easily built using a virtual machine. afterParty includes powerful visualization tools for transcriptome dataset exploration and uses a flexible annotation architecture which will allow additional types of annotation to be added in the future. Conclusions afterParty's main use case scenario is one in which a working biologist has generated a large volume of transcribed sequence data and wishes to turn it into a useful resource that has some durability. By reducing the effort, bioinformatics skills, and computational resources needed to annotate and publish a transcriptome, afterParty will facilitate the annotation and sharing of sequence data that would otherwise remain unavailable. A typical metazoan transcriptome containing several tens of thousands of contigs can be annotated in a few minutes of interactive time and a few days of computational time.
Type Of Technology	Software
Year Produced	2013
Open Source License?	Yes
Impact	AfterParty now has a community of users worldwide.
URL	http://afterparty.bio.ed.ac.uk


Title	Badger genome exploration environment
Description	Summary: High quality draft genomes are now easy to generate, as sequencing and assembly costs have dropped dramatically. However, building a user friendly, searchable website and database for a new annotated genome data is not straight forward. Here we present Badger, a lightweight and easy-to-install genome exploration environment designed for next generation, non-model organism genomes. Availability: Badger is released under the GPL and is available at http://badger.bio.ed.ac.uk/. We show two working examples: (1) a test dataset included with the source code and (2) a collection of four filarial nematode genomes.
Type Of Technology	Software
Year Produced	2013
Open Source License?	Yes
Impact	Badger has been used extensively by the Blaxter lab and others to present genome data to the world.
URL	http://badger.bio.ed.ac.uk


Title	TAGC-plots and Blobsplorer tools for genomics
Description	Generating the raw data for a de novo genome assembly project for a target eukaryotic species is relatively easy. This democratisation of access to large-scale data has allowed many research teams to plan to assemble the genomes of non-model organisms. These new genome targets are very different from the traditional, inbred, laboratory reared model organisms. They are often small, and cannot be isolated free of their environment - whether ingested food, the surrounding host organism of parasites, or commensal and symbiotic organisms attached to or within the individuals sampled. Preparation of pure DNA originating from a single species can be technically impossible, but assembly of mixed-organism DNA can be difficult, as most genome assemblers perform poorly when faced with multiple genomes in different stoichiometries. This class of problem is common in metagenomic datasets that deliberately try to capture all the genomes present in an environment, but replicon assembly is not often the goal of such programmes. Here we present an approach to extracting from mixed DNA sequence data subsets that correspond to single species' genomes and thus improving genome assembly. We use both numerical (proportion of GC bases and read coverage) and biological (best-matching sequence in annotated databases) indicators to aid partitioning of draft assembly contigs, and the reads that contribute to those contigs, into distinct bins that can then be subjected to rigorous, optimised assembly. We also present a tool, blobsplorer, that aids exploration and selection of subsets from GC/coverage/taxonomy annotated datasets. Partitioning the data in this way "rescues" poorly assembled genomes, and reveals unexpected symbionts and commensals in eukaryotic genome projects. The TAGC-plot pipeline script is available from http://github.com/blaxterlab/blobology, and the Blobsplorer tool from https://github.com/mojones/blobsplorer.
Type Of Technology	Software
Year Produced	2013
Open Source License?	Yes
Impact	Blobsplorer/TAGC plots are now in wide use in genomics. The toolkit has been featured in several courses and publications.
URL	http://github.com/blaxterlab/blobology


Description	Blaxter group - presentations and outreach 2016
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Professional Practitioners
Results and Impact	The Blaxter group presented work at a wide range of national and international conferences, including PopGroup, the Arthropod Genomics Workshop, The C. elegans International Meeting, The Hydra Helminthology meeting, The European Society for Nematology, The UK Genome Science meeting, and others. At many of these venues, in addition to offering platform or poster presentations, we also presented workshops or training activities.
Year(s) Of Engagement Activity	2016


Description	Blaxter group presentations and outreach 2015
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Mark Blaxter and research team communication and outreach 2015 Globodera genomics and blobtools software 25/02/2015 JHI Postgraduate Student Competition 2015 James Hutton Institute, Aberdeen, UK A tale of Two Peaks: Analysing Genomic Data from Potato Cyst Nematodes Talk 26/03/2015 JHI Cell and Molecular Sciences (CMS) seminar James Hutton Institute, Invergowrie, Dundee, UK Frustration and happiness : (De)-constructing parasite genomes Talk 16/06/2015 JHI Dundee effector consortium (DEC) meeting 2015 Birnam Arts and Conference Centre, Birnam, UK Variation within the Globodera pallida species complex: preliminary results Talk 03/09/2015 Molecular and Cellular Biology of Helminth Parasites IX Bratsera Hotel, Hydra, Greece Inter- and intra-specific analyses of the effector complement in potato cyst nematodes Poster 18/09/2015 UoE Postgraduate Poster Day University of Edinburgh, Edinburgh, UK Inter- and intra-specific analyses of the effector complement in potato cyst nematodes Poster 26/09/2015 Edinburgh University Doors Open Day University of Edinburgh, Edinburgh, UK Potato Cyst Nematodes (PCN) - Nematode parasites of potatoes Poster 30/11/2015 NextGenBug University of Edinburgh, Edinburgh, UK Blobtools: Blobology 2.0 Talk 01/12/2015 UK pollinator genomics meeting Roslin Institute, Edinburgh, UK Bees and Blobs Talk LepBase 06/03/2015 EMARES Cambridge, UK The Bicyclus Genome Project Talk 06/03/2015 EMARES Cambridge, UK An introduction to Lepbase Talk 17/06/2015 Arthropod Genomics Manhattan, Kansas, USA Lepbase - A multi genome database for the Lepidoptera Poster 24/07/2015 10th Heliconius Meeting Gamboa, Panama Lepbase - A multi genome database for the Lepidoptera (API demonstration) Workshop 24/07/2015 10th Heliconius Meeting Panama Lepbase - A multi genome database for the Lepidoptera Poster 26/07/2015 10th Heliconius Meeting Panama Lepbase Workshop Talk 04/09/2015 Edinburgh Bioinformatics Edinburgh, UK Lepbase - A multi genome database for the Lepidoptera Talk 26/09/2015 Open Doors Day "Make a butterfly" interactive exhibition 26/09/2015 Edinburgh University Doors Open Day Edinburgh, UK Lepbase Multiple Sequence Alignments game Poster+Game 28/10/2015 NextgenBUG Dundee, UK Lepbase - an Ensembl (and more) for the Lepidoptera Talk Nematode genomics 24.06.2015 20th International C. elegans Meeting Los Angeles USA A new evolutionary framework for the phylum Nematoda: a case study of HOX cluster evolution Poster 24.06.2015 20th International C. elegans Meeting Los Angeles USA Caenorhabditis Genomes Project Workshop (organiser and chair) Talk 24.06.2015 20th International C. elegans Meeting Los Angeles USA Current status of the CGP in Edinburgh Talk Meloidogyne genomics 10-14 August 2015 ESEB Lausanne-Switzerland Genomic consequences of hybridization and the loss of meiotic recombination in Root-knot nematodes poster 15-18 December 2015 PopGroup Edinburgh-UK Genomic consequences of hybridization and the loss of meiotic recombination in Root-knot nematodes talk 23 February 2016 NextGenBug Edinburgh-UK Genomics of Root-knot nematodes talk
Year(s) Of Engagement Activity	2015


Description	Blaxter lab workshops
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	The Blaxter lab took our software products and research tools to various venues (Arthropod Genomics, UK Genome Science meeting, Butterfly Genomics) to present as workshops, training events or interactive sessions
Year(s) Of Engagement Activity	2016


Description	Press releases and website
Form Of Engagement Activity	A magazine, newsletter or online publication
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Public/other audiences
Results and Impact	We have engaged actively with the University press office to promote press coverage of our research outcomes, particularly major publications (which have had coverage in national and international newspapers) and in blogs and other online media. We have also promoted major new initiatives such as additional core funding of the Edinburgh genomics facility. Increased visibility of Edinburgh Genomics within the community; requests for comment by funders and government on matters pertaining to genomics.
Year(s) Of Engagement Activity	2009,2010,2011,2012,2013,2014,2015,2016