Opening gene expression data to the research community

Lead Research Organisation: James Hutton Institute
Department Name: Cell & Molecular Sciences

Abstract

Gene expression and regulation is the foundation of plant development, organ specific differences and response to environment. RNA-Seq is a high-throughput sequencing technology that has become the primary platform for the study of gene and transcript level expression. Public data archives store vast volumes of raw RNA-Seq data that requires specialist analysis skills and large-scale computational resources to be of value to biologists and is thus an underutilised resource. Since our first attempt at accessing and processing quantitative gene expression data from publicly archived RNA-seq samples, the number of barley RNA-seq datasets have increased >5 fold, which brings challenges in the scale of processing and visualising large numbers of datasets. We propose to build on our existing barley gene expression database and website, EORNA, to provide a scalable, highly automated system for the discovery, retrieval and quantification of barley RNA-seq data. Comparative visualisation of transcript-level expression data will be coupled to improved curation of the experimental metadata. We will update and expand our current EORNA database with all available public barley RNA-Seq data, quantified against the latest state-of-the-art barley pan-transcriptome reference dataset. We will provide scalable transcript expression level plots with direct access to gene sequence information and annotation through an enhanced and easily searchable website. Finally, we will make the entire system generic and provide it as a free resource to the wider scientific community so that researchers working on other organisms can establish their own species-specific databases. Together, this will create an essential gene and gene transcript discovery resource for barley researchers and breeders, and the wider scientific community.

Technical Summary

Public sequence archives contain vast numbers of datasets that hold enormous potential for reuse in studies beyond those they were generated for. RNA-Seq has great potential for adding value to research projects by means of generating quantitative expression data in an easily accessible format. We have previously established a database resource, EORNA (https://ics.hutton.ac.uk/eorna/index.html) based on barley RNA-Seq retrieved from the European Nucleotide Archive (ENA). Here we propose a completely new iteration of the EORNA database that is based on a high degree of automation and scalability, as well as producing a generic version that is usable for other taxa by the wider community. We will implement approaches for the automated discovery, retrieval and analysis of samples from public archives using programmatic access based on the REST API implemented at the ENA and will combine this with mapping reads to the upcoming barley pan-transcriptome reference transcript dataset using the latest decoy based read mapping in the Salmon algorithm. A new website will be constructed to allow easy access to transcript expression analyses data using the LAMP configuration (Linux, Apache, mySQL, and Perl). The website will allow users to interrogate data through an entry point via BLAST search of the reference barley assembly or the predicted transcripts; a keyword search of the derived rice and Arabidopsis thaliana BLAST annotation, and a direct string search using the transcript, gene, or contig identifiers. We will apply a new, more scalable technology for visualization of expression values than that used in the original EORNA database. We will then make available to the community all code developed, along with the database schema, and bundle this into a distribution that the wider community can benefit from, especially for the numerous researchers working on non-model organisms for which curated resources such as EORNA are unavailable.

Publications

10 25 50