A Reference Transcript Database for improved analysis of RNA-seq data from barley

Lead Research Organisation: University of Dundee
Department Name: School of Life Sciences

Abstract

The term 'gene expression' refers to the biological process by which a gene gives rise to a protein. In eukaryotes, gene expression is complex. The DNA sequence of the gene is first copied into a precursor messenger RNA (pre-mRNA) by the process of transcription and the pre-mRNA subjected to several processing steps to form a mature messenger RNA (mRNAs) that is the template for synthesis of the corresponding protein. The post-transcriptional processing steps can generate different mRNA transcripts from the same gene (i.e. transcript isoforms), effectively modulating individual transcript abundance and potentially protein function. Having multiple transcript isoforms from a single gene is problematic in terms of 1) defining the expression levels of individual transcript isoforms and how they change under different conditions, and 2) determining their characteristics - such as whether they encode protein isoforms or not. As gene expression data is widely used to derive biological inference, for example, by grouping genes according to common patterns of expression, failure to take account of the relative abundance of alternative transcripts will unavoidably generate false conclusions. In this project, we focus on the development of a resource/tool that will allow the accurate detection and quantification of mRNA transcript isoforms in barley. The tool will enable high resolution analysis of dynamic changes in gene expression at the individual transcript level and as a recognised and accessible reference will help unify and structure such analyses across a research community.

One of the main approaches scientists use to associate genes with functions is to monitor patterns of gene expression: i.e. where and when genes are switched on or off, and at what level. Current approaches provide an overall measure of gene expression by counting the frequency of occurrence of very specific sequences that correspond to a given mRNA relative to the whole population of mRNAs in a particular sample and transforming these counts into relative abundance levels. However these methods are unable to distinguish the abundance of individual isoform variants, in particular those that determine protein levels, structures and activities. We call the tool a 'Reference Transcript Database' or RTD. The RTD is effectively a library of all of the transcript isoforms that exist in a diverse range of tissues from a single organism. By using the RTD in gene expression studies we can identify and determine the abundance of different transcript isoforms easily and quickly, and these can be used in subsequent functional analyses.

We focus this project on the crop plant, barley, a model for the small grain Triticeae cereals that include wheat and rye. The RTD will allow effects on global and specific gene expression to be easily analysed at the transcript level in plants subjected to a range of conditions or treatments, improving our community's ability to explore and understand a wide range of biological processes. The RTD will be refreshed and maintained longer term by the barley and computational sciences groups at the James Hutton Institute.

Technical Summary

Post-transcriptional mRNA processing, particularly Alternative Splicing (AS), generates multiple transcript isoforms per gene. AS occurs in up to 70% of intron-containing plant genes. AS isoforms can either be targeted for degradation or can encode proteins with different functions. In Arabidopsis, a combination of ultra-deep RNA-seq and new computational methods of analysis generated transcript-specific expression datasets that allow us to interpret the contribution of individual transcript isoforms to overall patterns of gene expression. The new methods required the development of a Reference Transcript Dataset (RTD) - ultimately a library of all transcript isoforms present within the cells and tissues of an organism. We generated AtRTD2 comprising over 82k non-redundant transcripts for the 34k Arabidopsis genes. The accurate quantification of individual transcripts and AS events signal a step change in plant transcriptome analysis.

Here, a barley RTD will be constructed from full-length transcript datasets generated by PacBio Iso-seq, supplemented and error-corrected by deep Illumina paired-end RNA-seq data. We will sequence RNA from twenty tissues, including plants exposed to biotic and abiotic stress. The RTD will be made immediately available to the barley research community to allow unified analysis/re-analysis of new/existing RNA-seq data and to aid the design of new experiments (e.g. time-courses of infection or abiotic stress). Transcript-specific data will identify genes regulated at the level of transcription, AS and both. We will identify novel genes and mechanisms of regulation which contribute to the complex transcriptome re-programming responsible for the response of a plant to environmental or developmental cues. The new data will provide novel insights into genes/transcripts that control phenotypes and, where appropriate, causal variants that can be used to develop genetic markers for use in crop improvement.

Planned Impact

We envisage two significant primary impacts: the first will be on the ability of barley researchers exploring gene function by RNA-seq based expression analysis to more accurately analyse and interpret their data. The second will be on anyone who refers to the reference barley genome sequence because the RTD will be a key informational resource for experimentally supported genome annotation. As such the primary beneficiaries and users will be the research sector, both academic and industrial. The first version of the barley RTD will be generated and made available to the community before the end of this 24 month project.

We believe the main challenge to maximising impact will be to raise awareness of the value of the RTD and promote its adoption by the research community. While the appended letters of support demonstrate community support and awareness, we are conscious that RTD development needs to be done quickly so that research groups can use, plan and design RNA-seq experiments with an RTD and transcript isoform analysis pipeline firmly in mind.

The main Impact Objectives are therefore to:
1) Inform the barley community of the value of the use of the RTD well ahead of a primary release, allowing groups to design and plan RNA-seq experiments with this in mind.
2) Inform the barley community of the value of transcript-isoform specific expression data for identifying genes regulated by post-transcriptional processes such as AS.
3) Release the RTD to the community as soon as possible through standard communication channels including community websites and social media.

To achieve these objectives:
1. The PI/Co-I will ensure community awareness by contacting barley research group leaders with details of the project and how it will benefit them and have the Co-I describe the development and advantages of the Arabidopsis AtRTD2 at a meeting of barley researchers early in the programme.
2. The PIs/Co-Is will present regular updates of progress at national and international conferences and meetings (e.g. Monogram, PAG) as well as invited seminars
3. The initial barley RTD will be released to collaborating groups (see letters of support) for validation as soon as possible and subsequently made widely available on our websites prior to publication. As an RTD is not a static entity, we will release versioned updates (with change logs) over time (updated RTDs will be essential for our own research as well as that of the community)
4. Training and mentoring the PDRA and encouraging their participation in public engagement activities.

Publications

10 25 50

publication icon
Barakate A (2020) Barley Anther and Meiocyte Transcriptome Dynamics in Meiotic Prophase I. in Frontiers in plant science

publication icon
Simpson CG (2019) High-Resolution RT-PCR Analysis of Alternative Barley Transcripts. in Methods in molecular biology (Clifton, N.J.)

publication icon
Twardziok S (2018) The Barley Genome

 
Description The work has shown that there can be many different RNA molecules expressed from a single gene. Some of these are the product of alternative splicing, some from variation at the transcription start or end sites, others probably stochastic variation from 'time of sampling' or abberant transcripts destined for degradation and turnover and a combination of all three. The data we have collected allows many of these different RNA species to be incorporated into 'expression analysis' revealing different layers of regulation operating in complex biological systems either as a result of tissue specificity, developmental change or response to environment. The dataset 'BaRTV1.0' was an important tool for understanding how, where and when genes are switched on or off and how different variants are induced or repressed according to biological variants (time, tissue etc). We are near to releasing an updated version of the RTD - BaRTV2.0 which has been assembled from a combination of short read Illumina sequence data and long read single molecule PacBio data from 20 different tissues from a 'typical' 2-row spring barley cultivar called Barke. BaRTV2.0 is a significant improvement over BaRTV1.0 and will imminently be available for download for use in quantifying transcripts identified in RNA-seq experiments, and for annotating new barley genome sequences. While we have reanalysed many other RNAseq datasets with BaRTV1 we are currently updating the results (published as a database called EoRNA) based on analysis with BartV2.18. Once again, reanalysis of existing RNA-seq datasets with BaRT2.0 could form the basis of excellent hons student projects.
Exploitation Route The reference transcript database is an evolving resource. BaRTV2.18 is an uptodate and improved version of V1 based on long read PacBio isoseq data and a more comprehensive collection of deep Illumina short read data (i.e. BaRTV2.0). We also included methods for accurate transcript start and end site definition and alterntive splice variants. The resource is already being used by the barley research community for analysis of RNA-seq data.
Sectors Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Education,Environment

URL https://doi.org/10.1101/2021.09.10.459729
 
Description The Reference transcript dataset has been used by ourselves and is being increasingly used by several groups to streamline their analysis of gene expression. It has been V1 and V2 have collectively been cited 47 times since first release and is central to use of the 3-D RNA-seq app developed in house at JHI which has been cited over 55 times. The different versions have and are being used for experimentally supported barley reference genome annotation and the approaches developed applied to analysis of a pan-transcriptome study. It has been instrumental in the development and annotation of V1.0 of a community pan Pan-Genome resource and is currently being used to support annotation of a V2. 0 pan genome containing ~80 diverse reference quality barley genome sequences.
First Year Of Impact 2019
Sector Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Education,Environment
Impact Types Economic

 
Description An automated pipeline for construction of Reference Transcript Datasets (RTD) to enable rapid and accurate gene expression analysis in plant species
Amount £316,936 (GBP)
Funding ID BB/S020160/1 
Organisation Biotechnology and Biological Sciences Research Council (BBSRC) 
Sector Public
Country United Kingdom
Start 10/2019 
End 09/2021
 
Description UUKI Rutherford fund - strategic partner grants
Amount £150,000 (GBP)
Funding ID RF-2018-30 
Organisation Universities UK 
Sector Academic/University
Country United Kingdom
Start 03/2018 
End 03/2019
 
Title BaRT 
Description A database containing a reference set of transcripts expressed from the barley cultivar Morex. Used for rapid and accurate analysis of RNA-seq data 
Type Of Material Database/Collection of data 
Year Produced 2019 
Provided To Others? Yes  
Impact Rapid and accurate analysis of RNA-seq data using alignment free methods 
URL https://ics.hutton.ac.uk/barleyrtd/index.html
 
Description Barley Pan Genome 
Organisation IPK Gatersleben
Country Germany 
Sector Private 
PI Contribution Provide a reference quality sequence of the cultivar Golden Promise
Collaborator Contribution Reference Quality sequences of other barley genotypes (consortium effort)
Impact No outcomes yet
Start Year 2017
 
Description Barley Yield associated Networks (BARN) 
Organisation IPK Gatersleben
Country Germany 
Sector Private 
PI Contribution BARN is an ERA CAPS collaborative award with three partners. We will provide a Reference Transcript dataset and RNA seq information from 2 tissues from 200 barley cultivars. We will jointly analyse the resulting data
Collaborator Contribution Each has common and specific tasks. The Reference Transcript dataset and RNA seq information will be used to interrogate expression in a further 2 tissues from 200 barley cultivars. We will also survey sequence all 200 lines and build cultivar specific RTD's to assist analysis. The partners will jointly analyse the resulting data
Impact Too early
Start Year 2018
 
Description Barley Yield associated Networks (BARN) 
Organisation University of Minnesota
Country United States 
Sector Academic/University 
PI Contribution BARN is an ERA CAPS collaborative award with three partners. We will provide a Reference Transcript dataset and RNA seq information from 2 tissues from 200 barley cultivars. We will jointly analyse the resulting data
Collaborator Contribution Each has common and specific tasks. The Reference Transcript dataset and RNA seq information will be used to interrogate expression in a further 2 tissues from 200 barley cultivars. We will also survey sequence all 200 lines and build cultivar specific RTD's to assist analysis. The partners will jointly analyse the resulting data
Impact Too early
Start Year 2018
 
Description BRIDGE 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Study participants or study members
Results and Impact Presentation of BaRTV1.0 to participants of the German BRIDGE project (IPK/MIPS/Industry)
Year(s) Of Engagement Activity 2019
 
Description BaRTV1.0 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Study participants or study members
Results and Impact Presentation at the Barley Away days in Dunkeld Feb 2020. Attended by wide range of national/international scientists, students and stakeholders
Year(s) Of Engagement Activity 2020
 
Description SAB 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact BBSRC requested we establish a Science Advisory Board for our Barley Reference Transcript Database (RTD) project and after appointing three members we extended the remit of this SAB to cover a range of related projects and to gather their expert feedback more widely. We received written feedback from Mario Caccammo and Philippa Borill at our first meeting last April. The SAB will meet again in April 2020 though we have extended its composition to include Ian Bancroft (who couldnt make the first meeting) and representation from the EBI (Bruno Contreras) and the barley Pan Genome Consortium (Nils Stein). We will continue to engage with the SAB beyond this current project to provide advice on the related awards (e.g. the ERA-CAPS project BARN)
Year(s) Of Engagement Activity 2019