Construction and refinement of Reference Transcript Dataset annotations for fast and accurate transcript quantification in barley and potato

Lead Research Organisation: University of Dundee
Department Name: School of Life Sciences

Abstract

In the model plant, Arabidopsis, we have taken a new and successful approach to analyse gene expression from high-throughput sequencing of RNA (RNA-seq). The high incidence of alternative splicing (AS) in plants (found in >600% of intron-containing genes) requires methods to distinguish and quantify AS variants or isoforms. We have taken an approach used by scientists analyzing gene expression and AS in human cancers. The approach uses programmes like SALMON, which does not require mapping of reads to genomes but instead use a Reference Transcript Dataset to quantify transcripts. We have generated an RTD for Arabidopsis (Zhang et al., 2015, 2016), which involved the development of in-house pipelines for removing redundancy in the RTD. Using this system with RNA-seq data from a time-course of plants transferred from normal to cold temperatures has already demonstrated transcript-specific expression including AS responses to cold, and identified new genes involved in the cold response. The PhD project will develop computational pipelines for construction of RTDs in crop plants (potato and barley) which will be tested with RNA-seq datasets. This will require development of algorithms for the pipelines and for various downstream analyses. The ability to generate transcript-specific and allele-specific expression data will greatly enhance our ability to analyse gene expression and identify key genes in plant/crop processes such as abiotic and biotic stress responses.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
BB/M010996/1 01/10/2015 30/09/2023
1785562 Studentship BB/M010996/1 01/09/2016 04/11/2020 Juan Carlos Juan Entizne
 
Description Among the most significant achievements from my award is the development of software tools for the analysis of transcriptomic data. These programs allows for fast and accurate analysis of high-throughput sequencing data. These tools will be useful to any researchers that wish to analyze the transcriptome of poorly annotated organisms.

As of now, I have meet the objective of generating an improved transcriptome annotation for the Double-Monoploid Potato cultivar (called StRTD, version 01March2020), which albeit still being a prototype, it already show a considerable improvement on the number of transcripts and genes annotated (in comparison to the currently available transcriptome annotation for potato). I have also created a preliminary high-quality transcriptome annotation for Lettuce (LsRTD, version 01Jan2020). I am currently working on validating these novel transcriptome annotations.

I have also created a program that, using as input transcriptome annotations, it generates more "biologically accurate" translations, and it also identifies transcriptional characteristics present in the annotated transcripts.
Exploitation Route Due to the compartmentalized nature and independency of the programs, the outcomes from this award can be easily taken forward by future Ph.D. and/or Post-doc students.

For example, new protocols to integrate data for other novel sequencing technologies (ex: PacBio) can be developed and applied in pararel, or subsequently, to these programs, therefore generating even more accurate and/or complete transcriptome annotations.

Furthermore, since I'm designing these programs to be user-friendly, these programs can be put to use by any researcher that wish to generate a high-quality comprehensive transcriptome annotation to study almost any organism they are interesting in.
Sectors Agriculture, Food and Drink,Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology

 
Description Construction and refinement of Reference Transcript Dataset annotations for fast and accurate transcript quantification in barley and potato
Amount £58,000 (GBP)
Funding ID 1785562 
Organisation Biotechnology and Biological Sciences Research Council (BBSRC) 
Sector Public
Country United Kingdom
Start 09/2016 
End 08/2020
 
Title AutoRTD 
Description To accurately measure differential gene expression and alternative splicing, it is necessary to quantify expression levels of individual transcripts and genes. To achieve this, it is necessary to have a complete, non-redundant transcriptome which lists the possible transcripts. This is called a Reference Transcript Dataset (RTD). AutoRTD is a program which takes RNA-sequencing data and assembles transcripts. It uses multiple assemblers and incorporates many quality control filters at different stages of the assembly process in order to remove false transcripts. This protoype is currently being tested and will be released on GitHub shortly. 
Type Of Material Technology assay or reagent 
Year Produced 2019 
Provided To Others? No  
Impact The tool is being used to generate RTDs for potato, lettuce and barley as tests. This will greatly increase the accuracy of the transcriptomes for these species compared to what is currently available. 
 
Title TranSuite 
Description A major problem in characterising transcripts is mis-annotation of open reading frames (ORFs). Current programs find the longest OTF and in many cases this is incorrect. TranSuite takes transcripts (e.g. newly assembled transcripts in Reference Transcript datasets) and identifies the authentic translation start sites and then translates all the transcripts of that gene from the AUG. The program also deals with specific exceptions. The program then characterises the features of all the transcripts in terms of their coding capacity and whether they are likely to be degraded by nonsense-mediated decay. The program will be released soon on GitHub. 
Type Of Material Technology assay or reagent 
Year Produced 2019 
Provided To Others? No  
Impact TranSuite is being used to assess transcript assemblies in Arabidopsis, potato and barley as tests before release. It compares very favourable to other less capable programs that are in current use. 
 
Description Collaboration on understanding the role of UPF1 in Arabidopsis 
Organisation Central European Institute of Technology (CEITEC)
Country Czech Republic 
Sector Academic/University 
PI Contribution The lab of Karel Riha has examined the functions of UPF1 in Arabidopsis. It is involved in nonsense-mediated decay but also in plant defence responses. WE performed RNA-seq analysis which entailed JC Entizne (PhD student) using the prototype of his transcript assembly program to identify novel transcripts which are affected by UPF1.
Collaborator Contribution The Riha lab have added extensive data on the role of UPF1 in translation and in the expression of TNL defence genes.
Impact We have a manuscript which is about to be submitted
Start Year 2018
 
Description Collaboration to use AutoRTD as part of a new pipeline to build RTDs for barley data 
Organisation James Hutton Institute
Department Cell and Molecular Sciences
Country United Kingdom 
Sector Public 
PI Contribution Auto RTD is being used to develop a new pipeline for rapid construction of RTDs using barley RNA-seq data as test data.
Collaborator Contribution The collaborators are providing barley RNA-seq data and computational expertise in developing the pipeline.
Impact None as yet
Start Year 2019
 
Description Collaboration to use AutoRTD to generate preliminary RTD for lettuce 
Organisation University of York
Department Biological Sciences
Country United Kingdom 
Sector Academic/University 
PI Contribution To test the prototype of Auto RTD, we have constructed an RTD from lettuce. This provides iformation for modifying and improving the program.
Collaborator Contribution The collaborators provided the RNA-seq data for construction of the RTD.
Impact No outputs as yet
Start Year 2019
 
Title Program for accurate translations of transcripts by fixing starting codon 
Description This program find, for a group of transcripts belonging to a gene, the starting codon that translates into the longest ORF. It then use that start codon to try to translate all the related transcripts (transcripts coming from the same gene). 
Type Of Technology Software 
Year Produced 2018 
Impact Most translation programs translate the longest ORF of each transcript. However, this is not completly biologically accurate. For example, such approach ignores the presence of PTC (premature termination codons), which are transcripts that are biologically expected to be terminated instead of translated into their longest version. 
 
Title Quality assessment of transcript model assembled from RNA-seq data for the identification of chimeric, fragmentary and redundant transcripts 
Description This program takes as input multiple transcriptome annotations and cross-reference the transcript models. The program then applies a series of criterias to classify conflicting transcript models as chimeric, fragments or redundants. Finally, the program merge accepted (not-conflicting) models into a single transcriptome annotation. 
Type Of Technology Software 
Year Produced 2018 
Impact This program allows to integrate transcriptome information coming from multiple sources, either previously annotated and curated annotations, or newly assembled ones. The annotation generate by this program contain only high-confidence models, which allows for accurate differential gene and transcript expression analysis with other bioinformatic tools that depend on annotated transcriptomes. 
 
Description 3D RNA-seq training workshops at University of Leeds (October) and Nottingham (November) - presented by Runxuan Zhang, Wenbin Guo, JC Entizne. 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Study participants or study members
Results and Impact We have developed 3D RNA-seq in oreder to analyse the RNA-seq data generated in time-courses of Arabidopsis plants exposed to cold. 3D RNA-seq is an analysis program for RNA-seq data analysis designed for use by biologists with minimal bioinformatics experience. The program is an easy-to-use tool that provides accurate differential gene and transcript expression and differential alternative splicing. It can be used for RNA-seq data from eukaryotes and has been successfully used with plants (Arabidopsis, potato, barley etc) and animals (human, mouse etc). It won the University of Dundee School of Life Sciences Best Innovation award in 2019. 3D RNA-seq was launched in May 2019 with publication in bioRxiv and has had great success with nearly 2,500 users. To help to have uptake in the UK, we have run training courses in the University of Leeds (supported by GARNet and the University of Nottingham. In addtion, we have trained individuals from human and medical sciences.
Year(s) Of Engagement Activity 2019
 
Description Presentation of my computational tool for the creation of RTDs at the James Hutton Institue and at the School of Life Science (University of Dundee) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Postgraduate students
Results and Impact My project is leading to the development of a computational tool for the creation of high-quality transcriptome annotations. I presented my tool both at the James Hutton Institute and at the University of Dundee. Additionally, I also presented an improved transcriptome annotation for the Double-Monoploid potato cultivar generate with my tool. The audiences, both at the James Hutton Institute and the University of Dundee, expressed interest on my tool and requested further information about when it is going to be available for use and to which organism my tool can be applied to.
Year(s) Of Engagement Activity 2018