An automated pipeline for construction of Reference Transcript Datasets (RTD) to enable rapid and accurate gene expression analysis in plant species

Lead Research Organisation: James Hutton Institute
Department Name: Information & Computational Sciences

Abstract

A gene is the basic physical and functional unit on the genome. Genes are turned off and on at different times of development and in response to external and internal signals. Protein-coding genes are copied (transcribed) into precursor messenger RNA (pre-mRNA) which are then processed in different ways into mRNAs which can then be translated into proteins. A goal of the biological research is to understand how genes work by measuring changes in gene expression. This is achieved by estimating the abundances of all of the transcripts produced at any particular time or condition.

The current technologies to measure gene and transcript expression are called RNA sequencing (RNA-seq) which by sequencing millions of transcripts allows RNA levels to be measured on a genome-wide scale. The two main platforms are Illumina which generates short reads (currently 75 to 250 bp) and PacBio/Nanopore single molecule sequencing which produces full-length transcript reads. To measure gene expression, Illumina short reads are often mapped to the genome and assembled into transcripts which is an inaccurate process. PacBio/Nanopore have high sequencing error rates and do not generate sufficient depth of coverage of genes. These technologies, both in terms of chemistry and computational analyses, continue to advance at a rapid pace but a combination of the platforms is currently the best approach to generate RNA-seq data. In addition, the fastest and most accurate programs for computational quantification of transcript and gene expression require a comprehensive catalogue of transcripts which we call a Reference Transcript Dataset (RTD).

Over the last four years, we developed an RTD for Arabidopsis (AtRTD2) based on extensive Illumina short read sequences. Through a series of iterations, we developed the computational methods to identify and retain high confidence transcripts while removing false transcripts. AtRTD2 greatly increased the accuracy of the quantification allowing, for example, identification of novel transcription and splicing factors in response to cold. The challenge now is to translate this knowledge and experience to other plant and crop (and animal) species. Currently, transcript sequence catalogues for most plant species are incomplete, missing large numbers of transcripts, and for those with RNA-seq data, out-of-date analysis procedures have produced large numbers of false transcripts.

From developing AtRTD2, we have a prototype pipeline for constructing an RTD. The key features are multiple quality control filters which remove mis-assembled transcripts, redundant transcripts, chimaeric transcripts and transcript fragments. These multiple, iterative steps are currently individually coded and while the pipeline can be used, it will take up to 12 months to generate an RTD and requires the full-time expertise of a bioinformatician.

We will develop a fully automated pipeline (RTDBox) which can be used by scientists with basic bioinformatics skills or bioinformaticians with little experience in transcriptomics. Such a pipeline would also be designed to allow the incremental improvement of the RTD with the automatic incorporation of any new RNA-seq data (Illumina, PacBio, Nanopore). Within the pipeline, we will develop a transcript evaluation suite (TES) which will provide evaluation metrics to help biologists to identify and remove mis-constructed transcripts from assembly programs as well as understand the quality and completeness of the RTD generated. All our experience and expertise will be brought together to make a user-friendly software for plant scientists to measure gene expressions more accurately and thereby improving the exploration of biological processes across the globe.

Technical Summary

For the majority of plant and crop species, transcript information is incomplete and poorly annotated. AtRTD2 shows the feasibility of building a comprehensive RTD and both Illumina and PacBio/Nanopore are required for complete and comprehensive RTD construction. We have the necessary knowledge and expertise to produce an automated, easy-to-use pipeline for building RTDs and allowing incorporation of new RNA-seq datasets as they arise.

The automated pipeline and software will be designed for use by scientists with basic bioinformatics skills or bioinformaticians with little experience in transcriptomics. RTDBox will be available in several formats, on different platforms, that will provide flexible access: 1) A local galaxy server will allow users to upload sequence data, run the pipeline and download RTD directly; 2) The pipeline will be set up on publicly available platforms, such as Cyverse (https://www.cyverse.org/) and GigaGalaxy (http://gigagalaxy.net/); 3) The wrapped pipeline will also be available in Galaxy Toolshed for download and installation for groups with local Galaxy infrastructure and who prefer to keep their data private; 4) The pipeline will also be wrapped in Docker containers so that they can be downloaded and run under Unix. It will have a modular construction covering the major functions: uploading RNA-seq data, quality control and trimming (if needed), read mapping and transcript assembly using different assembly programs. Separate automated pipelines for Illumina short read and single molecule sequencing will be included along with stringent quality controls such as splice junction assessment (archived through SJ and SJ phase databases). Merging of different assemblies (new and existing) and further quality control to remove redundancy, fragments etc are performed in the Transcript Evaluation Suite (TES). TES provides evaluation metrics to help the biologists to understand the quality and completeness of the RTD generated.

Planned Impact

The main output of this work will be development and provision of the automated computational pipeline, RTDBox, to construct high quality RTDs for the plant research community and beyond. The major impact will be the uptake of the RTDBox by different plant communities to generate RTDs for different plant species, cultivars or ecotypes. We envisage two significant primary impacts of the pipeline and software:
1. the ability of plant researchers to carry out high quality RNA-seq analysis of gene expression more quickly and accurately to improve understanding of gene regulation and identification of novel genes in biological processes.
2. the means to evaluate the quality of existing and future transcript assemblies. Current literature and databases contain thousands of mis-annotated transcript isoforms with insufficient quality control; the pipeline will permit rapid re-analysis and clean-up of of such data as part of processing of a new RTD for analysis of RNA-seq.

The main challenge is to raise awareness of the importance and opportunities of having high quality, comprehensive RTDs. To ensure speedy uptake and exploitation of RTDs, we have three Impact Objectives:
1. Inform the plant community of the value of the use of the RTD well ahead of a primary release of RTDBox allowing groups to design and plan RNA-seq experiments and even apply for funding to make an RTD.
2. Inform the plant community of the value of working at the transcript level for differential expression data analyses including AS and improving accuracy of downstream analyses (e.g. gene and splicing networks).
3. Release the RTDBox to the plant community as soon as possible through a range of platforms for ease of access and monitor uptake.

To achieve these objectives, we have four Impact Activities:
1) Publicise the need and importance of RTDs and encourage the use of the RTDBox in plant communities The PI/Co-Is will emphasise the benefits of RTDs and the importance of a comprehensive and accurate transcript annotation on downstream analysis at national and international meetings, invited seminars, plant science community newsletters, social media and publications. In particular, we will contact plant science research group leaders in the UK with details of the project and and in a highly interactive way, we will visit the 10-12 main University and Institute plant science departments/groupings in the UK to make presentations on value and advantages of RTD construction in the 6-9 month period of the grant

2) Ensuring that potential beneficiaries have the opportunity to engage fully with the research. By the end of the first year, RTDBox will be released on Github, a publicly available Galaxy server and other platforms (e.g. Docker). We will provide user friendly graphical user interface and detailed user manuals on how to use RTDBox and use online methods to monitor access and obtain feedback for improvement. We will commit to maintaining the RTD Galaxy server for at least two years after the project and to try and obtain funding for longer.

3) Release RTDs for tomato, potato and lettuce for improved RNA-seq analysis. We will contact the research groups responsible for genome annotation and resources in tomato, lettuce and potato in preparation for the release of the species RTDs. These RTDs will be made available on other genome browsers and genome resource websites (e.g. IGB, Ensembl and Gramene. We can monitor the downloads for these databases and associated citations for long term success.

4) Public engagement and PDRA career development. We regularly have opportunities for public engagement at the University of Dundee and James Hutton Institute and the PI/Co-I and PDRA will take part. We will provide the PDRA with formal mentoring and appraisal with a focus on supporting career development. JHI has a formal programme of appraisal for PDRAs designed to identify training needs and opportunities to develop a career path.
 
Description Over the last 6 years, we developed an RTD for Arabidopsis (AtRTD2) based on extensive Illumina short-read sequences. Through a series of iterations, we developed computational methods to identify and retain high confidence transcripts while removing false transcripts. AtRTD2 greatly increased the accuracy of the quantification allowing, for example, identification of novel transcription and splicing factors in response to cold. It has now been translated to other plant and crop (and animal) species, such as barley, potato, rice and oil palm. Currently, transcript sequence catalogues for most plant species are incomplete, missing large numbers of transcripts, and for those with RNA-seq data, out-of-date analysis programs have produced large numbers of false transcripts.

In the past year, we have
1) improved and formalized the short read assembly method and pipeline
2) Developed a novel computational method to define transcripts accurately from pacbio Iso-seq data
4) Developed an R package that allows pacbio Iso-seq data analysis using the above method
2) developed a software solution that allows us to authenticate the users to access their analysis through email
3) Web interface that allows the users to carry out the analysis and control the analysis process
Exploitation Route We will develop a fully automated pipeline (RTDBox) that can be used by scientists with basic bioinformatics skills or bioinformaticians with little experience in transcriptomics. Such a program would also be designed to allow the incremental improvement of the RTD with the automatic incorporation of any new RNA-seq data (Illumina, PacBio, Nanopore). Within the pipeline, we will develop a transcript evaluation suite that will provide evaluation metrics to help biologists to identify and remove misconstrued transcripts from assembly programs as well as understand the quality and completeness of the RTD generated. All our experience and expertise will be brought together to make user-friendly software for plant scientists to measure gene expressions more accurately and thereby improving the exploration of biological processes across the globe.

Now the short read pipeline has been used to construct transcript references for a number of projects, including potato, barley, lettuce, and raspberry. It has also been employed for a barley pan-transcriptome project to construct transcript references for 20 different barley cultivars. The RTDBox can be used to generate transcript annotations for fast and accurate quantification using RNA-seq data and the 3D RNA-seq pipeline developed in my group can be used to investigate differential gene expression and alternative splicing analysis.
Sectors Agriculture, Food and Drink,Healthcare,Pharmaceuticals and Medical Biotechnology

URL https://rtdbox.hutton.ac.uk/#/
 
Description Australia Partnering Award: International pooling for advanced cereal science - IPAC
Amount £47,766 (GBP)
Funding ID BB/V018299/1 
Organisation Biotechnology and Biological Sciences Research Council (BBSRC) 
Sector Public
Country United Kingdom
Start 04/2021 
End 03/2024
 
Description Create new opportunities to exploit barley resources and accelerate breeding
Amount £30,612 (GBP)
Funding ID BB/V018906/1 
Organisation Biotechnology and Biological Sciences Research Council (BBSRC) 
Sector Public
Country United Kingdom
Start 04/2021 
End 03/2025
 
Description SEFARI Workshop: 3D RNA-seq App - A ?exible and powerful tool for di?erential expression and alternative splicing analysis of RNA-seq data for biologists
Amount £9,807 (GBP)
Organisation Scottish Environment, Food and Agriculture Research Institutes Gateway 
Sector Charity/Non Profit
Start 11/2020 
End 11/2020
 
Description The Generation Gap - Mechanisms of maternal control on grain
Amount £88,838 (GBP)
Funding ID BB/W002590/1 
Organisation Biotechnology and Biological Sciences Research Council (BBSRC) 
Sector Public
Country United Kingdom
Start 09/2021 
End 09/2024
 
Title 3D RNA-seq App - A ?exible and powerful tool for di?erential expression and alternative splicing analysis of RNA-seq data for biologists 
Description RNA-sequencing (RNA-seq) analysis of gene expression and alternative splicing should be routine and robust but is often a bottleneck for biologists because of different and complex analysis programs and reliance on specialized bioinformatics skills. We have developed the '3D RNA-seq' App, an R shiny App and web-based pipeline for the comprehensive analysis of RNA-seq data from any organism. It represents an easy-to-use, flexible and powerful tool for analysis of both gene and transcript-level gene expression to identify differential gene/transcript expression, differential alternative splicing and differential transcript usage (3D) as well as isoform switching from RNA-seq data. 3D RNA-seq integrates state-of-the-art differential expression analysis tools and adopts best practice for RNA-seq analysis. 
Type Of Material Improvements to research infrastructure 
Year Produced 2019 
Provided To Others? Yes  
Impact The program is designed to be run by biologists with minimal bioinformatics experience (or by bioinformaticians) allowing lab scientists to analyse their RNA-seq data. It achieves this by operating through a user-friendly graphical interface that automates the data flow through the programs in the pipeline. The comprehensive analysis performed by 3D RNA-seq is extremely rapid and accurate, can handle complex experimental designs, allows user setting of statistical parameters, visualizes the results through graphics and tables, and generates publication-quality figures such as heat-maps, expression profiles and GO enrichment plots. The manuscript has been cited 14 times just over one year and >4400 users have used 3D RNA-seq for their RNA-seq analysis globally with a quarter of returning and regular users, who have used our tool on multiple occasions. 
URL http://3drnaseq.hutton.ac.uk/
 
Title RTDBox 
Description RTDbox is a computational pipeline that allows scientists to construct a high-quality transcript reference, which enables fast and accurate quantifications of gene expression using RNA-seq data. We have established cutting-edge methods for filtering misassembled transcripts from Illumina short-read assemblies and PacBio Iso-seq long reads. We also provide a web interface that allows this analysis to be carried out quickly and easily without coding. It is in a testing phase and it will be available to public once it is thoroughly tested. 
Type Of Material Improvements to research infrastructure 
Year Produced 2021 
Provided To Others? No  
Impact The transcriptome reference plays a key role in gene expression quantification as incomplete, misassembled transcriptome often leads to erroneous gene expression quantifications. Our new pipeline will allow the construction of high quality transcriptome reference quickly that incorporate a range of stringent filtering to remove mis-assembled transcripts. For the PacBio long read pipeline, we also developed a method that defines the transcript start and end accurately, which not only improves the gene expression accuracy, but also allows the study of the transcriptional regulations, such as polyadenylations and alternative splicing. 
URL https://rtdbox.hutton.ac.uk/#/
 
Description 3D RNA-seq training workshop 
Organisation Australian National University (ANU)
Country Australia 
Sector Academic/University 
PI Contribution We have carried out a 3D RNA-seq training workshop at the Australian National University. (https://www.eventbrite.com.au/e/3d-rna-seq-workshop-tickets-556207510637)
Collaborator Contribution The participants have provided feedbacks on how to improve the 3D RNA-seq tool as well as the training.
Impact not available yet
Start Year 2023
 
Description RTDBox will be validated on three crop species: lettuce, tomato and potato 
Organisation University of Cambridge
Country United Kingdom 
Sector Academic/University 
PI Contribution RTDBox is developed to automate the construction of comprehensive and high-quality transcriptome for plant species using high throughput sequencing data. We have budgeted for Illumina short-read sequencing and PacBio sequencing for three exemplary crop species: Lettuce (in collaboration with Prof Katherine Denby at University of York), tomato (in collaboration with Prof David Baulcomb at University of Cambridge) and potato (in collaboration with Dr Ingo Hein at University of Dundee). I have contacted all the above collaborators and notified the project schedule to get them ready to make RNA available for sequencing.
Collaborator Contribution Discussions and plans were made with all collaborators on how to proceed with the generation and preparation of the samples.
Impact no outputs yet
Start Year 2019
 
Description RTDBox will be validated on three crop species: lettuce, tomato and potato 
Organisation University of Dundee
Country United Kingdom 
Sector Academic/University 
PI Contribution RTDBox is developed to automate the construction of comprehensive and high-quality transcriptome for plant species using high throughput sequencing data. We have budgeted for Illumina short-read sequencing and PacBio sequencing for three exemplary crop species: Lettuce (in collaboration with Prof Katherine Denby at University of York), tomato (in collaboration with Prof David Baulcomb at University of Cambridge) and potato (in collaboration with Dr Ingo Hein at University of Dundee). I have contacted all the above collaborators and notified the project schedule to get them ready to make RNA available for sequencing.
Collaborator Contribution Discussions and plans were made with all collaborators on how to proceed with the generation and preparation of the samples.
Impact no outputs yet
Start Year 2019
 
Description RTDBox will be validated on three crop species: lettuce, tomato and potato 
Organisation University of York
Department Department of Biology
Country United Kingdom 
Sector Academic/University 
PI Contribution RTDBox is developed to automate the construction of comprehensive and high-quality transcriptome for plant species using high throughput sequencing data. We have budgeted for Illumina short-read sequencing and PacBio sequencing for three exemplary crop species: Lettuce (in collaboration with Prof Katherine Denby at University of York), tomato (in collaboration with Prof David Baulcomb at University of Cambridge) and potato (in collaboration with Dr Ingo Hein at University of Dundee). I have contacted all the above collaborators and notified the project schedule to get them ready to make RNA available for sequencing.
Collaborator Contribution Discussions and plans were made with all collaborators on how to proceed with the generation and preparation of the samples.
Impact no outputs yet
Start Year 2019
 
Description barley long read analysis for heat stress 
Organisation University of Silesia
Country Poland 
Sector Academic/University 
PI Contribution Using our established short read pipeline, we are testing and collecting user feedback through collaborations with a research group from University of Silesia in Katowice, led by Dr. Agata Daszkowska to analysis PacBio long-read sequencing data to study the heat stress in barley
Collaborator Contribution The research group from the University of Silesia in Katowice, led by Dr. Agata Daszkowska has helped us to provide advice on how to improve the RTDBox pipeline for its accessibility.
Impact not available yet
Start Year 2022
 
Description common bean RTD 
Organisation IPK Gatersleben
Country Germany 
Sector Private 
PI Contribution Using our established short-read pipeline, we are testing and collecting user feedback through collaborations with a research group from IPK. Dr. Beate Fraust visited us in Dundde and we trained her to use RTDBox for developing RTD for common beans.
Collaborator Contribution Feedbacks are provided on what to improve for RTDBox
Impact common bean RTD
Start Year 2022
 
Description lettuce RTD 
Organisation University of York
Department Department of Biology
Country United Kingdom 
Sector Academic/University 
PI Contribution My team will utilize the PacBio Iso-seq sequencing and Illumina sequencing data generated from the samples harvests at Prof Denby's lab to generate a high-quality lettuce transcriptome using the pipeline established in this project.
Collaborator Contribution Prof Katherine Denby generated the RNAs from diverse tissues and experimental conditions for illumina and PacBio sequencing.
Impact high quality lettuce transcriptome that allows accurate and fast gene quantifications using RNA-seq
Start Year 2021
 
Description tomato RTD 
Organisation University of Oxford
Department Department of Plant Sciences
Country United Kingdom 
Sector Academic/University 
PI Contribution we are using the RTDBox we developed to construct a comprehensive and high quality RTD in tomato
Collaborator Contribution Sara Lopez Gomollon has generated the tomato samples and extract RNAs that were send for sequencing
Impact n/a
Start Year 2021
 
Title 3D RNA-seq: a powerful and flexible tool for rapid and accurate differential expression and alternative splicing analysis of RNA-seq data for biologists 
Description 3D RNA-seq' App is an R shiny App and web-based pipeline for the comprehensive analysis of RNA-seq data from any organism. It represents an easy-to-use, flexible and powerful tool for analysis of both gene and transcript-level gene expression to identify differential gene/transcript expression, differential alternative splicing and differential transcript usage (3D) as well as isoform switching from RNA-seq data. 3D RNA-seq integrates state-of-the-art differential expression analysis tools and adopts best practice for RNA-seq analysis. The program is designed to be run by biologists with minimal bioinformatics experience (or by bioinformaticians) allowing lab scientists to analyse their RNA-seq data. It achieves this by operating through a user-friendly graphical interface that automates the data flow through the programs in the pipeline. The comprehensive analysis performed by 3D RNA-seq is extremely rapid and accurate, can handle complex experimental designs, allows user setting of statistical parameters, visualizes the results through graphics and tables, and generates publication-quality figures such as heat-maps, expression profiles and GO enrichment plots. 
Type Of Technology Webtool/Application 
Year Produced 2019 
Open Source License? Yes  
Impact The publication has been cited 14 times and >4,400 users have used the tool for their RNA-seq analysis globally, with a quarter of regular and returning users who have used it on multiple occasions. 
URL http://3drnaseq.hutton.ac.uk
 
Title RTDBox 
Description RTDbox is a computational pipeline that allows scientists to construct a high-quality transcript reference, which enables fast and accurate quantifications of gene expression using RNA-seq data. We have established cutting-edge methods for filtering misassembled transcripts from Illumina short-read assemblies and PacBio Iso-seq long reads. We also provide a web interface that allows this analysis to be carried out quickly and easily without coding. 
Type Of Technology Webtool/Application 
Year Produced 2021 
Open Source License? Yes  
Impact The transcriptome reference plays a key role in gene expression quantification as incomplete, misassembled transcriptome often leads to erroneous gene expression quantifications. Our new pipeline will allow the construction of high quality transcriptome reference quickly that incorporate a range of stringent filtering to remove mis-assembled transcripts. For the PacBio long read pipeline, we also developed a method that defines the transcript start and end accurately, which not only improves the gene expression accuracy, but also allows the study of the transcriptional regulations, such as polyadenylations and alternative splicing. 
URL https://rtdbox.hutton.ac.uk/#/
 
Description Deliver 3D RNA-seq training workshop at Australian National University 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Professional Practitioners
Results and Impact Wenbin and I have undertaken a two-day workshop about their software 3D RNA-seq (https://3drnaseq.hutton.ac.uk/app_direct/3DRNAseq/) for the analysis of transcriptomics data.

Sessions are available to attend on the following dates and times:

Wednesday 1 March 2023, 9:00am to 12:00 noon, Seminar Rooms 1 & 2
Thursday 2 March 2023, 9:00am to 12:00 noon, Seminar Rooms 1 & 2
Year(s) Of Engagement Activity 2023
URL https://www.eventbrite.com.au/e/3d-rna-seq-workshop-tickets-556207510637
 
Description poster presentation at RECOMB 2022 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact I have presented a poster presentation titled "Novel computational methods for high-resolution single molecule sequencing-based transcriptomes in Arabidopsis and barley" at RECOMB 2022, San Diego engaging with 20+ scientists.
Year(s) Of Engagement Activity 2022
 
Description presentation at International Conference on Arabidopsis Research (ICAR) conference 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact presented a talk titled "A high-resolution single molecule sequencing based Arabidopsis transcriptome using novel methods of Iso-seq analysis " at International Conference on Arabidopsis Research (ICAR), Belfast, 20-24 June 2022
Year(s) Of Engagement Activity 2022
 
Description transcriptome data analysis training workshop 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact we have held a transcriptome data analysis training workshop, including developing high quality transcript reference datasets for accurate quantification (Zhang et al, 2017; Zhang et al, 2022 ) and using 3D RNA-seq to carry out comprehensive and high quality gene expression analysis. The 3D RNA-seq app (Guo et al., 2021) has been developed at the James Hutton Institute with over 8,700 users globally and cited 49 times by plant, animal and human studies since 2019. The workshop was attended by 18 participants, from students, post-docs and permanent staff from the IPK including three participants travelling from Poland. All participants had a chance to run through the app with a test dataset and were keen to use it on their own datasets afterward. Overwhelmingly positive feedbacks have been received through different channels.
Year(s) Of Engagement Activity 2022
URL https://www.denbi.de/training/1469-3d-rna-seq-a-flexible-and-powerful-tool-for-differential-expressi...