An automated pipeline for construction of Reference Transcript Datasets (RTD) to enable rapid and accurate gene expression analysis in plant species

Lead Research Organisation: The James Hutton Institute
Department Name: Information & Computational Sciences

Abstract

A gene is the basic physical and functional unit on the genome. Genes are turned off and on at different times of development and in response to external and internal signals. Protein-coding genes are copied (transcribed) into precursor messenger RNA (pre-mRNA) which are then processed in different ways into mRNAs which can then be translated into proteins. A goal of the biological research is to understand how genes work by measuring changes in gene expression. This is achieved by estimating the abundances of all of the transcripts produced at any particular time or condition.

The current technologies to measure gene and transcript expression are called RNA sequencing (RNA-seq) which by sequencing millions of transcripts allows RNA levels to be measured on a genome-wide scale. The two main platforms are Illumina which generates short reads (currently 75 to 250 bp) and PacBio/Nanopore single molecule sequencing which produces full-length transcript reads. To measure gene expression, Illumina short reads are often mapped to the genome and assembled into transcripts which is an inaccurate process. PacBio/Nanopore have high sequencing error rates and do not generate sufficient depth of coverage of genes. These technologies, both in terms of chemistry and computational analyses, continue to advance at a rapid pace but a combination of the platforms is currently the best approach to generate RNA-seq data. In addition, the fastest and most accurate programs for computational quantification of transcript and gene expression require a comprehensive catalogue of transcripts which we call a Reference Transcript Dataset (RTD).

Over the last four years, we developed an RTD for Arabidopsis (AtRTD2) based on extensive Illumina short read sequences. Through a series of iterations, we developed the computational methods to identify and retain high confidence transcripts while removing false transcripts. AtRTD2 greatly increased the accuracy of the quantification allowing, for example, identification of novel transcription and splicing factors in response to cold. The challenge now is to translate this knowledge and experience to other plant and crop (and animal) species. Currently, transcript sequence catalogues for most plant species are incomplete, missing large numbers of transcripts, and for those with RNA-seq data, out-of-date analysis procedures have produced large numbers of false transcripts.

From developing AtRTD2, we have a prototype pipeline for constructing an RTD. The key features are multiple quality control filters which remove mis-assembled transcripts, redundant transcripts, chimaeric transcripts and transcript fragments. These multiple, iterative steps are currently individually coded and while the pipeline can be used, it will take up to 12 months to generate an RTD and requires the full-time expertise of a bioinformatician.

We will develop a fully automated pipeline (RTDBox) which can be used by scientists with basic bioinformatics skills or bioinformaticians with little experience in transcriptomics. Such a pipeline would also be designed to allow the incremental improvement of the RTD with the automatic incorporation of any new RNA-seq data (Illumina, PacBio, Nanopore). Within the pipeline, we will develop a transcript evaluation suite (TES) which will provide evaluation metrics to help biologists to identify and remove mis-constructed transcripts from assembly programs as well as understand the quality and completeness of the RTD generated. All our experience and expertise will be brought together to make a user-friendly software for plant scientists to measure gene expressions more accurately and thereby improving the exploration of biological processes across the globe.

Technical Summary

For the majority of plant and crop species, transcript information is incomplete and poorly annotated. AtRTD2 shows the feasibility of building a comprehensive RTD and both Illumina and PacBio/Nanopore are required for complete and comprehensive RTD construction. We have the necessary knowledge and expertise to produce an automated, easy-to-use pipeline for building RTDs and allowing incorporation of new RNA-seq datasets as they arise.

The automated pipeline and software will be designed for use by scientists with basic bioinformatics skills or bioinformaticians with little experience in transcriptomics. RTDBox will be available in several formats, on different platforms, that will provide flexible access: 1) A local galaxy server will allow users to upload sequence data, run the pipeline and download RTD directly; 2) The pipeline will be set up on publicly available platforms, such as Cyverse (https://www.cyverse.org/) and GigaGalaxy (http://gigagalaxy.net/); 3) The wrapped pipeline will also be available in Galaxy Toolshed for download and installation for groups with local Galaxy infrastructure and who prefer to keep their data private; 4) The pipeline will also be wrapped in Docker containers so that they can be downloaded and run under Unix. It will have a modular construction covering the major functions: uploading RNA-seq data, quality control and trimming (if needed), read mapping and transcript assembly using different assembly programs. Separate automated pipelines for Illumina short read and single molecule sequencing will be included along with stringent quality controls such as splice junction assessment (archived through SJ and SJ phase databases). Merging of different assemblies (new and existing) and further quality control to remove redundancy, fragments etc are performed in the Transcript Evaluation Suite (TES). TES provides evaluation metrics to help the biologists to understand the quality and completeness of the RTD generated.

Planned Impact

The main output of this work will be development and provision of the automated computational pipeline, RTDBox, to construct high quality RTDs for the plant research community and beyond. The major impact will be the uptake of the RTDBox by different plant communities to generate RTDs for different plant species, cultivars or ecotypes. We envisage two significant primary impacts of the pipeline and software:
1. the ability of plant researchers to carry out high quality RNA-seq analysis of gene expression more quickly and accurately to improve understanding of gene regulation and identification of novel genes in biological processes.
2. the means to evaluate the quality of existing and future transcript assemblies. Current literature and databases contain thousands of mis-annotated transcript isoforms with insufficient quality control; the pipeline will permit rapid re-analysis and clean-up of of such data as part of processing of a new RTD for analysis of RNA-seq.

The main challenge is to raise awareness of the importance and opportunities of having high quality, comprehensive RTDs. To ensure speedy uptake and exploitation of RTDs, we have three Impact Objectives:
1. Inform the plant community of the value of the use of the RTD well ahead of a primary release of RTDBox allowing groups to design and plan RNA-seq experiments and even apply for funding to make an RTD.
2. Inform the plant community of the value of working at the transcript level for differential expression data analyses including AS and improving accuracy of downstream analyses (e.g. gene and splicing networks).
3. Release the RTDBox to the plant community as soon as possible through a range of platforms for ease of access and monitor uptake.

To achieve these objectives, we have four Impact Activities:
1) Publicise the need and importance of RTDs and encourage the use of the RTDBox in plant communities The PI/Co-Is will emphasise the benefits of RTDs and the importance of a comprehensive and accurate transcript annotation on downstream analysis at national and international meetings, invited seminars, plant science community newsletters, social media and publications. In particular, we will contact plant science research group leaders in the UK with details of the project and and in a highly interactive way, we will visit the 10-12 main University and Institute plant science departments/groupings in the UK to make presentations on value and advantages of RTD construction in the 6-9 month period of the grant

2) Ensuring that potential beneficiaries have the opportunity to engage fully with the research. By the end of the first year, RTDBox will be released on Github, a publicly available Galaxy server and other platforms (e.g. Docker). We will provide user friendly graphical user interface and detailed user manuals on how to use RTDBox and use online methods to monitor access and obtain feedback for improvement. We will commit to maintaining the RTD Galaxy server for at least two years after the project and to try and obtain funding for longer.

3) Release RTDs for tomato, potato and lettuce for improved RNA-seq analysis. We will contact the research groups responsible for genome annotation and resources in tomato, lettuce and potato in preparation for the release of the species RTDs. These RTDs will be made available on other genome browsers and genome resource websites (e.g. IGB, Ensembl and Gramene. We can monitor the downloads for these databases and associated citations for long term success.

4) Public engagement and PDRA career development. We regularly have opportunities for public engagement at the University of Dundee and James Hutton Institute and the PI/Co-I and PDRA will take part. We will provide the PDRA with formal mentoring and appraisal with a focus on supporting career development. JHI has a formal programme of appraisal for PDRAs designed to identify training needs and opportunities to develop a career path.

Publications

10 25 50
 
Description Over the last 6 years, we developed an RTD for Arabidopsis (AtRTD2) based on extensive Illumina short read sequences. Through a series of iterations, we developed the computational methods to identify and retain high confidence transcripts while removing false transcripts. AtRTD2 greatly increased the accuracy of the quantification allowing, for example, identification of novel transcription and splicing factors in response to cold. It has now been translated to other plant and crop (and animal) species, such as barley, potato, rice and oil palm. Currently, transcript sequence catalogues for most plant species are incomplete, missing large numbers of transcripts, and for those with RNA-seq data, out-of-date analysis programs have produced large numbers of false transcripts.
Exploitation Route We will develop a fully automated pipeline (RTDBox) which can be used by scientists with basic bioinformatics skills or bioinformaticians with little experience in transcriptomics. Such a program would also be designed to allow the incremental improvement of the RTD with automatic incorporation of any new RNA-seq data (Illumina, PacBio, Nanopore). Within the pipeline, we will develop a transcript evaluation suite (TES) which will provide evaluation metrics to help biologists to identify and remove misconstrued transcripts from assembly programs as well as understand the quality and completeness of the RTD generated. All our experience and expertise will be brought together to make a user-friendly software for plant scientists to measure gene expressions more accurately and thereby improving the exploration of biological processes across the globe.
Sectors Agriculture, Food and Drink,Healthcare,Pharmaceuticals and Medical Biotechnology

 
Description RTDBox will be validated on three crop species: lettuce, tomato and potato 
Organisation University of Cambridge
Country United Kingdom 
Sector Academic/University 
PI Contribution RTDBox is developed to automate the construction of comprehensive and high-quality transcriptome for plant species using high throughput sequencing data. We have budgeted for Illumina short-read sequencing and PacBio sequencing for three exemplary crop species: Lettuce (in collaboration with Prof Katherine Denby at University of York), tomato (in collaboration with Prof David Baulcomb at University of Cambridge) and potato (in collaboration with Dr Ingo Hein at University of Dundee). I have contacted all the above collaborators and notified the project schedule to get them ready to make RNA available for sequencing.
Collaborator Contribution Discussions and plans were made with all collaborators on how to proceed with the generation and preparation of the samples.
Impact no outputs yet
Start Year 2019
 
Description RTDBox will be validated on three crop species: lettuce, tomato and potato 
Organisation University of Dundee
Country United Kingdom 
Sector Academic/University 
PI Contribution RTDBox is developed to automate the construction of comprehensive and high-quality transcriptome for plant species using high throughput sequencing data. We have budgeted for Illumina short-read sequencing and PacBio sequencing for three exemplary crop species: Lettuce (in collaboration with Prof Katherine Denby at University of York), tomato (in collaboration with Prof David Baulcomb at University of Cambridge) and potato (in collaboration with Dr Ingo Hein at University of Dundee). I have contacted all the above collaborators and notified the project schedule to get them ready to make RNA available for sequencing.
Collaborator Contribution Discussions and plans were made with all collaborators on how to proceed with the generation and preparation of the samples.
Impact no outputs yet
Start Year 2019
 
Description RTDBox will be validated on three crop species: lettuce, tomato and potato 
Organisation University of York
Department Department of Biology
Country United Kingdom 
Sector Academic/University 
PI Contribution RTDBox is developed to automate the construction of comprehensive and high-quality transcriptome for plant species using high throughput sequencing data. We have budgeted for Illumina short-read sequencing and PacBio sequencing for three exemplary crop species: Lettuce (in collaboration with Prof Katherine Denby at University of York), tomato (in collaboration with Prof David Baulcomb at University of Cambridge) and potato (in collaboration with Dr Ingo Hein at University of Dundee). I have contacted all the above collaborators and notified the project schedule to get them ready to make RNA available for sequencing.
Collaborator Contribution Discussions and plans were made with all collaborators on how to proceed with the generation and preparation of the samples.
Impact no outputs yet
Start Year 2019