Development and benchmarking of improved computational methods for transcript-level expression analysis using RNA-seq data

Lead Research Organisation: University of Manchester
Department Name: Life Sciences

Abstract

After sequencing of the human genome was completed, Scientists were surprised to discover that there are far fewer protein-coding genes than was previously predicted. One reason that an organism as complex as human can be built from a relatively small number of genes is that each gene encodes more than one protein. An intermediate molecule, messenger RNA (mRNA), carries the information from the genome in the cell nucleus to ribosomes which create proteins. These mRNA molecules are also known as transcripts and their full complement is termed the transcriptome. Before they mature these transcripts are edited to form the template for different proteins. This editing process is called splicing and different transcripts that result are called splice variants or isoforms. An additional complexity in the transcriptome is due to the fact that each gene has multiple copies (for example 2 in human, 6 in wheat) and these different copies, called alleles, can be expressed differently under different conditions or in different tissues. The transcriptome is a collection of transcripts which includes all the allele-specific gene isoforms that are expressed in the cell along with other non-coding RNA molecules.

Splicing and allele usage are fundamental ways that the function of genes can be modulated in a tissue-specific manner. Therefore developing technologies to accurately measure transcript expression is a necessary step towards understanding and modelling cells and tissues. A recently developed experimental technology called RNA-seq gives unprecedented access to data about the transcriptome. Computational methods are required to interpret these data which are in the form of a list containing millions of short RNA sequence fragments. These fragments are difficult to interpret because, for example, the same fragment could have come from a large number of different gene isoforms. The question is, which one? Computational methods can be used to answer this question and infer the concentration of different gene isoforms in the sample given these data. In this project we will develop a new computational method, implemented in publically available free software, which uses advanced statistical procedures to solve this problem. An important distinguishing feature of the method is the ability to associate inferred concentrations with a degree of uncertainty which captures technical and biological sources of error as well as the inherent difficulty of the problem due to the difficulty of assigning fragments to gene isoforms. We will create benchmark data that allows us to assess the performance or our method and other available published methods, allowing researchers and end-users of different methods to understand their properties. Finally, we will adapt an existing computer program, puma, to work with the processed RNA-seq data in order to identify genes which change between conditions, which have similar expression patterns or which contribute most to the variance in the data.

Technical Summary

RNA-seq technology enables the discovery and quantification of multiple transcripts for each gene, including different gene isoforms and different allelic forms. We propose the development of a Bayesian inference approach for inferring the concentration of different transcripts present in a sample by using a probabilistic model of mapped reads. By using a Bayesian inference approach we will capture the level of inherent uncertainty in our estimates of transcript expression levels due to mapping ambiguity, technical noise, read depth limitations and biological noise. We will include the possibility of discovering unannotated isoforms. The use of a read-level probabilistic model will allow us to incorporate information about read density biases and read mapping quality scores. We will apply the model to quantify allele-specific isoform expression which is particularly challenging in complex genomes such as the hexaploid wheat that can express genes from a set of three diploid genomes. We will develop a transcript-level benchmark dataset for method evaluation in which different gene isoforms are spiked in at known concentrations against a natural background. R-code implementing our methods for transcript-level inference and benchmarking will be disseminated through the Bioconductor project. We will extend the existing puma Bioconductor package for noise propagation in microarray analysis so that the methods there can be applied to transcript-level expression data with an associated multivariate uncertainty distribution.

Planned Impact

Communication and Engagement: We will publish papers in open access peer-reviewed journals so that the academic community are made aware of developments. Software will be implemented as open source Bioconductor packages. A public benchmark will lead to better practice by allowing a publically available comparison of competing methods on a level playing field. We have close links to TGAC and the other MRC hubs and we will ensure that all of these groups are made aware of the tools developed and their application. The CGR, as a NERC and MRC hub, also works with a large bioinformatics community and will train new users in working with this software.

Collaboration and Co-production: The investigators are also engaged in many other BBSRC projects which can adopt the methodology developed here to add value to those projects. These projects also provide excellent application data for this proposal. Many of these projects involve short read sequencing of economically important species and comparative analysis to model species and we will identify other projects where the software will be deployed and ensure that their feedback is reflected in the development of the software.

Exploitation and Application: As this tool will be deployed primarily for academic research we do not intend to protect its application. It will be made freely available to the user community through a suitable open source license.

Capacity and Involvement: We are involved in supervising BBSRC and MRC funded Ph.D. students who will benefit from this research as they will be directly using the software developed and we regularly employed sixth form students to undertake research activities in the lab. Both SITRAN and the CGR undertake a wide range of outreach activities to industry, the academic community and the general public and actively engage with the media at local, national and international level

Impact Activity Deliverables and Milestones: Computational Biology developments will be presented at international conferences. Four key papers and associated software will be published along with a benchmarking website.

Resource for activity: We request £1,200 per paper for open access journal charges and £5,700 for presentation at five leading conferences over three years.

Publications

10 25 50
publication icon
Papastamoulis P (2014) Improved variational Bayes inference for transcript expression estimation. in Statistical applications in genetics and molecular biology

publication icon
Papastamoulis P (2017) Bayesian estimation of differential transcript usage from RNA-seq data. in Statistical applications in genetics and molecular biology

publication icon
Papastamoulis P (2018) A Bayesian model selection approach for identifying differentially expressed transcripts from RNA sequencing data. in Journal of the Royal Statistical Society. Series C, Applied statistics

 
Description We have developed an efficient computer algorithm for determining the concentration of different gene transcripts (mRNA molecules) from RNA sequencing (RNA-seq) data. The algorithm is implemented in the BitSeq package for RNA-Seq data analysis which is freely available for use by the scientific community both as a stand-alone package and as part of the popular R/Bioconductor project. We have benchmarked the method and demonstrated that is it is more accurate than competing methods while being comparable in terms of computation time. A recent large-scale independent benchmark (Kanitz et al. Genome Research 16, 2015) has also shown the new method to perform very well according to a range of assessment criteria. The research paper describing this work (Hensman et al. Bioinformatics 31, 2015) was the most read article in the journal Bioinformatics in December 2015 and January 2016. The paper also describes an extensive benchmarking study and dataset which are available online. The BitSeq method was applied by our collaborators at the University of Liverpool to model wheat data as part of this collaborative project.

In addition we have also developed new methods for determining whether transcript levels and relative transcript usage has changed between two conditions (Papastamoulis and Rattray, JRSS C 2017 and papers under review) and we have developed extensive benchmarking datasets and code both for transcript quantification and differential expression, also freely available online.
Exploitation Route The software we have developed is publicly available and the code is open source and can be used and adapted by academic or industry researchers. Analysis of RNA-Seq data is widespread in biomedical research and therefore our methods can be applied to a broad range of applications. The fast Variational Bayes algorithms that we have described can also be applied to other data-intensive machine learning applications involving clustering and differential analysis of discrete counts-based data.
Sectors Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Pharmaceuticals and Medical Biotechnology

 
Title BitSeqVB variational Bayes extension of the BitSeq package 
Description BitSeq is a package for transcript isoform level expression and differential expression estimation for RNA-seq. Through this project we introduced a much faster Variational Bayes (VB) version of BitSeq. 
Type Of Technology Software 
Year Produced 2014 
Open Source License? Yes  
Impact The BitSeqVB method was a top performing method in a recent benchmark (Kanitz, A., Gypas, F., Gruber, A. J., Gruber, A. R., Martin, G., & Zavolan, M., 2015, Comparative assessment of methods for the computational inference of transcript isoform abundance from RNA-seq data, Genome biology, 16(1), 1-26). The paper describing the work (Hensman et al. Bioinformatics 2015) was the most-read article in the journal Bioinformatics in December and January. 
URL http://bitseq.github.io/
 
Title cjBitSeq software 
Description cjBitSeq [1] implements a Bayesian model selection approach in order to simultaneously estimate transcript expression and perform Differential Expression (DE) analysis from RNA-seq data, given two (replicated) samples of biological conditions. The method has been also extended to the special case of Differential Transcript Usage [2]. A hierarchical Bayesian model builds upon the BitSeq [3, 4] framework and the posterior distribution of transcript expression and differential expression is inferred using Markov Chain Monte Carlo (MCMC). [1] Papastamoulis P. and Rattray M. (2017a). A Bayesian model selection approach for identifying differentially expressed transcripts from RNA-Seq data. Journal of the Royal Statistical Society, Series C. [2] Papastamoulis P. and Rattray M. (2017b). Bayesian estimation of Differential Transcript Usage from RNA-seq data.. Statistical Applications in Genetics and Molecular Biology. [3] Glaus P, Honkela A. and Rattray M. (2012). Identifying differentially expressed transcripts from RNA-Seq data with biological variation. Bioinformatics (28): 1721-1728. [4] Papastamoulis P., Hensman J., Glaus, P. and Rattray M. (2014). Improved variational Bayes inference for transcript expression estimation. Statistical Applications in Genetics and Molecular Biology (13), vol 2: 213-216. 
Type Of Technology Software 
Year Produced 2017 
Open Source License? Yes  
Impact Too early to assess impact 
URL https://github.com/mqbssppe/cjBitSeq/wiki