Data-driven hierarchical analysis of de novo transcriptomes

Lead Research Organisation: University of Cambridge
Department Name: Plant Sciences

Abstract

Recent developments in sequencing technology have made it possible to directly sample sequencing reads from the expressed transcripts present in a population of cells. This has provided an unparalleled view of gene expression, including how expression changes in response to certain stimuli, across different conditions, and between different disease states. Yet, the de novo analysis of this data, wherein the genome of the organism being sequenced is unknown, still poses many technical challenges. The view of gene expression obtained from such experiments is typically fractured, incomplete, and difficult to analyze in the manner one would in a reference-based scenario, where a corresponding genome is available. In this project, a suite of mathematical models and software tools will be developed to help close the gap between reference-based and de novo analysis of gene expression. The developed tools will be able to identify and correct a host of errors in the predicted (assembled) transcripts. The methods that are typically used to assess gene expression from sequencing data will be specialized for the de novo context, by explicitly accounting for the incomplete nature of the data. The recovered genes will be compared to known genes in related organisms, and new methods will be developed that will allow better prediction of the function of the genes discovered. The methods developed herein will estimate measures of uncertainty at each stage of analysis, and will include the measured uncertainty in all resulting conclusions. Finally, these novel methods and tools will be applied to the study of C4 photosynthesis - a highly efficient form of photosynthesis. A large number of sequencing datasets centered around the C4 system have been produced, yet the genetic mechanism underlying this trait is still unknown. The methods developed in this project will be used to improve our understanding of the regulatory elements involved in the C4 photosynthesis pathway. The project will also include the creation and support of an online community where scientists can seek expert advice on transcriptomic methods (including the software developed herein) and experimental design.

Technical Summary

The goal of this proposal is to develop a novel set of methods, and integrated set of tools, for the analysis of de novo transcriptomes. There are currently a number of tools that aim to tackle different phases of the de novo transcriptome analysis pipeline (e.g. assembly, clustering, expression quantification, differential expression testing), but none of these provide a well-integrated, principled and efficient approach to this difficult challenge. The methods we propose to develop and validate herein will provide a state-of-the-art pipeline for posing and answering a host of relevant Biological questions about how transcripts, genes, and functional modules are differentially expressed and regulated; specifically in the context of organisms for which we lack a reference genome.

Planned Impact

The broader impacts of this proposal center around creating vibrant, inclusive and productive user (and de-veloper) communities around high-quality open-source implementations of the methods proposed herein. As part of aim 1 of the broader impacts, undergraduates at both institutions will be involved in the method development, testing and data analysis, which will provide them with valuable research experience in Computational Biology.
The project collaborators will involve undergraduate students, at both Cambridge and Stony Brook, in the work being carried out in this proposal. At Stony Brook, PI Patro will mentor one or more undergraduates in the Departments of Computer Science and Biology. These departments do not currently offer a Bioinfor- matics or Computational Biology class at the undergraduate level, so this will pose a unique opportunity for undergraduates at Stony Brook to simultaneously learn about Computational Biology and become involved in research in this field. The students will be involved in collecting and curating the set of multi-modal tran- scriptomic datasets that will be used for systematic methodological validation. They will also be involved in helping to design and carry out the comparative analysis of different pipelines, and secure co-authorship on the resulting publications.
As we create the methods proposed herein, we will be systematically evaluating them at each step. The suite of synthetic and experimental data we gather, and the evaluation metrics and protocols we develop for assessing the relative accuracy of our pipeline (and others) on this data will be of interest in its own right. Thus, as part of the broader impact of this work, we propose the creation of a curated collection of data and evaluation tools for assessing the speed and accuracy of de novo transcriptome analysis pipelines. The experimental data we select will be free and publicly-available, and we will make the simulated data we generate as part of this suite publicly available as well. The evaluation tools and metrics will be developed using standard tools and practices for reproducible research. Here, the aim will be to make the assessment of new pipelines in our framework easy, so that methods can be evaluated using a common, open, and reproducible benchmark as they become available. We also anticipate that the creation of this benchmark and the evaluation of different pipelines will lead to a set of best practices and preferred methodologies for the analysis of de novo transcriptomes. To this end, we will foster the creation of an online community where researchers can discuss the benchmarking results, propose new tests and metrics, and seek expert advice on transcriptomics methods and experimental design. We anticipate that an open and extensive benchmark will not only help researches select the best tools and protocols, but will also focus the community on the most pressing and open methodological questions in de novo transcriptome analysis.
The development of high-quality, efficient, open-source implementations of the methods described above for the in-depth analysis of de novo transcriptomes is another of the broader impacts of the proposed work. All of the methods developed as part of this proposal will be made publicly-available and distributed via GitHub. The developed software will incorporate integration testing and automated builds. The methods will be well-documented, and tutorials and walk-throughs for different usage scenarios will be made available. Finally, the methods will be supported through the creation of user discussion groups, as was done for Sailfish. The project collaborators have a proven track record of producing and supporting efficient, high-quality and open-source bioinformatics tools and the unified analysis pipeline and related tools developed as part of this proposal will continue in this tradition.

Publications

10 25 50
 
Description With our NSF collaborators, we have developed new methods to assemble short reads into contains, and to quantify the expression of genes from such sequencing approaches. Further we have used these approaches to provide new insights into patterns of gene expression associated with the evolution and functioning of C4 photosynthesis. Most important, we have identified different populations of small RNA in the two important cell types associated with the C4 leaf - bundle sheath and mesophyll cells. Not only, have we identified distinct populations in these cell types, but we have a working hypothesis about how they are synthesised, and also how they may be targeting C4 genes for regulation. This is the first time that small RNAs have been implicated in the control of the highly productive C4 system.
Exploitation Route The approaches that we developed are generally applicable for others using short read sequencing, and so are widely applicable to biological processes generally
Sectors Agriculture, Food and Drink