A computational platform for the high-throughput identification of short RNAs and their targets in plants

Lead Research Organisation: University of East Anglia
Department Name: Computing Sciences


Most of the RNA molecules in cells are involved in protein production (ribosomal, transfer or messenger RNAs). However there are RNA molecules with other functions. Two classes of such non-coding RNAs which have been recently discovered are microRNAs (miRNAs) and small interfering RNAs (siRNAs). These regulatory RNA molecules are very short (19-24 nucleotides) and are thus commonly known as short RNAs (sRNAs). Some sRNAs can interact with specific mRNAs because they have partially complementary sequences. As a result of these interactions the expression of the targeted mRNAs is significantly reduced. Other sRNAs can target the chromosomes and trigger DNA modification. More than a hundred miRNAs have been identified in plant species (Arabidopsis, rice and poplar). However, the total number of sRNAs in plants is much higher: A recent experimental study identified about 75,000 sRNAs in Arabidopsis. In addition, several miRNAs found in one species were absent from the other suggesting that there are miRNAs which are specific to certain groups of plants. To systematically identify sRNAs in plants, the co-PIs laboratories have started to use a novel high-throughput sequencing technology (454 pyrosequencing). Initially, they are studying Arabidopsis and later, as the genome sequences will become available, tomato and alfalfa sRNAs will be analysed. This novel technology produces about 200,000 sRNA sequences for each sample. Preliminary results from the 454 technology are currently processed manually using standard bioinformatics tools. However, such an analysis is unfeasible for the millions of sRNA sequences that will be derived from future 454 experiments. The main goal of this project is to develop a computational platform dedicated to the analysis of data generated by the high-throughput 454 sRNA sequencing projects. This platform will classify new sRNAs, some of which will be subjected to further experimental work, and search for possible RNA targets. It will be initially tested on 454 data from Arabidopis, and subsequently on tomato and alfalfa. Later in the project, a comparative analysis tools will be incorporated for mutant analysis. New bioinformatics tools and novel sRNAs discovered through this project will be made publically available. Identifying the full complement of sRNAs in different plant species will allow us to characterise an important and little understood layer of regulation in specific plant traits such as fleshy fruit development and ripening in tomato, and nitrogen fixation in alfalfa.

Technical Summary

A recently discovered layer of gene expression regulation in plants utilizes two types of small, non-coding regulatory RNAs (sRNAs): microRNAs (miRNAs) and small interfering RNAs (siRNAs). sRNAs can target RNA in a sequence-specific manner through base pairing between sRNAs and target RNAs. RNA targets of sRNAs are usually degraded in plants or in some cases translationally suppressed. sRNAs can also target genomic DNA causing methylation and heterochromatinisation that can lead to transcriptional gene silencing. Recent experimental studies indicate that there may be many thousands of sRNAs in plants, some of which are species specific. The laboratories of the co-PI's have set out to identify the full complement of sRNAs in different crop species. Their progress has been recently facilitated by a new high throughput pyrosequencing technology (www.454.com), which yields approximately 200,000 sRNA sequences for each sample. The aim of this project is to develop a computational platform to analyse the output from the co-PI's high-throughput sRNA sequencing projects. This platform will incorporate cutting-edge computational RNA analysis tools, and will allow both the classification of new sRNAs (some of which will be subjected to further experimental work), and the search for possible RNA targets. Identifying the full complement of sRNAs in different plant species will allow us to characterise an important and little understood layer of regulation in specific plant traits such as fleshy fruit development and ripening in tomato, and nitrogen fixation in alfalfa.


10 25 50
publication icon
Molnar A (2009) Highly specific gene silencing by artificial microRNAs in the unicellular alga Chlamydomonas reinhardtii. in The Plant journal : for cell and molecular biology

publication icon
Moxon S (2008) A toolkit for analysing large-scale plant small RNA datasets. in Bioinformatics (Oxford, England)

publication icon
Pais H (2011) Small RNA discovery and characterisation in eukaryotes using high-throughput approaches. in Advances in experimental medicine and biology

publication icon
Schwach F (2009) Deciphering the diversity of small RNAs in plants: the long and short of it. in Briefings in functional genomics & proteomics

Description Two major software products were developed in the course of the project: "The UEA plant sRNA toolkit" and "SiLoDb".

The UEA plant sRNA toolkit is a collection of software tools for the high-throughput analysis of sRNA datasets generated from cutting-edge sequencing techniques that are now the standard in sRNA research (in particular 454 and Illumina). The collection of tools provide a complete computational pipeline for processing of 454 (and Illumina) sequencing results from raw data, including the identification, visualisation, and classification of sRNAs in plants, as well as allowing mutant analysis and comparison of sRNA data with microarray data. The tools can all be used online at http://srna-tools.cmp.uea.ac.uk/plant/ and require no local installation. Users of the website upload large-scale sequencing data to any of the tools provided and download results in standard formats (spreadsheets, image files, FASTA sequence files). The tools can be used independently and, in addition, the output of some tools can also be uploaded to others for more advanced analyses, such as miRNA identification based on a filtered subset of sRNA sequences. All data processing is carried out on the UEA high-performance computing cluster. The tools have been used to identify new sRNAs in plants, which have been detailed in high-impact journals such as Genome Research, meeting Objective 4. Since its release in 2008, the UEA plant sRNA toolkit has been used extensively by sRNA researchers both in the UK and worldwide with an average of over 12,000 page views per month, 39% of which are from the UK. In addition, an average of 120-200 full analyses were performed per month by external users. We regularly receive emails from users requesting additional tools and features. Moreover, independent publications are now starting to appear in which The UEA plant sRNA toolkit has been used and cited as the main analysis tool.

SiLoDb is a database of publicly available plant sRNA data that is based on the idea of organising sRNAs into biological units, i.e. sRNA-generating genomic loci. SiLoDb is available at http://sourceforge.net/projects/silodb/. It currently covers the following organisms: Medicago truncatula, Arabidopsis thaliana, Chlamydomonas reinhardtii, Oryza sativa, and Solanum lycopersicum. The database can be searched and browsed in a number of ways and genomic matches of sRNAs are displayed in a standard genome browser (GMOD Gbrowse), which is well known to many biologists. Users can search for specific sRNAs by sequence and other properties such as length and number of matches to the genome. More importantly, it is also possible to search for genomic sRNA loci using parameters such as genomic region, presence of annotation such as "known miRNA" or overlap to gene regions and composition of sRNAs that match to the locus. The main search result shows an overview of sRNA loci and their main properties, including normalised expression levels for the samples available on the site. Two important measures that are used for sRNA classification are strand bias and size distribution of the sRNAs and both of these are shown graphically to give a quick overview of the subset of loci identified by the search. Further details, such as the exact sequences of sRNAs belonging to the locus, are available by following links in the output. The search results can also be downloaded and further examined in spreadsheet programs such as MS Excel. Information about the publicly available samples that were incorporated into the database is also available, including links back to the NCBI Gene Expression Omnibus (GEO) repository, where whole datasets are available for download.
Exploitation Route They can be used to analyse small RNA datasets in plants.
Sectors Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Pharmaceuticals and Medical Biotechnology

Description The 2008 Bioinformatics paper introducing the tookit has been cited over 140 times in Google Scholar as of October 2014. The toolkit has been used for identifying and analysing small RNAs in various organisms including butterfly, grape, arabidopsis, rice and strawberry. Building on the tools we have developed the UEA sRNA workbench (http://srna-workbench.cmp.uea.ac.uk/) which has superseded the tools, and which has been downloaded over 7,000 times since October 2014.
First Year Of Impact 2008
Sector Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software)
Impact Types Societal,Economic