Classification and functional annotation of endogenous siRNAs and other small RNAs

Lead Research Organisation: European Bioinformatics Institute
Department Name: Enright Group

Abstract

The aim of this project is to perform experiments and write computer programs that will aid research into the role of small interfering RNAs (siRNAs) produced within mammalian cells (endogenous siRNAs).

siRNAs are short RNA molecules snipped, by a protein named Dicer, from the end of double stranded RNA molecules which have bound to each other in the cell. The short double stranded segment that is released is unwound and one strand is loaded into a structure consisting of a set of other proteins. The siRNA strand acts as the guide for these proteins. The siRNA finds target regions in other RNAs in the cell. If these regions bind to the siRNA along its whole length, one of the proteins in the structure will cut the target and the target will break down. If only a small part at one end the siRNA binds to the target, the target will be destabilised and its ability to be read by the protein making machinery will be impeded. In this way siRNAs expressed in a cell are able to control the level of other RNAs and the amount of protein manufactured from the RNA messages.

In mammals, siRNAs were thought to be rare but advances in technology have begun to identify them. However, the most frequent methods for identifying small RNAs in the cell can only find them and can't predict what they do or how they were produced. Lots of other small RNAs, that are a similar length to siRNAs but which may have different functions, have also been found in the same experiments. In order to work out exactly what siRNAs might be doing in the cells we need to be able to reliably distinguish them from other small RNAs.

There are ongoing efforts to understand this data. The sequence of nucleotides in an RNA molecule can be used to trace the RNA molecule back to the region in the DNA genome it must have been read from and some tools exist that can identify some of the classes of small RNA based on this information. More work is required to efficiently and automatically identify which of the small RNAs are endo-siRNAs.

The first goal of this proposal is to generate data in the laboratory that will allow me to identify regions of the DNA genome that express dsRNA and which of these regions are cut up by Dicer to release the siRNAs. Once I have identified these regions, I will use computational methods to look for local features in the DNA and features in the short RNAs from the area, to develop computational tools that can identify endo-siRNAs from amongst the other small RNAs sampled from cells. I will also investigate the features around regions that produce other types of small RNA to look for similarities and use these to group short RNAs into sets that are most likely to behave in a similar way.

It has also recently been realised in mammals that siRNAs are able to suppress the expression of RNA from other regions of the genome by targeting the DNA directly. The mechanism by which this occurs is still not entirely understood. Again, by searching for features that surround regions potentially targeted by siRNAs, that I will identify experimentally, I will work to develop a method that will be able to computationally predict regions of the DNA that may be targeted by endogenous siRNAs in this way. Predicted targets would allow experimental biologists to follow up on interesting candidates in their own experiments and aid future research into the roles of individual endogenous siRNAs and this class of small RNAs in general.

There is increasing evidence that the slicing of double stranded RNA by Dicer or the endogenous siRNAs produced may play a role in several developmental processes and diseases. The results generated will aid future research in this field and lead to a better understanding of the processes in the cell that may potentially be relevant for human health and disease. It is also possible that siRNAs could be used as medicines if we can better understand the rules that control the areas of DNA that they target.

Technical Summary

I will generate genome wide data sets in the laboratory that will provide the basis for developing novel computational tools to detect and, later, predict the function and regulatory targets of endogenous siRNAs (endo-siRNAs). The initial experiments consist of a combination of dsRNA enriched sequencing data, prepared using the J2 antibody, p19 enriched dsRNA siRNA sequencing libraries and PEG enriched small RNA sequencing libraries prepared from wild type and Dicer knockdown backgrounds. These data will then be combined with information from public resources such as Ensembl and ENCODE. Machine learning will be used to identify a set of features associated with Dicer dependent endo-siRNA loci. Clustering techniques will also be used to identify features that can distinguish distinct sets of small RNA producing loci. These analyses will be used as the basis for producing methods for interpreting sRNA sequencing data. These methods will be applied to publically available sRNA sequencing data for Dicer knockout systems and other tissues and cell lines to assess performance and generate more general expression profiles. I will use endo-siRNA perturbations, ChIP-seq and public Ago-CLIP datasets to identify potential targets of transcriptional gene silencing (TGS). I will examine the attributes of target sites for features that will allow development of a predictive method.

There is growing evidence that Dicer processing of dRNAs, endo-siRNAs and TGS may play a role in a range developmental and medical systems, including learning, geographic atrophy and breast cancer. This proposal will address a pressing need for tools that can enable research in a rapidly expanding niche of cellular biology.

Planned Impact

Initially this research will benefit those conducting small RNA sequencing experiments across a broad range of biological fields. Frequently, small RNA sequencing is used as a method to profile miRNA expression. The open source tools developed here will allow researchers to profile their sequencing more thoroughly. The simple application of these tools to novel and publically available data will allow the extent to which small RNAs, beyond miRNAs, contribute to the small RNA profile of mammalian systems to be rapidly determined. This will inevitably impact on our understanding of a broad range of pathologies and developmental processes.

The project proposed represents a collaboration between the European Bioinformatics Institute in Cambridge and the Dunn School of Pathology in Oxford. I have considerable experience interacting with multiple laboratories across disciplinary and geographical boundaries. By fostering ties between these two Institutions I hope to encourage more frequent interactions and the development of further projects in the future.

Additionally I will develop tools for the prediction of targets for transcriptional gene silencing in mammals. These tools will facilitate the development of hypotheses by the wider scientific community that will enable the functional annotation of endo-siRNAs and the more rapid exploitation of this novel field for the development of siRNAs as experimental tools and potential therapeutics and diagnostics. As a consequence this software may have considerable impact both in the wider understanding of fundamental biological concepts such the regulation of cellular epigenetics and splicing, but would also aid commercial biotech and pharmaceutical companies in the development of products.

Through my public engagement efforts I hope to be able to interact both directly and indirectly with individuals with very different skill sets and a wide range of interests. I believe that the project presented here represents a great example of the challenges and prospects currently emerging in biology due to the volume of data being generated and how these can be addressed through interdisciplinary research. I hope this will encourage scientists at every career stage to develop a multidisciplinary approach to biological research and demonstrate the merits of developing both computer and laboratory skill sets as part of their studies.

By conducting this project I will develop skills I deem essential to my future as an academic lab leader. This project will allow me to develop a better understanding of a range of computational techniques and first hand experience handling a novel set of experimental protocols. It will also give me the opportunity to further develop my abilities as a project manager. I deem the prospect of running a lab conducting both experimental and computational research as a challenge and this project will allow me to become more proficient at directly marrying the two environments before taking the next step in my career.
 
Title mirnovo 
Description Mirnovo is a tool designed to predict miRNAs from small RNA sequencing data. It can predict these either with or without an available genomic sequence using machine learning (Random Forests). The tool can be used to make predictions in either plants or animals. 
Type Of Material Data analysis technique 
Year Produced 2017 
Provided To Others? Yes  
Impact Mirnovo is available as either a web-based application or a stand-alone tool. It will be of particular use for miRNA prediction in situations where a fully assembled genome sequence is not yet prepared for the species concerned. 
URL https://github.com/dvitsios/mirnovo
 
Description Early access to relevant RNA-seq data from the Wellcome Trust Sanger Institute 
Organisation The Wellcome Trust Sanger Institute
Country United Kingdom 
Sector Charity/Non Profit 
PI Contribution Bioinformatic analysis of RNA-sequencing data.
Collaborator Contribution Providing early access for relevant RNA-sequencing data generated in the laboratory.
Impact This collaboration is multi-disciplinary involving both wet-lab and bioinformatic research. Publication: PMID: 29144233
Start Year 2015
 
Description Experimental partnership with the Sir William Dunn School of Pathology 
Organisation University of Oxford
Department Sir William Dunn School of Pathology
Country United Kingdom 
Sector Academic/University 
PI Contribution Computational expertise for the development of novel methods and analyses.
Collaborator Contribution Experimental and biological expertise, data and access to laboratory facilities for the generation of new samples.
Impact This is a multidisciplinary collaboration between an experimental lab at the Sir William Dunn School of Pathology, Oxford University, and a computational lab at the EMBL - European Bioinformatics Institute.
Start Year 2014
 
Title Continued development of SequenceImp 
Description This pipeline is part of the Kraken suite of tools for the QC and analysis of small RNA sequencing samples. This pipeline itself was initially released and published in 2013. However, additional funding has allowed me to further develop and adapt the software, including the development of a parallel annotation pipeline. As such users can more easily apply the pipeline to a wider selection of species. 
Type Of Technology Software 
Year Produced 2016 
Open Source License? Yes  
Impact Currently, I have applied these more recent developments to my own analyses and applications in-house. The new annotation pipeline is also available online for use by the wider community. 
URL https://github.com/davis-m/SequenceImp
 
Description Mentor at EMBL-EBI/Wellcome Trust bioinformatics summer school 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact Worked in a team to design and mentor a project for the EMBL-EBI/Wellcome Trust bioinformatics summer school for a small number of researchers. Introduced them to various bioinformatics tools and computational analyses and in particular tools and resources associated with analysing small RNA sequencing.
Year(s) Of Engagement Activity 2015
 
Description Trainer for small RNA-seq analysis course 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Other audiences
Results and Impact Assistant on a course to train staff and students from Cambridge University and elsewhere in the analysis of small RNA sequencing data. There was much interest in using the tools and techniques taught in relation to future work.
Year(s) Of Engagement Activity 2017
URL https://training.csx.cam.ac.uk/bioinformatics/course/bioinfo-smallRNA