Computational prediction and analysis of long non-coding RNAs

Lead Research Organisation: European Bioinformatics Institute
Department Name: Enright Group

Abstract

The sequencing of the Human genome has created a new era in biological research. Understanding our genome and how it is regulated is one of the great challenges for science, yet has the potential to help improve lives and our ability to treat diseases. The advent of this genomic age has heralded rapid changes in the field of biology. One surprise from the initial sequencing of the genome was the relative scarcity of genomic regions which can be read to produce proteins via RNA intermediates. Proteins are the building blocks of cells and many important molecular machines are composed of proteins. The non protein-coding part of the genome was previously dismissed in some circles as largely containing 'junk dna'. In the last ten years a number of breakthroughs in genome analysis and genome sequencing have shed-light on many hitherto unknown aspects of biology being carried out by these non protein coding regions.

Novel technologies such as genome tiling arrays and high-throughput RNA sequencing has shown that although large portions of the genome may not be coding for protein sequences, they are still being read as RNA messages. The discovery of small RNA molecules such as small-interfering and microRNAs illustrated that many of these non-coding messages were being processed within cells and used to regulate other genes (both protein coding and non-coding). Within testes and oocytes (germline) another class of small RNAs called piwi-RNAs was discovered and shown to have an important role in protecting the genome as it passes from one generation to the next. Recently, attention is focusing on larger non-coding transcripts called long non-coding RNAs (lncRNAs). We know that the genome encodes many long RNA molecules which do not appear to encode proteins. A central dogma of biology has always been that DNA is read into RNA messages which subsequently encode proteins. This elegant view of molecular biology is still largely true, but the last ten years of research have revealed many hidden layers to this view of gene-regulation at the level of both DNA and RNA. Discovering how different classes of molecules work together is vital to our understanding of how our genome is regulated, how cells and organisms function and has tremendous implications for our understanding of development and disease.

In this proposal we aim to build a computational system that will be able to detect candidate lncRNAs from RNA sequence data obtained from experimental samples. We aim to collect, score and characterise these molecules and to present them in a web-interface for further analysis. We will use computational biology to attempt to find cases where these molecules may interact with each other, protein-coding genes or the genome itself to control gene-regulation. Using computers allows us to work with a large quantity of data quickly and efficiently, however experiments are required in a laboratory to confirm and expand these results. We will work with a Mouse laboratory and a fruitfly laboratory (Drosophila melanogaster) to confirm our findings and to test the importance of these molecules by knocking them out. We will study what happens to these molecules as the embryo develops and as red-blood cells develop to see how their spatial and temporal expression is regulated. We will also attempt to discover what other molecules (such as proteins) may be binding to them.

We believe that this project has the potential to greatly increase our understanding of these elusive molecules, the organisation of the genome and to assist ourselves and others in elucidating their roles in biology, health and disease.

Technical Summary

Recent advances in genome sequencing and high-throughput functional genomics have shown that the genome is pervasively transcribed. In particular non-coding RNA has recently come into the limelight as providing a platform for novel layers of gene-regulation that have been largely overlooked. Work on microRNAs and piwi-RNAs in particular has shown how the expression of large numbers of molecules (mRNAs and transposons) can be targeted and regulated by very short RNA molecules via a complex system of RNA binding proteins and other molecules.

This proposal focuses on long non-coding RNA (lncRNA) which are >200nt and lack a functional open reading frame. While a number of these molecules have been studied over the years, it has only been relatively recently that high-throughput sequencing and expression analysis has shown how many non protein-coding transcripts are being actively transcribed. Many functions have been proposed for these molecules, including antisense-regulation and the blocking of regulatory regions. We propose to develop a computational system for the detection, characterisation and functional analysis of these molecules from next-generation sequencing data. This system will process sequence reads from RNAseq experiments, clean and filter reads and assemble overlapping reads into likely lncRNA transcripts. These candidate molecules will be categorised according to their genomic context and we will attempt to detect cases where they may regulate other transcripts via antisense binding. The computational part of the proposal aims to produce a detailed computational pipeline and web resource for analysis of lncRNAs. The experimental part of the proposal aims to validate these candidates, obtain phenotypic information from knockouts and identify bound protein complexes which may mediate their function. We will assess the developmental profiles of lncRNAs across timecourses from Drosophila embryonic development and Mouse erythroid development.

Planned Impact

The main impact of this research will be to broaden our knowledge of the genome and its regulatory mechanisms. Our beneficiaries will include biologists, clinicians and scientists working in industry. The development of computational tools and resources has the potential to increase their productivity by making their own scientific analysis quicker, easier and more reliable.

This proposal requires very specific skills and will contribute to the knowledge economy of the United Kingdom through the training of researchers and students in our laboratory and through training courses that we run including EMBO and Wellcome Trust Advanced courses.

The European Bioinformatics Institute outreach team, promotes our science and research to the general public, school children and university students. We participate in school visits and open-days in Cambridge and at our campus. Through events such as these we aim to interest children and adults alike in science and its benefits.

It is possible that a breakthrough made during this research would be patentable or possible to commercialise. We have a technology transfer office at EMBL Heidelberg who are in a position to assess these situations and make recommendations where appropriate. We will explore commercialisation opportunities if they arise during this research and do not impact on our open-source, open-data policies described in the cast for support.

Long non-coding RNAs have already been shown to play roles in disease. We have a number of collaborations with clinical groups studying heart-disease, obesity and cancer. Any finding obtained during this project of medical relevance will be assessed and discussed with clinical groups at Addenbrookes Hospital (University of Cambridge) and other clinical institutions. We also have links to industry and pharmaceutical companies and such research may be of real benefit to their own ongoing research programs. It is likely that any therapeutic or diagnostic outcomes of research such as this would be of great benefit to society at large.

Publications

10 25 50
 
Description We have developed a comprehensive pipeline for the detection of long-noncoding RNAs.

We have developed a comprehensive atlas of transcription in the mouse germline with our collaborators.

We have found novel insights into the function of non-coding transcripts in controlling gene regulation in mouse spermatogenesis.

Our findings have relevance for sperm development in mammal systems including human with possible relevance to diseases such as infertility or germ cell tumours.
Exploitation Route Our pipeline is already being used by other groups to perform their own analysis.
Others may look for clinical relevance for our findings in mouse for human disease. Uptake of the pipeline has been rapid and we have recently used this approach to discover a large cohort of transposon driven lncRNAs during germline development, with an EMBO Reports publication.
Sectors Healthcare,Pharmaceuticals and Medical Biotechnology

URL http://www.ebi.ac.uk/research/enright/software/chimira
 
Description The pipelines and software we have created are already being used by large numbers of scientists in the UK, EU and worldwide. We already have interest from clinical collaborators in the application of our systems to medical samples including paediatric germ cell tumours. We have recently published two significant papers based on this research. The first shows how transposons drive the expression of large numbers of lncRNAs during germline development and also in other tissues. More recently, through a collaboration with the Curie Institute (Alena Shkumatava) we have worked on showing the effect of miRNAs and lncRNAs on behaviour of Zebrafish. The first paper is 2017 EMBO Reports, the second is 2018 Nature Structural and Molecular Biology (In Press)
First Year Of Impact 2016
Sector Education,Healthcare,Pharmaceuticals and Medical Biotechnology
Impact Types Societal,Economic

 
Description UK BioBank - Ethics and Governance Committee
Geographic Reach National 
Policy Influence Type Participation in a guidance/advisory committee
Impact I became a member of the UK BioBank ethics and governance committee. This committee works with funders (Wellcome, MRC, BBSRC) to make sure that the data of over 500,000 participants is being used effectively and ethically.
URL https://egcukbiobank.org.uk
 
Description MRC Methodology Fellowship
Amount £155,756 (GBP)
Funding ID MR/L012367/1 
Organisation Medical Research Council (MRC) 
Sector Public
Country United Kingdom
Start 05/2015 
End 10/2018
 
Title ChimiRa 
Description An online pipeline for the comprehensive analysis of small RNAs. This is aimed at biologists and operates over the web at EBI servers. This provides end-to-end analysis designed for and with Biologists in mind. This pipeline has enabled a lot of epitranscriptomic research into microRNA modification and editing and has been published recently. 
Type Of Material Improvements to research infrastructure 
Year Produced 2015 
Provided To Others? Yes  
Impact The tool is now widely used (>500 users per year). It is beginning to pick up citations and was the first available tool that was entirely web-based and allowed analysis of microRNA uridylation and other modifications. 
URL http://www.ebi.ac.uk/enright-srv/chimira/
 
Title Kraken Pipeline 
Description This is a comprehensive pipeline for the analysis of small RNA sequencing data. 
Type Of Material Biological samples 
Year Produced 2013 
Provided To Others? Yes  
Impact We have a relatively large group of researchers around the UK and europe using the tool to facilitate their own analyses 
URL http://www.ebi.ac.uk/research/enright/software/kraken
 
Title Catalogue of transcription in the Mouse Germline - Spermatogenesis 
Description A collection of: Large scale NGS data for mRNAs Large scale NGS data for long non-coding RNAs Small RNA NGS data for microRNAs and piwiRNAs 
Type Of Material Database/Collection of data 
Year Produced 2014 
Provided To Others? Yes  
Impact This will be the most complete catalogue of transcription in the mouse germline. 
URL http://wwwdev.ebi.ac.uk/enright-srv/krakenbot/
 
Description Long non-coding RNA collaborations with 
Organisation Curie Institute Paris (Institut Curie)
Country France 
Sector Academic/University 
PI Contribution Thanks to this award we have begun a successful collaboration with Dr. Alena Shkumatava at the Curie institute in Paris. We have used the pipelines developed to assess and analyse long non-coding RNAs in her Zebrafish models which has resulted in a Nature, Structural and Molecular Biology paper (accepted).
Collaborator Contribution They have investigated the roles of particular microRNA lncRNA interactions and their effect on behaviour. This has profound implications for non-coding RNAs in higher organisms.
Impact "Defective germline reprogramming rewires the spermatogonial transcriptome". Nature Structural & Molecular Biology 2018 (Accepted)
Start Year 2016
 
Description Long non-coding RNAs in the Mouse Germline 
Organisation European Molecular Biology Laboratory
Department European Molecular Biology Laboratory Monterotondo
Country Italy 
Sector Public 
PI Contribution We are providing all the computational analysis.
Collaborator Contribution They are providing sample preparation, library preparation for sequencing and biological validation
Impact Multiple publications have resulted from this collaboration including: Nature, Molecular Cell and the Journal of Experimental Medicine
Start Year 2011
 
Description EMBO Course - RNA Sequencing Analysis 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact This is the EMBO course with primary responsibility to teach a large cohort of postgraduate and undergraduate participants the fundamentals of RNA seq analysis. We presented and taught our small RNA pipelines and discussed our BBSRC funded activities for lncRNA research and small RNA analysis.
Year(s) Of Engagement Activity 2013,2014,2015,2016
URL http://www.embo.org/events/practical-courses
 
Description EMBO Course - microRNA Profiling 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact EMBO Workshop on profiling small RNAs. We teach participants how to use our BBSRC funded pipelines for research into small RNAs and long non-coding RNAs. This event occurs annually at a different european location
Year(s) Of Engagement Activity 2012,2013,2014,2015,2016
URL http://www.embo.org/events/practical-courses
 
Description Talk - EMBO Symposium on Systems Biology of Long non-coding RNAs 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Presented our BBSRC funded work into long non-coding RNAs at an EMBO workshop at the Weizmann Institute in Tel Aviv
Year(s) Of Engagement Activity 2016
URL http://events.embo.org/16-ncrna/
 
Description Wellcome Trust Course - Functional Genomics and Systems Biology 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Taught participants how to analyse RNA-seq data for small RNAs, mRNAs and long non-coding RNAs. Presented outcomes from our BBSRC funded research into pipeline development and for long non-coding RNAs.

This is a recurring course where the PI is an organiser and instructor
Year(s) Of Engagement Activity 2012,2013,2014,2015,2016
URL https://registration.hinxton.wellcome.ac.uk/home.wt