Rfam: the community resource for RNA families

Lead Research Organisation: University of Manchester
Department Name: School of Biological Sciences

Abstract

DNA encodes the genetic information that is transferred from parents to their offspring. When required, DNA is first transcribed into RNA, which is then translated into proteins that do useful work inside the cells. But many RNAs do much more than merely act as messengers between genes and proteins. These non-coding RNAs (ncRNAs; because they do not "code" for proteins) can be found in all living things, many of which are essential for survival. There are many types of ncRNAs, for example ncRNA is at the heart of a ribosome, the molecular machine that synthesises all proteins in our bodies.

Importantly, when scientists encounter an RNA sequence, they need a reliable tool to identify this RNA and its function. Moreover, it is necessary to find the constituent RNA parts whenever a new genome is sequenced. The Rfam database was thus created, which is an online resource that groups together related ncRNAs into families, each represented by a statistical model that allows the detection of other members of the same family. Since its inception in 2002, Rfam has expanded from ~100 families to nearly 3,000 families today, reflecting the growth of the ncRNA field. Rfam has been used world over in thousands of studies spanning many biology disciplines, e.g. Rfam was used to find ncRNAs in important crops like rice and sugar beet when their genomes were first sequenced. However, it is important to keep Rfam up-to-date because new RNAs are being constantly discovered and additional information is gleaned about already known ncRNAs. We will collaborate with the RNA community to accomplish the following objectives:

(1) We will focus on updating some of the most important RNA families for which at least one 3D structure has been found. The 3D structure can show us which parts of a long RNA sequence are close to each other in 3D space. With this knowledge, we can predict how the sequence may change, yet forming the same 3D shape. While Rfam has some of this information, it is not as accurate as what is known from 3D structures. By integrating 3D data into Rfam, scientists will be able to write new computer programs that can predict RNA 3D structure from sequence.

(2) We will create a complete collection of ncRNA type called microRNAs, which are short RNA sequences that control the amounts of different proteins in the body. Since problems with microRNAs are linked to cancer, it is important to be able to discover these in genomes and identify which ones are related. We will collaborate with the miRBase developers at the University of Manchester to synchonise microRNA families contained within the two databases. Although miRBase is complete, it does not have the tools to maintain the families while the opposite holds true for Rfam. By working together, we will create a single, complete collection of microRNA families so as to facilitate the discovery of microRNAs in new genomes using Rfam.

(3) We will create more families based on RNAs found in viruses. Many viruses use RNA structures to infect, reproduce, or avoid the host immune response. Rfam has a small number of viral families, mostly dating from a decade ago. We will update them by working with the virologists from the European Viral Bioinformatics Center who have compiled a set of conserved viral RNA structures. Scientists will then be able to use Rfam to detect viruses in sequences and study their RNA structures.

We will also regularly update the Rfam website, respond to user queries, and attend conferences to meet colleagues and share resource developments. Collectively, this work will further enhance the functionality and utility of a powerful resource and cement Rfam's central status in the field of RNA research worldwide.

Technical Summary

Established in 2002, Rfam is a database of RNA families that contains manually curated multiple sequence alignments and covariance models that can be used to find RNAs in genomic sequences. Rfam data has been widely used by the RNA community for genome annotation and algorithm development. In this proposal, we will develop Rfam to address the needs of three important and diverse user communities. First, we will enhance the annotations of all RNA families with known 3D structures by incorporating more accurate consensus secondary structures and pseudoknots based on experimentally determined structures. We will also employ the newly released R-scape software to improve secondary structures based on covariation analysis even in the absence of 3D data. Second, we will collaborate with the miRBase microRNA database to develop a comprehensive set of microRNA precursor families. This will enable miRBase to use Rfam to maintain the microRNA family classification, and Rfam will be able to annotate sequences with microRNA families from miRBase. Third, we will work with the European Viral Bioinformatics Center to expand the coverage of conserved viral RNA structures. By creating a comprehensive set of viral RNA families, we will enable scientists to detect viral sequences (this is particularly applicable to metagenomic datasets), as well as improve our understanding of viral recombination. Altogether, these efforts will expand the number of families by over 80%. The improvements gained by collaborating closely with these three user communities will be beneficial to Rfam users overall. We will disseminate information about the latest Rfam developments by engaging in outreach and training activities, including Docker-based tutorials using containers to simplify access to the Rfam software. These combined new developments will enable Rfam to spearhead a global effort aimed at understanding the biological functions and roles of ncRNAs.

Planned Impact

Rfam is a resource that contributes to researchers involved in all BBSRC strategic priorities but primarily data driven biology and systems approaches to the biosciences. Rfam will be used extensively by the life sciences community, including bioinformaticians, wet-lab researchers, and clinicians. The huge growth in data produced by new sequencing technologies means that it is now more important than ever to empower researchers with tools and resources to help them interpret their data to provide a complete listing of all biological entities found within it.

Rfam is the only resource currently capable of identifying a wide range of non-coding RNA homologs in sequence data, which will be of great benefit to scientists analysing newly sequenced genomes and to all model organism databases, from Flybase, PomBase, to the even more comprehensive Ensembl and Ensembl Genomes. Moreover, a subset of Rfam models are also being used within the field of metagenomics, for annotating rRNAs at scale (e.g. MGnify). Many of the resources benefiting from the Rfam data are based in the UK, thus contributing to the UK's international reputation as a leader in bioscience.

In addition to benefiting all Rfam users by continuing the development of a widely used community resource, the specific changes proposed in this project will have a beneficial impact on 3 specialised Rfam user communities. First, Rfam is used for developing and testing of new algorithms for RNA 2D and 3D structure prediction. The improvements in Rfam annotations using the information from RNA 3D structure will translate to the improvements in the accuracy of software developed using Rfam. Second, thousands of miRBase users will benefit from an enhanced classification of microRNAs powered by Rfam. In addition, the new Rfam microRNA annotations will be used by the resources that rely on Rfam for genome annotation such as Ensembl, Ensembl Genomes, and NCBI Eukaryotic Gene Annotation pipelines. Third, the expansion of viral RNA families in Rfam will benefit the European Viral Bioinformatics Center, including its UK members, and the rest of the virology research community. Conserved viral RNA structures are essential for various stages of viral life cycle, for protection against exonucleases and avoiding immune response (for example, an alternatively folding RNA structure in 3'-UTR of dengue virus modulates immune reactions in both humans and insects). Having a comprehensive library of viral families in Rfam will enable the detection of these RNA structures in viral and metagenomic sequences.

Rfam data can ensure scientists have a more complete picture of the "parts list" involved in constructing each genome and better understand the roles that ncRNA play in gene regulation. We have only recently begun to understand the role that ncRNAs play in health and disease. For example, microRNAs are deregulated in cancer, snoRNAs are silenced in Prader-Willi syndrome while plant microRNAs play important roles in immune responses against viruses. There are also significant research efforts into RNA-based therapeutics, which are promising tools to improve health and welfare. The Innovate UK Medicines Discovery Catapult has an ongoing project to identify novel therapeutics targets, one of which is specifically aimed at RNA families. Rfam is a crucial resource for such studies, allowing similarities in RNAs between organisms to be studied and providing researchers with search tools to identify previously unknown ncRNA homologs.

Publications

10 25 50
 
Title Rfam database of RNA families 
Description Rfam is a collection of multiple sequence alignments and covariance models representing families of non-protein coding RNA sequences. 
Type Of Material Database/Collection of data 
Provided To Others? Yes  
Impact Rfam is a core RNA bioinformatics resource, used by thousands of RNA researchers around the world. 
URL https://rfam.org/