Rfam: The community resource for RNA families

Lead Research Organisation: EMBL - European Bioinformatics Institute
Department Name: Sequence Database Group

Abstract

DNA encodes the genetic information that is transferred from parents to their offspring. When required, DNA is first transcribed into RNA, which is then translated into proteins that do useful work inside the cells. But many RNAs do much more than merely act as messengers between genes and proteins. These non-coding RNAs (ncRNAs; because they do not "code" for proteins) can be found in all living things, many of which are essential for survival. There are many types of ncRNAs, for example ncRNA is at the heart of a ribosome, the molecular machine that synthesises all proteins in our bodies.

Importantly, when scientists encounter an RNA sequence, they need a reliable tool to identify this RNA and its function. Moreover, it is necessary to find the constituent RNA parts whenever a new genome is sequenced. The Rfam database was thus created, which is an online resource that groups together related ncRNAs into families, each represented by a statistical model that allows the detection of other members of the same family. Since its inception in 2002, Rfam has expanded from ~100 families to nearly 3,000 families today, reflecting the growth of the ncRNA field. Rfam has been used world over in thousands of studies spanning many biology disciplines, e.g. Rfam was used to find ncRNAs in important crops like rice and sugar beet when their genomes were first sequenced. However, it is important to keep Rfam up-to-date because new RNAs are being constantly discovered and additional information is gleaned about already known ncRNAs. We will collaborate with the RNA community to accomplish the following objectives:

(1) We will focus on updating some of the most important RNA families for which at least one 3D structure has been found. The 3D structure can show us which parts of a long RNA sequence are close to each other in 3D space. With this knowledge, we can predict how the sequence may change, yet forming the same 3D shape. While Rfam has some of this information, it is not as accurate as what is known from 3D structures. By integrating 3D data into Rfam, scientists will be able to write new computer programs that can predict RNA 3D structure from sequence.

(2) We will create a complete collection of ncRNA type called microRNAs, which are short RNA sequences that control the amounts of different proteins in the body. Since problems with microRNAs are linked to cancer, it is important to be able to discover these in genomes and identify which ones are related. We will collaborate with the miRBase developers at the University of Manchester to synchonise microRNA families contained within the two databases. Although miRBase is complete, it does not have the tools to maintain the families while the opposite holds true for Rfam. By working together, we will create a single, complete collection of microRNA families so as to facilitate the discovery of microRNAs in new genomes using Rfam.

(3) We will create more families based on RNAs found in viruses. Many viruses use RNA structures to infect, reproduce, or avoid the host immune response. Rfam has a small number of viral families, mostly dating from a decade ago. We will update them by working with the virologists from the European Viral Bioinformatics Center who have compiled a set of conserved viral RNA structures. Scientists will then be able to use Rfam to detect viruses in sequences and study their RNA structures.

We will also regularly update the Rfam website, respond to user queries, and attend conferences to meet colleagues and share resource developments. Collectively, this work will further enhance the functionality and utility of a powerful resource and cement Rfam's central status in the field of RNA research worldwide.

Technical Summary

Established in 2002, Rfam is a database of RNA families that contains manually curated multiple sequence alignments and covariance models that can be used to find RNAs in genomic sequences. Rfam data has been widely used by the RNA community for genome annotation and algorithm development. In this proposal, we will develop Rfam to address the needs of three important and diverse user communities. First, we will enhance the annotations of all RNA families with known 3D structures by incorporating more accurate consensus secondary structures and pseudoknots based on experimentally determined structures. We will also employ the newly released R-scape software to improve secondary structures based on covariation analysis even in the absence of 3D data. Second, we will collaborate with the miRBase microRNA database to develop a comprehensive set of microRNA precursor families. This will enable miRBase to use Rfam to maintain the microRNA family classification, and Rfam will be able to annotate sequences with microRNA families from miRBase. Third, we will work with the European Viral Bioinformatics Center to expand the coverage of conserved viral RNA structures. By creating a comprehensive set of viral RNA families, we will enable scientists to detect viral sequences (this is particularly applicable to metagenomic datasets), as well as improve our understanding of viral recombination. Altogether, these efforts will expand the number of families by over 80%. The improvements gained by collaborating closely with these three user communities will be beneficial to Rfam users overall. We will disseminate information about the latest Rfam developments by engaging in outreach and training activities, including Docker-based tutorials using containers to simplify access to the Rfam software. These combined new developments will enable Rfam to spearhead a global effort aimed at understanding the biological functions and roles of ncRNAs.

Planned Impact

Rfam is a resource that contributes to researchers involved in all BBSRC strategic priorities but primarily data driven biology and systems approaches to the biosciences. Rfam will be used extensively by the life sciences community, including bioinformaticians, wet lab researchers, and clinicians. The huge growth in data produced by new sequencing technologies means that it is now more important than ever to empower researchers with tools and resources to help them interpret their data to provide a complete listing of all biological entities found within it.

Rfam is the only resource currently capable of identifying a wide range of non-coding RNA homologs in sequence data, which will be of great benefit to scientists analysing newly sequenced genomes and to all model organism databases, from Flybase, PomBase, to the more diverse Ensembl and Ensembl Genomes. Moreover, a subset of Rfam models are also being used within the field of metagenomics, for annotating tRNAs and rRNAs at scale (e.g. MGnify). Many of the resources benefiting from the Rfam data are based in the UK, thus contributing to the UK's international reputation as a leader in bioscience.

In addition to benefiting all Rfam users by continuing the development of a widely used community resource, the specific changes proposed in this project will have a beneficial impact on three specialised Rfam user communities. First, Rfam is used for developing and testing of new algorithms for RNA 2D and 3D structure prediction. The improvements in Rfam annotations using the information from RNA 3D structure will translate to the improvements in the accuracy of software developed using Rfam. Second, thousands of miRBase users will benefit from an enhanced classification of microRNAs powered by Rfam. In addition, the new Rfam microRNA annotations will be used by the resources that rely on Rfam for genome annotation such as Ensembl, Ensembl Genomes, and NCBI Eukaryotic Gene Annotation pipelines. Third, the expansion of viral RNA families in Rfam will benefit the European Viral Bioinformatics Center, including its UK members, and the rest of the virology research community. Conserved viral RNA structures are essential for various stages of viral life cycle, for protection against exonucleases and avoiding immune response (e.g. an alternatively folding RNA structure in 3'-UTR of dengue virus modulates immune reactions in both humans and insects). Having a comprehensive library of viral families in Rfam will enable the detection of these RNA structures in viral and metagenomic sequences.

Rfam data can ensure scientists have a more complete picture of the "parts list" involved in constructing each genome and better understand the roles that ncRNA play in gene regulation. We have only recently begun to understand the role that ncRNAs play in health and disease. For example, microRNAs are deregulated in cancer, snoRNAs are silenced in Prader-Willi syndrome, while plant microRNAs play important roles in immune responses against viruses. There are also significant research efforts into RNA-based therapeutics, which are promising tools to improve health and welfare. The Innovate UK Medicines Discovery Catapult has an ongoing project to identify novel therapeutics targets, one of which is specifically aimed at RNA families. Rfam is a crucial resource for such studies, allowing similarities in RNAs between organisms to be studied and providing researchers with search tools to identify previously unknown ncRNA homologs.

Publications

10 25 50
 
Title Rfam 
Description The Rfam database is a collection of RNA families, each represented by multiple sequence alignments, consensus secondary structures and covariance models (CMs). The families in Rfam break down into three broad functional classes: non-coding RNA genes, structured cis-regulatory elements and self-splicing RNAs. Typically these functional RNAs often have a conserved secondary structure which may be better preserved than the RNA sequence. The CMs used to describe each family are a slightly more complicated relative of the profile hidden Markov models (HMMs) used by Pfam. CMs can simultaneously model RNA sequence and the structure in an elegant and accurate fashion. 
Type Of Material Database/Collection of data 
Provided To Others? Yes  
Impact Rfam has enabled facile annotation of genomes with a large variety of non-coding RNAs. 
URL http://rfam.xfam.org/
 
Description European Virus Bioinformatics Center (Manja Marz) 
Organisation Friedrich Schiller University Jena (FSU)
Department European Virus Bioinformatics Center
Country Germany 
Sector Academic/University 
PI Contribution Rfam is importing the data from the EVBC to create new Rfam families.
Collaborator Contribution The EVBC is the source of new viral RNA families for Rfam.
Impact We are prototyping the import of new RNA families from EVBC to Rfam.
Start Year 2019
 
Description Harvard University (Elena Rivas and Sean Eddy) 
Organisation Harvard University
Country United States 
Sector Academic/University 
PI Contribution Rfam began using the R-scape software developed by Elena Rivas and Sean Eddy at Harvard University. The feedback provided by the Rfam team led to improvements in both R-scape and Infernal.
Collaborator Contribution Elena Rivas and Sean Eddy are involved in the development of Infernal, which is a key piece of software used by Rfam. They also developed R-scape, which is a new tool allowing to evaluate and improve Rfam families.
Impact DOI:10.1093/nar/gkx1038
 
Description NCBI - Eric Nawrocki 
Organisation National Center for Biotechnology Information (NCBI)
Country United States 
Sector Public 
PI Contribution The Rfam team provides feedback about the Infernal software to Dr Eric Nawrocki, who develops Infernal and is based at NCBI. The feedback helps to improve Infernal and guide its development.
Collaborator Contribution Dr Eric Nawrocki is the main developer of the Infernal software that Rfam relies on to identify non-coding RNAs. Dr Nawrocki helps us to use Infernal efficiently and assists with Infernal-related queries sent to the Rfam help desk.
Impact doi:10.1093/nar/gkx1038
Start Year 2014
 
Description Public outreach at a Girlguiding event in Ely College 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Schools
Results and Impact Ioanna Kalvari and Anton Petrov participated in a County STEM Day organised by Girlguiding Cambridgeshire East. The event reached ~100 Year 5-8 students who learned about Rfam and participated in the RNA Scanner activity, developed specifically to explain what Rfam is and what RNA families are.
Year(s) Of Engagement Activity 2020
 
Description Rfam poster at RNA UK 2020 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Ioanna Kalvari presented a poster about Rfam at the RNA UK meeting, increasing the awareness about the resource.
Year(s) Of Engagement Activity 2020