Rfam: The community resource for RNA families

Lead Research Organisation: European Bioinformatics Institute
Department Name: Sequence Database Group

Abstract

DNA encodes the genetic information that is transferred from parents to their offspring. When required, DNA is first transcribed into RNA, which is then translated into proteins that do useful work inside the cells. But many RNAs do much more than merely act as messengers between genes and proteins. These non-coding RNAs (ncRNAs; because they do not "code" for proteins) can be found in all living things, many of which are essential for survival. There are many types of ncRNAs, for example ncRNA is at the heart of a ribosome, the molecular machine that synthesises all proteins in our bodies.

Importantly, when scientists encounter an RNA sequence, they need a reliable tool to identify this RNA and its function. Moreover, it is necessary to find the constituent RNA parts whenever a new genome is sequenced. The Rfam database was thus created, which is an online resource that groups together related ncRNAs into families, each represented by a statistical model that allows the detection of other members of the same family. Since its inception in 2002, Rfam has expanded from ~100 families to nearly 3,000 families today, reflecting the growth of the ncRNA field. Rfam has been used world over in thousands of studies spanning many biology disciplines, e.g. Rfam was used to find ncRNAs in important crops like rice and sugar beet when their genomes were first sequenced. However, it is important to keep Rfam up-to-date because new RNAs are being constantly discovered and additional information is gleaned about already known ncRNAs. We will collaborate with the RNA community to accomplish the following objectives:

(1) We will focus on updating some of the most important RNA families for which at least one 3D structure has been found. The 3D structure can show us which parts of a long RNA sequence are close to each other in 3D space. With this knowledge, we can predict how the sequence may change, yet forming the same 3D shape. While Rfam has some of this information, it is not as accurate as what is known from 3D structures. By integrating 3D data into Rfam, scientists will be able to write new computer programs that can predict RNA 3D structure from sequence.

(2) We will create a complete collection of ncRNA type called microRNAs, which are short RNA sequences that control the amounts of different proteins in the body. Since problems with microRNAs are linked to cancer, it is important to be able to discover these in genomes and identify which ones are related. We will collaborate with the miRBase developers at the University of Manchester to synchonise microRNA families contained within the two databases. Although miRBase is complete, it does not have the tools to maintain the families while the opposite holds true for Rfam. By working together, we will create a single, complete collection of microRNA families so as to facilitate the discovery of microRNAs in new genomes using Rfam.

(3) We will create more families based on RNAs found in viruses. Many viruses use RNA structures to infect, reproduce, or avoid the host immune response. Rfam has a small number of viral families, mostly dating from a decade ago. We will update them by working with the virologists from the European Viral Bioinformatics Center who have compiled a set of conserved viral RNA structures. Scientists will then be able to use Rfam to detect viruses in sequences and study their RNA structures.

We will also regularly update the Rfam website, respond to user queries, and attend conferences to meet colleagues and share resource developments. Collectively, this work will further enhance the functionality and utility of a powerful resource and cement Rfam's central status in the field of RNA research worldwide.

Technical Summary

Established in 2002, Rfam is a database of RNA families that contains manually curated multiple sequence alignments and covariance models that can be used to find RNAs in genomic sequences. Rfam data has been widely used by the RNA community for genome annotation and algorithm development. In this proposal, we will develop Rfam to address the needs of three important and diverse user communities. First, we will enhance the annotations of all RNA families with known 3D structures by incorporating more accurate consensus secondary structures and pseudoknots based on experimentally determined structures. We will also employ the newly released R-scape software to improve secondary structures based on covariation analysis even in the absence of 3D data. Second, we will collaborate with the miRBase microRNA database to develop a comprehensive set of microRNA precursor families. This will enable miRBase to use Rfam to maintain the microRNA family classification, and Rfam will be able to annotate sequences with microRNA families from miRBase. Third, we will work with the European Viral Bioinformatics Center to expand the coverage of conserved viral RNA structures. By creating a comprehensive set of viral RNA families, we will enable scientists to detect viral sequences (this is particularly applicable to metagenomic datasets), as well as improve our understanding of viral recombination. Altogether, these efforts will expand the number of families by over 80%. The improvements gained by collaborating closely with these three user communities will be beneficial to Rfam users overall. We will disseminate information about the latest Rfam developments by engaging in outreach and training activities, including Docker-based tutorials using containers to simplify access to the Rfam software. These combined new developments will enable Rfam to spearhead a global effort aimed at understanding the biological functions and roles of ncRNAs.

Planned Impact

Rfam is a resource that contributes to researchers involved in all BBSRC strategic priorities but primarily data driven biology and systems approaches to the biosciences. Rfam will be used extensively by the life sciences community, including bioinformaticians, wet lab researchers, and clinicians. The huge growth in data produced by new sequencing technologies means that it is now more important than ever to empower researchers with tools and resources to help them interpret their data to provide a complete listing of all biological entities found within it.

Rfam is the only resource currently capable of identifying a wide range of non-coding RNA homologs in sequence data, which will be of great benefit to scientists analysing newly sequenced genomes and to all model organism databases, from Flybase, PomBase, to the more diverse Ensembl and Ensembl Genomes. Moreover, a subset of Rfam models are also being used within the field of metagenomics, for annotating tRNAs and rRNAs at scale (e.g. MGnify). Many of the resources benefiting from the Rfam data are based in the UK, thus contributing to the UK's international reputation as a leader in bioscience.

In addition to benefiting all Rfam users by continuing the development of a widely used community resource, the specific changes proposed in this project will have a beneficial impact on three specialised Rfam user communities. First, Rfam is used for developing and testing of new algorithms for RNA 2D and 3D structure prediction. The improvements in Rfam annotations using the information from RNA 3D structure will translate to the improvements in the accuracy of software developed using Rfam. Second, thousands of miRBase users will benefit from an enhanced classification of microRNAs powered by Rfam. In addition, the new Rfam microRNA annotations will be used by the resources that rely on Rfam for genome annotation such as Ensembl, Ensembl Genomes, and NCBI Eukaryotic Gene Annotation pipelines. Third, the expansion of viral RNA families in Rfam will benefit the European Viral Bioinformatics Center, including its UK members, and the rest of the virology research community. Conserved viral RNA structures are essential for various stages of viral life cycle, for protection against exonucleases and avoiding immune response (e.g. an alternatively folding RNA structure in 3'-UTR of dengue virus modulates immune reactions in both humans and insects). Having a comprehensive library of viral families in Rfam will enable the detection of these RNA structures in viral and metagenomic sequences.

Rfam data can ensure scientists have a more complete picture of the "parts list" involved in constructing each genome and better understand the roles that ncRNA play in gene regulation. We have only recently begun to understand the role that ncRNAs play in health and disease. For example, microRNAs are deregulated in cancer, snoRNAs are silenced in Prader-Willi syndrome, while plant microRNAs play important roles in immune responses against viruses. There are also significant research efforts into RNA-based therapeutics, which are promising tools to improve health and welfare. The Innovate UK Medicines Discovery Catapult has an ongoing project to identify novel therapeutics targets, one of which is specifically aimed at RNA families. Rfam is a crucial resource for such studies, allowing similarities in RNAs between organisms to be studied and providing researchers with search tools to identify previously unknown ncRNA homologs.

Publications

10 25 50
 
Description Rfam is a database of non-coding RNA families. Each family is composed of a manually curated sequence alignment, called the seed, a secondary structure and a covariance model. Rfam is commonly used for annotating genomes with non-coding RNAs, including resources such as Ensembl. In this grant we aimed to improve Rfam by increasing its size and improving families. Our specific aims were to add 3D structures to Rfam families, synchronise miRBase and Rfam, and produce important viral RNA families. Over the course of the grant we have achieved these goals. To begin with, we have increased the number of families from 2,791 to 4,108.
Objective 1 was to integrate 3D structure information into Rfam. The 3D structure of molecules is valuable for understanding the function and interactions with other molecules. It is difficult to solve 3D structures of RNA, making each structure valuable, but informative. In this aim we planned to connect the limited and rich world of RNA structures with the larger world of RNA alignments. This would help RNA scientists from both communities access more, and more informative data. We have developed tooling to annotate all PDB structures with the matching Rfam family every week. This information is then used to align the structures into Rfam seed alignments automatically. We then manually review each alignment and update the Rfam family as needed. Using this tooling, we have found 133 families with 3D structures and aligned at least 1 structure into 35 families. These improved seed alignments now include the secondary structure observed in the structure along with the sequence. Additionally, we have provided this information back to PDBe which uses our data to annotate all RNA structures. These annotations are visible on every PDBe page for RNAs.
Objective 2 was to synchronise Rfam and miRBase. miRBase is the authoritative resource for miRNAs, which have been shown to play an essential regulatory role in eukaryotes. While Rfam had a collection of miRNA families, these were not the same as the curated miRBase families. We sought to correct this mismatch and projected we would create ~1,500 Rfam families. However, we have created or updated ~1,700 families as of release 14.9. Additionally, we developed tooling which has allowed us to continue updating and synchronising families. We now have a nearly complete synchronisation between the two resources and will be able to keep Rfam up-to-date with miRBase in the future.
Objective 3 was to extend the coverage of viral families in Rfam. We have added Rfam families from 3 different viral families, Coronaviridae, Flaviviridae, and Hepacivirus, which produced 36 new Rfam families. Notably, we built special alignments for Sarbecovirus which includes the SARS-CoV-2 virus responsible for the recent pandemic. Rfam produced a special release, 14.2, in collaboration with the Marz group and the European Virus Bioinformatics Center response to the pandemic which included these families. In addition to SARS-CoV-2 we have families for other important human pathogens such as Hepatitis C.
Exploitation Route The results of this work are available from the Rfam website and thus available for the scientific community to reuse in their own work. For example, using Rfam new genomic sequences can be annotated for all types of ncRNAs.
Sectors Agriculture, Food and Drink,Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology

URL https://rfam.org/
 
Description Rfam has been developed and maintained since 2002 and has had scientific and economic impact. A SureChEMBL search identified over 330 patents mentioning Rfam and are assigned to a variety organziations including, the University of Manchester, Monsanto, Agilent Technologies, Procter and Gamble, MIT, University of California, and others. For example, EP-3929295-A1 describes a method to modulate the functionality of RNAs fragments in cells.
First Year Of Impact 2006
Sector Healthcare,Pharmaceuticals and Medical Biotechnology
Impact Types Economic

 
Title Rfam 
Description The Rfam database is a collection of RNA families, each represented by multiple sequence alignments, consensus secondary structures and covariance models (CMs). The families in Rfam break down into three broad functional classes: non-coding RNA genes, structured cis-regulatory elements and self-splicing RNAs. Typically these functional RNAs often have a conserved secondary structure which may be better preserved than the RNA sequence. The CMs used to describe each family are a slightly more complicated relative of the profile hidden Markov models (HMMs) used by Pfam. CMs can simultaneously model RNA sequence and the structure in an elegant and accurate fashion. 
Type Of Material Database/Collection of data 
Provided To Others? Yes  
Impact Rfam has enabled facile annotation of genomes with a large variety of non-coding RNAs. 
URL http://rfam.xfam.org/
 
Description Bohdan Schneider 
Organisation Academy of Sciences of the Czech Republic
Country Czech Republic 
Sector Academic/University 
PI Contribution We have provided an analysis of the size and scope of RNA alignment data available to researchers interested in producing an AlphaFold for RNA
Collaborator Contribution Bohdan has provided an analysis of how RNA structures have changed over time and the quality of existing structures.
Impact We have drafted a paper describing several of the issues facing creating an AlphaFold for RNA. We hope this paper will push the RNA science community to develop more data and better methods for structure prediction.
Start Year 2022
 
Description European Virus Bioinformatics Center (Manja Marz) 
Organisation Friedrich Schiller University Jena (FSU)
Department European Virus Bioinformatics Center
Country Germany 
Sector Academic/University 
PI Contribution Rfam is importing the data from the EVBC to create new Rfam families.
Collaborator Contribution The EVBC is the source of new viral RNA families for Rfam.
Impact We are prototyping the import of new RNA families from EVBC to Rfam.
Start Year 2019
 
Description Harvard University (Elena Rivas and Sean Eddy) 
Organisation Harvard University
Country United States 
Sector Academic/University 
PI Contribution Rfam began using the R-scape software developed by Elena Rivas and Sean Eddy at Harvard University. The feedback provided by the Rfam team led to improvements in both R-scape and Infernal.
Collaborator Contribution Elena Rivas and Sean Eddy are involved in the development of Infernal, which is a key piece of software used by Rfam. They also developed R-scape, which is a new tool allowing to evaluate and improve Rfam families.
Impact DOI:10.1093/nar/gkx1038
 
Description Marta Szachniuk 
Organisation Poznan University of Technology
Country Poland 
Sector Academic/University 
PI Contribution We have provided an analysis of the size and scope of RNA alignment data available to researchers interested in producing an AlphaFold for RNA
Collaborator Contribution Marta has provided an analysis of how RNA structure prediction has faired over time as well as general guidence on the paper.
Impact We have drafted a paper describing several of the issues facing creating an AlphaFold for RNA. We hope this paper will push the RNA science community to develop more data and better methods for structure prediction.
Start Year 2022
 
Description NCBI - Eric Nawrocki 
Organisation National Center for Biotechnology Information (NCBI)
Country United States 
Sector Public 
PI Contribution The Rfam team provides feedback about the Infernal software to Dr Eric Nawrocki, who develops Infernal and is based at NCBI. The feedback helps to improve Infernal and guide its development.
Collaborator Contribution Dr Eric Nawrocki is the main developer of the Infernal software that Rfam relies on to identify non-coding RNAs. Dr Nawrocki helps us to use Infernal efficiently and assists with Infernal-related queries sent to the Rfam help desk.
Impact doi:10.1093/nar/gkx1038
Start Year 2014
 
Description University of Manchester (Sam Griffiths-Jones) 
Organisation University of Manchester
Country United Kingdom 
Sector Academic/University 
PI Contribution Rfam provided Prof Griffiths-Jones and the miRBase team with a set of curated microRNA families based on the data submitted by miRBase.
Collaborator Contribution Prof Griffiths-Jones and the miRBase team submitted microRNA data to Rfam in order to create new families or update the existing ones.
Impact New and updated microRNA Rfam families have resulted from this collaboration.
Start Year 2020
 
Title R2DT v1.1 
Description This is release v1.1 of R2DT, a framework for the visualisation of RNA secondary structure using templates. 
Type Of Technology Software 
Year Produced 2021 
Open Source License? Yes  
Impact This tool allows for the visualization of RNA secondary structures in familiar easy to read layouts. Unlike other software this will work for large and small RNAs and produce a consistent and familiar diagram. This has been used in several publications, has been used in RNAcentral to visualize over 25 million RNAs, and integrated into a variety of other websites like FlyBase. 
URL https://zenodo.org/record/4700588
 
Description 6th Meeting on Regulating with RNA in Bacteria and Archaea 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Nancy Ontiveros presented a poster at the Meeting on Regulating with RNA in Bacteria and Archaea.
Year(s) Of Engagement Activity 2022
 
Description EBI Summerfest 
Form Of Engagement Activity Participation in an open day or visit at my research institution
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Professional Practitioners
Results and Impact We participated in EBI's Summerfest, where an estimated 30-50 members of the Wellcome genome campus community and the EMBL leadership came to explore the outreach activities available on campus. Afterward we receive interest in bringing our outreach activities to broader audiences.
Year(s) Of Engagement Activity 2022
 
Description ELIXIR Europe tweet "ELIXIR CDR corona virus genomes MGnify" 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Tweet from the official ELIXIR Europe account highlighting the work carried out by PI Dr Rob Finn and his microbiome informatics team, which utilised the MGnify resource workflows to identify coronavirus genomes in ELIXIR Core Data Resources, such as the ENA (DOI 10.1093/bib/bbaa232).
Year(s) Of Engagement Activity 2020
URL https://twitter.com/ELIXIREurope/status/1323597075007840256
 
Description ELIXIR News "Identification of coronaviruses genomes in public datasets" 
Form Of Engagement Activity A magazine, newsletter or online publication
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The ongoing SARS-CoV-2 pandemic highlighted the need to understand all aspects of coronavirus biology, including their prevalence and diversity in animal hosts and the environment. Given the pressing need for greater knowledge around this topic, researchers within the Microbiome Informatics Team (PI Dr Rob Finn) at EMBL- European Bioinformatics Institute (EMBL-EBI) repurposed existing MGnify infrastructure to generate a pipeline that detects and characterises coronaviruses from metavirome and metatranscriptomic datasets. This pipeline identified a complete SARS-CoV-2 genome from a human lung sample collected in Wuhan, China, at the start of the pandemic - demonstrating proof of concept (DOI 10.1093/bib/bbaa232).
Year(s) Of Engagement Activity 2020
URL https://elixir-europe.org/news/identification-coronaviruses-genomes-public-datasets
 
Description Meet the Scientist 
Form Of Engagement Activity Participation in an open day or visit at my research institution
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Schools
Results and Impact Two members of the Rfam team participated in Meet the scientist, where over 100 school children had the chance to interact with scientists and learn about how to enter a scientific field. A smaller group, around 20, had in person discussions with the team members.
Year(s) Of Engagement Activity 2022
 
Description Poster Presentation at RNA Society 2022 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact We presented a poster at the 2022 RNA Society conference. This is the leading conference in RNA science and we were able reach a broad audience of RNA scientists.
Year(s) Of Engagement Activity 2022
 
Description Poster at Hidden Life of ncRNA (EMBL conference) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Nancy Ontiveros-Palacios presented a virtual poster describing the latest developments in the Rfam database at Hidden Life of ncRNA, a major international conference.
Year(s) Of Engagement Activity 2020
 
Description Poster presentation at 6th Meeting on Regulating with RNA in Bacteria and Archaea 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact This is a conference for researchers interested in bacteria, and different from the typical RNA scientists we speak to. We were able to reach a different auidence than we typically do. They were interested and exicted to use Rfam in their work.
Year(s) Of Engagement Activity 2022
 
Description Presentation at computation approaches to RNA structure and function 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The latest developments in Rfam were present to a group of expert users. Additionally, we solicted feedback on the future direction for Rfam and which projects would be most impactful.
Year(s) Of Engagement Activity 2022
 
Description Presentation at the International Society of Biocuration meeting 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Nancy Ontiveros presented a poster on Rfam and the process of curating secondary structures of RNAs with 3D information.
Year(s) Of Engagement Activity 2021
 
Description Presentation on RNAcentral and Rfam as Resources for exploring RNA 3D structure 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Blake Sweeney presented a talk discussing using RNAcentral and Rfam for exploring 3D structures
Year(s) Of Engagement Activity 2021
 
Description Public outreach at a Girlguiding event in Ely College 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Schools
Results and Impact Ioanna Kalvari and Anton Petrov participated in a County STEM Day organised by Girlguiding Cambridgeshire East. The event reached ~100 Year 5-8 students who learned about Rfam and participated in the RNA Scanner activity, developed specifically to explain what Rfam is and what RNA families are.
Year(s) Of Engagement Activity 2020
 
Description Rfam poster at RNA UK 2020 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Ioanna Kalvari presented a poster about Rfam at the RNA UK meeting, increasing the awareness about the resource.
Year(s) Of Engagement Activity 2020
 
Description Rfam webinar 2021 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Nancy Ontiveros, Sam Griffiths-Jones, Eric Nawrocki, and Anton Petrov ran a webinar introducing Rfam and Rfam families.
Year(s) Of Engagement Activity 2021
 
Description Structural bioinformatics training course 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Nancy Ontiveros presented a training on how to use Rfam to an international audience.
Year(s) Of Engagement Activity 2021