Rfam: Towards a sustainable resource for understanding the genomic functional ncRNA repertoire

Lead Research Organisation: European Bioinformatics Institute
Department Name: Sequence Database Group

Abstract

In molecular biology, the central dogma says that genes encoded in a genome code for RNA, which is then translated into the proteins carrying out the main processes of the cell. But, RNA is not just an intermediate step between genes and protein. Instead, RNA is capable of performing a number of tasks that are essential for life - for example, the ribosome (the machine responsible for synthesizing proteins from RNA) is an RNA-based machine, and RNA plays important roles in regulating the levels of other genes. These RNAs involved in biology are known as non-coding RNAs (ncRNA).

RNA research has lagged behind that of proteins, in part due to the difficulties in working with them experimentally and computationally. The field of RNA biology is comparatively poorly served with resources that can aid research when compared with protein science. Rfam is one of the largest and most authoritative sources on ncRNA information, and provides a central portal of information covering a wide variety of ncRNA types. We use statistical models to group related non-coding RNAs into families. We then provide information on their function, as well as providing tools which other scientists can use to discover related non-coding RNAs in their samples of interest. A primary use of our database is to identify ncRNAs in DNA sequences. This allows scientists to map the positions of ncRNAs and study how ncRNAs have evolved between related organisms giving clues to their function. We aim to facilitate this further by providing families of ncRNAs from organisms which have had their entire genome sequenced. These organisms are generally those which are of interest to scientists because of their role in disease (e.g. pathogenic bacteria), their economic importance (e.g bread wheat, a major source of human nutrition), or because they occupy an important biological niche (e.g, humans). We'll also provide researchers with tools and training to build their own RNA families, allowing them to study RNAs which are of particular interest to them.

Not only is it important to be able to identify a ncRNA, it's also important for us to tell our users what the function of an ncRNA is. To this end, we are improving our functional annotation of our RNA families, by using structured language terms that are easily parseable by both humans and computers. This means that our large data sets can be mined quickly, allowing researchers to build up a picture of how ncRNAs interact with the rest of the cell's components and understand more about the roles ncRNA play in biological systems.

All our information is freely available via the Rfam website and as a downloadable database. We also export our data to other resources, such as databases concerned with a specific organism, and more general RNA databases such as RNAcentral.

Technical Summary

This proposal concerns the Rfam database and associated web portal, which uses covariance models to describe RNA families, and annotates these families with functional information. We will continue to create new families and examine our coverage of the RNA sequence database, RNAcentral to identify ncRNAs which are not covered in Rfam, and use this information to direct new family building. We will also update and improve our functional annotation of ncRNAs by attaching Gene Ontology terms to families, and using software tools to automatically propagate our annotations to the Gene Ontology Consortium. This will result in improved functional annotation for ncRNAs and by exporting them to the Gene Ontology consortium, they will be propagated to a wide range of resources ensuring their maximum utility. To deal with the data deluge that risks hampering many bioinformatic resources, we will move to producing family alignments based on sequences from completed genomes only. This will result in smaller families which are more biologically relevant, as the absence of a match in related organisms will represent a true gene loss and not incomplete sequence data. We will produce new visualisation tools using technology such as BioJS to take advantage of this new information. To increase the sustainability of our resource, we will develop software tools and associated training materials to allow users to build their own covariance models, and submit them to us for propagation throughout the community.

Planned Impact

Rfam is a resource that contributes to researchers involved in all BBSRC strategic priorities but primarily food nutrition and health and data driven biology. It will be used extensively by the life sciences community, including bioinformaticians, wet-lab researchers and clinicians. The huge growth in data produced by new sequencing technologies means that it is now more important than ever that researchers have access to tools to help them interpret their data. Rfam is the only resource currently capable of identifying a wide range of ncRNA homologs in sequence data and therefore plays a key role in both data driven and systems biology. Our move to genome-centric annotation means that Rfam will provide comprehensive annotation of ncRNAs in many organisms, which will be of great benefit to many model organism resources, such as PomBase, Flybase and the more comprehensive Ensembl and Ensembl Genomes. ncRNA information is frequently missing from genome annotation; Rfam's data can ensure scientists have a more complete picture of the "parts list" involved in constructing each genome. As with Rfam, many of the resources benefiting from Rfam's data are based in the UK, thus contributing to the UK's international reputation as a leader in bioscience.

We are only recently beginning to understand the role ncRNAs play in health and disease. For example, microRNAs are deregulated in cancer, snoRNAs are silenced in Prader-Willi syndrome and plant microRNAs play important roles in immune responses against viruses. There are significant research efforts into RNA-based therapeutics, which are promising tools to improve health and welfare. Rfam is a crucial resource for such work, allowing similarities in RNAs between organisms to be studied and providing researchers with search tools to identify previously unknown ncRNA homologs.

A major aim of this proposal is to develop tools to allow researchers to create their own Rfam families. Thus creating a set of community based curators for Rfam. Historically, family building has been a specialised job performed by experienced Rfam curators; however the software is maturing to a point where family creation by non-specialists is feasible. Thus, a major impact of our work will be the transfer of knowledge and skills to a wide range of RNA researchers and providing them with bioinformatic tools they can use to further their work. This approach also enables researchers to develop a more multidisciplinary approach to the understanding of RNA function.

Publications

10 25 50

publication icon
Kalvari I (2018) Non-Coding RNA Analysis Using the Rfam Database. in Current protocols in bioinformatics

 
Description We improved the quality and quantity of Rfam data by building 568 new RNA families (23% growth) and bringing the total number of RNA families in Rfam to 3,016. The Rfam database is now more sustainable and ready to scale with the growth of the available genomic data due to the transition to annotating a non-redundant and regularly updated set of reference genomes. We made the data easier to access by adding to the website a new text search as well as creating a public MySQL database. We engaged with the scientific community at multiple scientific meetings and interacted with the general public at regional and local events, such as meetings with schools and science festivals.
Exploitation Route Our users use the data we freely provide to design new synthetic molecules, guide experiments and understand the evolution, structure and function of non-coding RNAs. Rfam is a critical part of the RNA bioinformatic toolkit and is widely used for finding non-coding RNAs in newly sequenced genomes.
Sectors Aerospace, Defence and Marine,Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Energy,Environment,Healthcare,Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology

 
Description The Rfam database has been developed since 2002. According to the SureChEMBL database, Rfam is mentioned in over 200 patents. For example, patent WO-2012069613-A1 uses Rfam in a new method for selecting a competent oocyte or a competent embryo by determining the expression level of specific microRNA species in a body fluid or in cumulus cells.
First Year Of Impact 2006
Sector Healthcare,Pharmaceuticals and Medical Biotechnology
 
Description A comprehensive platform for the functional annotation of non-coding RNA genes and gene families
Amount £939,339 (GBP)
Funding ID 218302/Z/19/Z 
Organisation Wellcome Trust 
Sector Charity/Non Profit
Country United Kingdom
Start 03/2020 
End 03/2024
 
Title Rfam 
Description The Rfam database is a collection of RNA families, each represented by multiple sequence alignments, consensus secondary structures and covariance models (CMs). The families in Rfam break down into three broad functional classes: non-coding RNA genes, structured cis-regulatory elements and self-splicing RNAs. Typically these functional RNAs often have a conserved secondary structure which may be better preserved than the RNA sequence. The CMs used to describe each family are a slightly more complicated relative of the profile hidden Markov models (HMMs) used by Pfam. CMs can simultaneously model RNA sequence and the structure in an elegant and accurate fashion. 
Type Of Material Database/Collection of data 
Provided To Others? Yes  
Impact Rfam has enabled facile annotation of genomes with a large variety of non-coding RNAs. 
URL http://rfam.xfam.org/
 
Description Bohdan Schneider 
Organisation Academy of Sciences of the Czech Republic
Country Czech Republic 
Sector Academic/University 
PI Contribution We have provided an analysis of the size and scope of RNA alignment data available to researchers interested in producing an AlphaFold for RNA
Collaborator Contribution Bohdan has provided an analysis of how RNA structures have changed over time and the quality of existing structures.
Impact We have drafted a paper describing several of the issues facing creating an AlphaFold for RNA. We hope this paper will push the RNA science community to develop more data and better methods for structure prediction.
Start Year 2022
 
Description Harvard University (Elena Rivas and Sean Eddy) 
Organisation Harvard University
Country United States 
Sector Academic/University 
PI Contribution Rfam began using the R-scape software developed by Elena Rivas and Sean Eddy at Harvard University. The feedback provided by the Rfam team led to improvements in both R-scape and Infernal.
Collaborator Contribution Elena Rivas and Sean Eddy are involved in the development of Infernal, which is a key piece of software used by Rfam. They also developed R-scape, which is a new tool allowing to evaluate and improve Rfam families.
Impact DOI:10.1093/nar/gkx1038
 
Description Marta Szachniuk 
Organisation Poznan University of Technology
Country Poland 
Sector Academic/University 
PI Contribution We have provided an analysis of the size and scope of RNA alignment data available to researchers interested in producing an AlphaFold for RNA
Collaborator Contribution Marta has provided an analysis of how RNA structure prediction has faired over time as well as general guidence on the paper.
Impact We have drafted a paper describing several of the issues facing creating an AlphaFold for RNA. We hope this paper will push the RNA science community to develop more data and better methods for structure prediction.
Start Year 2022
 
Description NCBI - Eric Nawrocki 
Organisation National Center for Biotechnology Information (NCBI)
Country United States 
Sector Public 
PI Contribution The Rfam team provides feedback about the Infernal software to Dr Eric Nawrocki, who develops Infernal and is based at NCBI. The feedback helps to improve Infernal and guide its development.
Collaborator Contribution Dr Eric Nawrocki is the main developer of the Infernal software that Rfam relies on to identify non-coding RNAs. Dr Nawrocki helps us to use Infernal efficiently and assists with Infernal-related queries sent to the Rfam help desk.
Impact doi:10.1093/nar/gkx1038
Start Year 2014
 
Title R2DT v1.1 
Description This is release v1.1 of R2DT, a framework for the visualisation of RNA secondary structure using templates. 
Type Of Technology Software 
Year Produced 2021 
Open Source License? Yes  
Impact This tool allows for the visualization of RNA secondary structures in familiar easy to read layouts. Unlike other software this will work for large and small RNAs and produce a consistent and familiar diagram. This has been used in several publications, has been used in RNAcentral to visualize over 25 million RNAs, and integrated into a variety of other websites like FlyBase. 
URL https://zenodo.org/record/4700588
 
Description 6th Meeting on Regulating with RNA in Bacteria and Archaea 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Nancy Ontiveros presented a poster at the Meeting on Regulating with RNA in Bacteria and Archaea.
Year(s) Of Engagement Activity 2022
 
Description EBI Summerfest 
Form Of Engagement Activity Participation in an open day or visit at my research institution
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Professional Practitioners
Results and Impact We participated in EBI's Summerfest, where an estimated 30-50 members of the Wellcome genome campus community and the EMBL leadership came to explore the outreach activities available on campus. Afterward we receive interest in bringing our outreach activities to broader audiences.
Year(s) Of Engagement Activity 2022
 
Description Meet the Scientist 
Form Of Engagement Activity Participation in an open day or visit at my research institution
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Schools
Results and Impact Two members of the Rfam team participated in Meet the scientist, where over 100 school children had the chance to interact with scientists and learn about how to enter a scientific field. A smaller group, around 20, had in person discussions with the team members.
Year(s) Of Engagement Activity 2022
 
Description Meet the Scientist event organised by Social Mobility Foundation 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Schools
Results and Impact Dr Anton Petrov participated in a Meet the Scientist event organised by the Social Mobility Foundation. The activity reached ~20 school students who learned about non-coding RNA and career in research.
Year(s) Of Engagement Activity 2018
 
Description Presentation at the International Society of Biocuration meeting 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Nancy Ontiveros presented a poster on Rfam and the process of curating secondary structures of RNAs with 3D information.
Year(s) Of Engagement Activity 2021
 
Description Presentation on RNAcentral and Rfam as Resources for exploring RNA 3D structure 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Blake Sweeney presented a talk discussing using RNAcentral and Rfam for exploring 3D structures
Year(s) Of Engagement Activity 2021
 
Description Public engagement at Curios Nature event 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Schools
Results and Impact Dr Anton Petrov participated in the Curious Nature event at Wellcome Genome Campus reaching out to ~25 school students (11-12 year old) who learned about RNA. The activities involved 3D-printed RNA structures and a bespoke model of tRNA secondary structure.
Year(s) Of Engagement Activity 2018
 
Description Rfam poster at ISMB in Chicago 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact We presented a poster showcasing the latest Rfam developments at a major international conference in Chicago, USA.
Year(s) Of Engagement Activity 2018
 
Description Rfam talk at RNA Society in Berkeley 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact We presented the latest progress of the Rfam project to the international audience at the RNA Society meeting in Berkeley, USA.
Year(s) Of Engagement Activity 2018
 
Description Rfam webinar 2021 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Nancy Ontiveros, Sam Griffiths-Jones, Eric Nawrocki, and Anton Petrov ran a webinar introducing Rfam and Rfam families.
Year(s) Of Engagement Activity 2021
 
Description Rfam workshop in Benasque 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact We held an Rfam workshop at the Benasque RNA meeting which brings together RNA experts from all over the world. The workshop helped us engage with a target audience of scientists who will be contributing data to Rfam.
Year(s) Of Engagement Activity 2018
 
Description Structural bioinformatics training course 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Nancy Ontiveros presented a training on how to use Rfam to an international audience.
Year(s) Of Engagement Activity 2021