Rfam: Towards a sustainable resource for understanding the genomic functional ncRNA repertoire

Lead Research Organisation: EMBL - European Bioinformatics Institute
Department Name: Sequence Database Group

Abstract

In molecular biology, the central dogma says that genes encoded in a genome code for RNA, which is then translated into the proteins carrying out the main processes of the cell. But, RNA is not just an intermediate step between genes and protein. Instead, RNA is capable of performing a number of tasks that are essential for life - for example, the ribosome (the machine responsible for synthesizing proteins from RNA) is an RNA-based machine, and RNA plays important roles in regulating the levels of other genes. These RNAs involved in biology are known as non-coding RNAs (ncRNA).

RNA research has lagged behind that of proteins, in part due to the difficulties in working with them experimentally and computationally. The field of RNA biology is comparatively poorly served with resources that can aid research when compared with protein science. Rfam is one of the largest and most authoritative sources on ncRNA information, and provides a central portal of information covering a wide variety of ncRNA types. We use statistical models to group related non-coding RNAs into families. We then provide information on their function, as well as providing tools which other scientists can use to discover related non-coding RNAs in their samples of interest. A primary use of our database is to identify ncRNAs in DNA sequences. This allows scientists to map the positions of ncRNAs and study how ncRNAs have evolved between related organisms giving clues to their function. We aim to facilitate this further by providing families of ncRNAs from organisms which have had their entire genome sequenced. These organisms are generally those which are of interest to scientists because of their role in disease (e.g. pathogenic bacteria), their economic importance (e.g bread wheat, a major source of human nutrition), or because they occupy an important biological niche (e.g, humans). We'll also provide researchers with tools and training to build their own RNA families, allowing them to study RNAs which are of particular interest to them.

Not only is it important to be able to identify a ncRNA, it's also important for us to tell our users what the function of an ncRNA is. To this end, we are improving our functional annotation of our RNA families, by using structured language terms that are easily parseable by both humans and computers. This means that our large data sets can be mined quickly, allowing researchers to build up a picture of how ncRNAs interact with the rest of the cell's components and understand more about the roles ncRNA play in biological systems.

All our information is freely available via the Rfam website and as a downloadable database. We also export our data to other resources, such as databases concerned with a specific organism, and more general RNA databases such as RNAcentral.

Technical Summary

This proposal concerns the Rfam database and associated web portal, which uses covariance models to describe RNA families, and annotates these families with functional information. We will continue to create new families and examine our coverage of the RNA sequence database, RNAcentral to identify ncRNAs which are not covered in Rfam, and use this information to direct new family building. We will also update and improve our functional annotation of ncRNAs by attaching Gene Ontology terms to families, and using software tools to automatically propagate our annotations to the Gene Ontology Consortium. This will result in improved functional annotation for ncRNAs and by exporting them to the Gene Ontology consortium, they will be propagated to a wide range of resources ensuring their maximum utility. To deal with the data deluge that risks hampering many bioinformatic resources, we will move to producing family alignments based on sequences from completed genomes only. This will result in smaller families which are more biologically relevant, as the absence of a match in related organisms will represent a true gene loss and not incomplete sequence data. We will produce new visualisation tools using technology such as BioJS to take advantage of this new information. To increase the sustainability of our resource, we will develop software tools and associated training materials to allow users to build their own covariance models, and submit them to us for propagation throughout the community.

Planned Impact

Rfam is a resource that contributes to researchers involved in all BBSRC strategic priorities but primarily food nutrition and health and data driven biology. It will be used extensively by the life sciences community, including bioinformaticians, wet-lab researchers and clinicians. The huge growth in data produced by new sequencing technologies means that it is now more important than ever that researchers have access to tools to help them interpret their data. Rfam is the only resource currently capable of identifying a wide range of ncRNA homologs in sequence data and therefore plays a key role in both data driven and systems biology. Our move to genome-centric annotation means that Rfam will provide comprehensive annotation of ncRNAs in many organisms, which will be of great benefit to many model organism resources, such as PomBase, Flybase and the more comprehensive Ensembl and Ensembl Genomes. ncRNA information is frequently missing from genome annotation; Rfam's data can ensure scientists have a more complete picture of the "parts list" involved in constructing each genome. As with Rfam, many of the resources benefiting from Rfam's data are based in the UK, thus contributing to the UK's international reputation as a leader in bioscience.

We are only recently beginning to understand the role ncRNAs play in health and disease. For example, microRNAs are deregulated in cancer, snoRNAs are silenced in Prader-Willi syndrome and plant microRNAs play important roles in immune responses against viruses. There are significant research efforts into RNA-based therapeutics, which are promising tools to improve health and welfare. Rfam is a crucial resource for such work, allowing similarities in RNAs between organisms to be studied and providing researchers with search tools to identify previously unknown ncRNA homologs.

A major aim of this proposal is to develop tools to allow researchers to create their own Rfam families. Thus creating a set of community based curators for Rfam. Historically, family building has been a specialised job performed by experienced Rfam curators; however the software is maturing to a point where family creation by non-specialists is feasible. Thus, a major impact of our work will be the transfer of knowledge and skills to a wide range of RNA researchers and providing them with bioinformatic tools they can use to further their work. This approach also enables researchers to develop a more multidisciplinary approach to the understanding of RNA function.

Publications

10 25 50
publication icon
Kalvari I (2018) Non-Coding RNA Analysis Using the Rfam Database. in Current protocols in bioinformatics

 
Description We improved the quality and quantity of Rfam data by building 568 new RNA families (23% growth) and bringing the total number of RNA families in Rfam to 3,016. The Rfam database is now more sustainable and ready to scale with the growth of the available genomic data due to the transition to annotating a non-redundant and regularly updated set of reference genomes. We made the data easier to access by adding to the website a new text search as well as creating a public MySQL database. We engaged with the scientific community at multiple scientific meetings and interacted with the general public at regional and local events, such as meetings with schools and science festivals.
Exploitation Route Our users use the data we freely provide to design new synthetic molecules, guide experiments and understand the evolution, structure and function of non-coding RNAs. Rfam is a critical part of the RNA bioinformatic toolkit and is widely used for finding non-coding RNAs in newly sequenced genomes.
Sectors Aerospace, Defence and Marine,Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Energy,Environment,Healthcare,Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology

 
Description The Rfam database has been developed since 2002. According to the SureChEMBL database, Rfam is mentioned in over 200 patents. For example, patent WO-2012069613-A1 uses Rfam in a new method for selecting a competent oocyte or a competent embryo by determining the expression level of specific microRNA species in a body fluid or in cumulus cells.
First Year Of Impact 2006
Sector Healthcare,Pharmaceuticals and Medical Biotechnology
 
Title Rfam 
Description The Rfam database is a collection of RNA families, each represented by multiple sequence alignments, consensus secondary structures and covariance models (CMs). The families in Rfam break down into three broad functional classes: non-coding RNA genes, structured cis-regulatory elements and self-splicing RNAs. Typically these functional RNAs often have a conserved secondary structure which may be better preserved than the RNA sequence. The CMs used to describe each family are a slightly more complicated relative of the profile hidden Markov models (HMMs) used by Pfam. CMs can simultaneously model RNA sequence and the structure in an elegant and accurate fashion. 
Type Of Material Database/Collection of data 
Provided To Others? Yes  
Impact Rfam has enabled facile annotation of genomes with a large variety of non-coding RNAs. 
URL http://rfam.xfam.org/
 
Description Harvard University (Elena Rivas and Sean Eddy) 
Organisation Harvard University
Country United States 
Sector Academic/University 
PI Contribution Rfam began using the R-scape software developed by Elena Rivas and Sean Eddy at Harvard University. The feedback provided by the Rfam team led to improvements in both R-scape and Infernal.
Collaborator Contribution Elena Rivas and Sean Eddy are involved in the development of Infernal, which is a key piece of software used by Rfam. They also developed R-scape, which is a new tool allowing to evaluate and improve Rfam families.
Impact DOI:10.1093/nar/gkx1038
 
Description NCBI - Eric Nawrocki 
Organisation National Center for Biotechnology Information (NCBI)
Country United States 
Sector Public 
PI Contribution The Rfam team provides feedback about the Infernal software to Dr Eric Nawrocki, who develops Infernal and is based at NCBI. The feedback helps to improve Infernal and guide its development.
Collaborator Contribution Dr Eric Nawrocki is the main developer of the Infernal software that Rfam relies on to identify non-coding RNAs. Dr Nawrocki helps us to use Infernal efficiently and assists with Infernal-related queries sent to the Rfam help desk.
Impact doi:10.1093/nar/gkx1038
Start Year 2014
 
Description Meet the Scientist event organised by Social Mobility Foundation 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Schools
Results and Impact Dr Anton Petrov participated in a Meet the Scientist event organised by the Social Mobility Foundation. The activity reached ~20 school students who learned about non-coding RNA and career in research.
Year(s) Of Engagement Activity 2018
 
Description Public engagement at Curios Nature event 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Schools
Results and Impact Dr Anton Petrov participated in the Curious Nature event at Wellcome Genome Campus reaching out to ~25 school students (11-12 year old) who learned about RNA. The activities involved 3D-printed RNA structures and a bespoke model of tRNA secondary structure.
Year(s) Of Engagement Activity 2018
 
Description Rfam poster at ISMB in Chicago 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact We presented a poster showcasing the latest Rfam developments at a major international conference in Chicago, USA.
Year(s) Of Engagement Activity 2018
 
Description Rfam talk at RNA Society in Berkeley 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact We presented the latest progress of the Rfam project to the international audience at the RNA Society meeting in Berkeley, USA.
Year(s) Of Engagement Activity 2018
 
Description Rfam workshop in Benasque 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact We held an Rfam workshop at the Benasque RNA meeting which brings together RNA experts from all over the world. The workshop helped us engage with a target audience of scientists who will be contributing data to Rfam.
Year(s) Of Engagement Activity 2018