Rfam: Towards a sustainable resource for understanding the genomic functional ncRNA repertoire
Lead Research Organisation:
European Bioinformatics Institute
Department Name: Sequence Database Group
Abstract
In molecular biology, the central dogma says that genes encoded in a genome code for RNA, which is then translated into the proteins carrying out the main processes of the cell. But, RNA is not just an intermediate step between genes and protein. Instead, RNA is capable of performing a number of tasks that are essential for life - for example, the ribosome (the machine responsible for synthesizing proteins from RNA) is an RNA-based machine, and RNA plays important roles in regulating the levels of other genes. These RNAs involved in biology are known as non-coding RNAs (ncRNA).
RNA research has lagged behind that of proteins, in part due to the difficulties in working with them experimentally and computationally. The field of RNA biology is comparatively poorly served with resources that can aid research when compared with protein science. Rfam is one of the largest and most authoritative sources on ncRNA information, and provides a central portal of information covering a wide variety of ncRNA types. We use statistical models to group related non-coding RNAs into families. We then provide information on their function, as well as providing tools which other scientists can use to discover related non-coding RNAs in their samples of interest. A primary use of our database is to identify ncRNAs in DNA sequences. This allows scientists to map the positions of ncRNAs and study how ncRNAs have evolved between related organisms giving clues to their function. We aim to facilitate this further by providing families of ncRNAs from organisms which have had their entire genome sequenced. These organisms are generally those which are of interest to scientists because of their role in disease (e.g. pathogenic bacteria), their economic importance (e.g bread wheat, a major source of human nutrition), or because they occupy an important biological niche (e.g, humans). We'll also provide researchers with tools and training to build their own RNA families, allowing them to study RNAs which are of particular interest to them.
Not only is it important to be able to identify a ncRNA, it's also important for us to tell our users what the function of an ncRNA is. To this end, we are improving our functional annotation of our RNA families, by using structured language terms that are easily parseable by both humans and computers. This means that our large data sets can be mined quickly, allowing researchers to build up a picture of how ncRNAs interact with the rest of the cell's components and understand more about the roles ncRNA play in biological systems.
All our information is freely available via the Rfam website and as a downloadable database. We also export our data to other resources, such as databases concerned with a specific organism, and more general RNA databases such as RNAcentral.
RNA research has lagged behind that of proteins, in part due to the difficulties in working with them experimentally and computationally. The field of RNA biology is comparatively poorly served with resources that can aid research when compared with protein science. Rfam is one of the largest and most authoritative sources on ncRNA information, and provides a central portal of information covering a wide variety of ncRNA types. We use statistical models to group related non-coding RNAs into families. We then provide information on their function, as well as providing tools which other scientists can use to discover related non-coding RNAs in their samples of interest. A primary use of our database is to identify ncRNAs in DNA sequences. This allows scientists to map the positions of ncRNAs and study how ncRNAs have evolved between related organisms giving clues to their function. We aim to facilitate this further by providing families of ncRNAs from organisms which have had their entire genome sequenced. These organisms are generally those which are of interest to scientists because of their role in disease (e.g. pathogenic bacteria), their economic importance (e.g bread wheat, a major source of human nutrition), or because they occupy an important biological niche (e.g, humans). We'll also provide researchers with tools and training to build their own RNA families, allowing them to study RNAs which are of particular interest to them.
Not only is it important to be able to identify a ncRNA, it's also important for us to tell our users what the function of an ncRNA is. To this end, we are improving our functional annotation of our RNA families, by using structured language terms that are easily parseable by both humans and computers. This means that our large data sets can be mined quickly, allowing researchers to build up a picture of how ncRNAs interact with the rest of the cell's components and understand more about the roles ncRNA play in biological systems.
All our information is freely available via the Rfam website and as a downloadable database. We also export our data to other resources, such as databases concerned with a specific organism, and more general RNA databases such as RNAcentral.
Technical Summary
This proposal concerns the Rfam database and associated web portal, which uses covariance models to describe RNA families, and annotates these families with functional information. We will continue to create new families and examine our coverage of the RNA sequence database, RNAcentral to identify ncRNAs which are not covered in Rfam, and use this information to direct new family building. We will also update and improve our functional annotation of ncRNAs by attaching Gene Ontology terms to families, and using software tools to automatically propagate our annotations to the Gene Ontology Consortium. This will result in improved functional annotation for ncRNAs and by exporting them to the Gene Ontology consortium, they will be propagated to a wide range of resources ensuring their maximum utility. To deal with the data deluge that risks hampering many bioinformatic resources, we will move to producing family alignments based on sequences from completed genomes only. This will result in smaller families which are more biologically relevant, as the absence of a match in related organisms will represent a true gene loss and not incomplete sequence data. We will produce new visualisation tools using technology such as BioJS to take advantage of this new information. To increase the sustainability of our resource, we will develop software tools and associated training materials to allow users to build their own covariance models, and submit them to us for propagation throughout the community.
Planned Impact
Rfam is a resource that contributes to researchers involved in all BBSRC strategic priorities but primarily food nutrition and health and data driven biology. It will be used extensively by the life sciences community, including bioinformaticians, wet-lab researchers and clinicians. The huge growth in data produced by new sequencing technologies means that it is now more important than ever that researchers have access to tools to help them interpret their data. Rfam is the only resource currently capable of identifying a wide range of ncRNA homologs in sequence data and therefore plays a key role in both data driven and systems biology. Our move to genome-centric annotation means that Rfam will provide comprehensive annotation of ncRNAs in many organisms, which will be of great benefit to many model organism resources, such as PomBase, Flybase and the more comprehensive Ensembl and Ensembl Genomes. ncRNA information is frequently missing from genome annotation; Rfam's data can ensure scientists have a more complete picture of the "parts list" involved in constructing each genome. As with Rfam, many of the resources benefiting from Rfam's data are based in the UK, thus contributing to the UK's international reputation as a leader in bioscience.
We are only recently beginning to understand the role ncRNAs play in health and disease. For example, microRNAs are deregulated in cancer, snoRNAs are silenced in Prader-Willi syndrome and plant microRNAs play important roles in immune responses against viruses. There are significant research efforts into RNA-based therapeutics, which are promising tools to improve health and welfare. Rfam is a crucial resource for such work, allowing similarities in RNAs between organisms to be studied and providing researchers with search tools to identify previously unknown ncRNA homologs.
A major aim of this proposal is to develop tools to allow researchers to create their own Rfam families. Thus creating a set of community based curators for Rfam. Historically, family building has been a specialised job performed by experienced Rfam curators; however the software is maturing to a point where family creation by non-specialists is feasible. Thus, a major impact of our work will be the transfer of knowledge and skills to a wide range of RNA researchers and providing them with bioinformatic tools they can use to further their work. This approach also enables researchers to develop a more multidisciplinary approach to the understanding of RNA function.
We are only recently beginning to understand the role ncRNAs play in health and disease. For example, microRNAs are deregulated in cancer, snoRNAs are silenced in Prader-Willi syndrome and plant microRNAs play important roles in immune responses against viruses. There are significant research efforts into RNA-based therapeutics, which are promising tools to improve health and welfare. Rfam is a crucial resource for such work, allowing similarities in RNAs between organisms to be studied and providing researchers with search tools to identify previously unknown ncRNA homologs.
A major aim of this proposal is to develop tools to allow researchers to create their own Rfam families. Thus creating a set of community based curators for Rfam. Historically, family building has been a specialised job performed by experienced Rfam curators; however the software is maturing to a point where family creation by non-specialists is feasible. Thus, a major impact of our work will be the transfer of knowledge and skills to a wide range of RNA researchers and providing them with bioinformatic tools they can use to further their work. This approach also enables researchers to develop a more multidisciplinary approach to the understanding of RNA function.
Publications
Kalvari I
(2018)
Non-Coding RNA Analysis Using the Rfam Database.
in Current protocols in bioinformatics
Kalvari I
(2018)
Rfam 13.0: shifting to a genome-centric resource for non-coding RNA families.
in Nucleic acids research
Kalvari I
(2021)
Rfam 14: expanded coverage of metagenomic, viral and microRNA families.
in Nucleic acids research
Description | We improved the quality and quantity of Rfam data by building 568 new RNA families (23% growth) and bringing the total number of RNA families in Rfam to 3,016. The Rfam database is now more sustainable and ready to scale with the growth of the available genomic data due to the transition to annotating a non-redundant and regularly updated set of reference genomes. We made the data easier to access by adding to the website a new text search as well as creating a public MySQL database. We engaged with the scientific community at multiple scientific meetings and interacted with the general public at regional and local events, such as meetings with schools and science festivals. |
Exploitation Route | Our users use the data we freely provide to design new synthetic molecules, guide experiments and understand the evolution, structure and function of non-coding RNAs. Rfam is a critical part of the RNA bioinformatic toolkit and is widely used for finding non-coding RNAs in newly sequenced genomes. |
Sectors | Aerospace Defence and Marine Agriculture Food and Drink Digital/Communication/Information Technologies (including Software) Energy Environment Healthcare Manufacturing including Industrial Biotechology Pharmaceuticals and Medical Biotechnology |
Description | The Rfam database has been developed since 2002. According to the SureChEMBL database, Rfam is mentioned in over 200 patents. For example, patent WO-2012069613-A1 uses Rfam in a new method for selecting a competent oocyte or a competent embryo by determining the expression level of specific microRNA species in a body fluid or in cumulus cells. |
First Year Of Impact | 2006 |
Sector | Healthcare,Pharmaceuticals and Medical Biotechnology |
Description | A comprehensive platform for the functional annotation of non-coding RNA genes and gene families |
Amount | £939,339 (GBP) |
Funding ID | 218302/Z/19/Z |
Organisation | Wellcome Trust |
Sector | Charity/Non Profit |
Country | United Kingdom |
Start | 03/2020 |
End | 03/2021 |
Title | Rfam |
Description | The Rfam database is a collection of RNA families, each represented by multiple sequence alignments, consensus secondary structures and covariance models (CMs). The families in Rfam break down into three broad functional classes: non-coding RNA genes, structured cis-regulatory elements and self-splicing RNAs. Typically these functional RNAs often have a conserved secondary structure which may be better preserved than the RNA sequence. The CMs used to describe each family are a slightly more complicated relative of the profile hidden Markov models (HMMs) used by Pfam. CMs can simultaneously model RNA sequence and the structure in an elegant and accurate fashion. |
Type Of Material | Database/Collection of data |
Provided To Others? | Yes |
Impact | Rfam has enabled facile annotation of genomes with a large variety of non-coding RNAs. |
URL | http://rfam.xfam.org/ |
Description | Bohdan Schneider |
Organisation | Academy of Sciences of the Czech Republic |
Country | Czech Republic |
Sector | Academic/University |
PI Contribution | We have provided an analysis of the size and scope of RNA alignment data available to researchers interested in producing an AlphaFold for RNA |
Collaborator Contribution | Bohdan has provided an analysis of how RNA structures have changed over time and the quality of existing structures. |
Impact | We have drafted a paper describing several of the issues facing creating an AlphaFold for RNA. We hope this paper will push the RNA science community to develop more data and better methods for structure prediction. |
Start Year | 2022 |
Description | Harvard University (Elena Rivas and Sean Eddy) |
Organisation | Harvard University |
Country | United States |
Sector | Academic/University |
PI Contribution | Rfam began using the R-scape software developed by Elena Rivas and Sean Eddy at Harvard University. The feedback provided by the Rfam team led to improvements in both R-scape and Infernal. |
Collaborator Contribution | Elena Rivas and Sean Eddy are involved in the development of Infernal, which is a key piece of software used by Rfam. They also developed R-scape, which is a new tool allowing to evaluate and improve Rfam families. |
Impact | DOI:10.1093/nar/gkx1038 |
Description | Marta Szachniuk |
Organisation | Poznan University of Technology |
Country | Poland |
Sector | Academic/University |
PI Contribution | We have provided an analysis of the size and scope of RNA alignment data available to researchers interested in producing an AlphaFold for RNA |
Collaborator Contribution | Marta has provided an analysis of how RNA structure prediction has faired over time as well as general guidence on the paper. |
Impact | We have drafted a paper describing several of the issues facing creating an AlphaFold for RNA. We hope this paper will push the RNA science community to develop more data and better methods for structure prediction. |
Start Year | 2022 |
Description | NCBI - Eric Nawrocki |
Organisation | National Center for Biotechnology Information (NCBI) |
Country | United States |
Sector | Public |
PI Contribution | The Rfam team provides feedback about the Infernal software to Dr Eric Nawrocki, who develops Infernal and is based at NCBI. The feedback helps to improve Infernal and guide its development. |
Collaborator Contribution | Dr Eric Nawrocki is the main developer of the Infernal software that Rfam relies on to identify non-coding RNAs. Dr Nawrocki helps us to use Infernal efficiently and assists with Infernal-related queries sent to the Rfam help desk. |
Impact | doi:10.1093/nar/gkx1038 |
Start Year | 2014 |
Title | R2DT v1.1 |
Description | This is release v1.1 of R2DT, a framework for the visualisation of RNA secondary structure using templates. |
Type Of Technology | Software |
Year Produced | 2021 |
Open Source License? | Yes |
Impact | This tool allows for the visualization of RNA secondary structures in familiar easy to read layouts. Unlike other software this will work for large and small RNAs and produce a consistent and familiar diagram. This has been used in several publications, has been used in RNAcentral to visualize over 25 million RNAs, and integrated into a variety of other websites like FlyBase. |
URL | https://zenodo.org/record/4700588 |
Description | 6th Meeting on Regulating with RNA in Bacteria and Archaea |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Nancy Ontiveros presented a poster at the Meeting on Regulating with RNA in Bacteria and Archaea. |
Year(s) Of Engagement Activity | 2022 |
Description | EBI Summerfest |
Form Of Engagement Activity | Participation in an open day or visit at my research institution |
Part Of Official Scheme? | No |
Geographic Reach | Regional |
Primary Audience | Professional Practitioners |
Results and Impact | We participated in EBI's Summerfest, where an estimated 30-50 members of the Wellcome genome campus community and the EMBL leadership came to explore the outreach activities available on campus. Afterward we receive interest in bringing our outreach activities to broader audiences. |
Year(s) Of Engagement Activity | 2022 |
Description | Meet the Scientist |
Form Of Engagement Activity | Participation in an open day or visit at my research institution |
Part Of Official Scheme? | No |
Geographic Reach | Regional |
Primary Audience | Schools |
Results and Impact | Two members of the Rfam team participated in Meet the scientist, where over 100 school children had the chance to interact with scientists and learn about how to enter a scientific field. A smaller group, around 20, had in person discussions with the team members. |
Year(s) Of Engagement Activity | 2022 |
Description | Meet the Scientist event organised by Social Mobility Foundation |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | Regional |
Primary Audience | Schools |
Results and Impact | Dr Anton Petrov participated in a Meet the Scientist event organised by the Social Mobility Foundation. The activity reached ~20 school students who learned about non-coding RNA and career in research. |
Year(s) Of Engagement Activity | 2018 |
Description | Presentation at the International Society of Biocuration meeting |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Nancy Ontiveros presented a poster on Rfam and the process of curating secondary structures of RNAs with 3D information. |
Year(s) Of Engagement Activity | 2021 |
Description | Presentation on RNAcentral and Rfam as Resources for exploring RNA 3D structure |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Blake Sweeney presented a talk discussing using RNAcentral and Rfam for exploring 3D structures |
Year(s) Of Engagement Activity | 2021 |
Description | Public engagement at Curios Nature event |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | Local |
Primary Audience | Schools |
Results and Impact | Dr Anton Petrov participated in the Curious Nature event at Wellcome Genome Campus reaching out to ~25 school students (11-12 year old) who learned about RNA. The activities involved 3D-printed RNA structures and a bespoke model of tRNA secondary structure. |
Year(s) Of Engagement Activity | 2018 |
Description | Rfam poster at ISMB in Chicago |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | We presented a poster showcasing the latest Rfam developments at a major international conference in Chicago, USA. |
Year(s) Of Engagement Activity | 2018 |
Description | Rfam talk at RNA Society in Berkeley |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | We presented the latest progress of the Rfam project to the international audience at the RNA Society meeting in Berkeley, USA. |
Year(s) Of Engagement Activity | 2018 |
Description | Rfam webinar 2021 |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Nancy Ontiveros, Sam Griffiths-Jones, Eric Nawrocki, and Anton Petrov ran a webinar introducing Rfam and Rfam families. |
Year(s) Of Engagement Activity | 2021 |
Description | Rfam workshop in Benasque |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | We held an Rfam workshop at the Benasque RNA meeting which brings together RNA experts from all over the world. The workshop helped us engage with a target audience of scientists who will be contributing data to Rfam. |
Year(s) Of Engagement Activity | 2018 |
Description | Structural bioinformatics training course |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Nancy Ontiveros presented a training on how to use Rfam to an international audience. |
Year(s) Of Engagement Activity | 2021 |