rfaRm: An R client-side interface to facilitate the analysis of the Rfam database of RNA families

rfaRm is an R package providing a client-side interface for the Rfam database of non-coding RNA and other structured RNA elements. The package facilitates the search of the Rfam database by keywords or sequences, as well as the retrieval of all available information about specific Rfam families, such as member sequences, multiple sequence alignments, secondary structures and covariance models. By providing such programmatic access to the Rfam database, rfaRm enables genomic workflows to incorporate information about non-coding RNA, whose potential cannot be fully exploited just through interactive access to the database. The features of rfaRm are demonstrated by using it to analyze the SARS-CoV-2 genome as an example case.


Introduction
The Rfam database [1] is a collection of families of non-coding RNA and other structured RNA elements. Each family is defined by a multiple sequence alignment of a representative set of family members (seed alignment), a consensus secondary structure, and a covariance model [2], which integrates both multiple sequence alignment information and secondary structure information, and are analogous to the hidden Markov models used in Pfam [3] to determine the consensus sequence of protein families.
The database can be used to identify non-coding RNA elements within a nucleotide sequence of interest by searching it against the Rfam library of covariance models with the Infernal software [4]. Additionally, the database can be used to browse existing RNA families via keyword-based searches or direct access with the accession number or ID of specific families. Different pieces of information can be retrieved for each RNA family, including a descriptive summary, secondary structure information and consensus sequence, amongst many others. All of these functionalities can be accessed through the Rfam web-based interface. However, even though a RESTful API for the database is available, no client-side interface has been implemented so far to allow automated access to it, which makes any medium-or largescale analysis a laborious and time-consuming process. Here, we present a client-side interface to the Rfam database, enabling its programmatic access and therefore expanding the scope of the genomic analysis that can be carried out with the information provided by the Rfam database. The language of choice was R. This choice is based on the large number of tools already available in the Bioconductor project for the analysis of high-throughput genomic data. The software presented here complements these tools and facilitates the integration of the data retrieved via rfaRm within existing genomic workflows [5].

Software features
rfaRm provides two types of functionalities: searches within the Rfam database, and retrieval of data associated to specific RNA families.
In its current version, rfaRm allows two types of searches within the Rfam database: by keyword, and by sequence. In a keyword search, the user can provide a keyword that will be matched against family descriptions and identifiers. Matching families are returned as a list of Rfam accession numbers. In a sequence search, the user submits an RNA sequence and the list of RNA families present in this sequence is returned. While the current implementation of the Rfam web server allows for queries on sequences of up to 10,000 nucleotides, rfaRm imposes no limit on the length of sequences to be analyzed. Instead, if a sequence longer than 10,000 nucleotides is provided as input, it is internally split into smaller, overlapping fragments that are then used to perform individual searches. Found hits are mapped back into the original sequence before being returned to the user.
Furthermore, rfaRm also provides a functionality analogous to the "clan competition" feature employed by the Rfam web server to ensure hit quality. If such functionality is enabled, groups (clans) of related Rfam families are defined. If two hits overlap by a user-defined length (by default, 50% of the length of the shortest hit), and they belong to the same clan, only the hit with best score is kept. rfaRm allows clan competition to be disabled, which in some cases might be preferrable to identify nested non-coding RNA hits.
After identifying a set of RNA families of interest, rfaRm allows to retrieve and plot different data about each family by providing their Rfam accession number or ID. The data that can be retrieved for each family include: a descriptive summary, the consensus sequence and secondary structure (in extended dot-bracket or WUSS notations) (Fig 1A), the covariance model, the seed alignment used to define it, several types of secondary structure plots (Fig 1B), the phylogenetic tree associated with the seed alignment (Fig 1C), the full list of sequence regions belonging to the family (including their GenBank accessions and starting and ending positions), and a list of entries of the PDB database with associated 3D structures for members of the family. Insightful plots of secondary structure and phylogenetic trees can be either directly displayed in R or saved into separate files with a desired format (including SVG to facilitate further edition).

Integration with other packages and software
In order to facilitate the analysis and further manipulation of the data retrieved from Rfam, we designed rfaRm to output all data into standard formats that can be directly read by other software and R packages. All plots can be written as character strings representing SVG objects in XML format (which can be manipulated with the rsvg [6] and magick [7] R packages), in addition to being saved in many commonly used image formats. Consensus sequence and secondary structure in the extended dot-bracket format can be saved to files directly readable by the R4RNA R package [8] (Fig 2). Covariance models are outputted in the Infernal format. Seed multiple sequence alignments can be stored as Biostrings Multiple Alignment R objects (one of the standard formats of the Bioconductor project) [9] or saved into FASTA-or Stockholm-formatted files. Finally, phylogenetic trees can be saved in the New Hampshire Extended format (NHX), which can be read by a variety of software such as the treeio R package [10].

Description of available functions
A short description and an example of usage for each of the functions available in rfaRm is presented here. Further details can be found in the manual and vignette of the package. In the examples presented here, the mitochondrial DNA of Ashbya gossypii is used (RefSeq accession number NC_005789.1). rfamTextSearchFamilyAccession. Purpose. Searches the Rfam database for entries containing a specified keyword in the family ID, summary or description. Arguments • query: string with the keyword to be searched in the Rfam database.

Example of usage
## Search Rfam families associated to the keyword "tRNA" rfamTextSearchFamilyAccession(rfamFamily = "tRNA") rfamSequenceSearch. Purpose. Identifies non-coding RNA in a sequence provided by the user. Arguments • sequence: string with an RNA sequence to be searched against the Rfam database. Should contain only standard RNA symbols (i.e., "A", "U", "G" and "C).
• fragmentsOverlap: when a sequence larger than 10000 nucleotides is provided, it is internally split into smaller fragments before using them to search the Rfam database. This argument controls the number of overlapping bases between consecutive fragments.
• clanCompetitionFilter: logical value indicating if results should be reduced through a clan competition filter, which removes overlapping hits if they belong to Rfam families of the same clan and have an overlap above a certain threshold.
• clanOverlapThreshold: number indicating the minimum overlap between two hits (as a fraction of the smallest hit) to remove the hit with the worst e-value if their families belong to the same Rfam clan.
• rfamFamilyID: string with the Rfam family ID to be converted to a family accession. rfamConsensusSecondaryStructure. Purpose. Retrieves the consensus secondary structure and sequence of the specified Rfam family. Arguments

Example of usage
• rfamFamily: string with the Rfam family accession or ID for which a descriptive summary should be retrieved.
• filename: optional string specifying the name of a file. If provided, the consensus secondary structure and sequence will be saved to the specified file.
• format: string indicating the notation to be used for the RNA secondary structure. It can be either "DB" (extended Dot-Bracket notation; default) or "WUSS" (Washington University Secondary Structure notation).

Example of usage
## Obtain the consensus secondary structure and sequence ## for the Rfam family with accession RF00005 (tRNA) ## in the extended Dot-Bracket format rfamConsensusSecondaryStructure(rfamFamily = "RF00005", format = "DB") rfamSecondaryStructurePlot. Purpose. Plots a diagram of the specified type of the secondary structure of an Rfam family. Arguments • rfamFamily: string with the Rfam family accession or ID for which the secondary structure should be plotted.
• filename: optional string specifying the name of a file. If provided, the plot will be saved to the specified file.

Example of usage
## Generate a diagram of the secondary structure of ## the Rfam family with accession RF00005 (tRNA), colored ## by basepair conservation rfamSecondaryStructurePlot(rfamFamily = "RF00005", plotType = "fcbp") rfamSecondaryStructureXMLSVG. Purpose. Obtain an SVG file (in XML format) with a representation of the secondary structure of the specified Rfam family. Arguments • rfamFamily: string with the Rfam family accession or ID for which an SVG file the secondary structure should be plotted.
• filename: string specifying the path to which the SVG file should be saved.
Example of usage rfamSeedAlignment. Purpose. Retrieves the seed multiple alignment of the specified Rfam family. The seed alignment is used to determine the covariance model defining each Rfam family, and comprises only a subset of all the members of each family. Arguments • rfamFamily: string with the Rfam family accession or ID whose seed alignment should be retrieved.
• filename: optional string specifying a file to which the seed alignment will be saved if provided.

Example of usage
## Obtain the seed alignment of the Rfam family with ## accession RF00005 (tRNA) in the Stockholm format and ## save it to a file rfamSeedAlignment(rfamFamily = "RF00005", filename = "test.stk", format = "stockholm") rfamSeedTree. Purpose. Retrieves the phylogenetic tree of the seed multiple alignment associated to the specified Rfam family. The tree is retrieved in the NHX format (New Hampshire extended) and saved to a file. Arguments • rfamFamily: string with the Rfam family accession or ID for which the phylogenetic tree of the seed alignment should be retrieved.
• filename: string specifying a file to which the phylogenetic tree will be saved.

Example of usage
## Obtain the phylogenetic tree of seed alignment of the ## Rfam family with accession RF00005 (tRNA) and save it ## to a file rfamSeedTree(rfamFamily = "RF00005", filename = "test.nhx") rfamSeedTreeImage. Purpose. Plots the phylogenetic tree of the seed multiple alignment associated to the specified Rfam family. Arguments • rfamFamily: string with the Rfam family accession or ID for which the phylogenetic tree of the seed alignment should be plotted.
• filename: optional string specifying a file to which the plot of the phylogenetic tree will be saved if provided.
• label: string indicating the type of labels that should be added to the plot of the phylogenetic tree. Can be either "species" (for labeling with species names) or "acc" (for labeling with sequence accessions).

Example of usage
## Plot the phylogenetic tree of seed alignment of the ## Rfam family with accession RF00005 (tRNA) labelled with ## species names rfamSeedTreeImage(rfamFamily = "RF00005", label = "species") rfamCovarianceModel. Purpose. Retrieves the covariance model of the specified Rfam family (generated with the Infernal software). Arguments • rfamFamily: string with the Rfam family accession or ID for which the covariance model should be retrieved.
• filename: string specifying a file to which the covariance model will be saved.

rfamSequenceRegions.
Purpose. Retrieves all sequence regions encoding an RNA assigned to be a member of the specified Rfam family. Arguments • rfamFamily: string with the Rfam family accession or ID for which the member sequence regions should be retrieved.
• filename: optional string specifying the name of a file. If provided, the sequence regions will be saved to the specified file in tab-delimited format. rfamPDBMapping. Purpose. Retrieves entries of the PDB database with the experimentally solved 3D structure of members of the specified Rfam family, with correspondences between residues of the PDB structure and positions in the covariance model of the Rfam family. Arguments

Example of usage
• rfamFamily: string with the Rfam family accession or ID for which the matching PDB entries should be retrieved.
• filename: optional string specifying the name of a file. If provided, the matching PDB entries will be saved to the specified file in tab-delimited format.

Case study: Analysis of the SARS-CoV-2 genome
In order to demonstrate the functionalities of our package, we used rfaRm to analyze the reference genome of SARS-CoV-2 (RefSeq accession number NC_045512.2). For this, we first identified the non-coding RNA elements present in the SARS-CoV-2 genome. We then extracted and plotted information concerning the RNA families found to be present in the genome and showed how such information can be further processed with other Bioconductor packages.
Even though the genome is 29,903 bases long, such a sequence can be directly processed by rfaRm thanks to the internal splitting into fragments of 10,000 bases. We chose an overlap between consecutive fragments of 3,000 nucleotides to minimize the risk of missing hits present at the boundaries between fragments. We performed the search with and without clan competition to compare the results. An illustration of these queries implemented using R is provided hereafter: library(rfaRm) ## Read genome from FASTA file library(seqinr) sars_cov_2_genome <-unlist(read.fasta("sars_cov_2_genoma.fasta", seqtype = "DNA", as.string = TRUE)) ## Convert DNA string to RNA string sars_cov_2_RNA_genome <-gsub("t", "u", sars_cov_2_genome, ignore. case = TRUE) ## Search for Rfam families hits in the whole genome without clan competition sars_cov_2_Rfam_hits <-rfamSequenceSearch(sars_cov_2_RNA_genome, fragmentsOverlap = 3000, clanCompetitionFilter = FALSE) length(sars_cov_2_Rfam_hits) ## Search for Rfam families hits in the whole genome with clan competition sars_cov_2_Rfam_hits_2 <-rfamSequenceSearch(sars_cov_2_RNA_genome, fragmentsOverlap = 3000, clanCompetitionFilter = TRUE) length(sars_cov_2_Rfam_hits_2) ## Search for Rfam families hits in the whole genome without clan competition sars_cov_2_Rfam_hits_2 <-rfamSequenceSearch(sars_cov_2_RNA_genome, fragmentsOverlap = 3000, clanCompetitionFilter = FALSE) length(sars_cov_2_Rfam_hits_2) The search without clan competition returned a total of 7 RNA families in the SARS-CoV-2 genome (Table 1).
7. s2m: motif of unknown function found in the 3' untranslated region of astroviruses, coronaviruses and equine rhinoviruses.
On the other hand, the search with clan competition enabled only returned 3 hits, corresponding to Sarbecovirus-5UTR, Corona_FSE and Sarbecovirus-3UTR. bCoV-5UTR and bCoV-3UTR are discarded because they are essentially the same hit as the Sarbecovirus-5UTR and Sarbecovirus-3UTR respectively, with the only difference that the latter are versions found specifically in SARS beta-coronaviruses. The Corona_pk3 and s2m motifs were also omitted, since they are comprised within the larger Sarbecovirus-3UTR.
Knowing the IDs of the RNA families of interest, more detailed information can be easily retrieved through a set of query functions. As an example, we acquired detailed information about the Sarbecovirus-5UTR (with Rfam accession number RF03120).
First, we extracted the consensus sequence and secondary structure for the family: >>>)))).)))))).)))))))))). . .))))))). . . .." Next, we extracted the seed multiple sequence alignment of the family: ## Retrieve seed multiple sequence alignment and save it to a FASTA file rfamSeedAlignment("RF03120", filename = "RF03120_seedAlgn.fasta", format = "fasta") The information from the consensus secondary structure and the multiple seed alignment can be combined and visualized with the R4RNA package (Fig 2): ## Read consensus secondary structure and multiple alignment library(R4RNA) library(Biostrings) secondaryStructureTable <-readVienna("RF03120_cons.txt") seedAlignment <-readBStringSet("RF03120_seedAlgn.fasta") ## Make a helix plot of the consensus secondary structure and annotate it with information ## from the seed alignment plotCovariance(seedAlignment, secondaryStructureTable, grid = TRUE, line = TRUE, arrow = TRUE, legend = FALSE, cex = 3) It is also possible to generate plots of the secondary structure annotated with different types of information by specifying the "format" argument. Possible values are: • rscape: R-scape analysis of the seed alignment of the family • rscape-cyk: secondary structure predicted by R-scape from the seed alignment of the family For example, in order to generate a plot of the secondary structure of the Sarbecovirus-5UTR annotated with sequence conservation (Fig 1B), the following example code can be used: A plot of the phylogenetic tree of the seed alignment can be easily generated and labeled with species names or sequence accession numbers (Fig 1C): ## Generate a plot of the phylogenetic tree of Rfam family RF03120 labeled with species ## names and save it to a GIF file rfamSeedTreeImage("RF03120", filename = "RF03120_tree.gif", label = "species") Additionally, the phylogenetic tree can be retrieved in the New Hampshire Extended (NHX) format. The tree can then be read and processed with other software, such as the treeio R package: ## Save the phylogenetic tree of Rfam family RF03120 to a file in the NHX format rfamSeedTree("RF03120", filename = "RF03120_treeNHX.nhx") ## Read the tree as a treedata object library(treeio) treeioTree <-read.nhx("RF03120_treeNHX.nhx") ## Print a summary of the tree as.

Conclusion
The rfaRm R package provides an easy-to-use client-side interface to the Rfam database, enabling users to access it programmatically and therefore bypassing the limitations of interactive access through the web interface or the requirement to install and search the database locally. Programmatic access allows the identification of non-coding RNA across entire genomes. The package is designed to interoperate with existing software by returning data into formats directly readable by other tools and R packages and is available as part of the Bioconductor project, which facilitates its integration within workflows and pipelines for the analysis of genomic data. We believe the package will provide a useful resource for the community of RNA Bioinformatics, whose interest in the tool has been demonstrated by the considerable number of downloads of the package in spite of its still short lifetime.