Figures
Abstract
Despite the increasing number of 3D RNA structures in the Protein Data Bank, the majority of experimental RNA structures lack thorough functional annotations. As the significance of the functional roles played by noncoding RNAs becomes increasingly apparent, comprehensive annotation of RNA function is becoming a pressing concern. In response to this need, we have developed FURNA (Functions of RNAs), the first database for experimental RNA structures that aims to provide a comprehensive repository of high-quality functional annotations. These include Gene Ontology terms, Enzyme Commission numbers, ligand-binding sites, RNA families, protein-binding motifs, and cross-references to related databases. FURNA is available at https://seq2fun.dcmb.med.umich.edu/furna/ to enable quick discovery of RNA functions from their structures and sequences.
Citation: Zhang C, Freddolino L (2024) FURNA: A database for functional annotations of RNA structures. PLoS Biol 22(7): e3002476. https://doi.org/10.1371/journal.pbio.3002476
Academic Editor: Yunsun Nam, The University of Texas Southwestern Medical Center, UNITED STATES OF AMERICA
Received: November 28, 2023; Accepted: June 24, 2024; Published: July 29, 2024
Copyright: © 2024 Zhang, Freddolino. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All necessary data are provided for download in the FURNA database itself, which is the subject of the manuscript. Source code for establishing the FURNA database is available for download from https://github.com/freddolino-lab/furna/. The database itself is available for download from https://doi.org/10.5281/zenodo.11664059 and https://doi.org/10.5281/zenodo.11672037 (split into two files due to size limitations).
Funding: This work was directly supported by NIH R01 AI134678 (to LF). This work also used the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by National Science Foundation (2138259, 2138286, 2138307, 2137603, and 2138296). The funders played no role in the study design, data collection, manuscript preparation, or decision to publish.
Competing interests: LF is a paid consultant and scientific advisory board member for Circnova, Inc. Circnova did not support the work in any way, and played no role in the study design, performance, or decision to publish.
Abbreviations: BP, Biological Process; CC, Cellular Component; CCD, Chemical Component Dictionary; EC, Enzyme Commission; GO, Gene Ontology; HDV, hepatitis delta virus; IL, incremental length; MAD, multiwavelength anomalous diffraction; MF, Molecular Function; PDB, Protein Data Bank; PWM, position weight matrix; TPP, thiamine pyrophosphate
Introduction
Advances in experimental RNA structure determination methods, particularly Cryo-EM [1], have resulted in over 16,000 RNA chains being deposited into the Protein Data Bank (PDB) database [2]. Despite these strides, functional annotations of experimental RNA structures are glaringly absent in both the PDB and secondary databases. The PDB database merely includes the bare minimum annotations for RNA chains, such as their names, lengths, and species. Downstream databases like NAKB (formerly known as NDB) [3], DNATCO [4], and BGSU RNA [5] offer more annotations for base pairing, backbone torsions, and 3D motifs, yet the annotation of the RNAs’ biological roles is still wanting. The MeRNA database [6] was probably the only database dedicated to function (in this case, metal ion binding sites) for experimental RNA structures, but it has long been defunct and was limited to 256 RNAs and their binding to metal ions. Simultaneously, recent studies have confirmed that many noncoding RNAs play vital roles in numerous biological events, particularly those involved in gene expression regulations [7], making RNA structures ideal targets for drug design [8]. This fresh understanding underscores the importance of annotating RNA functions for the RNA biology community.
In contrast to the stark lack of a functional database for RNA structures, several databases to annotate protein functions have already been established. Databases such as PDBsum [9] and SIFTS [10] annotate protein chains in the PDB using Gene Ontology (GO) terms and Enzyme Commission (EC) numbers by mapping PDB chains to UniProt [11] proteins and InterPro [12] families. The PDBbind-CN [13], BindingDB [14], and Binding MOAD [15] databases collect protein–ligand interactions with known affinity data. The PDBe-KB [16] database features ligand-binding sites and posttranslational modification sites for all PDB proteins. The FireDB [17] and IBIS [18] database curate protein–ligand interaction data in the PDB. Most recently, BioLiP2 [19] was developed as a comprehensive database covering almost all functional aspects of PDB proteins, including GO terms, EC numbers, ligand-binding sites, binding affinities, and cross-reference to external databases.
Inspired by BioLiP2, we created FURNA, the first database in the field to offer comprehensive functional annotations for all RNA chains in the PDB database. Function annotations in FURNA include GO terms, EC numbers, Rfam [20] RNA families, RNA motifs for protein binding, species, literature, and cross references to external databases like PDBsum, NAKB, DNATCO, BGSU RNA, ChEMBL [21], DrugBank [22], ZINC [23], and RNAcentral [24]. Unlike protein–ligand interaction databases such as BioLiP, FireDB, and PDBbind-CN, which consider receptor–ligand contacts within each asymmetric unit, FURNA determines RNA–ligand interactions based on the biological assembly (i.e., biounit). This approach situates RNA–ligand interactions within the context of its quaternary structure (i.e., the RNA’s interaction with nucleic acid and protein partners). FURNA is available both as an open-source software package and as a browsable and searchable web service at https://seq2fun.dcmb.med.umich.edu/furna/.
Results
Overall statistics
At the time of writing this manuscript (October 2023), FURNA includes 16154 RNAs involved in 380680 ligand–RNA interactions; the online version of the database is updated on a weekly basis. Among these interactions, 186025, 138245, 31659, 24056, and 695 are interactions with metal ions, proteins, “regular” small molecule compounds excluding metal ions, other RNAs, and DNAs, respectively. Unlike BioLiP, FURNA does not attempt to exclude “biologically irrelevant” RNA-associated molecules from the database apart from removal of water molecules. This is because the biological relevance of ligands, especially metal ions, are less clearly defined for RNAs than for proteins. For example, calcium ions (Ca2+) are usually biologically irrelevant artifacts added for purification and/or crystallization purposes for proteins, but they are used to substitute magnesium ions (Mg2+) that are critical to maintain the folding of RNAs in pre-catalytic states [25]. Similarly, while potassium ions (K+) are a simple buffer additive for many proteins, they are critical for the folding of many large RNAs where potassium ions stabilize juxtaposition of nucleotides with large sequence separation by neutralizing charge density [26]. This is why a significant portion (48.9%) of ligand–RNA interactions in FURNA are metal ions, among which 91.4% are magnesium ions, which are the most critical ion for RNA folding (Fig 1A). For small molecule ligands that are not metal ions, the 2 most frequent compounds are osmium (III) hexammine and cobalt hexammine (III) (S1 Table), which are crystallization additives used to determine the RNA structure by multiwavelength anomalous diffraction (MAD) phasing [27]. Although these compounds are metal-containing coordination complexes with positive charges, FURNA does not consider them to be metal ions because they are not natural ions, do not bind to natural binding sites of biologically relevant metal ions (S1 Fig), and do not consist solely of a single metal atom.
(A) Pie chart for the breakdown of ligand types in ligand–RNA interactions. (B) Venn graph showing the numbers of RNAs with GO terms in the MF, BP, and/or CC aspects. (C) Top GO terms in RNAs and proteins collected by FURNA. Data for panels (A) and (B) are available under the section “Database statistics” from the home page of the FURNA website.
Among the 10561 RNAs with GO annotations in FURNA, 9288, 5311, and 7014 have annotations in Molecular Function (MF), Biological Process (BP), and Cellular Component (CC) aspects, respectively (Fig 1B). Out of the RNAs with MF terms, 58.0% are rRNAs (denoted by GO:0003735 “structural constituent of ribosome”) and 23.5% are tRNAs (indicated by GO:0030533 “triplet codon-amino acid adaptor activity”) (Fig 1C). This suggests that the distribution of RNA families among experimentally determined RNA structures is highly biased, consistent with MF annotations for RNA-binding proteins where GO:0019843 “rRNA binding” and GO:0000049 “tRNA binding” are among the most common GO terms. It is worth noting that, on average, the similarity of BP GO term annotations between an interacting RNA–protein pair is significantly higher than a random RNA-RNA pair or a random RNA–protein pair (S2 Fig, Mann–Whitney U test p-value <1E-300 and p-value = 1.0E-20, respectively). This suggests that RNA–protein interactions will be useful for RNA BP term prediction, similar to the utility of protein–protein interactions in protein function prediction [28–30].
Web interface
The FURNA website provides 3 primary interfaces: SEARCH, BROWSE, and DOWNLOAD. The functionalities of these interfaces are elaborated upon below.
BROWSE
Each entry in the FURNA database represents 1 RNA chain in the PDB. For each of these RNA chains, the BROWSE interface displays the PDB ID and chain ID, resolution, EC number, GO terms, RNAcentral accessions, Rfam families, species, PubMed citations, and protein-binding motifs found in the ATtrRACT [31] database (Fig 2). Additionally, if the RNA chain has a non-water ligand, the BROWSE interface also provides information on the ligand ID, the chain and residue sequence number of the ligand, the ligand-binding nucleotides on the query RNA, and the biological assembly information where interaction with that ligand was retrieved.
Individual pages, accessed by clicking on the ligand in the last column of the BROWSE table, offer detailed structure and function information of each ligand–RNA interaction. These individual pages include the 3D structures of the RNA chain on its own, the full biological assembly, the RNA–ligand pair, and the local structure of the ligand binding site. These are displayed via 4 JSmol [32] applets, where the second applet uses the mmCIF-formatted structure while the remaining 3 applets use PDB format structures. The reason for this difference is that the full biological assembly may contain more than 99,999 atoms and/or 62 chains, making it impossible to be represented in PDB format [33]. On the other hand, an individual RNA–ligand pair can always be represented by a PDB file, which has a smaller file size than an mmCIF file [34] and is compatible with more bioinformatics tools [33]. Where available, GO terms for Molecular Function, Biological Process, and Cellular Component are presented in 3 directed acyclic graphs created using Graphviz [35], illustrating the relationships among different GO terms. The GO terms, as well as EC numbers, are also listed as tables. Additional information is also provided, including the RNA sequence and secondary structure, resolution, the structure’s name, species, ATtRACT motifs, PubMed citations, and crosslinks to other databases. In case of a small molecule ligand or an ion, the page exhibits the 2D diagram, ligand IDs (including PDB CCD ID, ChEMBL ID, DrugBank ID, and ZINC ID), the chemical formula, ligand name, and linear descriptions of the molecules (Fig 3). For RNA, DNA, or protein ligands, additional details such as the sequence, name, and species, as well as relevant function annotations such as GO terms and EC numbers, are provided when available (S3 Fig). In addition to webpages for individual ligand–RNA interactions, FURNA also provides individual webpages for each RNA chain (S4 Fig), which can be viewed by clicking on the first column of the summary table at the BROWSE interface (Fig 2). In addition to showing the 3D structure of the RNA chain on its own and the full biological assembly, it also lists the sequence, secondary structure, resolution, name, species, ATtRACT motifs, PubMed citations, crosslinks to other databases and, whenever available, associated GO terms and EC numbers. If the RNA is a ribozyme with known catalytic active site(s), the active site nucleotides are listed as a table and highlighted in the structure applet (S4 Fig). All ligands for the subject RNA chain are tabulated under the “Interaction partners” section to provide links to view the individual ligand–RNA interaction pages (Fig 3).
SEARCH
The “SEARCH” interface provides 4 methods to explore FURNA: “Search by name,” “Quick sequence search (via BLAST),” “Sensitive sequence search (via Infernal),” and “Search by structure.” Firstly, users can query FURNA using PDB ID, PDB chain ID, ligand ID (as defined by the 3-letter code in the PDB database’s Chemical Compound Dictionary), ligand name, RNAcentral accession, Rfam family, EC number, GO term, ATtRACT motif, taxonomy, PubMed ID, or any combination of these. Secondly, FURNA can employ NCBI BLAST+ to search its entries using RNA, DNA, or protein sequences through a local nonredundant sequence database where identical sequences are merged into the same entry. In the search results, both representative hits found in the nonredundant database and members from the same sequence clusters are displayed (Fig 4A). Thirdly, to address the issue of a BLAST search’s low sensitivity for nucleic acid sequences, FURNA offers an alternative, more sensitive RNA sequence search option using Infernal (see Materials and methods, Fig 4B and 4C). Lastly, users can search the tertiary structure of a query RNA (in PDB format) through the FURNA database using US-align (see Materials and methods).
DOWNLOAD
All data from FURNA can be downloaded in bulk through the “DOWNLOAD” interface. Functional annotations for each RNA chain and each ligand–RNA interaction are available in tab-separated tables. The FASTA sequences of RNAs, plus those of RNA-binding proteins and RNA-binding DNAs, are also provided. The coordinates of the RNAs and all non-water ligands are supplied in PDB format files. Furthermore, the link to the source codes for database curation and website display is also located on this page.
Case study on TPP riboswitches
To illustrate FURNA’s utility in RNA function annotation, we conducted a case study involving the TPP (thiamin pyrophosphate)-binding riboswitches, also known as the THI element or Thi-box riboswitch. This well-known family of riboswitches binds to thiamine pyrophosphate (TPP) to regulate the expression of its downstream gene [36,37]. In Escherichia coli, one such riboswitch is located upstream of the Hydroxyethylthiazole kinase (thiM) coding sequence [38,39] (Fig 3 and S2 Table). We used the 200 nucleotides immediately upstream of the thiM open reading frame as the query sequence to search FURNA. Unsurprisingly, both BLAST (Fig 4A) and Infernal (Fig 4B) searches of the E. coli TPP riboswitch through FURNA return hits for many TPP riboswitches, including those from E. coli. Similar results can be obtained by searching the region upstream of the thiM gene of Siccibacter turicensis, which also belongs to the Enterobacteriaceae family (S2 Table).
Shown is the FURNA web page for the interaction between thiamine diphosphate and TPP riboswitch (PDB 2gdi chain X, https://seq2fun.dcmb.med.umich.edu/furna/pdb.cgi?pdbid=2gdi&chain=X&lig3=TPP&ligCha=X&ligIdx=1). The structures of the RNA with all interaction partners, the RNA with only the ligand of interest, and the ligand itself without the RNA can be downloaded through the links at the bottom of the “Interaction” section.
(A) Top 5 BLAST search hits. (B) Top 5 Infernal search hits.
Based on gene function and the general prevalence of Thi-box riboswitches, we suspected the presence of riboswitches at several locations in Bacillus subtilis, e.g., one situated upstream of the coding sequence of the HMP/thiamine-binding protein (ykoF) and the other situated upstream of the aminopyrimidine aminohydrolase (tenA, S2 Table). Indeed, the tenA riboswitch has been previously reported [40], whereas a riboswitch upstream of ykoF has not, to our knowledge, been previously reported in the literature, although its presence is indicated in the RNAcentral (https://rnacentral.org/rna/URS000005CA97) and Rfam (https://rfam.org/family/RF00059) databases. When using FURNA to perform a BLAST sequence search of the putative B. subtilis TPP riboswitches (i.e., the 200 nucleotides immediately upstream of the ykoF and tenA open reading frames), no hits are returned, including hits to the E. coli thiM riboswitch. This outcome is not unexpected considering E. coli and B. subtilis are gram-negative and gram-positive bacteria, respectively, and have evolved separately for billions of years. In contrast, a sensitive Infernal search using either of the potential B. subtilis TPP riboswitches does yield hits to other TPP riboswitches, including the E. coli thiM riboswitch (Fig 5). These findings highlight FURNA’s capability for function annotations of low-homology RNAs using its sensitive sequence search option, providing a unified interface for obtaining functional information on a new RNA of interest.
(A) Top 5 Infernal search hits for the ykoF riboswitch. (B) Top 5 Infernal search hits for the tenA riboswitch.
Discussion and conclusions
We introduce FURNA, the first comprehensive structure database for ligand–RNA interactions and RNA function annotations. Compared to existing RNA structure and function databases, FURNA stands out in several ways. Firstly, it is the only database to utilize standard function vocabularies (GO terms and EC numbers) for the annotation of RNA tertiary structures. Secondly, it outlines ligand–RNA interactions based on biological assembly, which enhances the investigational context of interactions within the complete RNA-containing complex. Thirdly, FURNA offers user-friendly database search capabilities at varying levels of sensitivity, ensuring its relevance in annotating even remote RNA homologs. Fourthly, its data curation code is modular and fully open source, thereby simplifying regular data updates and future development. These unique aspects of FURNA position it as a valuable resource for the biological community, aiding in summarizing known RNA biological functions, creating functional hypotheses for poorly characterized RNAs, and developing new algorithms for ligand-RNA docking, virtual screening, and structure-based RNA function annotation. Nonetheless, FURNA does present a challenge in its lack of a clear definition of the biological relevance of ligand–RNA interactions (i.e., distinguishing biologically meaningful ligands from crystallization buffer components), an issue we plan to address in our future work.
Materials and methods
Each entry in FURNA corresponds to 1 RNA chain in the PDB database. To this end, we first download the mmCIF files of all structures containing nucleic acid from the PDB database and split them into individual chains using a modified version of the BeEM tool [33]. An RNA chain is defined by possessing more ribonucleotides than deoxyribonucleotides and amino acids. RNA chains with 10 or more nucleotides become entries in FURNA, but oligo-ribonucleotide fragments with fewer than 10 nucleotides are only included as “ligands” if they bind to an RNA chain with 10 or more nucleotides. The curation of an RNA chain involves several steps: annotating GO terms and EC numbers, mapping RNA-protein binding motifs, extracting RNA–ligand interactions, and assigning RNA secondary structures.
Rfam family matching
Rfam families are matched to an RNA chain by 2 approaches. First, we use the Rfam-PDB mapping provided by the Rfam database (https://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.pdb.gz). Second, if a PDB chain is not included in the mapping file, we search its RNA sequence against the most current version of the Rfam database (Rfam 14.10, with covariance models for 4170 families, https://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.cm.gz) using Infernal [41]. This Infernal search utilizes the parameters: cmsearch --cpu 4 -Z 549862.597050 --toponly, where the search space size parameter 549862.597050 is the same as that used by the Rfam database. Regardless of which approaches are used, Rfam families for an RNA chain are shown in ascending order of their E-values.
This workflow is not perfect, but reflects the best results that can be attained using automated procedures with currently available databases. As an example of a present false negative, PDB 1vc6 chain B is a full-length experimental structures of hepatitis delta virus (HDV) ribozyme, but it is not matched to the Rfam family for HDV ribozyme (RF00094) by either of the 2 approaches mentioned above, because aligning 1vc6 sequence to the covariance model of RF00094 results in a very high (E-value = 2,200). This is probably because an approximately 20-nucleotide long fragment corresponding to the middle of the RF00094 covariance model is absent in the experimental structure (S5 Fig).
Catalytic site annotation
The nucleotides at the active site of ribozymes are annotated based on the Ribocentre [42] database, which to our knowledge is the only curated database for ribozyme active sites available as of this writing. Among all 21 types of ribozymes collected by Ribocentre, detailed active site information on the RNA 3D structure is available only for a subset of 15 types of ribozymes (S3 Table). For each of these 15 types of ribozymes, active site nucleotides are annotated for only 1 representative experimental structure. To extend the Ribocentre database annotation to all members of these 15 ribozyme types, we implemented a template-based approach using US-align [43]. Briefly, the representative experimental structure of a ribozyme type is used as the query by US-align to search through all FURNA RNA chains that share the same Rfam families as the target ribozyme type (S3 Table). FURNA RNAs with significant structure similarity (TM-score ≥ 0.45 [44]) to the query RNA are collected. Additionally, if the query ribozyme has <100 nucleotides, all FURNA RNAs without Rfam families are also searched by US-align to get similar FURNA RNAs with TM-score ≥ 0.45, as the covariance models of the Rfam families for short ribozymes may not match the FURNA RNA sequence with significant E-values as shown in the HDV ribozyme example mentioned in the previous paragraph. Nucleotides on the FURNA RNAs that are aligned to the active sites on the query ribozyme structure are marked as putative active site members.
GO term and EC number annotation
We employ 2 complementary strategies to obtain GO terms for an RNA chain. First, we transfer the GO terms related to each Rfam family (http://current.geneontology.org/ontology/external2go/rfam2go) of the query RNA. Second, we map RNA chains to RNAcentral sequences based on the mapping file provided by RNAcentral (http://ftp.ebi.ac.uk/pub/databases/RNAcentral/current_release/id_mapping/database_mappings/pdb.tsv). If the RNAcentral entry has GO terms in the Gene Ontology Annotation (GOA) project (http://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/goa_uniprot_all.gaf.gz), we also transfer these GO terms to the FURNA entry. We utilize Graphviz [35] to plot the direct acyclic graphs showcasing the relationships among an RNA’s GO terms (including their parent terms). For the subset of RNAs with annotated catalytic activities, we convert their EC numbers from GO terms using the EC2GO mapping (https://www.ebi.ac.uk/GOA/EC2GO). For RNA-binding proteins, their UniProt accessions, GO terms, and EC numbers are directly obtained through the SIFTS [10] database.
RNA-protein binding motif mapping
To identify RNA motifs corresponding to known recognition sites for RNA-binding proteins, we download the position weight matrices (PWMs) for all 1,583 protein-binding motifs from the latest ATtRACT database (version 0.99β). These motifs and the query RNAs collected by FURNA are grouped by species. Here, we extract the species information of an RNA chain from the respective mmCIF file, specifically from records such as “gene_src_ncbi_taxonomy_id,” “ncbi_taxonomy_id,” “pdbx_gene_src_ncbi_taxonomy_id,” or “pdbx_ncbi_taxonomy_id.” For any species that has at least 1 ATtRACT motif and 1 FURNA RNA chain, we download its transcriptome from the NCBI FTP site (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/) to determine its background distribution of the 4 nucleotide types (A, C, G, and U). This background information is ascertained using the fasta-get-markov program of the MEME suite [45]. Subsequently, this background file is used by the FIMO program [46] of the MEME suite when it searches the motif PWMs against the FURNA RNAs with the parameters:--norc --bfile, to enable the motifs to align with the RNAs.
Ligand–RNA interaction extraction
For each query RNA included in the FURNA database, we gather its interaction partners from the mmCIF format biological assembly file (i.e., biounit) that contains the pertinent RNA chain. As an example, the asymmetric unit of PDB 1a9n (the spliceosomal U2B"-U2A’ protein complex bound to a fragment of U2 small nuclear RNA) contains 6 chains, which comprises 4 protein chains (Chains A, B, C, and D) and 2 RNA chains (Chains Q and R). This PDB correlates with 2 different biological assemblies: assembly 1 includes chains A, B, and Q; assembly 2 incorporates chains C, D, and R. Consequently, to extract ligand–RNA interactions for 1a9n chain R, we only consider assembly 2.
Starting from the biological assembly file selected for a query RNA, we employ a modified version of the BeEM program [33] to split it into different chains. For each chain, we further split the macromolecule part and the small molecule parts, where the former and latter are labeled by numerical values and a period (“.”), respectively, in the “label_seq_id” record of the mmCIF file. Next, we collect all non-water ligands from all chains in the mmCIF file, including small molecules and metal ions, proteins, DNAs, and RNAs (excluding the query RNA). For each query RNA-ligand pair, we calculate all inter-molecular atomic contacts, i.e., atom pairs within the sum of the van der Waals radii plus 0.5 Å, among non-hydrogen atoms. We label a nucleotide on the query RNA as a ligand-binding residue if it has 2 or more inter-molecular atomic contacts with a ligand. We group any collection of 2 or more ligand-binding residues for the same query RNA–ligand pair into a binding site. Ligands without a binding site are excluded.
For a small molecule ligand, we extract the name, synonyms, chemical formula, and linear descriptions (including SMILES, InChI, and InChIKey) from the Chemical Component Dictionary (CCD) provided by the PDB database. We perform mappings from PDB ligand IDs (i.e., CCD IDs) to ligand IDs in the ChEMBL, DrugBank, and ZINC databases using the UniChem database [47]. For protein ligands, we retrieve their GO terms, EC numbers, species, and UniProt accessions from the SIFTS [10] database. For DNA ligands, we retrieve the species from the mmCIF file of the asymmetric unit, similar to how we obtain species information for RNA chains.
RNA secondary structure assignment
FURNA assigns RNA secondary structures in dot-bracket format for canonical base pairs (Watson–Crick pairs and G:U Wobble pairs) in the experimental 3D structure, using 2 complementary methods: CSSR [48] and DSSR [49]. CSSR is our recently developed method optimized for coarse-grained RNA structures. It can assign secondary structures even when the nucleotides have missing atoms. Conversely, DSSR only functions when the nucleobase of the nucleotide is fully atomic and its RMSD to the standard nucleobase conformation [50] is less than 0.28 Å. Due to these stringent requirements, DSSR-assigned secondary structures might have missing positions compared to the input RNA. To ensure consistency between DSSR input and output, we utilize Arena [51] to fill in missing atoms and rectify unphysical nucleobase conformations for all RNA chains before we execute the DSSR assignment. For an RNA-RNA interaction involving 2 RNA chains, we assign secondary structures to both the individual RNAs and the RNA pair.
Infernal database construction
For users to perform sensitive Infernal searches of query RNA sequences through FURNA, a database in the Infernal [41] format must be preconstructed. To accomplish this, we first obtain a nonredundant set of RNAs, which is generated by collapsing multiple FURNA RNAs with identical sequences into 1 entry. For each RNA in the nonredundant set, the CSSR-assigned secondary structure in dot bracket is collected. Pseudoknots present in the secondary structures are removed by an incremental length (IL) approach, where nonconflicting paired regions are added one by one, starting with the longest paired region [52]. Subsequently, the secondary structure and sequence are converted by the “cmbuild” tool of the Infernal package into the uncalibrated Infernal format covariance model. This covariance model is then calibrated by the “cmcalibrate” tool of the Infernal package. The calibrated covariance models for all nonredundant FURNA RNA chains are concatenated into the Infernal format database. This database can be utilized by the “cmscan” tool of the Infernal package, allowing a user to perform Infernal searches of query RNA sequences through FURNA.
US-align database construction
Since conducting a tertiary structure search of all RNA chains in FURNA is more time-consuming than a sequence search, 2 procedures are implemented to reduce the size of the structure database used for US-align search. First, the nonredundant set of RNAs with nonidentical sequences is isolated, from which the coordinates of the C3’ atoms are extracted. The exclusion of atoms other than C3’ does not affect US-align, which only considers C3’ atoms for RNA structure alignment. Second, we utilize the qTMclust tool [43] from the US-align package to cluster the structures of the nonredundant RNAs. This results in a set of representative RNA structures with a pairwise TM-score [44] less than 0.5. These representative RNA structures form the US-align database. When a user carries out an RNA structure query through the FURNA website, this query structure will be searched using US-align through the database of representative structures to report the top 100 hits with the highest TM-scores. Meanwhile, the RNAs belonging to the same structure clusters will also be reported.
Supporting information
S1 Fig. Examples of binding sites for biologically relevant cations versus phasing additives.
Shown are X-ray structures of group I intron P4-P6 domains from Tetrahymena thermophila in complex with Mg2+ and cobalt hexammine (III). (A) Structure determined with Mg2+ (yellow spheres) and Cobalt hexammine (III) (magenta spheres), PDB 1gid chain A. (B) Structure determined with Mg2+ (blue spheres) only, PDB 6d8o chain A. (C) Overlap of the 2 structures. Magenta arrows in panels (A) and (B) indicate cobalt hexammine (III) binding sites, which are completely different from Mg2+ binding sites.
https://doi.org/10.1371/journal.pbio.3002476.s001
(TIF)
S2 Fig. Distributions of F1-scores for the similarities of GO annotations between interacting versus random RNA-RNA pairs or interacting versus random RNA-protein pairs.
Here, the F1-score of GO annotations between molecule A and molecule B is calculated as: Here, GOA and GOB are the set of GO terms (including parent terms) in molecule A and molecule B, respectively. The solid horizontal bars inside each violin show the mean F1-score. The values above the violin plots show the p-value of Wilcoxon rank sum tests between adjacent violins. We observe that just as for proteins, inter-molecular interactions provide a substantial amount of information regarding BP and CC terms, but not for MF terms (as might be expected, as interacting pairs should co-localize in the cell and be involved in the same pathway but will typically not have the same function at the molecular level). P-value <0.001 and 0.01~0.001 are marked by *** and **, respectively.
https://doi.org/10.1371/journal.pbio.3002476.s002
(TIF)
S3 Fig. Web page for protein–RNA interaction between human splicesomal U2B” protein (PDB 1a9n chain B) and U2 small nuclear RNA (PDB 1a9n chain Q, https://seq2fun.dcmb.med.umich.edu/furna/pdb.cgi?pdbid=1a9n&chain=Q&lig3=protein&ligCha=B&ligIdx=0).
https://doi.org/10.1371/journal.pbio.3002476.s003
(TIF)
S4 Fig. Web page for a hammerhead ribozyme chain (PDB 3zp8 chain A, https://seq2fun.dcmb.med.umich.edu/furna/pdb.cgi?pdbid=3zp8&chain=A).
https://doi.org/10.1371/journal.pbio.3002476.s004
(TIF)
S5 Fig. Multiple sequence alignment for the experimental structure of HDV ribozyme (PDB 1vc6 chain B, first row) and representative sequences of the Rfam family RF00094 “Hepatitis delta virus ribozyme” used to build the RF00094 covariance model.
The covariance model region that is absent in the experimental structure is highlighted by a black box.
https://doi.org/10.1371/journal.pbio.3002476.s005
(TIF)
S1 Table. Top small molecule compounds in FURNA, excluding monatomic ions.
https://doi.org/10.1371/journal.pbio.3002476.s006
(DOCX)
S2 Table. The TPP riboswitches used for our case study.
https://doi.org/10.1371/journal.pbio.3002476.s007
(DOCX)
S3 Table. Annotations of active site nucleotides for the 15 types of ribozymes currently covered by the Ribocentre database.
https://doi.org/10.1371/journal.pbio.3002476.s008
(DOCX)
References
- 1. Ma H, Jia X, Zhang K, Su Z. Cryo-EM advances in RNA structure determination. Signal Transduct Target Ther. 2022;7(1):58. Epub 20220223. pmid:35197441; PubMed Central PMCID: PMC8864457.
- 2. Berman H, Henrick K, Nakamura H. Announcing the worldwide Protein Data Bank. Nat Struct Biol. 2003;10(12):980. pmid:14634627.
- 3. Berman HM, Lawson CL, Schneider B. Developing Community Resources for Nucleic Acid Structures. Life (Basel). 2022;12(4). Epub 20220406. pmid:35455031; PubMed Central PMCID: PMC9031032.
- 4. Cerny J, Bozikova P, Maly M, Tykac M, Biedermannova L, Schneider B. Structural alphabets for conformational analysis of nucleic acids available at dnatco.datmos.org. Acta Crystallogr D Struct Biol. 2020;76(Pt 9):805–13. Epub 20200817. pmid:32876056; PubMed Central PMCID: PMC7466747.
- 5. Petrov AI, Zirbel CL, Leontis NB. Automated classification of RNA 3D motifs and the RNA 3D Motif Atlas. RNA. 2013;19(10):1327–40. Epub 20130822. pmid:23970545; PubMed Central PMCID: PMC3854523.
- 6. Stefan LR, Zhang R, Levitan AG, Hendrix DK, Brenner SE, Holbrook SR. MeRNA: a database of metal ion binding sites in RNA structures. Nucleic Acids Res. 2006;34(Database issue):D131–4. pmid:16381830; PubMed Central PMCID: PMC1347421.
- 7. Fernandes JCR, Acuna SM, Aoki JI, Floeter-Winter LM, Muxel SM. Long Non-Coding RNAs in the Regulation of Gene Expression: Physiology and Disease. Noncoding RNA. 2019;5(1). Epub 20190217. pmid:30781588; PubMed Central PMCID: PMC6468922.
- 8. Matsui M, Corey DR. Non-coding RNAs as drug targets. Nat Rev Drug Discov. 2017;16(3):167–179. pmid:27444227
- 9. Laskowski RA, Jablonska J, Pravda L, Varekova RS, Thornton JM. PDBsum: Structural summaries of PDB entries. Protein Sci. 2018;27(1):129–34. Epub 20171027. pmid:28875543; PubMed Central PMCID: PMC5734310.
- 10. Dana JM, Gutmanas A, Tyagi N, Qi G, O’Donovan C, Martin M, et al. SIFTS: updated Structure Integration with Function, Taxonomy and Sequences resource allows 40-fold increase in coverage of structure-based annotations for proteins. Nucleic Acids Res. 2019;47(D1):D482–D489. pmid:30445541; PubMed Central PMCID: PMC6324003.
- 11. UniProt C. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res. 2021;49(D1):D480–D489. pmid:33237286; PubMed Central PMCID: PMC7778908.
- 12. Blum M, Chang HY, Chuguransky S, Grego T, Kandasaamy S, Mitchell A, et al. The InterPro protein families and domains database: 20 years on. Nucleic Acids Res. 2021;49(D1):D344–D354. pmid:33156333; PubMed Central PMCID: PMC7778928.
- 13. Liu Z, Li Y, Han L, Li J, Liu J, Zhao Z, et al. PDB-wide collection of binding data: current status of the PDBbind database. Bioinformatics. 2015;31(3):405–12. Epub 20141009. pmid:25301850.
- 14. Gilson MK, Liu T, Baitaluk M, Nicola G, Hwang L, Chong J. BindingDB in 2015: A public database for medicinal chemistry, computational chemistry and systems pharmacology. Nucleic Acids Res. 2016;44(D1):D1045–53. Epub 20151019. pmid:26481362; PubMed Central PMCID: PMC4702793.
- 15. Smith RD, Clark JJ, Ahmed A, Orban ZJ, Dunbar JB, Carlson HA. Updates to Binding MOAD (Mother of All Databases): Polypharmacology Tools and Their Utility in Drug Repurposing. J Mol Biol. 2019;431(13):2423–2433. WOS:000474675300004. pmid:31125569
- 16. Consortium PD-K. PDBe-KB: collaboratively defining the biological context of structural data. Nucleic Acids Res. 2022;50(D1):D534–D542. pmid:34755867; PubMed Central PMCID: PMC8728252.
- 17. Maietta P, Lopez G, Carro A, Pingilley BJ, Leon LG, Valencia A, et al. FireDB: a compendium of biological and pharmacologically relevant ligands. Nucleic Acids Res. 2014;42(Database issue):D267–72. Epub 20131115. pmid:24243844; PubMed Central PMCID: PMC3965074.
- 18. Shoemaker BA, Zhang D, Tyagi M, Thangudu RR, Fong JH, Marchler-Bauer A, et al. IBIS (Inferred Biomolecular Interaction Server) reports, predicts and integrates multiple types of conserved interactions for proteins. Nucleic Acids Res. 2012;40(Database issue):D834–40. Epub 20111118. pmid:22102591; PubMed Central PMCID: PMC3245142.
- 19. Zhang C, Zhang X, Freddolino PL, Zhang Y. BioLiP2: an updated structure database for biologically relevant ligand-protein interactions. Nucleic Acids Res. 2023. Epub 20230731. pmid:37522378.
- 20. Kalvari I, Nawrocki EP, Ontiveros-Palacios N, Argasinska J, Lamkiewicz K, Marz M, et al. Rfam 14: expanded coverage of metagenomic, viral and microRNA families. Nucleic Acids Res. 2021;49(D1):D192–D200. pmid:33211869; PubMed Central PMCID: PMC7779021.
- 21. Mendez D, Gaulton A, Bento AP, Chambers J, De Veij M, Felix E, et al. ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res. 2019;47(D1):D930–D940. pmid:30398643; PubMed Central PMCID: PMC6323927.
- 22. Wishart DS, Feunang YD, Guo AC, Lo EJ, Marcu A, Grant JR, et al. DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res. 2018;46(D1):D1074–D1082. pmid:29126136; PubMed Central PMCID: PMC5753335.
- 23. Irwin JJ, Tang KG, Young J, Dandarchuluun C, Wong BR, Khurelbaatar M, et al. ZINC20-A Free Ultralarge-Scale Chemical Database for Ligand Discovery. J Chem Inf Model. 2020;60(12):6065–73. Epub 20201029. pmid:33118813; PubMed Central PMCID: PMC8284596.
- 24. Consortium RNAcentral. RNAcentral 2021: secondary structure integration, improved sequence search and new member databases. Nucleic Acids Res. 2021;49(D1):D212–D20. Epub 2020/10/28. pmid:33106848; PubMed Central PMCID: PMC7779037.
- 25. Marcia M, Pyle AM. Visualizing Group II Intron Catalysis through the Stages of Splicing. Cell. 2012;151(3):497–507. WOS:000310529300008. pmid:23101623
- 26. Rozov A, Khusainov I, El Omari K, Duman R, Mykhaylyk V, Yusupov M, et al. Importance of potassium ions for ribosome structure and function revealed by long-wavelength X-ray diffraction. Nat Commun. 2019;10(1):2519. Epub 20190607. pmid:31175275; PubMed Central PMCID: PMC6555806.
- 27. Cate JH, Doudna JA. Metal-binding sites in the major groove of a large ribozyme domain. Structure. 1996;4(10):1221–1229. pmid:8939748.
- 28. Zhang C, Freddolino PL, Zhang Y. COFACTOR: improved protein function prediction by combining structure, sequence and protein-protein interaction information. Nucleic Acids Res. 2017;45(W1):W291–W9. Epub 2017/05/05. pmid:28472402; PubMed Central PMCID: PMC5793808.
- 29. Zhang C, Zheng W, Freddolino PL, Zhang Y. MetaGO: Predicting Gene Ontology of Non-homologous Proteins Through Low-Resolution Protein Structure Prediction and Protein Protein Network Mapping. J Mol Biol. 2018;430(15):2256–2265. WOS:000437815400011. pmid:29534977
- 30. You R, Yao S, Xiong Y, Huang X, Sun F, Mamitsuka H, et al. NetGO: improving large-scale protein function prediction with massive network information. Nucleic Acids Res. 2019;47(W1):W379–W387. pmid:31106361; PubMed Central PMCID: PMC6602452.
- 31. Giudice G, Sanchez-Cabo F, Torroja C, Lara-Pezzi E. ATtRACT-a database of RNA-binding proteins and associated motifs. Database (Oxford). 2016;2016. Epub 2016/04/09. pmid:27055826; PubMed Central PMCID: PMC4823821.
- 32. Hanson RM, Prilusky J, Renjian Z, Nakane T, Sussman JL. JSmol and the Next-Generation Web-Based Representation of 3D Molecular Structure as Applied to Proteopedia. Isr J Chem. 2013;53(3–4):207–216. WOS:000317859800011.
- 33. Zhang C. BeEM: fast and faithful conversion of mmCIF format structure files to PDB format. BMC Bioinformatics. 2023;24(1):260. pmid:37340457
- 34. Zhang C, Pyle AM. PDC: a highly compact file format to store protein 3D coordinates. Database (Oxford). 2023;2023. pmid:37010520; PubMed Central PMCID: PMC10069377.
- 35. Ellson J, Gansner ER, Koutsofios E, North SC, Woodhull G. Graphviz and dynagraph—Static and dynamic graph drawing tools. Math Vis. 2004:127–48. WOS:000186345600006.
- 36. Serganov A, Nudler E. A decade of riboswitches. Cell. 2013;152(1–2):17–24. pmid:23332744; PubMed Central PMCID: PMC4215550.
- 37. Wachter A. Riboswitch-mediated control of gene expression in eukaryotes. RNA Biol. 2010;7(1):67–76. Epub 20100101. pmid:20009507.
- 38. Serganov A, Polonskaia A, Phan AT, Breaker RR, Patel DJ. Structural basis for gene regulation by a thiamine pyrophosphate-sensing riboswitch. Nature. 2006;441(7097):1167–71. Epub 20060521. pmid:16728979; PubMed Central PMCID: PMC4689313.
- 39. Edwards TE, Ferre-D‘Amare AR. Crystal structures of the thi-box riboswitch bound to thiamine pyrophosphate analogs reveal adaptive RNA-small molecule recognition. Structure. 2006;14(9):1459–1468. pmid:16962976.
- 40. Mironov AS, Gusarov I, Rafikov R, Lopez LE, Shatalin K, Kreneva RA, et al. Sensing small molecules by nascent RNA: a mechanism to control transcription in bacteria. Cell. 2002;111(5):747–756. pmid:12464185.
- 41. Nawrocki EP, Eddy SR. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics. 2013;29(22):2933–5. Epub 2013/09/07. pmid:24008419; PubMed Central PMCID: PMC3810854.
- 42. Deng J, Shi Y, Peng X, He Y, Chen X, Li M, et al. Ribocentre: a database of ribozymes. Nucleic Acids Res. 2023;51(D1):D262–D268. pmid:36177882; PubMed Central PMCID: PMC9825448.
- 43. Zhang C, Shine M, Pyle AM, Zhang Y. US-align: universal structure alignments of proteins, nucleic acids, and macromolecular complexes. Nat Methods. 2022;19(9):1109–15. Epub 20220829. pmid:36038728.
- 44. Gong S, Zhang C, Zhang Y. RNA-align: quick and accurate alignment of RNA 3D structures based on size-independent TM-scoreRNA. Bioinformatics. 2019;35(21):4459–61. Epub 2019/06/05. pmid:31161212; PubMed Central PMCID: PMC6821192.
- 45. Bailey TL, Johnson J, Grant CE, Noble WS. The MEME Suite. Nucleic Acids Res. 2015;43(W1):W39–49. Epub 20150507. pmid:25953851; PubMed Central PMCID: PMC4489269.
- 46. Grant CE, Bailey TL, Noble WS. FIMO: scanning for occurrences of a given motif. Bioinformatics. 2011;27(7):1017–8. Epub 20110216. pmid:21330290; PubMed Central PMCID: PMC3065696.
- 47. Chambers J, Davies M, Gaulton A, Hersey A, Velankar S, Petryszak R, et al. UniChem: a unified chemical structure cross-referencing and identifier tracking system. J Chem. 2013;5(1):3. Epub 20130114. pmid:23317286; PubMed Central PMCID: PMC3616875.
- 48. Zhang C, Pyle AM. CSSR: assignment of secondary structure to coarse-grained RNA tertiary structures. Acta Crystallogr D Struct Biol. 2022;78(Pt 4):466–71. Epub 20220311. pmid:35362469; PubMed Central PMCID: PMC8972804.
- 49. Lu XJ, Bussemaker HJ, Olson WK. DSSR: an integrated software tool for dissecting the spatial structure of RNA. Nucleic Acids Res. 2015;43(21):e142. Epub 2015/07/18. pmid:26184874; PubMed Central PMCID: PMC4666379.
- 50.
Arnott S. Polynucleotide secondary structures: an historical perspective. Oxford, UK: Oxford University Press; 1999. p. 1–38.
- 51. Perry ZR, Pyle AM, Zhang C. Arena: Rapid and Accurate Reconstruction of Full Atomic RNA Structures From Coarse-grained Models. J Mol Biol. 2023;435(18):168210. Epub 20230720. pmid:37479079.
- 52. Smit S, Rother K, Heringa J, Knight R. From knotted to nested RNA structures: a variety of computational methods for pseudoknot removal. RNA. 2008;14(3):410–6. Epub 20080129. pmid:18230758; PubMed Central PMCID: PMC2248259.