A Dynamic 3D Graphical Representation for RNA Structure Analysis and Its Application in Non-Coding RNA Classification

With the development of new technologies in transcriptome and epigenetics, RNAs have been identified to play more and more important roles in life processes. Consequently, various methods have been proposed to assess the biological functions of RNAs and thus classify them functionally, among which comparative study of RNA structures is perhaps the most important one. To measure the structural similarity of RNAs and classify them, we propose a novel three dimensional (3D) graphical representation of RNA secondary structure, in which an RNA secondary structure is first transformed into a characteristic sequence based on chemical property of nucleic acids; a dynamic 3D graph is then constructed for the characteristic sequence; and lastly a numerical characterization of the 3D graph is used to represent the RNA secondary structure. We tested our algorithm on three datasets: (1) Dataset I consisting of nine RNA secondary structures of viruses, (2) Dataset II consisting of complex RNA secondary structures including pseudo-knots, and (3) Dataset III consisting of 18 non-coding RNA families. We also compare our method with other nine existing methods using Dataset II and III. The results demonstrate that our method is better than other methods in similarity measurement and classification of RNA secondary structures.


Introduction
As a bridge of information transmission, RNAs carry a wide variety of functions in biological systems, including performing catalytic function, regulating gene expression, and carrying genetic information [1]. In living cells, the single-stranded RNAs do not remain in a linear form. Instead, their bases always fold into pairs that lead to the formation of RNA secondary structures [2]. Since the three dimensional (3D) structures and functions of RNAs are mostly determined by their secondary structures [3], it is important to account for both primary sequences and secondary structures in understanding the functional similarities among RNAs, especially for non-coding RNAs (ncRNAs) and RNAs with pseudo-knots.
Recently, ncRNAs have become a focus for both computational and experimental researches [4]. Though ncRNAs do not encode proteins, they are involved in a lot of cellular processes [5]. For example, ncRNAs play an important role in chromosome maintenance and segregation [6] and have also been implicated in neurological diseases and various cancers [7,8]. Specifically, microRNAs are endogenous, short, non-coding RNA molecules that are directly involved in the posttranscriptional regulation of gene expression. Dysregulation of microRNAs is usually associated with diseases [9,10]. Moreover, a few computational methods have been developed to detect causal genes of diseases at the whole-genome level [11,12].
It is known that the functions of ncRNAs are mostly determined by their structures [5]. Though sequences can be well aligned by techniques like seed technique [13,14], predicting the secondary structure of ncRNAs is very difficult [15]. Novel and effective methods to accurately evaluate the structural similarities of ncRNAs are highly demanded, especially for those with special structures like pseudo-knot. As a kind of RNA structures formed by stem nesting, pseudo-knot is responsible for a few important biological activities such as virus infiltration [16].
Historically, dynamic programming based algorithms with various scoring functions have been widely used to measure the similarities among RNA secondary structures [17][18][19]. For example, based on the alignment of a string representation of RNA secondary structure, a score function and a distance function were established to represent insertion, deletion, and substitution of bases in the compared structures [17,18]. In addition, a tree representation of the RNA secondary structure elements [19] and base pairing probability matrices [20,21] were also proposed to measure the similarities among RNA structures. However, methods based on dynamic programming are computationally inefficient, which makes it hard to predict RNAs with complex secondary structures like pseudo-knot.
In order to measure RNA similarity more efficiently, various alternative techniques have been tested. For example, a novel 2D graphical representation of RNA secondary structure was proposed in [22]. However, this method may cause the loss of information due to its nonuniqueness in representing an RNA. Jeffery introduced a chaos game representation of DNA sequences [23], based on which Li et al. proposed a non-degenerative 2D graphical representation of RNA secondary structure to solve the information loss problem [24]. On the ground of sequence and base chemical information, two similar 3D representation methods were proposed [23,25]. However, they are space-demanding, especially for long RNAs. As a high dimension representation scheme, a 4D method was developed to resolve the problem of structure degeneracy and information loss, but it is not good for visualization [26]. In addition, a novel wavelet-based graphical representation method was used to classify non-coding RNA secondary structures [27]. However, the data obtained by this method is redundant because each base is characterized by three vectors.
In this paper, we proposed a novel dynamic 3D graphical representation of RNA secondary structures based on their sequences, chemical properties, and structural information. We evaluated our algorithm on three sets of RNA secondary structures including 9 viral RNAs, 33 RNAs with complex secondary structures, and 120 non-coding RNAs. A comparison study showed that our novel method outperformed other nine methods in deciphering RNA similarity.

Materials and Methods Data
We adopted three RNA datasets in this study: (1) Dataset I consisting of 3'-terminus of RNAs from nine viruses reported by Bol (see Fig 1) [28]; (2) Dataset II containing two kinds of RNA secondary structures: 17 complex RNA secondary structures retrieved from RNase P database [29] and 16 RNA secondary structures with pseudo-knots retrieved from Pseud Base (Pseud Base++) [30] (see S1 Fig); (3) Dataset III including 60 non-coding RNA secondary structures from RNA STRAND database (RNA STRAND v2.0) [31,32], and 60 non-coding RNA sequences randomly selected from 18 non-coding RNA families in Rfam database (see S2 Fig).

3D graphical representation of RNA secondary structures
The secondary structure of an RNA was constructed by four free bases A, G, C, and U, as well as base pairs formed by bonds between A-U, G-C, and G-U. For convenience, A, G, C, and U located in the base pairs were denoted by A', G', C', and U', respectively. In this way, we can use a sequence consisting of A, G, C, U and A', G', C', U' to represent an RNA secondary structure. The sequence is called the "characteristic sequence" of the RNA secondary structure. We provided a software to generate such sequence (see S1 Software) from any RNA secondary structure.
Based on their chemical properties, the four bases A, C, G, and U can be divided into three groups: According to the base classification scheme (i), (ii), and (iii), a characteristic sequence can be represented by three 3D graphs through three maps φ 1 , φ 2 , and φ 3 respectively (see Table 1 i , and U 0 i are the cumulative occurrence numbers of A, G, C, U, A 0 , G 0 , C 0 , and U 0 , respectively in the characteristic sequence from the first base to the i-th base, i = 1, 2, . . . n with n being the length of the characteristic sequence. Specifically, let G = g 1 g 2 g 3 . . .g n be the characteristic sequence of an RNA secondary structure, where n is the length of G. Each base g i can be mapped into three dots φ 1 (g i ) = (x 1i , y 1i , z 1i ), φ 2 (g i ) = (x 2i , y 2i , z 2i ), and φ 3 (g i ) = (x 3i , y 3i , z 3i ) where i = 1, 2, . . ., n. By connecting dots in φ 1 (g i ) = (x 1i , y 1i , z 1i ) (i = 1, 2, . . ., n) in order, we can obtain an M-K curve to represent the characteristic sequence G. Similarly, by connecting dots in φ 2 (g i ) (i = 1, 2, . . ., n) and φ 3 (g i ) (i = 1, 2, . . ., n) respectively, we can obtain R-Y and W-S curves. Taking the secondary structure of TSV-3 as an example, we plotted its M-K, R-Y, and W-S curves in Fig 2. Obviously, the x, y, and z coordinates for each of the three curves are all dynamic. The points projected onto the X-Y plane in our method are moving around the unit circle by a certain length and the z coordinate is changed by the content of bases.

Properties of the method
In this section, we introduced 3 properties showing 3 theoretical advantages of our method including non-degenerative (no information loss), easily reflecting the content of stem and loop, and the distribution of base frequencies. The proof of the properties were provided in S1 Text.
Property 1. The mapping between an RNA secondary structure and its dynamic 3D graphical representation is one to one, and the mapping on the X-Y plane is non-degenerative. Thus no information is lost.
Property 2. Our dynamic 3D graphical representation can easily reflect the content of bases and proportion of stem and loop structures.
This property shows that a few information on the base distribution and compositions of RNA secondary structure, such as the proportion of stem and loop structures, can be intuitively reflected by the dynamic 3D graphical representation. It is of note that the proportion of stem and loop structures is extremely important for RNA secondary structure prediction [33]. Property 3. For a characteristic sequence of the RNA secondary structure, let the frequen- where n is the length of the characteristic sequence, then z 1 3 indicate the distribution of base frequencies in the whole sequence.

Characterizing RNA secondary structures by 36-D vectors
For an RNA secondary structure, we have three sets of points (x 1i , y 1i , z 1i ), (x 2i , y 2i , z 2i ), and (x 3i , y 3i , z 3i ), i = 1, 2, . . ., n, where n is the length of the structure. We draw the geometrical center of the three curves to construct a 36-dimensional vector denoted by The 36-dimensional vector was adopted as a descriptor to characterize RNA secondary structures. It is of note that according to the 3 properties in the previous section and the Table 1. The definition of three maps φ 1 , φ 2 , and φ 3 . Let G = g 1 g 2 g 3 . . .g n be the characteristic sequence of TSV-3, where n is the length of G. Each base g i can be mapped into three dots φ 1 (g i ) = (x 1i , y 1i , z 1i ), φ 2 (g i ) = (x 2i , y 2i , z 2i ), and φ 3 (g i ) = (x 3i , y 3i , z 3i ), where i = 1, 2, . . ., n. By connecting dots in φ 1 (g i ) = (x 1i , y 1i , z 1i ) (i = 1, 2, . . ., n) in order, we can obtain an M-K curve to represent the characteristic sequence G. Similarly, by connecting dots in φ 2 (g i ) = (x 2i , y 2i , z 2i ) and φ 3 (g i ) = (x 3i , y 3i , z 3i ) respectively, we can obtain R-Y and W-S curves.
doi:10.1371/journal.pone.0152238.g002 compactness and uniqueness of the three curves for a given RNA secondary structure, this dynamic graphical representation scheme can reflect the distribution of bases in the characteristic sequence and has small degeneracy. Thus, we adopted it to compute the similarity between RNA secondary structures. Specifically, for any two RNA secondary structures, we first constructed their 36D representative vectors, and then calculated the similarity between the two vectors by the quotient between (1) the Euclidean distance between their end-points (in graph) and (2) the cosine of the angle between the two vectors. Clearly, the smaller is the quotient, the more similar are the two RNA secondary structures.

Results and Discussion
Similarities between nine RNA secondary structures of virus  Table 2.
As Table 2 shows, the smallest entries are associated with pairs (AVII, LRMV-3), (LRMV-3, EMV-3), and (AVII, EMV-3), which indicates that AVII, LRMV-3, and EMV-3 are more similar to each other. On the other hand, APMV-3, AlMV-3, and PDV-3 show great dissimilarity with others. This is consistent with the results reported by Liao et al. [34,35], Yao et al. [22], Li et al. [24], and Bai et al. [36]. For a better view of our results, we constructed a phylogenetic tree (see Fig 3) for the nine RNA structures based on the 9×9 similarity/dissimilarity matrix using UPGMA method in MEGA5.1. The results indicate that the 36D vectors can catch some intrinsic characteristics of RNA secondary structures.

Further test on Dataset II and III
To further evaluate the sensitivity of this algorithm in measuring the similarities among complex RNA secondary structures, we applied it to Dataset II, whose characteristic sequences were shown in S1 Table. Based on [37,38], the 17 complex RNA secondary structures shown in S1 We plotted the similarities among the 33 RNA secondary structures (based on their 36D vector representations) in Fig 4. There are 11 branches corresponding to the six classes of RNase P structures and five classes of pseudo-knot structures respectively. Thus, our method perfectly separated the RNA classes in Dataset II.
By a similar process, we constructed a phylogenetic tree (Fig 5) for the secondary structures of 60 ncRNAs in Dataset III, whose characteristic sequences were shown in S2 Table. The phylogenetic tree presents clearly 18 branches corresponding to the 18 non-coding RNA families with only one mis-clustering, i.e., RF00001.Methanolobus.tindarius was classified into the cluster RF00374. The results show that our approach performs well in comparing the secondary structures of non-coding RNAs.

Comparison with other methods on Dataset II and III
We compared our similarity measure with other nine popular RNA comparison methods, including a similarity metric based on the wavelet decomposition of the TV-Curve of ncRNA [27], LZ complexity by Liu et al. [39], five 3D graphical representations of RNA secondary structures proposed by Liao et al. [34], Zhu et al. [25], Feng et al. [40], Liu et al. [41], and Luo et al. [42], and two different 2D graphical representations of RNA secondary structures provided by Li et al. [24] and Yao et al. [43] respectively. We applied the nine methods into Dataset II and III, and the RNA similarities measured by these methods were shown as UPGMA trees in S3-S20 Figs.
As can be seen, the method based on wavelet decomposition [27] failed in separating RNase P from others (see S3 Fig) and mis-classified RF00024.AF221913.1 and RF00001. Thermococcus.celer (see S4 Fig). Similarly, as shown in S5-S18 Figs, the five 3D and two 2D graphical representations performed weakly in comparing 33 RNA secondary structures in Dataset II. In addition, except for the method provided by Luo et al. [42], the other six methods also presented weak classifications for the 18 non-coding RNA families in Dataset III. Specifically, the method by Liu et al. [39] cannot distinguish RNAs with pseudo-knots from Pseud Base in Dataset II (see S19 and S20 Figs). An easy comparison in S3 Table showed that our method outperformed the above nine approaches. Our 36D geometrical center vector may capture some intrinsic characteristics of the non-coding RNA and pseudo-knot Table 2. The similarity/dissimilarity matrix for nine RNA secondary structures in Fig 1 based on RNA 36D vector representation.   species  AlMV-3  APMV-3  AVII  CiLRV-3  CVV-3  EMV-3  LRMV-3  PDV-3  TSV-3 AlMV   structures. Moreover, the points projected onto the X-Y plane in our method are dynamic. This is different from Liao [34] and Zhu [25], in which they fixed x and y coordinates. These dynamic x and y coordinates may provide more "order" information along the secondary structure than the fixed ones.

Conclusion
Based on the chemical property and order of bases, we first proposed a dynamic 3D graphical visualization scheme for RNA secondary structures. We then extracted digital features of the dynamic 3D graphs to compute the similarity between two RNA secondary structures. The method was applied to ncRNAs from the Rfam database and achieved good classification results. In the end, we compared our method with other nine popular approaches and showed that our method outperformed them on RNAs with pseudo-knot and non-coding RNAs. In the future, it will be interesting to extend this method to capture more features of RNA secondary structure, and interpret the information carried by the graphical representation. The additional features and information will be useful in developing a more efficient method to measure RNA structure similarity.
Supporting Information  Table. The comparison between our method and the other nine algorithms.
(DOC) S1 Text. The proof of three properties. (DOC)