Fig 1.
Secondary structure and module.
In (1) we show an RNA and its secondary structure with non-canonical interactions. Base pair interactions in blue are local (both nucleotides involved are in the same or in adjacent SSEs) while the ones in red are long range interactions (between two distant SSEs). The canonical base pair interactions are represented with double lines. We highlighted the loops in the structure with green dotted lines. Loops A and C are hairpins, loops D and E are interior loops, and loop B is a multi-loop. In (2) we show an instance of a module found in the RNA secondary structure in (1). On the right is the base pair pattern that characterizes this module and on the left is the sequence profile of this module (i.e. the nucleotide sequences of the corresponding parts of RNAs this module has been observed in). The first sequence in the profile, for instance, corresponds to the RNA displayed in (1).
Fig 2.
From 3D structure to directed edge-labelled graph.
In this figure we illustrate the transition from the 3D structure (a) to RNA 2D structure graph (b) and finally directed edge-labelled graph (c) with a simple RNA structure. Each edge label of the directed edge-labelled graph is a pair which first element represents the type of interaction (using the same symbols as in the RNA 2D structure graph) while the second denotes the local (blue) vs. long-range (red) property of the interaction (using the same colors as in the RNA 2D structure graph). Moreover, the set of edge labels forms a directed proper edge-coloring, as illustrated with the last panel (d) where each different geometric type of interaction has been associated a color. Note that panel (d) is only an illustration of the edge labels forming a proper edge-coloring as our method does not actually replace the labels by colors.
Fig 3.
Impact of proper edge-coloring on graph-matching.
This figure displays a piece of two graphs (G on the left and H on the right) in which the nodes 0 and a are already matched together. The next step is to match their neighbours. In the generic case, all permutations have to be tested. On the contrary, in the example displayed, the colors of the edges limit the options to consider to a single one.
Fig 4.
Illustration of the extension process.
This figure illustrates the extension process from a “starting point” (here ((g0, h0), (g0, h0)), in blue). We first consider the neighbors of g0 and h0 (in purple). Thanks to the PEC, there is only one way to match them. We then consider the neighbors of g1 and h1 (in green). We match g5 and h5 but discover that their neighborhoods are not compatible. At this point the behaviours of the three algorithms differ. This discovery implies that the matching cannot be extended to cover all of G so the Graph Isomorphism and Subgraph Isomorphism will abandon it and pass on to another “starting point”. The All Maximal Common Subgraphs on the contrary will take note of this discrepancy and keep extending the matching nevertheless. This extension will output a maximal common subgraph of G and H and a new branch will be created to explore the alternative solution suggested by the discrepancy found.
Fig 5.
Exploration tree with backtracking.
This figure displays the exploration tree representing a posteriori the relation between the different branches created. In this tree, the root is a starting point (i.e. the nodes that are already matched at the start of an exploration) and each leaf is a different maximal common subgraph. Each path from the root to a leaf describes an exploration. For instance, the node (14,20) of the exploration tree corresponds to the action of matching the node 14 from G to the node 20 of H. All the leafs in the right subtree have matched 14 to 20 and all the ones in the left subtree have not. Note that only the nodes with a left child are represented, all other nodes have been collapsed since they bear no information about the exploration process. The first exploration always produces the right most maximal common subgraph. In this exemple, the first exploration encountered two conflicts and the algorithm thus produced two new branches which respectively were instructed not to add (24,26) and not to add (14,20). The first of the two produced another maximal common subgraph without any trouble but the second encountered another conflict and so on and so forth.
Fig 6.
Simplified display of the full pipeline.
The RNA 2D structure graphs given as input are pre-processed for the sake of optimization. Each pair of graphs in the pre-processed data is then given to the maximal common subgraphs algorithm as input and the output is post-processed into partial sets of . All partial sets of
are finally merged into the complete set of
which is the output of the whole pipeline.
Fig 7.
Examples of structures to illustrate the three RIN classes.
Those three graphs displayed inside a Venn diagram are subgraphs of Fig 1 with the same SSEs annotations (SSEs D,C and E figured with colored areas). Graph #1 is valid for all three classes. Graph #2 spans over 3 SSEs and so cannot be a valid RINabc. Graph #3 does not contain long-range interactions and thus is only valid for class RINa.
Table 1.
Summary of the relation between the rules and the three RIN classes.
Fig 8.
Distribution of in Dataset 2.92.
Numbers of distinct (in blue) and all their occurrences (in green) over the different numbers of SSEs they span over in Dataset 2.92.
Table 2.
For each RINab we compute how the number of SSEs covered varies between the occurrences. A value of 0 means that all occurrences are over the same number of SSEs while ±1 (resp. ±2) means that the RINab can span two different number of SSEs (resp. three).
Fig 9.
Distribution of in Dataset 2.92.
Numbers of distinct (in red) and all their occurrences (in rose) over the different numbers of long range interactions they contain in Dataset 2.92.
Fig 10.
Numbers of distinct (in blue) and all their occurrences (in green) over the different numbers of SSEs they span over in Dataset 2.92.
Table 3.
Variation in the number of SSEs over the occurrences of the same RINa.
(Cf. Table 2). Those numbers show that the variation in the number of SSEs amongst the occurrences of a given RINa is both uncommon and limited, even more than with RINab, albeit slightly (82% of with no variation vs 78% of
).
Table 4.
Summary of numbers of unique RINs found in the different classes with the total numbers of occurrences.
Please note that this table also displays the numbers for the RINa class in Dataset 3.137 that we will present in section 3.4.
Table 5.
This table displays the runtime of previous method (CaRNAval) and our method (others rows) for different classes of structures extracted. The values have been measured with the linux time command and are real CPU times i.e. clock time elapsed between the start and the end of the execution. All runs have been performed on the same machine.
Table 6.
Number of occurrences found in Dataset 2.92 and Dataset 3.137 for 5 structures of interest.
The 5 structures of interest are denoted using both their name in the litterature (first column) and their ID in our database (second column). Note that it is the same ID displayed in CaRNAval.
Fig 11.
Distribution of in Dataset 3.137.
Numbers of distinct (in red) and all their occurrences (in rose) over the different numbers of long range interactions they contain in Dataset 3.13.
Fig 12.
Distribution of in Dataset 3.137.
Numbers of distinct (in blue) and all their occurrences (in green) over the different numbers of SSEs they span over in Dataset 3.13.
Fig 13.
The figure displays three 3D structures of ribosomal RNAs: 4Y4O (chain: 2A), 5J7L (chain: DA) and 6SPB (chain: A). The colored parts correspond to the 3 largest found in Dataset 3.137: RINa#1984 in red, RINa#1983 in cyan and RINa#1982 in lime green. The overlap of two
is colored in indigo. Additional information about those
and their overlap is provided in Table 7.
Table 7.
Additional information on the 3 largest found in Dataset 3.137.
The colors correspond to ones used in Fig 13. The values for the overlaps correspond to the number of nodes shared between the . The RNA chains are denoted using the name of the file (ex:4Y4O) plus the name of the chain (ex:2A).
Fig 14.
Kink-turn found in Dataset 3.137 with our method.