Finding recurrent RNA structural networks with fast maximal common subgraphs of edge-colored graphs

RNA tertiary structure is crucial to its many non-coding molecular functions. RNA architecture is shaped by its secondary structure composed of stems, stacked canonical base pairs, enclosing loops. While stems are precisely captured by free-energy models, loops composed of non-canonical base pairs are not. Nor are distant interactions linking together those secondary structure elements (SSEs). Databases of conserved 3D geometries (a.k.a. modules) not captured by energetic models are leveraged for structure prediction and design, but the computational complexity has limited their study to local elements, loops. Representing the RNA structure as a graph has recently allowed to expend this work to pairs of SSEs, uncovering a hierarchical organization of these 3D modules, at great computational cost. Systematically capturing recurrent patterns on a large scale is a main challenge in the study of RNA structures. In this paper, we present an efficient algorithm to compute maximal isomorphisms in edge colored graphs. We extend this algorithm to a framework well suited to identify RNA modules, and fast enough to considerably generalize previous approaches. To exhibit the versatility of our framework, we first reproduce results identifying all common modules spanning more than 2 SSEs, in a few hours instead of weeks. The efficiency of our new algorithm is demonstrated by computing the maximal modules between any pair of entire RNA in the non-redundant corpus of known RNA 3D structures. We observe that the biggest modules our method uncovers compose large shared sub-structure spanning hundreds of nucleotides and base pairs between the ribosomes of Thermus thermophilus, Escherichia Coli, and Pseudomonas aeruginosa.


Management of exceptions to the proper edge-coloring
The annotation method of RNA structures obtained from biological experiments sometimes produces nodes involved in two base pairs of the same type. The labels of the edges of RNA 2D structure graphs containing such nodes do not form a proper edge colouring which is problematic for our graph-matching algorithms. Those violations might be artifacts from the interaction prediction method which is based on distance thresholds between the nucleobases' atoms. However, it is possible that those double interactions are actually biologically relevant and we thus propose a solution to handle them. From our observations, such node are very uncommon (a dozen nodes over the fifty-two thousands nodes of the complexes from the non-redundant RNA database maintained on RNA3DHub [1]). Moreover, in all the cases we observed, the nucleobases forming the same interaction twice was forming it with two other nucleobases consecutive in the backbone. As a consequence, we handle those exceptions by duplicating any connected component containing a problematic node as follows : With a, b, c ∈ N s.t. {a, b}, {a, c} ∈ E l and b < c, and κ the connected component containing them. We create κ ! a , κ − a , κ + a and κ * a with : To avoid duplicating results, if the same structure is found in several versions of c, we only keep one occurrence. The conserved occurence is picked according to the following This order prioritizes the connected component our solution impacted the least so the fact that an occurence is mentioned has found in κ * implies that the triangular structure was required. We also do not search common structure between different versions of the same component as they represent the same nucleotides of the same RNA.
Please note that we have been considering connected component in this section because f ′ may disconnect the RNA 2D structure graphs.

Gathering of partial results
The core of the process of transforming a set of maximal common subgraphs (mcsg) into a collection of recurrent interaction networks (RINs) relies on the application of the filtering function f RIN to each mcsg in order to obtain the RINs inside it. We thus obtain a set of RINs from each mcsg found. We merge those sets of RINs so identical RINs (i.e. which canonical graphs are isomorphic) are merged into a single RIN combining all their occurrences (without duplicates). However, the set of sets RINs found in the mcsgs we obtain is not identical to the set of RINs found in the dataset we are seeking. To obtain the later, we have to correct two issues in the former. In such case, we consider that a does not provide any additional information to the collection and thus that it needs to be removed for the sake of the readability of the collection of RINs.
The presence of such RINs in the set of RINs found in the mcsgs is due to the counterintuitive fact that a maximal common subgraph may contain a "non-maximal" RIN i.e. a

Parallel computing
The pipeline has been designed to support parallel computing as the problem is greatly compatible with it. Indeed, the production of each set of maximal common subgraphs of two RNA 2D structure graphs is independent and can be processed separately. Thus we divide the work between n cores between a master process et n