Figures
Abstract
In this article, we present an efficient and concise OpenMP implementation of the Nussinov RNA folding algorithm, a well-known representative of non-serial polyadic dynamic programming (NPDP). Our goal is to develop an optimized implementation that can serve as a template for related dynamic programming applications. The proposed code is derived from a detailed analysis of manual implementations, emphasizing the separation of problematic and non-problematic instances and structuring computations in a way analogous to matrix multiplication. This design enables the semi-automatic extraction of data locality using tools based on Presburger arithmetic—techniques widely employed in classical loop transformations and advanced source-to-source compilers grounded in the polyhedral model. In the experimental evaluation, we assess the performance of our implementation on modern massively parallel AMD and Intel processors with 64, 128, and 192 threads. Our approach leverages cache-aware tiling, parallelism, and explicit vectorization to maximize computational efficiency, achieving performance that surpasses both automatically generated compiler-based solutions and manually tuned implementations on the evaluated platforms. Specifically, our implementation achieves execution times up to two orders of magnitude faster than polyhedral code, while also outperforming unvectorized manual approaches—being at least 30 × faster than array transposition–based methods and at least 5 × faster than the tiled sparsified Four Russians variant. Additionally, our results indicate that CPU implementations do not exhibit significantly worse performance compared to their corresponding GPU counterparts. These results demonstrate the importance of leveraging Advanced Vector Extensions (AVX) to fully exploit the capabilities of modern multi-core processors, particularly those in the AMD Epyc family.
Citation: Gruzewski M, Palkowski M (2026) Cache-efficient and vectorized parallel dynamic programming for RNA folding. PLoS One 21(5): e0349146. https://doi.org/10.1371/journal.pone.0349146
Editor: Ramon Antonio Rodriges Zalipynis, HSE University, RUSSIAN FEDERATION
Received: July 25, 2025; Accepted: April 24, 2026; Published: May 20, 2026
Copyright: © 2026 Gruzewski, Palkowski. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: We have ensured full compliance with the data availability requirements. All data underlying the findings, including complete experimental results, are publicly available. In addition, we have provided the full source code of our implementation as well as the implementations of the related methods, together with detailed instructions on how to compile and run them. The repository also includes scripts and guidelines to reproduce the experiments and validate the results. All materials are available on GitHub and have been archived on Zenodo to ensure long-term accessibility and reproducibility (see Data Avalibility section after the Abstract). https://doi.org/10.5281/zenodo.19417382 https://github.com/markpal/plosone/. The additional data are within the manuscript and its Supporting information files.
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Dynamic programming (DP) is a method used in algorithm design to solve problems by breaking them down into smaller overlapping subproblems, solving each subproblem once, and storing the results to avoid redundant computations. It is especially useful for optimization problems where the solution to a larger problem depends on the solutions to smaller ones, a property known as optimal substructure [1]. Classic examples include the knapsack problem, matrix chain multiplication, and longest common subsequence.
Non-serial polyadic dynamic programming (NPDP) is a more general and abstract class of dynamic programming. Unlike standard dynamic programming, which typically combines two subproblems in a fixed, serial order, non-serial polyadic DP allows the combination of multiple subproblems in flexible and non-sequential ways. The term “polyadic” refers to combining more than two subproblems at once, while “non-serial” indicates that there is no strict left-to-right or sequential dependency between subproblems [2,3]. This framework includes a wide range of complex problems, such as optimal triangulation of polygons, matrix chain multiplication, optimal binary search trees, and often relies on specialized optimization techniques like Knuth’s optimization to achieve better performance [4–6].
The algorithms mentioned above share the same structural pattern as many bioinformatics approaches, particularly those based on the Nussinov RNA folding algorithm [7–10]. In these methods, for a given pair of indices i, j, the solution involves examining possible partitions within that interval by summing the results of subproblems across the row and column of the dynamic programming table, continually scanning all valid splits. In the case of Nussinov, this same dependency grid appears in the prediction of the secondary structure of the RNA, where the algorithm computes the best possible base pairings by seeking solutions to smaller subproblems. This dependency structure is also preserved in more advanced Nussinov-style algorithms in bioinformatics, such as the McCaskill algorithm to calculate the probability of base pairs [7] and MEA (Maximum Expected Accuracy) [11], and the Zuker algorithm for minimizing free energy [12], which all depend on traversing and combining DP entries in similar polyadic non-serial patterns (see S1 Table).
A typical parallelization strategy for such problems involves the execution of independent tasks along the diagonals of the dynamic programming table. This ordering respects the data dependencies that arise from reading values across rows and columns. Such diagonal processing enables concurrency without violating correctness. In compiler theory, this transformation is known as loop skewing, a fundamental loop transformation technique that restructures iteration space to expose parallelism. The DP table constructed in these algorithms is usually triangular in shape, reflecting the fact that solutions are only defined for entries where i < j, which is common in problems that involve pairing, chaining, or segmentation over sequences. One of the early works by Rizk and Lavinière [13] proposed an algorithm for processing along diagonals for both CPUs and GPUs. Achieving high-performance NPDP code that effectively exploits data locality and vectorization remains a challenging task [14–16].
Our goal is to develop a high-performance implementation of the Nussinov RNA folding algorithm, applicable to similar dynamic programming problems. The proposed solution leverages multicore CPUs and vectorization to achieve performance exceeding both automatic compiler optimizations and some manual implementations.
The paper is organized as follows. The Introduction presents the Nussinov NPDP algorithm with attention to data locality, briefly reviews polyhedral tools and manual methods, and outlines existing limitations and motivations. The next section introduces our approach based on dependency reduction, classification of problem instances, and automated code generation. The experimental section evaluates runtime efficiency on three multicore systems and compares results with related works. Finally, the conclusion summarizes findings and discusses future research directions.
Nussinov’s algorithm
Nussinov proposed a dynamic programming algorithm for RNA folding in 1978 [8], which maximizes the number of non-crossing matchings between complementary bases of an RNA sequence of length N. Let be an RNA sequence, where
is a nucleotide. Nussinov’s dynamic programming recurrence for an N × N matrix S is given below.
1 for (int i = N – 1; i >= 0; i--) {
2 for (int j = i + 1; j < N; j++) {
3 for (int k = i; k < j; k++) {
4 S[i][j] = max(S[i][k] + S[k + 1][j], S[i][j]); // S1
5 }
6 S[i][j] = max(S[i][j], S[i + 1][j-1] + paired(seqq[i], seqq[j])); // S2
7 }
8 }
Here, S(i, j) defines the maximum number of base-pair matches of over the region 1 ≤ i ≤ N, and
is a function that returns 1 if (xi, xj) is an AU, GC, or GU pair, and 0 otherwise.
Listing 1 represents the triply nested affine loops with two statements accessing the two-dimensional array S implementing Nussinov’s algorithm.
Listing 1. The Nussinov algorithm code.
The iteration space (i, j) of the program loop forms the upper right triangle of Nussinov’s dynamic programming array. The loop indices are defined over the domain (i, j, k):
Locality improvement
The Nussinov algorithm and the family of related algorithms have a time complexity of O(n3) and a space complexity of O(n2). This means that during computation, the same cells in the cost (or pairing) table are visited multiple times, both row-wise and column-wise. To make better use of the cache, it is more efficient to use data blocking (also known as tiling), since the same rows and columns are repeatedly accessed [17]. By computing an entire block of target table entries at once rather than a single cell at a time, memory locality and cache utilization are improved, regardless of whether the computation is performed on a CPU or a GPU.
Manual loop tiling
Rizk et al. [13] employed a piecewise-linear scheduling approach, as used in our work, to generate efficient GPU implementations for RNA folding. In [17], they extended this work to accommodate GPU blocks scanned diagonally. However, they do not provide a method for deriving this schedule from the program’s dependence structure [18].
Mullapudi and Bondhugula [2] have investigated automatic techniques for tiling codes that extend beyond the scope of standard methods. Their approach leverages dynamic tile scheduling instead of generating a static schedule, and it is applicable to Nussinov’s algorithm. At present, we lack a precise characterization of the specific domains where each technique proves most effective [18].
Li, Ranka, and Sahni [19] provided an efficient GPU algorithm that employs a matrix multiplication approach using a CUDA strategy for Nussinov’s RNA folding. In contrast, their CPU implementation is limited to a transpose-based technique that utilizes a lower-triangular array or row scanning. However, as demonstrated in [16], this strategy proves inefficient for larger RNA strands. Furthermore, we also did not have access to the original GPU implementation of this achievement.
Zhao and Sahni developed three cache-efficient algorithms without increasing the memory requirement, ByRow, ByRowSegment, and ByBox for Nussinov’s RNA folding [20]. They showed that presented techniques based on a simple LRU cache model give better run time and energy performance than Li’s approach. Unfortunately, the code was limited to serial.
An efficient implementation has been demonstrated using the Four Russians (4R) strategy, which eliminates redundant computations by exploiting the fact that Nussinov’s array grows only until its last element [21,22]. 4R allows for a reduction in computational complexity by using a memoization strategy. However, as shown in [16], this work focuses only on the Nussinov algorithm, and no code for a parallel CPU version has been released. Moreover, if the NPDP algorithm produces an array in which elements increase and decrease, as in probabilistic RNA folding [7] - Venkatachalam’s precomputation cannot be applied directly. In 2023, Tchendji et al. [23] refined the technique by integrating a loop tiling strategy [9] with the sparsified Four Russians approach [21], implementing it using a portable OpenCL solution for both CPUs and GPUs for Nussinov’s code. Nonetheless, subsequent studies [16] indicate that this solution does not match the efficiency of CUDA-based implementations such as those proposed by Venkatachalam [22] and Gruzewski [16].
Wonnacott [18] presented a strategy for partitioning the domain of Nussinov’s instructions into problematic and non-problematic tiles. The non-problematic tiles can be executed arbitrarily, with the overall code execution synchronized to the completion of the problematic blocks. The paper demonstrates how to derive such a schedule through dependency analysis. Regrettably, the authors did not provide a CPU-based parallel implementation.
Source-to-source polyhedral compilers
Simultaneously with the development of manually implemented blocked code for NPDP algorithms, automatic source-to-source polyhedral compilers emerged. These compilers primarily leverage approaches based on affine transformations, index set splitting, and the extraction of synchronization-free slices via Presburger arithmetic methods developed for code parallelization by Pugh’s team, in conjunction with the Omega Calculator library [24].
Standard operations on relations and sets are used, such as intersection (∩), union (∪), difference (-), composition (∘), domain (dom R), range (ran R), inverse (R−1), relation application (S = R(S): e
iff exists e s.t. e → e
R,e ∈ S). In detail, the description of these operations is presented in papers [25–27].
Based on this mathematical apparatus, the polyhedral model was developed. This framework is used to represent and optimize loop nests and array accesses in programs. In this model, iterations of loops are represented as points in a multidimensional integer space, and their execution domains are defined by polyhedra—geometric constructs characterized by linear inequalities. Data accesses and dependencies are expressed as affine functions of the loop indices, allowing the computation to be modeled in a precise, algebraic form.
This formulation enables a broad spectrum of optimizations, including loop tiling, fusion, parallelization, and reordering, by conceptualizing these operations as transformations on the underlying polyhedra. Because these transformations must adhere to data dependency constraints, the polyhedral model facilitates the application of integer linear programming and Presburger arithmetic techniques to ensure that any transformed schedule remains both correct and optimal. Ultimately, many modern automatic source-to-source compilers rely on the polyhedral model, leveraging its structured and rigorous representation of a program’s computational domain to generate highly efficient and parallelized code.
The Omega project was rewritten in the ISL framework by Sven Verdoolaege, incorporating a PET analyzer and a code generator, as well as standalone transformation engines [27]. Subsequent versions of the Pluto [28], Traco [9], and Dapt [29] compilers relied on this library, differing mainly in the specific loop transformation algorithms they employed.
The Pluto optimizer, when using the Affine Transformation Framework (ATF), encountered difficulties in tiling all loops in the Nussinov algorithm; in particular, it was unable to tile the innermost loop in s1, which is the most performance-critical. It also failed to parallelize the tiled code of the McCaskill algorithm for partition functions computation [10] because dependencies between statements in the strongly connected component (SCC – a maximal mutually reachable statements in dependence graph) prevent permutability in the parallel direction. In simpler terms, within an SCC, the statements are mutually dependent, forming a cycle of dependencies.
It was only with the Traco compiler, which applied tile corrections via transitive closure of the dependence graphs, that the tiling of all Nussinov loops was achieved. Later, the approach to NPDP algorithms was further refined through space-time tiling, thereby avoiding the costly tile correction, and this method was ultimately implemented in the Dapt tool.
Limitations of existing Nussinov code optimization methods and objectives of this work
Despite numerous studies on both manual and automatic optimization of the Nussinov algorithm, several limitations remain. Existing methods lack a universal approach that can be easily applied to similar algorithms. They also do not provide efficient ways to exploit vectorization on modern CPU architectures, nor do they offer high-performance solutions that work across both GPU and CPU platforms. While efficient 4R [22,30] methods exist, their implementations are complex and largely restricted to the Nussinov algorithm on GPUs. Wonnacott’s [18] and Zhao’s [20] approaches produce only serial code, while Li’s implementation for CPU Transpose can be easily adapted to other problems but is not very effective. Polyhedral compilers [9,28,29] and parallel tiled sparsified 4R OpenCL implementations [14] have debatable performance. Hence, this motivated us to propose a solution meeting the following criteria:
- develop a simple and portable codebase that runs efficiently across different multi-core architectures,
- leverage both parallelism and vectorization to improve performance,
- ensure the approach is easily adaptable to similar NPDP computations, providing a flexible solution with short parallel code.
Finally, we will present an OpenMP [31] implementation that can be easily translated to other multi-threading libraries, demonstrating the portability and adaptability of our approach.
Computational strategy
We base our approach on key observations from manual methods. From Wonnacott’s work, we know that focusing on non-problematic instances that are not connected to the updated tile is beneficial. From Li’s work [19], we observed that such instructions (S1) can be organized into an efficient schedule similar to matrix multiplication, which enables execution on ubiquitous parallel platforms.
Our experience with polyhedral compilers [9,16] allows us to use ISL tools for semi-automatic code generation and to automate parts of the process, including dependency management. Ultimately, our goal is to produce high-performance code that enables vectorization, as in classical approaches, so that the performance is comparable to the results achieved in Frid’s and Tchendji’s work [21,23] as well as polyhedral approaches [9,28].
Dependence pattern.
As we mentioned, using the ISL library, we can represent dependencies through relations, just like in Traco and Dapt. Additional transitive paths can be derived from these relations.
The positive transitive closure for a given relation R, R+, is defined as follows [25]:
It describes which vertices in a dependence graph (represented by relation R) are connected directly or transitively with vertex e.
Transitive reduction applied in Presburger arithmetic is an approach to simplify relationships (expressed as logical constraints) while preserving their fundamental properties, such as reachability. Kelly et al. described this approach called Simple Redundant Synchronization Removal in paper [32] as
The composition of the relation R+ and R, or equivalently the application R+(R) [27], gives all paths between two nodes in the graph whose length is greater than or equal to 2. To compute the exact transitive reduction, we first need to compute the exact transitive closure of the union of all dependence graphs. It is worth noting that for Nussinov’s loop dependence relations, ISL [27] only approximates the transitive closure. However, we can use the iterative algorithm presented by Bielecki and Klimek in [33], which is capable of computing transitive relations without approximation for Nussinov’s loop dependences [9]. The tool was implemented within the TRACO compiler. We have placed the exact form of R+ for Nussinov’s loop nest dependencies in S2 Text.
Since some dependencies imply others, transitive reduction is applied to remove redundancies [24]. This leaves only a union of four dependence relations, reduced from seven, which simplifies the subsequent dependence analysis presented in the paper.
Among the presented relations, only one, R1, pertains to dependencies of instruction S1 to S1, and it represents the combined tile blocks in all three dimensions. The remaining dependencies represent the relation of instruction within the same iteration i, j, and the dependency
along the row i and column j. This ensures that loop skewing naturally handles these dependencies by definition. Therefore, within the scope of our interest, only the R1 relation remains.
The relation R1 forms a chain of consecutive instructions , creating a sequence of S1 instructions, which is the only instruction triply nested in the Nussinov loop nest. Sacrificing the third dimension in favor of two-dimensional blocks enables vectorization and facilitates the separation of non-problematic blocks in the subsequent code.
Problematic and non-problematic tiles.
We can now imagine a square block of instructions computed simultaneously for fixed (i, j), with dimensions bb × bb, where the values are stored in individual elements of the block and later reused. The key question becomes: which instructions can be executed in any order, and which must not be reordered, as their execution is determined by dependencies within a specific block?
Each block is described by coordinates (II, JJ), and in order to compute it, we only use instructions that are not directly related to each other by dependence within that block.
Since our focus is on the S1 instruction, we must analyze its control flow. It performs two reads, from positions (i, k) and (k + 1, j), and writes to position (i, j). The write to (i, j) belongs to the currently computed tile; therefore, we are interested only in those instructions that do not directly access this tile.
Based on the analyses presented in the works [2,3,19], we illustrate in Fig 1 the distinction between problematic and non-problematic tiles in the iteration space of Nissunov’s loop nest. The non-problematic tiles, marked in green, are those for which the instructions S1 can be executed independently, i.e., in any order relative to other cells within the current block (highlighted in yellow as the current or update tile). Additionally, two example red cells are highlighted, for which the execution of S1 is required for the current step, as the read references i,k and k + 1,j come from the same update tile. These dependencies are associated with the instructions in the red triangular regions, which form the set of problematic tiles and constitute the potential sources of conflicts related to the execution order of instructions.
In our approach, problematic tiles refer to computations of pairs that interact with elements of the currently processed block. Naturally, this block must still be compared with the final result of the non-problematic tiles; however, their execution can be performed in an arbitrary order. It is worth noting that both problematic and non-problematic tiles contain intra-tile dependencies, which are indicated by red and green arrows, respectively in Fig 1. Problematic tiles additionally contain inner-tile dependencies, whereas non-problematic tiles rely only on values that have already been computed.
If we formulate the dependency DEP := , it becomes clear that either the source or the destination of this relation may belong to the currently computed tile. By analyzing Fig 1, we can observe that these cases correspond only to instructions within the triangular blocks. Therefore, all other blocks will not directly access the memory cells written by the tile (II, JJ).
Wonnacott, in his paper [18], demonstrated that for a given tile (II, JJ), a subset of instructions can be safely parallelized and executed in any order. He referred to these as “almost tileable” or simply non-problematic instructions. These correspond to the second group of instructions in our analysis—those that do not directly depend on or affect the memory cells of the current tile. Consequently, the remaining instructions, which involve dependencies that may include the current tile’s memory region, are referred to as problematic. First, let us define TILE(II,JJ)
Wonnacott identified all “problematic” iterations — those that read from the tile being updated — by taking the intersection of each dependence relation with a relation whose source and sink iterations share the same values of all but the innermost loop [18].
Hence, we define the set of problematic instructions as follows:
that is, all instructions which either depend on the current tile or are depended on by it. It corresponds to relation R1 as a chain of S1 statements, but only those whose source or sink belongs to the current (updated) tile.
Consequently, the set of non-problematic instructions ZN for the current tile is the domain of instruction S1 excluding Z, i.e.,
Within the set of non-problematic instructions, denoted ZN, we can further partition the instructions based on their tile coordinates (II, JJ, KK), where II < KK < JJ, following the domain of the triangular matrix used in the Nussinov algorithm.
Although the DEP relation represents the Read after Read (RAR) dependence, it is possible to determine a path based on the analysis of the R dependency from the Nussinov code. The cells i,k and k + 1,j, although they can be determined independently, are needed at the same time to determine i,j. If there are read-write dependencies and
, we can determine the connecting path through the union of the transitive closure of the relation and its inverse,
. In other words, the path can be found in the undirected graph of dependencies.
A special case of problematic instructions outside the set Z also includes the set of S2 instructions, which simply remain within the 2D blocking domain. Additionally, we must consider the triangular blocks along the first diagonal, since for these the set ZN does not exist (they are defined by constraint ). It is worth noting that for the second diagonal as well, there will be no tiles satisfying the condition ii < kk < jj; only starting from the third diagonal does the set ZN become non-empty.
At the end of this subsection, we should clearly define problematic and non-problematic tiles. A tile is considered problematic if it contains any problematic instructions from Z. Therefore, a non-problematic tile consists only of non-problematic instructions, in this case, only S1.
Result code.
Upon reanalyzing the condition i < k < j, or equivalently 0 ≤ k < j − i, we need to identify which S1 instructions belong to the set of problematic ones. It can be observed that these correspond to two specific regions:
- Instructions located within the triangular block of the current tile, i.e., those for which
,
- And the final part of the current tile where
.
These are the instructions whose dependencies either reach into or are affected by the memory region of the tile being currently computed.
A loop is considered parallel if it does not carry any dependence; that is, for all dependences , the corresponding dependence vector in the scheduled space has a zero component along that loop dimension. In practice, this information is extracted from the affine schedules generated by Pluto [28], although other polyhedral compilers can also be used.
In accordance with the work [16], we formulate the set of tiles (ii, jj) scanned along the diagonals, and generate the code, for example, using the Barvinok tool [34], which accepts interactive commands in ISL.
To generate the code, we use the codegen function from the Barvinok tool, which scans 2D block identifiers and the instructions inside each block. It is presented on Listing 2. The c0 loop sequentially iterates along diagonal lines, and within each block, computations are performed independently according to loop skewing.
Listing 2. The result code for the Nussinov’s RNA folding computation.
1
2 for (int c0 = 0; c0 <= (N – 1) / bb; c0 += 1) // serial loop
3 #pragma omp parallel for
4 for (int c1 = c0; c1 <= N / bb; c1 += 1) {
5 short C[bb][bb] = {0};
6 short _ii = c1 - c0;
7 short _jj = c1;
8
9 // non-problematic tiles
10 for (short kk = _ii + 1; kk < _jj; kk++)
11 for (int row = 0; row < bb; row++)
12 #pragma omp simd
13 for (int col = 0; col < bb; col++)
14 for (int k = 0; k < bb; k++)
15 s1(bb * _ii + row, bb * _jj + col, bb*kk + k – 1, &C[row][col]);
16
17 // square tiles with problematic s1 and s2
18 if (_jj >= _ii + 1) {// c0 >= 1
19 for (int c2 = bb * _ii + bb – 1; c2 >= bb * _ii; c2––)
20 for (int c3 = bb * _jj; c3 <= min(N – 1, bb * _jj + bb – 1); c3++) {
21 for (int c4 = c2; c4 < c2 + bb – 1; c4++)
22 s1(c2, c3, c4); // triangle boundary block
23
24 S[c2][c3] = max(C[c2 - bb * _ii][c3 - bb * _jj], S[c2][c3]);
25
26 for (int c4 = bb * _jj – 1; c4 < c3; c4++)
27 s1(c2, c3, c4); // current block
28
29 s2(c2, c3);
30 }
31 }
32 // triangular first diagonal blocks
33 else if (_jj == _ii) {// c0 = 0
34 for (int c2 = min(N – 2, bb * _jj + bb – 2); c2 >= bb * _jj; c2––)
35 for (int c3 = c2 + 1; c3 <= min(N – 1, bb * _jj + bb – 1); c3++) {
36 for (int c4 = c2; c4 < c3; c4++)
37 s1(c2, c3, c4);
38 s2(c2, c3);
39 }
40 }
41 }
Listing 3. Statements code.
1 inline void s1(int i, int j, int k) {
2 S[i][j] = max(S[i][k] + S[k + 1][j], S[i][j]);
3 }
4
5 inline void s1(int i, int j, int k, short *s) {
6 *s = max(S[i][k] + S[k + 1][j], *s);
7 }
8
9 inline void s2(int i, int j) {
10 S[i][j] = max(S[i][j], S[i + 1][j-1] + paired(seqq[i], seqq[j]));
11 }
The generated code is divided into two parts:
- For c0 = 0, blocks are computed separately because the loop bounds form a triangular shape.
- for c0 ≥ 1, the code generator computes square blocks sequentially. Originally, the code generator computed a loop for all k values. In this case, using the sets Zn, we extract the kk loops and instructions from this nest, compute blocks separately, and create a maximum value array C for each thread.
To enable tiling of the k loop, a thread-local array C is used to compute the max reduction. This design eliminates the intra-tile reduction dependence (R1), allowing the tiling transformation to be applied safely and efficiently.
Some parts of the code still require adjustments, but only on the s1 statement. From the loop in line 19, we need to recompute the c4 loop (corresponding to k) up to the first triangular block, compare it with the value in the C array, and add the elements from the current block. This modification is applied manually, but knowledge of the sets Zn and Z is helpful here. For example, for TILE(1,5) from Fig 1, while
. So, in general, TILE(II,JJ) is combined with triangular tiles (II,II) and (JJ,JJ), called problematic. While Zn for TILE(II,JJ) are pairs of tiles (II, KK) and (KK, JJ), where II < KK < JJ, which represents lines 10–15 in Listing 2.
To avoid a nonlinear expression, we introduce bb as a constant. Next, we exclude the set of non-problematic instructions from the set and proceed to compute the tiles (ii, jj, kk), previously saving the current results in a local array C in the block in cache (with each thread having its own copy). These values must be taken into account when processing the problematic instructions.
To simplify the final code, we provided inline functions representing S1 and S2 statements (Listing 3). The code defines three inlined helper functions used in the Nussinov RNA folding algorithm. Function s1(i, j, k) implements the dynamic programming recurrence relation: it updates S[i][j] with the maximum between its current value and the sum of S[i][k] and S[k + 1][j]. The overloaded version s1(i, j, k, short *s) performs the same computation but stores the result in a temporary variable pointed to by s, which is used to store values in the local C array. The function s2(i, j) handles the special case where positions i and j are paired: it updates S[i][j] based on whether the nucleotides at those positions can form a valid base pair, using the helper paired() function and the current values in the DP matrix.
This code presented on Listing 2 implements a parallel, tiled version of the Nussinov RNA folding algorithm using OpenMP and SIMD. It operates on a blocked dynamic programming matrix with tile size bb, processing diagonals in order to respect data dependencies. The outer loop (c0) runs serially over diagonals, while the inner loop (c1) is parallelized. Each thread computes a tile (ii, jj) of the matrix using a private local array C, ensuring no data races—each thread has its own C. Non-problematic tiles are processed with standard recurrence using s1(), leveraging SIMD for vectorization. Square and triangular tiles (on or near the diagonal) are handled separately due to irregular index ranges, combining s1() and s2() calls. The results are written back to the global matrix S, using max() to update optimal base pairings. To enable loop parallelism and vectorization, we used the #pragma omp parallel for and #pragma omp simd directives from OpenMP [31], respectively.
It is worth noting that the loops iterating over row, column, and k can be freely interchanged and vectorized, regardless of whether they are outer or innermost loops.
We observed that the proposed computational strategy can be applied to similar NPDP problems. In S1 Text, we present a brief analysis of a closely related OBST algorithm based on Knuth’s method.
Experimental study
Experiments were conducted on modern machines, which are equipped with massively parallel units 2 × Intel Xeon Gold 6326, Intel CPU Max 9462, AMD Epyc 9654, and partially with 2 × Intel Platinum 8358.
The first machine includes two Xeon Gold 6326 processors, providing a total of 32 physical cores and 64 threads. Each processor operates at a base frequency of 2.9 GHz, with a boost clock of up to 3.5 GHz. The system is based on the Ice Lake architecture (x86_64), supporting both 32-bit and 64-bit modes, with a 46-bit physical address space.
The memory hierarchy includes 1.5 MB of L1 cache distributed across 32 instances, 40 MB of L2 cache split into 32 instances, and 48 MB of L3 cache divided into two instances. The system is equipped with 131 GB of RAM and runs on Ubuntu 22.04.4 LTS.
For compilation, the Intel ICC compiler 2025.3.0 was used to ensure optimized performance for the experiments with the flags -O3, -qopenmp, and -mavx512 for vectorization. In our comparisons, we did not include the implementations [2,18,20], as they are either unavailable or do not provide parallel versions.
Each experiment was repeated five times, and the reported results correspond to the arithmetic mean. The observed standard deviation is low, and for the largest problem sizes, the differences between runs remain below 5 seconds. Additionally, the relative variability, defined as , is within a few percent for longer execution times. As expected, both the standard deviation and the relative variability increase for shorter execution times, which can be attributed to the limited resolution and inherent noise of timing measurements. The complete statistical analysis is provided in the Data Availability section.
Table 1 presents the results for polyhedral compilers, the Transpose technique [19], and the parallel and sparsified Four Russians (PS4R) with tiling (PTS4R) method [23]. The polyhedral compilers were provided with Listing 1 as input. The Pluto compiler does not tile the innermost loop k, as confirmed for releases 0.11.4, 0.12.8, and 0.13, which results in the poorest performance among the compared techniques (flags ––parallel ––tile were used). Traco applies transitive closure on the dependence graph and can tile all loop nests; however, the use of irregular tiles reduces the achieved speedup. The best-performing automatic compiler is Dapt, whose time-space tiled code achieves strong performance on the Nussinov algorithm. It is worth noting that the OneAPI compiler is unable to vectorize the code generated by Pluto, Traco, and Dapt, and it applies only loop unrolling.
The Transpose technique operates on the lower triangle of the matrix to enable row-wise access. However, the requirement for a double-sized array negatively impacts data locality, which limits its effectiveness—it only outperforms Pluto. The newly introduced PTS4R technique, which builds upon Traco’s loop tiling approach, delivers competitive performance. Nevertheless, untiled parallel sparsification PS4R yields better results overall.
In contrast, our implementation significantly outperforms all of the above methods. It is simple and concise, avoiding the complex loop-bound computations required by polyhedral tools. The code achieves high cache efficiency, and parallelization of the inner loops enables effective vectorization of regular instruction patterns. As a result, it delivers consistent performance improvements, especially at larger problem sizes. We empirically tested various block sizes. In most cases, a block size of 32 provided the best performance, both in our implementation and in those using polyhedral compilers. For the PTS4R method, however, the default block size of 128 proved to be more effective.
We repeated the experiments with a second machine equipped with an Intel Xeon CPU Max 9462 (the Sapphire Rapids HBM architecture), featuring 64 hardware threads, up to 3.5 GHz turbo, a 150 MB L3 cache, a 128 MB L2 cache, a 3 MB L1d cache and a 2 MB L1i cache, as well as 128 GB of on‑package HBM2e and 631 GB of DRAM. The code was compiled with the Intel oneAPI C++ Compiler 2025.0.4 using the flags -O3, -qopt-report = max, -qopt-report-phase = vec, -march = native, -mavx512, and -qopenmp. The resulting .optrpt report indicates that vectorization was successfully enabled for the loop responsible for scanning columns in the non-problematic instance stream, with a vector length of 32 (see S3 Text).
The second machine is equipped with larger local memory, which is reflected in the results. The Dapt compiler starts to outperform both Pluto with 2D tiling and the Transpose technique only beyond input sizes of 10,000 nucleotides. However, techniques based on sparsified Four Russians continue to dominate, with the untiled parallel version showing a clear advantage (Table 2). On the other hand, our implementation demonstrates more stable performance across all input sizes and significantly outperforms the related methods mentioned above.
We extended our research by including a machine, this time from AMD — the AMD Epyc 9654 96-Core Processor, featuring 192 threads, 384MB of shared L3 cache, 1MB of L2 cache per core, and 64KB of L1 cache per core. The processor boosts up to 3.7 GHz, is equipped with 1.5 TiB of RAM, and runs AlmaLinux 8.7. The codes were compiled in the same manner as on the CPU Max machine, using the OneAPI compiler with version 2024.0.2.
Once again, our execution times turned out to be the shortest (Table 3). On AMD architectures, 2D tiling of Pluto and Transpose outperforms 3D tiling because it makes better use of the cache hierarchy and reduces synchronization overhead. 2D tiles create smaller, more cache-friendly blocks of data, while 3D irregular tiles can exceed cache capacity, causing more memory accesses. Additionally, 2D loops tend to generate more regular memory access patterns, which improves SIMD utilization and prefetching, resulting in higher overall performance. Pluto stood out as the fastest among the tested polyhedral compilers. The Transpose implementation by Li’s team [19] performed well, despite using twice the memory, and lagged behind the parallel PS4R method for RNA strands over 15,000 nucleotides. Fig 2 shows that our approach outperforms the best related methods, PS4R and PTS4R, achieving consistent speed-ups ranging from 5× to over 30× across input sizes on various CPU architectures.
Therefore, it can be concluded that the presented implementation is the fastest among CPU-based available solutions for Nussinov’s code, which also suggests potential for strong performance on other OpenMP-based platforms such as ARM. It falls short when compared to the corresponding blocked CUDA version, which was analyzed in [16]. The GPU implementation was optimized to perform loop skewing by launching kernels on diagonal blocks, enabling 2D parallel execution of loops iterating over the rows and columns of the matrices, and storing the output matrix C in shared memory.
In Table 4, we present a comparison of the proposed CPU implementation on Xeon CPU Max 9462 (64 threads), 2 × Platinum 8358 (2 × 64 = 128 threads), and AMD Epyc 9654 (192 threads) with CUDA implementations from the paper [16] on selected NVIDIA cards A100, V100, and RTX 4060. It is clear that, despite the inherently less parallel architecture of the CPU, the use of vectorization and data locality enables the CPU-based code to achieve performance results that are relatively close to those of the more parallelized GPU implementation, especially with the AMD Epyc machine.
In Table 5, we compare the parallel tiled sparsified 4R (PTS4R) OpenCL implementation executed on NVIDIA A100 and RTX 4060 GPUs with our proposed OpenMP implementation on Xeon Gold, CPU Max, and Epyc 9654 processors. The non-tiled PS4R version was not implemented by Tchendji et al. on GPU [14]. The execution time results indicate that the proposed approach consistently outperforms the related method.
Conclusion
In this paper, we presented an optimized implementation of the Nussinov algorithm, characterized by high efficiency, locality, and vectorization. The solution is compact, concise, and tiled across all dimensions, providing a robust and scalable approach to RNA secondary structure prediction. We compared the execution times of our implementation on massively parallel CPU machines from Intel and AMD against both manually optimized solutions and those obtained from automatic compilers based on the polyhedral model. Our solution achieves speedups of at least 5× and up to over 30 × compared to related CPU-dedicated implementations of the Four-Russians method (PS4R and PTS4R) in OpenCL for the Nussinov algorithm that do not utilize Advanced Vector Extensions (AVX). Notably, even on CPU architectures, our implementation achieves speedups of 15× or more compared to the PTS4R GPU implementation.
The proposed approach is relevant to the class of NP-hard problems, and similar dependency structures and memory access patterns can be found in other algorithms within computer science and bioinformatics. Additionally, we demonstrated the dynamic nature of the algorithm, which presents ongoing challenges for optimization. Our work contributes to the field of semi-automatic transformations, as we leveraged ISL-based computations alongside insights from manual solutions available in the literature. In summary, dynamic programming tasks in bioinformatics executed on CPUs must be heavily vectorized to achieve results within a reasonable time.
Future research will aim to extend these methods to more advanced algorithms from the NPDP Benchmark Suite [10] and be dedicated to RNA folding, to further enhance computational efficiency in the bioinformatics domain. Additionally, we plan to investigate energy efficiency and quantify potential energy savings achieved by the proposed approach.
Supporting information
S1 Table. NPDP benchmarks similar to Nussinov’s code.
https://doi.org/10.1371/journal.pone.0349146.s001
(PDF)
S1 Text. Short analysis of the NPDP Knuth algorithm (OBST).
https://doi.org/10.1371/journal.pone.0349146.s002
(PDF)
S2 Text. Exact transitive closure of the dependence graph for the Nussinov loop nest in ISL format.
https://doi.org/10.1371/journal.pone.0349146.s003
(PDF)
S3 Text. Fragment of the Intel oneAPI C+ Compiler report showing loop vectorization in a non-problematic domain.
https://doi.org/10.1371/journal.pone.0349146.s004
(PDF)
References
- 1.
Liu L, Wang M, Jiang J, Li R, Yang G. Efficient Nonserial Polyadic Dynamic Programming on the Cell Processor. In: IPDPS Workshops. Anchorage, Alaska: IEEE; 2011. pp. 460–71.
- 2.
Mullapudi RT, Bondhugula U. Tiling for Dynamic Scheduling. In: Rajopadhye S, Verdoolaege S, editors. Proceedings of the 4th International Workshop on Polyhedral Compilation Techniques. Vienna, Austria; 2014.
- 3.
Wonnacott DG, Strout MM. On the Scalability of Loop Tiling Techniques. In: Proceedings of the 3rd International Workshop on Polyhedral Compilation Techniques (IMPACT). 2013.
- 4. Bielecki W, Blaszynski P, Poliwoda M. 3D parallel tiled code implementing a modified Knuth’s optimal binary search tree algorithm. J Comput Sci. 2021;48:101246.
- 5.
Cormen TH, Leiserson CE, Rivest RL, Stein C. Introduction to Algorithms. 3rd ed. The MIT Press; 2009.
- 6. Knuth DE. Optimum binary search trees. Acta Informatica. 1972;1(3):270–270.
- 7. McCaskill JS. The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers. 1990;29(6–7):1105–19. pmid:1695107
- 8. Nussinov R, Pieczenik G, Griggs JR, Kleitman DJ. Algorithms for loop matchings. SIAM J Appl Math. 1978;35(1):68–82.
- 9. Palkowski M, Bielecki W. Parallel tiled Nussinov RNA folding loop nest generated using both dependence graph transitive closure and loop skewing. BMC Bioinformatics. 2017;18(1):290. pmid:28578661
- 10. Palkowski M, Bielecki W. NPDP benchmark suite for the evaluation of the effectiveness of automatic optimizing compilers. Parallel Comput. 2023;116:103016.
- 11.
Freiburg Bioinformatics Group. Freiburg RNA Tools, Teaching RNA algorithms; 2022. https://rna.informatik.uni-freiburg.de/teaching
- 12. Zuker M, Stiegler P. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res. 1981;9(1):133–48. pmid:6163133
- 13.
Rizk G, Lavenier D. GPU Accelerated RNA folding algorithm. Springer Berlin Heidelberg; 2009. pp. 1004–13.
- 14. Tchendji VK, Youmbi FIK, Djamegni CT, Zeutouo JL. A parallel tiled and sparsified four-Russians algorithm for Nussinov’s RNA folding. IEEE/ACM Trans Comput Biol Bioinform. 2023;20(3):1795–806. pmid:36279354
- 15. Thangamani A, Loechner V, Genaud S. A survey of general-purpose polyhedral compilers. ACM Trans Archit Code Optim. 2024;21(4):1–26.
- 16. Gruzewski M, Palkowski M. Cross-platform and polyhedral programming for Nussinov RNA folding. Future Generation Comput Syst. 2025;169:107786.
- 17.
Rizk G, Lavenier D, Rajopadhye S. GPU Accelerated RNA Folding Algorithm. Elsevier; 2011. pp. 199–210.
- 18.
Wonnacott D, Jin T, Lake A. Automatic tiling of "mostly-tileable" loop nests. In: IMPACT 2015: 5th International Workshop on Polyhedral Compilation Techniques. Amsterdam, The Netherlands; 2015.
- 19. Li J, Ranka S, Sahni S. Multicore and GPU algorithms for Nussinov RNA folding. BMC Bioinformatics. 2014;15 Suppl 8(Suppl 8):S1. pmid:25082539
- 20. Zhao C, Sahni S. Cache and energy efficient algorithms for Nussinov’s RNA Folding. BMC Bioinformatics. 2017;18(Suppl 15):518. pmid:29244013
- 21. Frid Y, Gusfield D. An improved Four-Russians method and sparsified Four-Russians algorithm for RNA folding. Algorithms Mol Biol. 2016;11:22. pmid:27499801
- 22. Venkatachalam B, Gusfield D, Frid Y. Faster algorithms for RNA-folding using the Four-Russians method. Algorithms Mol Biol. 2014;9(1):5. pmid:24602450
- 23. Tchendji VK, Youmbi FIK, Djamegni CT, Zeutouo JL. A parallel tiled and sparsified four-Russians algorithm for Nussinov’s RNA folding. IEEE/ACM Trans Comput Biol Bioinform. 2023;20(3):1795–806. pmid:36279354
- 24.
Kelly W, Maslov V, Pugh W, Rosser E, Shpeisman T, Wonnacott D. The Omega Project. http://www.cs.umd.edu/projects/omega/release-1.0.html
- 25.
Kelly W, Maslov V, Pugh W, Rosser E, Shpeisman T, Wonnacott D. The Omega Library interface guide. College Park, MD, USA. 1995.
- 26.
Pugh W, Wonnacott D. An exact method for analysis of value-based array data dependences. In: Sixth Annual Workshop on Programming Languages and Compilers for Parallel Computing. Springer-Verlag; 1993.
- 27.
Verdoolaege S. Integer Set Library - Manual; 2011. Available from: www.kotnet.org/skimo//isl/manual.pdf
- 28. Bondhugula U. A practical automatic polyhedral parallelizer and locality optimizer. SIGPLAN Not. 2008;43(6):101–13.
- 29.
Bielecki W, Poliwoda M. Automatic parallel tiled code generation based on dependence approximation. In: Malyshkin V, editor. Parallel computing technologies. Cham: Springer International Publishing; 2021. pp. 260–75.
- 30. Frid Y, Gusfield D. A simple, practical and complete O(n3/log n)-time algorithm for RNA folding using the Four-Russians speedup. Algorithms Mol Biol. 2010;5:13. pmid:20047670
- 31.
OpenMP Architecture Review Board. OpenMP Application Program Interface Version 5.2. 2021. https://www.openmp.org/specifications
- 32. Kelly W, Pugh W, Rosser E, Shpeisman T. Transitive closure of infinite graphs and its applications. Int J Parallel Prog. 1996;24(6):579–98.
- 33.
Wlodzimierz B, Tomasz K, Marek P, Beletska A. An Iterative Algorithm of Computing the Transitive Closure of a Union of Parameterized Affine Integer Tuple Relations. In: COCOA 2010: Fourth International Conference on Combinatorial Optimization and Applications, Lecture Notes in Computer Science. Springer Berlin Heidelberg. 2010. pp. 104–13. https://doi.org/10.1007/978-3-642-17458-2_10
- 34.
Verdoolaege S. Barvinok: User Guide. Version: barvinok-0.36; 2012. Available from: http://garage.kotnet.org/skimo/barvinok/barvinok.pdf