Figure 1.
Schematization of CUDA architecture.
Schematic representation of CUDA threads and memory hierarchy. Left side. Thread organization: a single kernel is launched from the host (the CPU) and is executed in multiple threads on the device (the GPU); threads can be organized in three-dimensional structures named blocks which can be, in turn, organized in three-dimensional grids. The dimensions of blocks and grids are explicitly defined by the programmer. Right side. Memory hierarchy: threads can access data from many different memories with different scopes; registers and local memories are private for each thread. Shared memory let threads belonging to the same block communicate, and has low access latency. All threads can access the global memory, which suffers of high latencies but is cached since the introduction of Fermi architecture. Texture and constant memories can be read from any thread and feature a cache as well. Figures are taken from the Nvidia's CUDA programming guide [58].
Figure 2.
Simplified scheme of cuTauLeaping workflow: in phase P1 each thread calculates the value for the simulation step; in phase P2, the threads whose
is “large” perform a tau-leaping step (by executing a set of non-critical reactions and (possibly) one critical reaction); the remaining threads perform a fixed number of SSA steps (where one reaction is executed at each step) during phase P3. The phases are iterated until all threads have reached
, a termination criterion verified during phase P4.
Figure 3.
Pseudocode of cuTauLeaping – host side.
Host-side pseudocode of cuTauLeaping. As a first step, the stoichiometric information of the reactions is exploited to pre-calculate the data structures needed by the algorithm; all matrices are flattened during this process. Then, once the support memory areas are allocated (e.g., the chunk of global memory where the system dynamics will be stored), the four phases of cuTauLeaping begin and are repeated until all simulations are completed.
Figure 4.
Pseudocode of cuTauLeaping – kernel .
Device-side pseudocode of kernel in cuTauLeaping, implementing the subdivision of threads according to the
value and the execution of a tau-leaping step. The kernel starts by loading the vectors
and
– which correspond to the current state of the system and to the values of stochastic constants, respectively – from the global memory areas that contain these data for all threads. Since these information are frequently accessed, they are immediately copied into the faster shared memory as vectors x and c, respectively. The kernel continues by verifying that the
value for the running thread
is not equal to the signal of terminated execution (i.e.,
). Then, it calculates the propensity functions of all reactions and accumulates their values in
; if
, the remaining time instants where the dynamics of the system is sampled are set to the current state and the simulation is terminated. The kernel concludes the phase P1 by calculating a putative
value for the tau-leaping step: if
is smaller than
, then thread
is halted and
is set to 0, so that it will perform the SSA steps during the next phase. Otherwise, the tau-leaping algorithm is performed by executing a set of non-critical reactions and (possibly) one critical reaction and, if the simulation has overrun one of the sampling time instants, the state stored in
is determined by linear interpolation.
Figure 5.
Pseudocode of cuTauLeaping – kernel .
Device-side pseudocode of kernel in cuTauLeaping, implementing the execution of the SSA steps. The kernel starts by loading the vectors
and
– which correspond to the current state of the system and to the values of stochastic constants, respectively – from the global memory areas that contain these data for all threads. Since these information are frequently accessed, they are immediately copied into the faster shared memory as vectors x and c, respectively. The kernel continues by verifying that the
value for the running thread
is equal to the signal corresponding to SSA (i.e.,
). Then, it performs a fixed number of SSA steps (100 in our default setting), where a single reaction is executed at each step, storing the system state at the sampled time instants
.
Figure 6.
Pseudocode of cuTauLeaping – kernel .
Device-side pseudocode of kernel in cuTauLeaping, implementing the verification of the termination of all simulations. The verification is performed by means of CUDA's hardware accelerated synchronization and counting features, which allow to count the threads of a block which satisfy a specific predicate. By exploiting CUDA's atomic functions, we accumulate the total number of threads which satisfy the predicate
: if it is equal to the number of threads, the execution of all parallel simulations is completed.
Figure 7.
Schematization of the flattened representation of the stoichiometric information.
The stoichiometry of chemical reactions is generally represented by (usually sparse) matrices, corresponding to the variation of the species appearing either as reactants or products; however, both tau-leaping and SSA exploit only the non-zero values of these matrices. Each stoichiometric matrix can be pre-processed to identify its non-zero values and discard the remaining ones, thus reducing the number of reading operations required by the two stochastic algorithms. Our strategy to reduce the size of these matrices consists in flattening each matrix as a vector of triples (), where
is the row index,
is the column index and
is the non-zero value in
. In our implementation, both
and
indices are 0-based and triples are stored using vectors of CUDA's
data types, that have the advantage of requiring a single instruction to fetch an entry. The top part of this figure shows the values appearing in the 3×4 stoichiometric matrix of reactant species of the Michaelis-Menten model (MM), which consists of 3 reactions over 4 molecular species (see Text S1). Note that only four cells of this matrix have non-zero values; the bottom part of the figure shows the corresponding
vector.
Figure 8.
Comparison of the computational time of CPU tau-leaping and cuTauLeaping.
Comparison between the computational time taken by cuTauLeaping and COPASI CPU tau-leaping to execute different batches of stochastic simulations of the Michaelis-Menten (MM) model (a), the Prokaryotic Gene Network (PGN) model (b), the Schlögl model (c), and the Ras/cAMP/PKA pathway (d) (see Text S1 for models definitions). For each model, cuTauLeaping becomes more profitable than the CPU counterpart when a certain number of parallel simulations is run, with a break-even that depends on the complexity of the system: for the MM and PGN models, cuTauLeaping is more effective when around parallel simulations are run, while for the Schlögl model and the Ras/cAMP/PKA pathway the break-even is around
simulations. Considering the speedup, the best results achieved with cuTauLeaping – with respect to COPASI – are around 583× for the MM model, 961× for the PGN model, 90× for the Schlögl model, and 25× for the model of the Ras/cAMP/PKA pathway (see also Table 1).
Table 1.
Comparison of computational time of COPASI CPU tau-leaping and cuTauLeaping.
Table 2.
Running times of cuTauLeaping for the simulation of randomly generated synthetic models.
Figure 9.
Frequency distribution of bistable states in the Schlögl model.
Frequency distribution of the molecular amount of molecular species in the Schlögl model, calculated using a total of
parallel simulations executed by cuTauLeaping. (a) Plot of the frequency distribution of
considering
a.u., to detect the bistable switching behavior that takes place in the first time instants of the dynamics; a slightly higher probability to reach the low steady state can be observed, starting from the initial state of the Schlögl system (described in Text S1). (b) Plot of the frequency distribution of
considering
a.u., to investigate the stability of the two steady states of the system; the heatmap highlights the two stable states (around 100 and 600 molecules of species
), and shows larger stochastic fluctuations around the high steady state.
Figure 10.
Parameter sweep analysis of the Schlögl model.
Results of a PSA-1D on the Schlögl model, in which the value of the stochastic constant is varied in the interval
(the set of reactions and the values of all other parameters are given in Tables 4 and 5 in Text S1). Each frequency distribution is calculated according to
simulations executed by cuTauLeaping, measuring the amount of the molecular species
at the time instant
a.u., considering ten different values of the stochastic constant
within the sweep interval. The figure shows that increasing values of
induce a decrease (increment) in the frequency distribution of
concerning the low (high) steady state, with intermediate values of
characterized by an effective bistable behavior.
Figure 11.
Three-dimensional parameter sweep analysis of the Schlögl model.
Results of a PSA-3D on the Schlögl model, performed by varying the stochastic constants ,
and
in the intervals
,
and
, respectively. The values of the stochastic constants were uniformly sampled in a
three-dimensional lattice; for each sample, we executed 256 simulations with cuTauLeaping (for a total of
simulations) and evaluated the frequency distribution of the amount of the molecular species
at the time instant
a.u.. This set of values was then partitioned according to the reached (low or high) stable steady state; in the plot, the red (blue) region corresponds to the parameterizations of the model which yield the high (low) steady state most frequently. The green region represents a set of conditions whereby both steady states are equally reached.
Figure 12.
Bidimensional parameter sweep analysis of the Ras/cAMP/PKA model.
Results of a PSA-2D on the Ras/cAMP/PKA model by varying the amount of GTP in the interval molecules (ranging from a reduced nutrient availability to a normal growth condition), and the amount of Cdc25 in the interval
molecules (ranging from the deletion to a 2-fold overexpression of this GEF proteins). The figure shows the amplitude of cAMP oscillations, evaluated as described in [44]; an amplitude value equal to zero corresponds to a non oscillating dynamics. (a) Plot of the results obtained by running
parallel simulations with cuTauLeaping; (b) plot of the results obtained by running
sequential simulations, performed on the CPU. The two batches of parallel and sequential simulations were executed with a comparable computational time.
Table 3.
tau-leaping data structures residing in CUDA high-performance memories.
Figure 13.
Performance comparison of CPU tau-leaping and cuTauLeaping for a PSA of the Ras/cAMP/PKA model.
Running times of cuTauLeaping and COPASI CPU tau-leaping to execute a PSA-1D of the Ras/cAMP/PKA model, where the stochastic constant was varied in the interval
and a total of
simulations were executed. The plot shows how the computational cost of tau-leaping running on CPU rapidly increases; this behavior can become prohibitive if several independent simulations need to be executed. On the contrary, cuTauLeaping shows a very moderate increase in the running times and outperforms the CPU implementation of tau-leaping.