The authors received a gift from a commercial source NVIDIA. This does not alter the authors' adherence to all the PLOS ONE policies on sharing data and materials.
Conceived and designed the experiments: JC HHH MK. Performed the experiments: WW LFX. Analyzed the data: WW LFX JC HHH MK. Wrote the paper: WW LFX JC HHH MK.
Recent developments in modern computational accelerators like Graphics Processing Units (GPUs) and coprocessors provide great opportunities for making scientific applications run faster than ever before. However, efficient parallelization of scientific code using new programming tools like CUDA requires a high level of expertise that is not available to many scientists. This, plus the fact that parallelized code is usually not portable to different architectures, creates major challenges for exploiting the full capabilities of modern computational accelerators. In this work, we sought to overcome these challenges by studying how to achieve both automated parallelization using OpenACC and enhanced portability using OpenCL. We applied our parallelization schemes using GPUs as well as Intel Many Integrated Core (MIC) coprocessor to reduce the run time of wave propagation simulations. We used a well-established 2D cardiac action potential model as a specific case-study. To the best of our knowledge, we are the first to study auto-parallelization of 2D cardiac wave propagation simulations using OpenACC. Our results identify several approaches that provide substantial speedups. The OpenACC-generated GPU code achieved more than
Recent developments in the field of high performance computing have greatly expanded the computational capabilities and application of Graphics Processing Units (GPUs). Using GPUs to perform computations that are typically handled by a CPU is known as General Purpose computation on GPUs (GPGPU). To name a few, GPUs are now used in the fields of bioinformatics
In this work, we sought to overcome the challenges of CUDA by studying how to achieve both automated parallelization using OpenACC (a directive based language extension similar to OpenMP) and enhanced portability using OpenCL. We applied our parallelization schemes using GPUs as well as Intel MIC-architecture accelerator to reduce the run time of wave propagation simulations. We used a well-established 2D cardiac action potential model as a specific case-study.
Models of cardiac wave propagation often require a large number of computations at each model node at each time step to compute the value of numerous ionic currents at the node. A computational approach that uses parallel processing on GPUs or millions of cores provides a huge performance increase over any sequential approach. For example, Neic et al accelerated cardiac biodomain propagation simulations using a cluster of GPUs which achieved up to
We first used OpenACC to automatically generate CUDA and OpenCL GPU code that runs on NVIDIA GPUs. In addition, we developed CUDA and OpenCL implementations for parallelization on NVIDIA GPUs, as well as an OpenMP implementation of the same model for parallelization on Intel CPUs. In this way we could compare the performance of different parallel computing techniques. By altering the number of threads in the OpenMP implementation, we were able to achieve good speedups on Intel Xeon Phi accelerator.
The contributions of this paper are two fold: 1) We parallelized a 2D cardiac wave propagation model using both manual and automatic parallelization paradigms including CUDA, OpenCL, and OpenACC. In particular, auto-parallelizing our model using OpenACC is an excellent example of achieving parallelism in an efficient and effective way. 2) We applied OpenCL, OpenACC, and OpenMP to the problem of simulating cardiac wave propagation so that parallelism from different architectures could be explored. We found that this approach provided excellent speedups of the model on GPUs and the Intel MIC-architecture accelerator. Our results indicate that this is a very useful approach for solving computational models of wave propagation in multi-dimensional media using newly-available computational accelerators.
GPUs are massively parallel multi-threaded devices capable of executing a large number of active threads concurrently. A GPU consists of multiple streaming multiprocessors, each of which contains multiple scalar processor cores. For example, NVIDIA's Fermi architecture GPU card Tesla C2050 contains 14 such multiprocessors, each of which contains 32 cores, for a total of 448 cores. In addition, the GPU has several types of memory, most notably the main device memory (global memory) and the on-chip memory shared between all cores of a single multiprocessor (shared memory).
GPUs achieve high-performance computing through the massively parallel processing power of hundreds or even thousands of compute cores. There are two popular programming schemes for GPUs. The first is CUDA (Compute Unified Device Architecture)
Intel MIC-architecture accelerator card is the most up-to-date product from Intel. A MIC card usually contains 60 or 61 cores, each core supports 4 hardware threads. One notable feature is that it has 512-bit wide SIMD vectors which provides fine granularity vectorization parallelism. A single instruction can operate on 8 adjacent double-precision floating point data or 16 single-precision floating point data. Intel MIC accelerator achieves high-performance computing through the hardware threads and the wide vector registers. Comparing to GPUs, a whole application can execute on the MIC accelerator. The programming languages for MIC accelerator include OpenMP, OpenCL, and MPI. In this work, we use OpenMP and OpenCL to exploit the parallelism of one MIC card.
Our goal was to study the computational speedups for simulating cardiac electrical wave propagation that are provided by multiple hardware platforms and several programming schemes. We chose to work with a relatively straight-forward 2D implementation of a well-known cardiac action potential model (Beeler-Reuter). This model simplified porting the code between hardware platforms and programming schemes. Although the model is not as complex as models used in state-of-the-art simulations
In our cardiac wave model, cardiac tissue is modeled as a large geometrical network of nodes that are electrically coupled. The electrical potential of the cardiac cell membrane at each node is represented as a set of partial differential equations (shown in
Top: an image showing cardiac electrical wave propagation as spatial fluctuations of transmembrane potential. A single rotating wave (rotor) is shown. Bottom: action potentials at one node are shown as the temporal variation of transmembrane potential at the node.
Top: an image of transmembrane potential showing complex wave activity. Bottom: action potentials at one node are shown as the temporal variation of transmembrane potential at the node.
The general approach for solving the model is shown in
As shown in
Parallel implementations of N by N dimensional cardiac models are relatively straightforward because once diffusion currents have been computed there is no data dependency between neighboring nodes for a particular timestep. Because of this, the differential equations that represent myocyte electrophysiology (the brgates() and brcurrents() functions) can be evaluated at each node, in almost any sequence.
We parallelized the cardiac model using CUDA and OpenCL. In CUDA and OpenCL, multiple threads execute the same instructions but the data processed by these threads might be on different nodes. Take CUDA for example
The CUDA implementation of our cardiac model is computationally efficient and provides substantial speedups compared to a sequential CPU implementation. However, CUDA code can only run on supported NVIDIA GPUs. To address this limitation we developed the OpenCL version of the model to support GPU parallelization across platforms. In this work, the OpenCL implementation was tested on Intel MIC accelerator card in addition to NVIDIA GPUs. The general architecture of our OpenCL implementation is the same as the CUDA implementation. One thread was used to solve the equations at one node in the model.
Since we have identified the hotspot of the sequential program, we can add OpenACC directives to offload the code block to run on accelerators like GPUs and coprocessors.
In the code block with OpenACC pragmas, “#pragma acc data” specify which data should be copied to the accelerator (using “copyin”), to/from the accelerator (using “copy”). The “#pragma acc loop” specifies the loop that would be parallelized by OpenACC. The “vector”, “worker”, and “gang” specifies the thread configurations similar to block size and grid size in OpenCL/CUDA. For example, Algorithm 2 shown in
We parallelized the CPU code using OpenMP
In this section, we report the results of experiments with our CUDA, OpenCL, and OpenACC GPU implementations. We also report the results of comparing performances of the GPU implementations with OpenMP implementation. In the end of this section, we report the speedups from Intel MIC architecture accelerator.
Our GPU implementations of the model were tested on two different GPU cards: Fermi-architecture GPU card Tesla C2050 and Kepler-architecture GPU card Tesla K20. In the following sections, we refer to them as Fermi GPU and Kepler GPU respectively. CPU-based implementations were also tested on the machines that hosted each GPU card. The machines hosting the Fermi GPU had 2 Intel E5530 2.4 GHz Quad Core Nehalem processors (8 cores total) and 24 GB memory. The machine hosting the Tesla K20 GPU card had 2 AMD Opteron 6320 2.8 GHz Eight Core processors (16 cores total) and 16 GB memory.
The Fermi GPU had 14 multiprocessors, each with 32 cores (448 cores total) clocked at 1.15 GHz and 3 GB global memory. The Kepler GPU card had 13 multiprocessors, each multiprocessor containing 192 cores (2496 cores total) clocked at 706 MHz. It had 4800 MB global memory. The peak double precision floating point performance for Fermi GPU and and Kepler GPU was 515GFlops and 1.17TFlops respectively. These hardware configurations were also used for the experiments comparing with OpenMP implementations.
The computational performance of the GPU implementations (running with the best block size configuration
For the single rotor simulations, we studied grid sizes ranging from
For both CUDA and OpenCL GPU implementations, the Fermi GPU provided speedups of at least
Rotor breakup simulations were used to study performance during greater computational loads. In these tests we set the model parameters to simulate the breakup of a single rotor into multiple rotors, resulting in electrical activity that is much more complex than that of a single rotor (shown in
The hardware used to test OpenACC implementation is the machine hosting the Kepler GPU. To compile OpenACC code, we used HMPP 3.3.3 compiler from CAPS
The comparison of speedups between automatically generated GPU code (CUDA/OpenCL) and hand-written GPU code (CUDA/OpenCL) is shown in
All GPU codes were run on the Kepler GPU.
To provide more perspective for the speedups achieved by our GPU implementation, we conducted additional tests with a parallel CPU implementation of the model using OpenMP. The OpenMP implementation was run on the hardware platform that hosted Fermi GPU and with the same reentrant activity (one rotor) simulation input files. We achieved an average speedup of
Reentrant activity (one rotor) was simulated.
We used Intel Xeon Phi 5110P coprocessor to test our hand-written OpenCL code, automatically generated OpenCL code (by OpenACC), and the OpenMP code written for Intel MIC architecture. The coprocessor contains 60 cores and supports 240 threads. The memory of the coprocessor is 8 GB. The machine hosting the coprocessor had 32 GB of memory and the CPU is Intel Xeon E5 clocked at 2.63 GHz. As to software, we used OpenCL 1.2 library (provided by Intel ICC compiler v14.0.0) for OpenACC and OpenCL implementations. The compiler to compile hand-written OpenCL code was GCC. We used the same CAPS compiler as above to compile OpenACC code to generate OpenCL code that runs on Intel MIC card. For OpenMP implementation, we used ICC compiler versioned at 14.0.0 from Intel.
Reentrant activity (one rotor) was simulated.
In this section, we first summarize the optimizations applied to different implementations. Then we compare different programming languages with various metrics. In the end, we compare with the related work and summarize this work.
Although it was straightforward to port the cardiac model to run on accelerators, optimizing the code was not trivial.
For CUDA and OpenCL implementations on GPU, we applied two other effective optimizations before optimizing the block size.
Eliminating atomic operations. A direct port of the sequential code contained 9 equations that represent the updates (writes) from a current node to the 8 neighboring nodes and itself. Before optimization, atomic operations (in CUDA) and kernel isolations (in OpenCL) were used to avoid racing updates. Atomic operations are expensive and kernel isolations bring synchronization overhead. To address this limitation, we came up with a neighbor-update-free strategy. In this optimization strategy, each node “collects” (reads) update requests from the neighbors and performs the update to the node itself.
Coalescing memory accesses. Coalesced memory accesses on GPUs means the data requests (e.g. by
We obtained the OpenACC implementation from adding pragmas to a sequential implementation that incorporated the above two optimizations. Added with the optimization for gang, worker, and vector configuration, we achieved good speedups from OpenACC on GPUs.
For MIC OpenMP implementation, as aforementioned, we applied loop unswitching optimization in addition to the above optimizations so that the vectorization power of MIC card could be better exploited. The loop unswitching strategy improved the speedup results by approximately
For the sequential and OpenMP version that run on CPUs and serve as the baselines, we passed O3 optimization flag to the compiler so that the implementations were well optimized to be strong baselines.
In this section, we compare metrics like lines of source code change to the original implementation, portability, time taken to program, of the CUDA, OpenCL, OpenACC, and OpenMP implementations.
Language | Code-Change | Estimated Time to Program | Platforms |
CUDA | 500(+/−) | Weeks | NVIDIA GPUs |
OpenCL | 500(+/−) | Weeks | GPUs, CPUs, MIC accelerator |
OpenACC | 10(+) | Days | GPUs, CPUs, MIC accelerator |
OpenMP | 10(+) | Days | CPUs, MIC accelerator |
The Code-Change (second column) reports the estimated number of lines of code change for the kernel computation functions from the sequential C code. The “+” symbol after the number means addition and “+/−” means additions and deletions. There are about 220 lines of computation code in the sequential CPU implementation. The CUDA and OpenCL implementation needed to replace almost every statement of the original CPU implementation. Plus, they needed to include GPU initialization and data preparation code. The code change was mostly addition and replacement of statements. The OpenACC implementation only incurred about 10 statement additions to the sequential code. The initialization of GPU, data initialization for GPU, and data movement between GPU and CPU were all automatically handled in the generated CUDA and/or OpenCL code. The programmer only needed to figure out the correct pragmas and associated pragma attributes to add. The OpenMP implementation was also obtained by only placing a few pragmas around the parallel code regions. Considering the difficulty of programming using the languages (third column), CUDA and OpenCL are about the same, they could take an experienced programmer weeks to get the program correct and optimized for GPUs. In our case, it took us months to get the well-tuned CUDA version. In contrast, OpenACC and OpenMP program mostly involved figuring out where to place the pragmas and what parameters should go with these pragmas. It does not require the programmer to be as experienced as for CUDA programming. It took us less than a week to get the efficient OpenACC implementation and the baseline OpenMP implementation. We also compare the portability of these implementations (fourth column). The CUDA implementation can only run on NVIDIA GPUs while OpenCL can run across different architectures include GPUs, CPUs and accelerators like Intel Xeon Phi coprocessor. Since OpenACC implementation can be compiled to OpenCL code, it can target the same architectures as OpenCL can. The OpenMP program cannot run on GPUs, however, it can run on CPUs and Intel's Xeon Phi coprocessor (accelerator).
Combining the performance results and the above metrics, we can see the OpenCL implementation achieved the best speedups and portability on GPUs; the OpenACC implementation, taking the minimum amount of effort to program, also achieved very good speedups on GPUs and the same portability as OpenCL implementation did. For Intel MIC architecture, the OpenMP implementation was the performance champion. We attribute this partially to the performance differences between the compilers and the (OpenCL) libraries used to generate the executables.
The first published simulation of 2D wave propagation in cardiac arrhythmias model using GPU hardware was performed on an Xbox 360 and resulted in a speedup of more than
Work that compared OpenACC, OpenCL, and CUDA include the acceleration of hydrocodes
This paper builds on our previous report
We conclude that emerging software developments like the OpenACC directive-based programming language facilitate exploiting application parallelism offered by evolving hardware architecture. Our method of using OpenACC, OpenCL, and OpenMP to achieve efficient and effective parallelization on different accelerators can be generally applied to benefit other domains.
Software Package containing all the implementations.
(ZIP)
The authors would like to thank Scott Grauer-Gray, Robert Searles, and William Killian for their help with the experiments.