Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

< Back to Article

Figure 1.

Typical NVIDIA GPU architecture.

The GPU is comprised of a set of Streaming MultiProcessors (SM). Each SM is comprised of several Stream Processor (SP) cores, as shown for the NVIDIA’s Fermi architecture (a). The GPU resources are controlled by the programmer through the CUDA programming model, shown in (b).

More »

Figure 1 Expand

Figure 2.

The sequential pseudo-code of bedpostX.

More »

Figure 2 Expand

Figure 3.

Distribution of resources for the CUDA kernel that performs the Levenberg-Marquardt algorithm.

Voxels are assigned to threads of CUDA blocks. Each CUDA block is comprised of threads and processes voxels ( was used in this study).

More »

Figure 3 Expand

Figure 4.

Distribution of resources for the CUDA kernel that performs the MCMC algorithm.

Each voxel is assigned to more than one thread within a thread block, so that the likelihood calculation is parallelised. Each CUDA block is comprised of threads and processes only 1 voxel ( was used in this study).

More »

Figure 4 Expand

Figure 5.

Workflow in the MCMC kernel.

(a) Workflow for a single iteration and a single parameter update describing how computation tasks are distributed between the threads of a block () in a case with gradient directions. The calculation of the model-predicted signals for the different gradient directions is distributed as evenly as possible between threads within a thread block. The remaining tasks, which are not computationally demanding, are performed by a leader thread, while the rest of threads are waiting. (b) Workflow for a thread block of the MCMC kernel that performs all T iterations for all R parameters (i.e. for a voxel). Each block has threads. The threads need to be synchronised at certain steps.

More »

Figure 5 Expand

Figure 6.

Execution times for the MCMC GPU kernel using different number of threads per block .

Results are shown for different number K of gradient directions (50, 100 and 200), for a slice of 4804 voxels ( fibres, MCMC iterations (3000 burn-in)).

More »

Figure 6 Expand

Table 1.

Major Hardware features for Tesla C2050 and M2090 GPUs.

More »

Table 1 Expand

Table 2.

Theoretical Peak Performance of the GPUs devices and CPU cores used.

More »

Table 2 Expand

Figure 7.

Comparison between CPU and GPU model estimates for the diffusivity d, the baseline signal and the volume fraction of the first fibre , in different brain areas.

(a) A corpus callosum voxel, (b) a centrum semiovale voxel and (c) a grey matter voxel. Each design was ran 1000 times on the same data and for each repeat the mean of the posterior distribution of the respective parameter was recorded. The histograms show the distributions of these means across all 1000 repeats. For each repeat, a burn-in period of 3000 iterations and a thinning period of 25 samples was used for the MCMC.

More »

Figure 7 Expand

Figure 8.

Comparison of single-core CPU and GPU execution times (in log scale) running the Levenberg-Marquardt algorithm with speed gains over two orders of magnitude: (a) As the number of Levenberg-Marquardt iterations are increased, and (b) as the number of voxels per slice are increased.

The execution times for (a) are for a slice of 4804 voxels, with the convergence criterion of the algorithm decreased to allow more iterations. For each case, results are shown for different number K of gradient directions (64, 128 and 256) and for estimating fibres.

More »

Figure 8 Expand

Figure 9.

Comparison of single-core CPU and GPU execution times (in log scale) running the MCMC algorithm with speed gains over two orders of magnitude: (a) As the number of MCMC iterations are increased, and (b) as the number of voxels per slice are increased.

The execution times for (a) are for a slice of 4804 voxels and for (b) for 1000 MCMC iterations. For each case, results are shown for different number K of gradient directions (64, 128 and 256) and for estimating fibres.

More »

Figure 9 Expand

Table 3.

Speed-ups for running bedpostX in a GPU over a single-core CPU.

More »

Table 3 Expand

Figure 10.

Total execution times (in log scale) of the bedpostX application in a single-core CPU and a Tesla C2050 GPU for the whole dataset (30 slices), as the number of fibres L is increased.

Results are shown for different number K of gradient directions (64, 128 and 256) and when MCMC iterations were utilised (3000 burn-in iterations).

More »

Figure 10 Expand

Table 4.

Speed-ups for running bedpostX in a cluster of GPUs over a cluster of CPUs.

More »

Table 4 Expand

Figure 11.

Total execution times (in log scale) of the bedpostX application in a CPU cluster and 372 GPUs Tesla M2090 processing 102 slices, as the number of fibres L is increased.

Results are shown for different number K of gradient directions (64, 128 and 256) and when MCMC iterations were utilised (3000 burn-in iterations).

More »

Figure 11 Expand