Comparison of CPU and GPU bayesian estimates of fibre orientations from diffusion MRI

Background The correct estimation of fibre orientations is a crucial step for reconstructing human brain tracts. Bayesian Estimation of Diffusion Parameters Obtained using Sampling Techniques (bedpostx) is able to estimate several fibre orientations and their diffusion parameters per voxel using Markov Chain Monte Carlo (MCMC) in a whole brain diffusion MRI data, and it is capable of running on GPUs, achieving speed-up of over 100 times compared to CPUs. However, few studies have looked at whether the results from the CPU and GPU algorithms differ. In this study, we compared CPU and GPU bedpostx outputs by running multiple trials of both algorithms on the same whole brain diffusion data and compared each distribution of output using Kolmogorov-Smirnov tests. Results We show that distributions of fibre fraction parameters and principal diffusion direction angles from bedpostx and bedpostx_gpu display few statistically significant differences in shape and are localized sparsely throughout the whole brain. Average output differences are small in magnitude compared to underlying uncertainty. Conclusions Despite small amount of differences in output between CPU and GPU bedpostx algorithms, results are comparable given the difference in operation order and library usage between CPU and GPU bedpostx.

Background Recent human brain MRI data sizes are continuing to increase as large imaging data repositories curate structural, functional and diffusion data with higher spatial resolution, faster temporal sampling, and higher angular [1][2][3]. Performing image analysis on such dataset is computationally expensive, even with clusters of CPUs working simultaneously on a single dataset [4]. By contrast, graphics processing units (GPUs) have a massively parallel structure designed with hundreds of smaller cores optimized to exploit the data level parallelism of certain applications, utilizing simpler instruction sets and distributing them over multiple cores [5,6]. This parallelization can accelerate computationally slow processes such as data visualization, stochastic iteration, and Bayesian simulations including probabilistic tractography [2,[4][5][6][7][8][9][10]. A popular tool in estimating diffusion parameters for whole brain diffusion MRI is available to be run on both CPU or GPU, with GPU algorithm achieving over 100 times speed-up compared to its CPU algorithm [6,11,12]. Despite the GPU's advantages in acceleration, few studies have examined whether there are differences in computational output from the CPU and GPU. In general, checking for output convergence between CPU and GPU results is important for several reasons. First, CPU and GPU both have double-precision capabilities in their compilation and runtime libraries, but the optimization of performance and speed-up of GPU binaries may restrict them to using single-precision libraries which can cause results to be different due to float-point precision differences [13,14]. Secondly, there are differences in the CPU and GPU random number generators and operation orders in implementing Markov Chain Monte Carlo (MCMC) [15][16][17]. For GPU results to be used interchangeably with existing CPU algorithms, the GPU algorithm should produce results that are reproducible and convergent with results obtained by the CPU algorithm. For example, Hernandez-Fernandez et al., compared the mean of a few representative diffusion weighted voxel values in a repeated test between CPU and GPU and found almost identical results [6]. However, their study did not report on CPU/GPU differences in contiguous within-slice voxels or multi-slice brain data. The current study aims to extend these findings by comparing sampled distribution shapes of CPU and GPU Bayesian estimation of diffusion parameters in a whole brain dataset. This paper is organized as follows. Brief introductions of diffusion MRI and Bayesian estimation of diffusion parameters are given. Then, the complete methodology of output comparison technique is described. Results of output comparison are presented for each diffusion parameter type, and then, we give our conclusions and discussions

Diffusion MRI and bayesian estimation
Diffusion MRI (dMRI) is a useful tool in visualizing the white matter connectivity of the brain and is widely used in both research and clinical contexts. dMRI is sensitive to molecular diffusion of water and enhances the anisotropy-the directional dependence-of neuronal white matter fibre tracts, which can be used to create fractional anisotropy maps, mean diffusivity maps and fibre pathways [18,19]. A commonly used method to estimate the fibre orientations and reconstruct the brain tracts in vivo is to use the FMRIB Software Library's (FSL) "Bayesian estimation of diffusion parameters" (bedpostx) and "probabilistic tracking of crossing fibres" (probtrackx) algorithms. In brief, bedpostx employs a Markov Chain Monte Carlo sampling technique to estimate the posterior probability density functions (PDF) of the diffusion parameters utilizing the "ball-and-stick" model which takes into account multiple fibre orientations in a given voxel where appropriate [11,12]. This allows the resolving of within-voxel fibre crossings, which is a common hurdle during the fibre tracking step, by fitting more than one fibre orientation in a given voxel only when it is relevant to do so. This feature is the "automatic relevance determination" (ARD) algorithm in bedpostx [11] which initially sets the additional fibre fractions in secondary orientations to zero with low variance, and iteratively estimates the variance separately so that when the additional fibre orientation is supported by the data, the additional fibre fraction can take a non-zero value with a larger variance. bedpostx uses the Levenberg-Marquardt (L-M) fit to initialize parameters by minimizing the sum of squared model residuals, similar to fitting a diffusion tensor model, then, it proposes a value for each parameter, drawing from Normal proposal distribution, calculates the likelihood term, and accepts or rejects the proposed value based on a Metropolis acceptance criterion. When employing the ball-and-stick model where the isotropic compartment is fitted with a mean value within a voxel (i.e. model = 1), bedpostx gives the following PDF distributions for each voxel as output: diffusivity value (d), baseline signal (S0), weight of each fibre orientation's contribution to anisotropic diffusion signal (stick), also known as fibre fraction values (f 1 , f 2 , etc.), and each fibre orientation's directional angles expressed in polar coordinates (ϕ 1 , θ 1 , ϕ 2 , θ 2 , etc). These PDFs are then randomly sampled by probtrackx to create fibre streamlines through stochastic propagation of multiple particles through the diffusion space [11,12]. Because bedpostx processes each voxel serially in the CPU, an extensive amount of computational time is required to obtain the PDFS, which makes it impractical for utilization in a clinical medical environment where a computing cluster may not be available [20,21]. To alleviate this problem, and to reduce computational time substantially FSL provides a GPU-based parallelized version of bedpostx, called bedpostx_gpu [6]. Here, the L-M initialization and MCMC sampling are parallelized such that multiple voxels are processed simultaneously. Difference in operation order exists between bedpostx and bedpostx_gpu such that, in the GPU, L-M initialization for the entire brain is done first, then MCMC sampling are done for the entire brain, whereas in the CPU, L-M initialization and MCMC sampling are done in sequential order for each voxel. We know of no study to date that has quantitatively examined output similarities and differences between the bedpostx and bedpostx_gpu algorithms in a whole-brain DTI dataset. Further, because the PDF distributions obtained from bedpostx is used in obtaining fibre streamlines in probtrackx and not their mean values, differing distributional shapes between the two algorithms can also cause bias in output fibre tracking using probtrackx. This study aims to compare the output of bedpostx and bedpostx_gpu and report on output PDF distribution (f 1 , f 2 , ϕ 1 , θ 1 , ϕ 2 , θ 2 ) shape difference, magnitude of difference in mean value and underlying uncertainty value.

Computational resources
bedpostx was used for output comparison with the GPU version bedpostx_gpu, both from FSL 6.0.5 package running on Ubuntu 20.04 LTS. The CPU version ran on a workstation with a dual Intel Xeon X5670 2.93 GHz CPU with 6 x 4-GB DDR3-1333 memory, and 24 threads. The GPU version ran on a workstation with one NVIDIA Tesla C2075 with 448 CUDA cores, 6-GB GDDR5 dedicated memory, PCIe x16 bus, CUDA 8.0 with driver version 390.144.

Data
Diffusion data from the MGH-USC Human Connectome Project (HCP) Image & Data Archive portal (https://ida.loni.usc.edu/), subject: mgh1005, was used for running multiple trials of bedpostx and bedpostx_gpu. The full dMRI data consist of directional volumes acquired in multiple shells (b = 0,1000,3000,5000,10000) but for our work, a single shell from the full set was used for analysis: motion and eddy corrected, b = 1000, 64 directional volumes and 6 nondirectional volumes, 1.5mm isotropic, 140x140x96. This was chosen because most clinical and research studies have access to a similar single-shell dMRI acquisition method and the resulting data can still support multi-fibre modeling of bedpostx algorithms. T1-weighted anatomical scan of the same subject was segmented [22] to derive binary masks of grey matter, white matter and cerebrospinal fluid, then co-registered to the diffusion data [23,24]. These masks were used to quantify how many significantly different distributions were localized in each tissue class. Analysis was repeated on a different HCP subject data, mgh1001, to assess reproducibility of comparison results.
In the absence of ground truth data, a whole brain synthetic dMRI data was generated using the ball-and-stick model through Dipy 0.16.0 [25] by the following. First, a single run of bedpostx on the CPU was done on subject mgh1005's dMRI data to produce mean values of diffusion parameters d, S0, f 1 , f 2 , ϕ 1 , θ 1 , ϕ 2 , θ 2 . Then, these mean values were used to recreate a simulated dMRI data with two fibre fractions and mean principal diffusion directions for each voxel. Finally, the simulated dMRI data was used to run bedpostx on CPU and GPU 20 times to generate 1000 samples of diffusion parameters with the goal of comparing the resulting sample mean values against the input mean values used for creating the simulated data. The random number generator seed used for generating the simulated data was different than to those used to generate these samples.

bedpostx PDF creation
Specified bedpostx and bedpostx_gpu input parameters are: 2250 MCMC iterations, of which during the latter 1250 iterations, parameter values were recorded to PDF every 25 iterations, resulting in 50 samples per PDF; monoexponential model (i.e. fit with mean diffusivity) with ARD fitting 2 fibres per voxel where appropriate. 20 trials of bedpostx and bedpostx_gpu were run with different random number generator seeds and output distributions from all trials were merged together to form 1000 samples per parameter PDF for bedpostx and for bed-postx_gpu. Furthermore, to inspect differences in L-M initialization between bedpostx and bed-postx_gpu, 20 trials of each algorithm were run again but with 1 iteration to record 1 sample close to the initializing value.

PDF distribution comparison and statistical analysis
PDF shape was statistically compared via two-sample Kolmogorov-Smirnov (KS) test to derive voxels that have different distributions between CPU and GPU (two-tailed, p < 0.05, uncorrected). Family-wise error rate was controlled by the Bonferroni method [26]. Voxels with significantly different distributions were then further categorized by their KS-scores (S) in 4 different ranges: 0.1-0.2, 0.2-0.3, 0.3-0.4, > 0.4. S scores illustrate the amount of sample deviation (e.g. S = 0.35, 35% of samples differ between two distributions). For f 1 and f 2 , CPU mean values along with absolute difference in mean CPU/GPU values were calculated and averaged for each S range. For angles, mean, standard deviation and median difference in principal diffusion directions (PDD) along with 95 th -percentile cones of angular uncertainties (CAUs) were calculated in voxels with at least one significantly different angle parameter for each pairs (i.e. ϕ 1 OR θ 1 , ϕ 2 OR θ 2 ). Maximum S score between the [ϕ,θ] pair was used when categorizing significantly different angle parameters into different S-ranges.

Effect of mixed fibre fraction and orientation samples near crossing-fibre areas
Fibre fraction parameters (f 1 , f 2 , etc.) and their associated orientations (ϕ 1 θ 1 ,ϕ 2 θ 2, etc.) could be inconsistently associated with the different underlying sub-fibre populations, especially if the fibre fractions are of comparable strength [27]. This can cause differing proportions of fibre fraction and orientation values to be labeled as one group (e.g. f 1 ,ϕ 1 θ 1 ) but labeled as another on the next trial (e.g. f 2 ,ϕ 2 θ 1 ). There is no guarantee that the labeling happens consistently and because we are merging samples from 20 different bedpostx and bedpostx_gpu trials to form the PDF distributions for comparison, it is possible that differences between the two platforms occur due to the this inconsistent labeling of sub-fibre populations. To investigate this effect of mixed fibre fractions and how much it may contribute to CPU and GPU output differences, we swapped f 1 ,ϕ 1 θ 1 and f 2 ,ϕ 2 θ 2 where f 2 > f 1 and ran the same statistical analysis on the swapped samples and compared the results against statistically different unswapped samples.

Difference in L-M initialization
Example L-M initialization difference map is shown in Fig 1 with difference greater than 1% of mean CPU values color coded. Diffusivity and baseline signal (d, S0) have increased difference towards the center of the brain, and f 2 , ϕ 2 , and θ 2 have greater amount of different voxels compared to f 1 , ϕ 1 and θ 1 .

f 1
About 1% of total number of brain voxels (5145 of 436738) had significantly different f 1 distributions. Significantly different voxels were sparsely localized throughout the brain bilaterally. Of the significantly difference voxels, 1% were found outside the three tissue class binary masks, 2% were found in cerebrospinal fluid, 34% were found in grey matter and 63% were found in white matter. The latter were located in long white matter projections, such as corpus callosum, corona radiata, internal capsules and anterior and posterior thalamic radiations ( Fig  2). Number of significant voxels, mean CPU f 1 , absolute average differences in mean f 1 , and absolute average difference in L-M initialization in each S-score regions are summarized in Table 1.
Majority of voxels had S-scores less than 0.3 (5101 out of 5139). Example PDF distribution shapes of CPU and GPU in significant voxels are shown in Fig 3. The largest S-score of 0.788 was found in the body of corpus callosum. Here, both f 1 distributions have peaks near 0.99. CPU data had a sharper peak, with all samples above 0.9. Fewer GPU samples are above 0.9, with the remainder between 0.4 and 0.9. Here, average f 1 initialization across 20 trials by CPU L-M was 0.998 while average f 1 initialization by the GPU L-M across 20 trials was 0.605. Larger average L-M initialization differences were noted for larger S-score. After adjusting for f 2 >f 1 samples by swapping, 740 voxels were no longer significantly different.
Similar results were found on another subject with significantly different f 1 voxels sparsely located throughout the whole brain. 4128 voxels were different with an average difference in magnitude at 0.0257. The magnitude of differences in mean f 1 values in CPU and GPU were 24.23% and 24.25% of simulated f 1 values respectively in the whole brain. Within the whitematter only, the differences were both 8.27% of simulated f 1 values.

f2
31% of total number of brain voxels (137240 out 436738) had significantly different f 2 distributions. Significantly different distributions were localized in grey matter (50%), cerebrospinal fluid (19%) and white matter (18%). For the white matter, they were localized in long white matter projections similar to those identified in f 1 .
89% of significant voxels had CPU or GPU mean f 2 values lower than 0.05, predominantly in areas with grey matter and cerebrospinal fluid, likely the effect of ARD estimating f 2 to zero in both bedpostx and bedpostx_gpu. To focus analysis on areas where f 2 is supported by data, we reported mean CPU f 2 , absolute average differences in mean f 2 , and absolute average difference in L-M initialization in each S-score regions only on areas with mean f 2 from CPU or GPU greater than or equal to 0.05 (Table 2).
This was the same threshold chosen by [11] when looking for secondary fibre orientations supported by ARD (Fig 4). Here, most significantly different voxels were localized in grey/ white matter junctions. Some were sparsely found bilaterally within identifiable structures such as corpus callosum, corona radiata, internal capsule, anterior and posterior thalamic radiations.
The majority of significant f 2 distribution differences had S-scores < 0.3 (13992 out of 14423). One example of a voxel exhibiting a large PDF difference in f 2 is depicted in Fig 5, where S-score = 0.415 which is the same location as the largest S-score found in Fig 4  (magenta). Here, the distribution shows a higher peak near f 2 = 0 in the GPU distribution and lower counts of samples between 0.05 and 0.2. In the CPU, there is a smaller peak near f 2 = 0 with a larger counts of samples between 0.05 to 0.2. The CPU L-M initialization step estimated f 2 = 0.143 averaged across 20 trials while GPU L-M initialization step had f 2 = 0.273 across 20 trials. Similar to f 1 , larger average L-M differences were found for larger S ranges. After adjusting for f 2 >f 1 samples, 836 voxels were no longer significantly different, and of these voxels, 680 were in areas with mean f 2 from CPU or GPU greater than or equal to 0.05.
Similar results were found in another subject where total number of significantly different f 2 voxels with mean f 2 > 0.05 was 15129, and mean magnitude difference at 0.0302. The magnitude of differences in mean f2 values in CPU and GPU were both 35.79% of simulated f 2 values where simulated f 2 > 0.05. Within the whitematter only, the differences were 17.98% and 18.00% of simulated f 2 for CPU and GPU, respectively. ϕ 1 and θ 1 198166 out of 436738 total brain voxels had significantly different ϕ 1 or θ 1 distribution. Significantly different distributions were localized predominantly in areas of grey matter (50%) and cerebrospinal fluid (21%). They were also found in the white matter (15%), with some key white matter structures such as corpus callosum, internal capsules, corona radiata and anterior and posterior thalamic radiations containing significantly different distributions (Fig 6).
Mean and median angle differences, and average 95 th percentile CAUs for each S-score range are tabulated in Table 3.
Again, the majority of these voxels have S < 0.3 (183714 out of 196081). Mean difference in angles of principle diffusion directions in all significantly different voxels was 2.194( stdev = 4.232˚) while the median difference was 1.064˚. In all significantly different ϕ 1 and θ 1 voxels, the average angular difference between the 95 th percentile CAUs for CPU and GPU is 0.013˚(CPU 53.224˚; GPU 53.211˚; see Table 3). Because ϕ 1 and θ 1 parameters are more meaningful in white-matter where anisotropy is higher, angular differences and 95 th percentile CAUs for each S-score range in white-matter only are tabulated in Table 4.
Overall, higher average difference in mean PDD and lower CAUs were found in significantly different voxels confined to the white-matter. An example distributions of ϕ 1 and θ 1 is shown in Fig 7. The depicted distributions in Fig 7 came from a voxel located in the corpus callosum (marked by � in Fig 6.) with S score = 0.501. In this voxel, two sharp peaks were found for CPU ϕ 1 samples whereas only a single peak was found for GPU whose location coincided with one of CPU's peak values. For θ 1 samples, GPU only had one sharp peak, whereas CPU had three peaks, one of which was similar to the GPU. The effect of distribution shape difference on diffusion direction is illustrated in Fig 8. It is noted that while the GPU principal Table 1. Significantly different f 1 distributions. diffusion directions are all estimated similarly, CPU principal diffusion directions have two separate clusters that are located antiparallel to each other. Adjusting for f 2 > f 1 , 18231 voxels became not significantly different and of these, 6755 were in the white-matter.

Average |Mean f1 CPU-Mean f1 GPU| Average |L-M CPU-L-M GPU|
Similar results were found in another subject's data with 190602 significantly different ϕ 1 ,θ 1 voxels with average mean PDD difference 2.559˚, with 95-percentile CAUs of 38.943˚and 38.838˚for CPU and GPU respectively. Compared to simulated data's mean principal direction, both CPU and GPU produced 8.2˚mean difference and 3.9˚median difference in the white matter.  Table 2. Significantly different f 2 distributions. 224337 out of 436738 total brain voxels had significantly different ϕ 2 or θ 2 distributions. Significantly different distributions were localized in grey matter (51%), cerebrospinal fluid (16%) and white matter structures (22%) such as corpus callosum, corona radiata, internal capsule, and the anterior and posterior thalamic radiations (Fig 9). Mean and median angle differences along with 95 th percentile CAUs for each S-score range are tabulated in Table 5. Again, most voxels have S-scores < 0.3 (213437 out of 223309). Overall mean difference in principle directions is 36.776˚with median difference of 29.397˚. Average angular uncertainty is 83.490˚for both CPU and GPU.

Average |Mean f2 CPU-Mean f2 GPU| Average |L-M CPU-L-M GPU|
With ϕ 2 and θ 2 , it is more meaningful to focus on white matter and areas where f 2 > 0.05 (i.e. where ARD has deemed appropriate to fit a second fibre orientation). Mean PDD difference and 95 th -percentile CAUs in the white-matter and f 2 > 0.05 for each S-score range are tabulated in Table 6.
Like ϕ 1 and θ 1 , the CAUs were lower when focusing in on the white matter region. Also, the difference in mean PDD was lower for each S-score range. Distributions of PDDs derived from ϕ 2 , θ 2 samples as well as ϕ 1 , θ 1 in a representative voxel with S = 0.453 for ϕ 2 , θ 2 and S = 0.451 for ϕ 1 , θ 1 near the right dorsolateral frontal are illustrated in Fig 10. There were two clusters of PDD formed by [ϕ 1 , θ 1 ] and [ϕ 2 , θ 2 ] for both CPU and GPU. Here, it was observed that majority of CPU's [ϕ 1 , θ 1 ] samples (N = 600) and GPU's [ϕ 2 , θ 2 ] samples (N = 850) had similar PDD distributions (angular difference between average PDDs of 2.04˚) while CPU's [ϕ 2 , θ 2 ] samples had similar PDD distribution with GPU's [ϕ 1 , θ 1 ] (angular difference between average PDDs of 1.85˚). It was noted that this voxel had mean f 1 = 0.32 and mean f 2 = 0.31 for both CPU and GPU, indicating that two fibre orientations were viable.
For each S-score range, average of mean PDD difference, standard deviation of mean PDD difference, median of mean PDD difference and 95 th -percentile cone of angular uncertainty are tabulated for significantly different voxels found in the white-matter where mean f 2 > 0.05.
In another subject, similar results were found for ϕ 2 ,θ 2 distributions where 213212 voxels were significantly different with mean PDD difference of 37.161˚. In the white matter where mean f2 > 0.05, the mean PDD difference was 7.433˚with 75.2˚95 th -percentile CAUs for both CPU and GPU. Compared to simulated data's mean principal direction, both CPU and GPU produced 18.4˚mean difference and 8.5˚median difference in the white matter where simulated f 2 values were > 0.05

Discussions and conclusions
A total of 2620428 pairs of distributions were created and compared across the whole brain. 78% of those distributions showed no significant difference between CPU and GPU. Of the significantly different distributions, 13% were localized outside the three tissue class binary masks, 50% were localized in grey matter, 18% in cerebrospinal fluid, and 19% in white matter, localized within the corpus callosum in the midline, and bilaterally within the corona radiata, internal capsule, and the anterior and posterior thalamic radiations. When analysis was repeated on another subject, similar results were observed. Significantly different f 1 distributions in a prominent white-matter structure, such as the body of corpus callosum in the midline as displayed in Fig 2 (magenta label), have been noted with more than half of the samples produced by CPU and GPU differing in value. In general, the corpus callosum contains a well-

PLOS ONE
defined fibre bundle in the Left-to-Right orientation and thus we expected a higher f 1 with lower uncertainty. Instead, we observed that CPU estimated all samples to be > 0.9, while the GPU had a larger spread of samples between values of 0.4 and 0.9. Since the corpus callosum in human brain would generally be found near the centre of the whole brain diffusion data, the relative operation order at which this voxel gets processed between the CPU and GPU would have differed greatly (earlier in GPU vs. later in CPU), and this may explain why significant difference was observed even in a prominent structure such as the corpus callosum. We also examined the initialization stages of L-M fit and noted that the results differed between CPU and GPU. For the L-M fit algorithm between CPU and GPU, the difference in operation order of when a particular voxel gets initialized is less impactful since L-M fit simply minimizes sum-of-squared residuals [6]. However, the different CPU and GPU-CUDA math libraries can result in different initialization values. Although GPU-CUDA math operations are double-precision capable [14], increased peak performance (e.g. higher speed-up) is found when single-precision operations are used in their place which may differ in precision compared to CPU math operations [28]. Our findings from Tables 1 and 2 show that larger L-M initialization difference at the start of MCMC sampling results in larger S-scores in significantly different distributions. This suggests that differences in PDF samples appeared to be stemming from a combination of the following: 1) differing starting points after the L-M fit, 2) differing operation order and 3) difference in math library. It is also observed that the L-M initialization difference maps (Fig 1) spatially resemble the inverse of a typical SNR map from a multi-channel MR head-coil [29,30]. Though we did not directly compare the SNR map of this dataset with L-M differences, we note and speculate that the larger difference in L-M initialization values between CPU and GPU that are mostly found near the inner most structure of the brain could be possible due to less SNR typically found in this region of the brain due to the head-coil SNR profile. This could also possibly explain why large difference was even noted in prominent white matter such as body of corpus callosum: the scan orientation is such that this structure is farther away from the head-coil elements. It could be possible to reduce the amount of differences between the two algorithms by acquiring data with higher SNR as much as possible, or to use DWI post-processing steps to estimate and reduce the noise (e.g. MRTrix dwidenoise) from the data before starting the parameter estimation step [31]. In the absence of groundtruth data to compare, we have created simulated ball-and-stick whole brain data using [25] to establish a benchmark error margins in output parameters of CPU and GPU. We found that for all bedpostx parameters, CPU and GPU both produced samples with similar magnitude of differences against the simulated data. This suggests that CPU and GPU would perform similar to each other when estimating against a ground-truth data, if available. We found that S-scores of significantly different distributions were no greater than 0.3 for 94% of significantly different distributions, i.e. 30% or less samples caused the difference. Distributional shape differences were characterized by: a) peak height differences for fibre fractions and b) number of peaks and peak value differences for diffusion direction angles. Larger difference in shape resulted in larger difference in mean values or principle diffusion direction angles. Mean angular differences in principle diffusion directions were 2.751˚and 36.776˚for significantly different [ϕ 1 ,θ 1 ] and [ϕ 2 ,θ 2 ] respectively. Their 95 th percentile CAUs were 53.2˚and 83.5r espectively. We see the effect of larger CAUs, especially for [ϕ 2 ,θ 2 ], stemming from angle samples that are antiparallel to each other or interchangeable angle samples between two viable sub-fibre population. We saw from Fig 8 that between CPU and GPU, one may produce coherent PDD (e.g. GPU samples from Fig 8) while the other produce two clusters of PDD that are located antiparallel to each other (e.g. CPU samples from Fig 8). Also, as depicted from Fig 10, two viable sub-fibre populations in a region where crossing-fibres are present can be mislabeled between CPU and GPU and thus create larger angular differences and uncertainties. When swapping those fibres to align for correct sub-fibre population, the differences between PDDs decreased overall. Algorithms such as probtrackx does tract streamlining with tract propagation constraints that propagate streamlines smoothly, and avoid internal looping or sharp turns. This is achieved by treating antiparallel angles as the same (i.e. multiplying antiparallel angles by -1 prior to propagation), and sampling from fibre-population that has minimal angular difference from previous propagation direction. These constraints would effectively allow consistent tract streamlines to be produced from CPUs and GPUs, despite the difference in PDF distributional shapes in the PDD angles. Previously, Jbabdi et al. have reported similar inconsistency in sub-fibre population labeling in bedpostx and they consolidated the orientations of sub-fibres by performing a swaping operation, similar to our method of swapping the angular samples where f 2 >f 1 [27] which resulted in better aligned mean directions between CPU and GPU. We note that one of the limitations of this current work is that there was no investigation into effect of using multi-shell models while doing the comparisons. It is reasonable to suggest that with higher b-values there will be better angular resolutions which might lead to better agreement between CPU and GPU results. The challenge in investigating this would be that the CPU bedpostx process will take far too long compared to the GPU as much more data need iterative estimation in a linear fashion, and one would require the use of multi-core High Power Computing resources to do this type of investigation in a reasonable amount of time. Still, b = 1000 with a monoexponential model estimation is valuable to investigate between CPU and GPU as most clinical MRI can acquire this type of data with conventional MR system setup and speed-up in bedpostx in the GPU can be most effective in this type of front-line clinical setting. Another limitation of this work is that the randomnumber generator type is not the same between CPU and GPU and thus there is no way of telling how much effect the random number generators have on the differences observed between the two algorithms. The authors have looked at preliminary data where the GPU bedpostx algorithm was modified to use the linear-congruential random number generator to obtain the same amount of samples and when compared against the CPU samples, they appeared to have similar amount and magnitude of difference as this current work, which suggests the effect of random number generator in producing differing results would be small. This would then lead us to believe that sample differences are more attributable to difference in implementation of CPU-GPU precision points, math libraries between CPU and GPU-CUDA and more importantly the operation order: L-M initialization then MCMC sequentially v.s. L-M parallel then MCMC parallel. As DWI data are collected with greater amount of gradient directions, combination of different b-values, higher-resolutions, conventional DWI post-processing steps will require more computational resources to finish processing in a reasonable amount of time, and GPUs can offer qualitatively the same results with minimal quantitative difference compared to underlying uncertainty with excellent speed [4]. In summary, although significant differences were found between outputs of CPU and GPU bedpostx parameter distributions, differences may have limited impact upon stochastic tractography with single-shell DTI data: a) differences were observed in only 22% of total distributions; b) differences were sparsely distributed in major tract areas; c) differences in fibre orientations were small compared to background angular uncertainty. The latter appears to arise from antiparallel angles and random assignment of principle directions in the presence of multiple viable sub-fibre populations which are not problematic in fibre-tracking using ProbtrackX. It appeared that the combination of differences in operation orders between CPU and GPU and math-library differences affects the magnitude of difference between the two algorithms and collecting data with optimized SNR along with offline post-processing steps to reduce noise level may improve output consistency between the two algorithms.