An Effective CUDA Parallelization of Projection in Iterative Tomography Reconstruction

Projection and back-projection are the most computationally intensive parts in Computed Tomography (CT) reconstruction, and are essential to acceleration of CT reconstruction algorithms. Compared to back-projection, parallelization efficiency in projection is highly limited by racing condition and thread unsynchronization. In this paper, a strategy of Fixed Sampling Number Projection (FSNP) is proposed to ensure the operation synchronization in the ray-driven projection with Graphical Processing Unit (GPU). Texture fetching is also used utilized to further accelerate the interpolations in both projection and back-projection. We validate the performance of this FSNP approach using both simulated and real cone-beam CT data. Experimental results show that compare to the conventional approach, the proposed FSNP method together with texture fetching is 10~16 times faster than the conventional approach based on global memory, and thus leads to more efficient iterative algorithm in CT reconstruction.


Introduction
Computed tomography (CT) has become one of the most widely used non-invasive medical imaging systems. As the rapid development of multi-slice CT, 3-D CT has replaced the 2-D CT in radiology routines by providing fast 3-D scanning. Recently, extensive studies have demonstrated that iterative methods, based on an accurate system model, are capable of providing better reconstruction quality than analytical methods, especially under low-dose CT scans [1][2][3][4][5][6][7][8][9][10]. However, due to the high computation cost in iterative reconstructions, FBP (Filtered Backprojection) based analytical reconstructions still take the main horsepower in current clinical reconstruction for 3-D CT [11]. works related to this study are reviewed. The proposed FSNP scheme is explained in detail in section 3. The parallelization to normalized voxel-based back-projection operator is also discussed. In Section 4, the proposed approach is applied to a cone-beam CT system Performances of different parallelization modes were compared in this section. Experiment results show that, with well synchronized operation, the FSNP method is superior in terms of computation efficiency to the conventional sampling strategy with fixed sample intervals, and the utilization of texture memory leads to further improved acceleration performance. This study also shows that the FSNP method works well with voxel-driven back-projection operator to give efficient iterative reconstruction. Table 1 lists the abbreviations used in this paper.

Method
Existing algorithms for projection and back-projection can be roughly classified into three modes: ray-driven, voxel -driven and distance-driven. The ray-driven algorithm is well suited for projection and the voxel-driven algorithm works well in back-projection. However, the raydriven back-projection tends to bring grid artifacts into image domain whereas voxel-driven projection brings grid artifacts to the projections [23]. With a more accurate geometric modeling, the distance-driven methods often lead to better performance than ray-driven projection and voxel-driven back-projection by mapping voxel and detector boundaries into the same axis [24]. Nevertheless, parallelization of the distance-driven method is highly limited by the intensive calculation of intersection areas along each ray.
Although ray-driven projector and pixel-driven back-projector are intrinsically unmatched, Zeng pointed out in [25] that unmatched projection/back-projection pair is sometimes beneficial to reconstructions, for example, a matched projection/back-projection pair with weighted line lengths often produces ring artifacts, whereas an unmatched pair, in which the projector is ray-driven with line-length-weighting but the back-projector is voxel-driven with bilinear interpolation, can effectively remove ring artifacts. Based on this, in this study we choose raydriven projection in implementing the proposed FSNP strategy, and perform reconstructions by combining it with voxel-driven back-projection operator.

Projection
As illustrated in Fig 1, the ray-driven algorithm works by tracing rays through scanning objects. Each ray can be simplified to an ideal line connecting the source and the detector center. Projection values are obtained via integrating sampled values along each projection ray, which is realized by a weighted intensity summation of the voxel intensities in each ray. Different methods have been proposed to calculate the weights for voxels in projection. For instance, one can use intersection length between line and voxels [26][27], or the interpolation from neighboring voxels [27]. The classic intersection length summation based model is given in Fig 1(a). In this mode, the ideal line indexed i passes through pixel j with intensity x j . With the intersection length of line i and pixel j denoted by l ij , the contribution of pixel j to the projection and the summed projection value can be quantified by x j ×l ij and the projection value y i = ∑ j x j ×l ij , respectively. To alleviate the intensive calculation for each intersection length l ij in this projection model, the line summation based model are often approximated by the linear interpolation model given in Fig 1(b). In this model, pixels surrounding each sample point in the ray are used to estimate the interpolated value C p k of point p k , with weights determined by their spatial distances to the sample point. The projection value amounts to y i ¼ X k C p k for detector i. The outline of ray-driven based projection algorithm is given below. Set first sampling point P 0 to S, and set k to 0; Set projection value y i to 0; In ray-driven model, each projection value is calculated by integrating the sample intensities along each projection ray. Calculation of all the ray-driven projection values can be accelerated via CUDA based parallelization. Generally, the parallelization of ray-driven projection includes two options: per ray per thread (PRPT) mode and per ray per block (PRPB) mode. In PRPT mode, each projection value is independently calculated inside one single thread, which allows easy and straight parallelization. But the acceleration efficiency of PRPT mode tends to be lowered by the looping integration in each kernel function. As to the PRPB mode, each projection value is calculated within one block, and each thread in the block calculates the value for one sampling point along this projection ray. The intensive atomic operations for the intensity summing operation in PRPB mode can be effectively accelerated by sum reduction with the shared memory. So, it is generally believed that the PRPB mode is more efficient than the PRPT mode in projection parallelization [28].
The sampling point numbers along different projection lines are not identical due to the varying intersection lengths for different projection lines. The varying sampling numbers will lead to unsynchronized operations in threads or blocks for PRPT or PRPB modes, which result in lowered parallelization efficiency. In this paper, a method of Fixed Sampling Number Projection (FSNP) is proposed to overcome this. As illustrated by Fig 2, this FSNP approach assumes that the X-ray attenuation is negligible outside a pre-specified field of view (FOV). We fix the sampling number along each ray and perform uniform sampling on the line segments inside the FOV, thus allows an easy synchronization for both the PRPT and PRPB modes.
In this FSNP strategy, to save computation cost, the parameters for each projection line (including intersection length, coordinates of intersection points and line direction) should be calculated and stored in memory before the projection operation. However, when the conventional cubic FOV (shown in Fig 2(a)) is used, we have to store these parameters for each projection view respectively. For a CT system with 512×512 detector plane and 360 angles, it takes about 2.5Gb memory to store the parameters for all projection views, which imposes a large saving burden for many GPU devices. To solve this, we define the FOV in cylinder and sphere shapes (as shown in Fig 2(b) and 2(c)) for long object and short object scans, respectively. In this way, for one specific detector, the intersection length keeps the same for different views, and the coordinates of intersection points and line direction can be easily calculated according to the rotational symmetry of FOV. Hence we only have to store the parameters for one view. Furthermore, in order to avoid unnecessary computation on those rays with too short intersections with the FOV, only the rays with longer intersection (with the FOV length) than one preset threshold r T are considered. Normally the FOV should be large enough to ensure that no object data is removed by excluding those rays with small intersection lengths. The pseudo code for the proposed FSNP algorithm is given as follows for both the PRPT and PRPM modes:

Back-projection
As illustrated in the 2-D schematic diagram of the pixel-driven back-projection (Fig 3), a line (ray) connecting the radiation source and voxel center intersects with the detector plane for each pixel. Routinely, the linear interpolation or kernel function convolution can be used to estimate the back-projected value with respect to intersection locations [23][24][25][26][27][28]. Fig 3 also shows that the back-projection operator is in fact a linear interpolation: x j = (p 1 l 2 + p 2 l 1 )/(l 1 + l 2 ), where l 1 + l 2 is the distance between the two neighboring detector centers, and p 1 and p 2 are the corresponding projection values for the neighboring detectors. The outline of the voxeldriven based back-projection algorithm is also given below.
Voxel-driven based back-projection algorithm Set point source position S; Compute the coordinates of each detector center; For j = 1 to J (J is the number of voxels) Compute the geometrical line equation that connects S and the voxel j, noted as line j ; Compute the intersection point of line line j and the detector array, noted as y j ; Compute the back-projection value x j according to the position of y j via interpolation or convolution; End For As can be seen in above algorithm outlines, the back-projection algorithm allows a straight parallelization because the calculation related to each back-projected voxel is independent and can be easily handled by thread based parallelization. We should note that the back-projection operator in iterative reconstructions is normalized to match the projection operator by dividing the projection value by a pre-computed intersection length. The pseudo code for the parallelized back-projection is given as follows:

Experiment Configuration
Experiments with a simulated cone beam CT system were performed with configurations listed in Table 2. The involved abbreviations are listed in Table 2. A 3-D Shepp-Logan phantom is used. The experiments were conducted for low resolution mode and high resolution mode. In low resolution mode, the projection size is 512×512, and the scanning object size is 256×256×256; in high resolution mode, the projection size is 1024×1024, and the scanning object size is 512×512×512. We use sphere FOV shown in Fig 2(c) in the experiment. The implementation runs under a mobile workstation with Intel Core™ 2.40GHz (Quad-Core), 24Gb RAM, NVIDIA GTX780M graphic card (4Gb global memory), and the developing environment is Microsoft Visual Studio 2008 with CUDA version 5.5.

Computation Cost
We evaluated the efficiency of the proposed FSNP method for projection computation in this section. Both PRPB and PRPT modes were implemented. For each mode, both global memory and texture memory versions are considered. The computation time is listed in Table 3 in the unit of millisecond (ms). We can see in Table 3 that the PRPB mode is faster than the PRPT mode in the case when global memory is used, but the projection with PRPT mode becomes more efficient when texture memory is used for interpolation. We can also see that in global memory version, the time cost increases linearly as the sampling point number increases. But for the texture memory version, increasing sampling point number did not result in a linear increment of computation cost. The computational time increases little when sampling point number was doubled for texture based PRPT mode (around 10% for low resolution mode and 7% for high resolution mode). This means that developing projection model with higher resolution is allowed without much extra computation cost for the proposed FSNP approach when using PRPT mode with texture memory. We also compared the computation time of the FSNP method with the conventional FSIP (Fixed Sampling Interval Projection) method in PRPT mode in Table 3. For the FSNP method, the threshold r T is set to 30mm. As to the FSIP method, we adjusted the sampling density to ensure that the average sampling point number is close to the fixed sample number in FSNP method. We can see that the proposed FSNP method is faster than the FSIP method due to the better synchronization.
From Table 3, we can see that, in the texture memory version with PRPT mode, the computational time remains at same level when the sampling point number doubles for both FSNP and FSIP modes. Thus the blob based projection can be used to give better suppression of streak artifacts and artifacts [29][30]. Based on [31], the kernel function of blobs is constructed using generalized Kaiser-Bessel (KB) window b n,a,α : b n;a;a ðrÞ ¼ 1 where r is the radial distance to the origin. I n is the modified Bessel function with order m, α is a parameter controls the shape of the blob, a is the radius. Within our experiment, the parameters were set as follows: n = 2, α = 10.4, a = 0.84 mm (two voxels). In the current CUDA version, only nearest neighbor and linear interpolation modes are supported by texture fetching. Both modes can be applied for the computation of kernel computation. If linear interpolation mode is considered, for each sampling point, the contribution of each nearby voxels is determined by the distances between the voxel centers and the sampling points. The computation of the distances has to be implemented serially in threads with PRPT mode, which implies lowered computation efficiency. We use nearest neighbor instead. A set of discrete blob kernel functions were pre-stored in global memory (100 3 in our experiment), each of the kernel function are with a different translation from the origin. The translation fk x ; k y ; k z g 2 À 1 2 ; 1 2 = Þ; À 1 2 ; 1 2 = Þ; À 1 2 ; 1 . For each sampling point, the closest blob kernel function is chosen for the interpolation. The computation cost with blob based kernel function in Table 4 indicates that the blob based projection can be accelerated by the proposed PRPT method with texture memory. Table 4 also shows that, similar to the linear interpolation based FSNP, the increment of computational time is around 14% to 30% when the sampling number is doubled for the blob based FSNP with PRPT mode.
The above result indicates that even with the costly operation of texture binding, texture memory can be used to improve efficiency in projection parallelization. Table 5 compares the computational cost for back-projection of parallelized voxel-driven algorithm using global and texture memories. Results in Table 5 confirm that the texture memory can also be used to accelerate back-projection.

Projection and Reconstructions
In this section, we perform CT reconstructions with proposed FSNP. Conventional ray-driven and distance driven methods are also considered in the reconstruction for comparison. Root mean square error (RMSE) defined in Eq (2) was used to quantify the difference between the reconstructed 3-D images and original phantom 3-D images: wheref is the phantom image as ground truth, and f denotes the reconstructed image. kf k denotes the L 2 norm calculation. We performed FDK reconstruction with Ram-Lak filter. 3-D Shepp-Logan phantom was used with the system configuration in Table 2 (low resolution mode), and the phantom data was projected into 360 angles. The results are shown in Fig 4. We do not provide the result for the voxel-driven projections because the grid artifacts in such projection mode often results in severe artifacts in FDK reconstruction [26]. With respect to the phantom images in Fig 4(A), obvious streak artifacts can be observed in Fig 4(B), 4(C), 4(D) and 4(E); the combination of linear interpolation based FSNP and voxel-driven back-projection operator can provide better results than the classic ray-driven projector and back-projector pair. We can also see in Fig 4(F) that the blob based FSNP leads to the reconstructions with effective artifact suppression and the lowest MSE values, but at the cost of blurred edges. This is because the simulated projections generated by blob based FSNP are blurred by the weighted summation of neighboring voxels.
We then evaluated the performance of FSNP in iterative reconstruction algorithm. The OSEM (Ordered Subsets Expectation-Maximization) algorithm with 30 subsets and 100 iterations was chosen as the iterative algorithm. Results in Figs 5 and 6 show that the linear interpolation based FSNP together with voxel-driven back-projector can provide results similar to the matched distance-driven pair, both visually and quantitatively. Although the measurements are assumed to be independent form each other in ideal projection model, they are in fact somehow correlated, due to the imperfect collimation and scatter effect. The blob-based kernel takes such correlation into consideration, and thus leads to a more realistic model in characterizing the residual between observed measurements and image projections, especially for differences near the edges. As a result, we may see that different from the results in Fig 4(F), the reconstructed images in Fig 5(D) indicate that the blob based FSNP also leads to a good preservation of edges in addition to artifact suppression. Fig 6 plots the line profile (the vertical blue line in the left image in Fig 5(A)) of the reconstructions in Fig 5, from which we can see that the reconstruction with blob based FSNP results in a smoother profile with a better match of the reference profile than others (zoom 1); the distance driven projection provides blurred edges (zoom 2); and the blob based FSNP has the best performance in recovering peak value (with the best matched profile with phantom data in zoom 3).
The proposed projection methods are also validated by real scan data from a micro CT system. A rat was scanned under 40kV tube voltage and 200mA tube current. The projection sequence contains 360 projections over 360°. The projection image size is 922×748 with pixel size 0.1mm 2 ;   . Nevertheless, we can also see that the improved edge preservation is also accompanied with amplified noise in the reconstructed images.

Discussion
CUDA technology provides a well-established software platform for developers to design parallelized workflow with C-style code, with direct access to the virtual instruction set and GPU memories. In this paper, a strategy termed Fixed Sampling Number Projection (FSNP) is devised to ensure the operation synchronization along projection lines in parallelizing ray- driven projection based on CUDA framework. The conventional cubic FOV is replaced by a rotational symmetric FOV to save computation cost in geometry parameter calculation. Backprojection parallelization is relatively simple because the computation involved in each kernel for each back-projected voxel is independent to the computation in other threads. But as to projection operator, which calculates projection values via line integration of sample intensities, the involved accumulation operation often results in racing condition that limits the computation speed. Such problem can be alleviated by memory management and synchronization optimization.
Although there is a common view that PRPB mode is more efficient than PRPT mode in parallelizing CT projection using CUDA, experiment results have shown that with texture fetching, the computation time for PRPT mode is 43%~67% less than that of the PRPB mode. This advantage is brought by the locality property of texture memory when fixed sampling number is used for each ray. It is also found that the cost in texture binding occupies a large part in the whole computation. For instance, as to the PRPT mode with linear interpolation, in which the computational time for each projection in FSNP was respectively 17.09ms and 18.95ms for the cases with 256 and 512 sampling points per ray, there is 14ms cost for texture binding. For high resolution mode, the texture binding for 512×512×512 objects took about 104ms. Compared to the case with projection operator, the texture binding for back-projection requires less computation time due to the smaller 2-D data size (about 0.9ms and 2.8ms for 512×512 and 1024×1024 projection images, respectively). Despite this, the utilization of texture memory allows building a more accurate model by increasing the sampling points, without significantly increasing computational load.
Although fixing the sampling number in each line leads to more effective parallelization, it also results in inconsistent sampling density for projection rays, due to the different intersection lengths for projection rays within the FOV. For instance, the centered projection ray is often with a higher sampling density for cubic FOV; yet it is with a lower sampling density in the cases of cylinder or sphere FOVs. In the proposed approach, the sampling number is set

Conclusion
This paper proposes an effective parallelization scheme FSNP for the projection in iterative CT reconstruction algorithms. In this FSNP method, the sampling point number on each projection ray is fixed to ensure the synchronization of parallel computing. Texture memory is also used in the FSNP approach to further improve the computation efficiency. Experiment results show that the proposed approach with texture memory is 10~16 times faster than the global memory version in iterative reconstruction. Although it is widely accepted that the PRPB mode is suitable for the parallelization of the projection operator, this study indicates that the PRPT mode is about twice faster than PRPB mode with the proposed approach. We also found that the proposed ray-driven base FSNP method works well with voxel based back-projection operator in reconstructions. By introducing blob based kernel functions into the FSNP method, better reconstruction results than the distance-driven projection and back-projection pair can be obtained by the proposed approach. Better performance in noise suppression can be expected to be obtained by incorporating the noise property into kernel building.