A deep learning reconstruction framework for X-ray computed tomography with incomplete data

As a powerful imaging tool, X-ray computed tomography (CT) allows us to investigate the inner structures of specimens in a quantitative and nondestructive way. Limited by the implementation conditions, CT with incomplete projections happens quite often. Conventional reconstruction algorithms are not easy to deal with incomplete data. They are usually involved with complicated parameter selection operations, also sensitive to noise and time-consuming. In this paper, we reported a deep learning reconstruction framework for incomplete data CT. It is the tight coupling of the deep learning U-net and CT reconstruction algorithm in the domain of the projection sinograms. The U-net estimated results are not the artifacts caused by the incomplete data, but the complete projection sinograms. After training, this framework is determined and can reconstruct the final high quality CT image from a given incomplete projection sinogram. Taking the sparse-view and limited-angle CT as examples, this framework has been validated and demonstrated with synthetic and experimental data sets. Embedded with CT reconstruction, this framework naturally encapsulates the physical imaging model of CT systems and is easy to be extended to deal with other challenges. This work is helpful to push the application of the state-of-the-art deep learning techniques in the field of CT.


Introduction
The invention of X-ray computed tomography (CT) has led to a revolution in many fields such as medical imaging, nondestructive testing and materials science. It can overcome the limit from radiographic imaging, namely that three-dimensional objects are projected on a two-dimensional plane and the depth information becomes invisible. Employing a set of twodimensional projections to reconstruct the three-dimensional objects, CT allows us to investigate the inner structures in a quantitative and nondestructive way.
Image reconstruction plays an important role during the development of CT and many reconstruction algorithms have been proposed over the last decades. deep learning residual architecture [30] for sparse-view CT reconstruction. The input of this architecture is the initial corrupted CT image from FBP or other algorithm. It firstly estimates topologically simpler streaking artifacts from the input image and then subtracts the estimated result from the input image to get artifact-free image. Obviously this method is independent on X-CT reconstruction and works in an indirect mode. Using multi-scale wavelet, they extended their work to limited angle CT reconstruction [31]. Jin et al also proposed a deep convolutional neural network [32] for inverse problem in imaging. It is similarly independent on X-CT reconstruction, but the estimated results are the final CT images not the artifacts. With dilated convolutions, Pelt et al introduced an architecture [33] to capture features at different image scales and densely connect all feature maps with each other. Their method is also independent on CT reconstruction, but has the ability to achieve accurate results with fewer parameters. It reduced the risk of overfitting the training data. Yang et al presented a deep convolutional neural network (CNN) method that increases the acquired X-ray tomographic signal during the low-dose fast acquisition by improving the quality of recorded projections. Short-exposure-time projections enhanced with CNNs show signal-to-noise ratios similar to long-exposure-time projections. However, it could not suppress the artifacts caused by sparseview or limited-angle scanning. The reported results have demonstrated that these approaches are much faster than the conventional algorithms and also easier to be implemented.
In this paper, we report a new deep learning reconstruction framework for X-CT with incomplete projections. Different from above mentioned techniques, it is the tight coupling of the deep learning U-net [14] and FBP algorithm in the domain of the projection sinograms. The estimated result are not the CT image or the artifacts, but the complete projection sinograms. Embedded with CT reconstruction, it naturally encapsulates the physical imaging model of CT systems and is easy to be extended to deal with other challenges such as beamhardening, scattering and noise. When training, this framework firstly obtains the forward projections from the initial reconstruction by applying FBP to the original incomplete projection sinograms. Taking the complete projection sinograms as a target, they are then input into the U-net to obtain the net parameters by doing deep learning. After training, this framework is determined and can obtain the final CT image from a given incomplete projection sinogram. Taking the sparse-view and limited-angle CT as examples, this framework has been validated by using synthetic data sets and experimental data sets. This work is helpful to push the application of deep learning in the field of CT.

Proposed framework
Fig 1 shows the proposed deep learning reconstruction framework for X-CT with incomplete projections. It is based on FBP and the deep learning U-net and called DLFBP. This framework consists of four parts. Initial FBP reconstruction of the original incomplete sinogram is the first part. The second part is the forward projection operator which is applied to the initial reconstructed image to obtain corrupted sinogram. The convolutional neural network U-net is the third part and used to execute deep learning. The final FBP reconstruction with the complete sinogram from the third part is the last part of the framework. Obviously, the third part is the iconic one and works in the domain of sinograms.
Eq (1) is the well-known two-dimensional equi-spaced fan beam FBP algorithm which is adopted to reconstruct the CT image in this article. It also has many extension versions for different CT scanning configurations. In this equation, β(x, y) represents the CT image in Cartesian coordinate system, U the geometrical weight factor, P(w, ϕ) the projection sinogram, h the inverse Fourier transform of the ramp filter, w the index of the detector channels and ϕ the CT rotation angle. The symbol " � " means convolution.
Eq (2) is the forward projection operator. In this equation, P(w, ϕ) represents the forward projection recorded by the wth detector channel at the rotation angle ϕ, β(x, y) the initial reconstructed CT image and l the forward projection path.
Eqs (3) and (4) formulate the U-net. In these two equations, f() represents the extractor to recognize and characterize the context from input X in the encoding way, Λ[] the nonlinear mapping function, F{} the constructor to obtain part of outputŶ , W and bias the parameters trained and determined in the neural network. One can refer to [14] for more details.
The parameter W is a vector and trained with the following steps. Firstly, it is initialized with a Gaussian distribution with standard deviation ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi  one batch used for training. Finally E is used to update W via backward propagation expressed in Eq (6). t represents the number of the iteration training, j the parameter index in the vector W and η the learning rate. When the learning rate η is big, the parameters in the network are updated a lot. It is helpful to reach convergence fast. However, learning oscillation may happen if it is too big. In implementation, gradually reducing the learning rate is a good idea.
The parameter bias is also trained and determined by using the above method. When the mean square error E reaches the convergence condition, the training stops and a determined network is available for work.

Convolutional filter for down-sampling
Down-sampling is one of the fundamental operations within U-net and implemented by the so-called max-pooling operator. It is mainly used to reduce the dimensions of the feature maps and increase the size of the receptive field. The pooling size should keep balance between these two factors. The determination relies on experience and the size usually set to be 2. The stride is correspondingly set to be 2 to obtain the same dimension reduction. As an example, the max-pooling with a 2 × 2 window and a stride 2 is depicted in Fig 2. At each window, maxpooling extracts the maximum value and removes others. It runs fast, but some details may be lost in this processing. We adopt a convolution filter with a 2 × 2 window and a stride 2, shown in Fig 2, to replace the max-pooling. All values in the window will make their contributions to the down-sampling according to the four window coefficients C 11 , C 12 , C 21 and C 22 . This new operation will improve the down-sampling accuracy since these four window coefficients are updated iteratively during the training.

Normalization
This framework involves complicated calculations such as image reconstruction, forward projection and convolution. They may lead to unignored computation errors and degrade the training efficiency and accuracy. So we apply normalization operation to input X, outputŶ and ground truth Y to avoid this problem. This normalization operation is expressed in Eq (7) in which I n represents the normalized image, I the raw image, mean() an operator to obtain mean value and std() an operator to obtain standard deviation value.

Running modes
This framework has two running modes. One is training mode and the other is working mode. Training mode has following steps: i) A set of incomplete sinograms is firstly matched with the corresponding complete sinograms into many pairs of training data. Each pair consists of an incomplete sinogram and a complete sinogram. ii) These data is input into the framework depicted in Fig 1 one pair by one pair and the network parameters such as W, bias, C 11 , C 12 , C 21 and C 22 are updated iteratively. iii)When all pairs are used once, an outer learning iteration is completed. iv) Repeat steps ii) and iii) until the learning converges.
The procedure for working mode is much simpler. When an incomplete sinogram is fed into the framework determined by the training mode, the output of the framework will be a high quality CT image.

Layer parameters
For the following experiments, the learning framework in Fig 1 has totally 27 lays. Table 1 lists the corresponding parameters of these layers. In this table, layer indexes start with the input layer indicated by the light green block in Fig 1.

Experiments
Sparse-view CT and limited-angle CT are two typical cases with incomplete data. Taking them as examples, this section validates the proposed reconstruction framework. Two experiments were executed. The first one is based on synthetic data set. Some of the synthetic data was used to train the framework and others to test. The second is based on real data set. Some of the real data was used to train the framework and others to test.

Data preparation
For synthetic data sets, 300 phantoms are used to obtain the fan beam sinograms with different sampling factors. Each phantom consists of tens of ellipses with random attenuation coefficients, sizes, and locations. In sparse-view CT experiments, the sampling factors are set to be 1, 8 and 12. They correspond to 720, 90 and 60 views respectively. The size of each phantom is 512 × 512 pixels. Fan beam sinograms are generated by using the embedded MATLAB function fanbeam(). The width of all the sinograms are 731 pixels. The sinogram with sampling factor 1 has a size 720 × 731 pixels and is treated as a complete one. Other sinograms are incomplete. Sinograms of 200 phantoms are used to train the framework and those from another 100 phantoms are used to test the framework.
Within the framework, for each incomplete sinogram, the initial FBP reconstruction is firstly executed with Eq (1) to obtain the initial CT image. Then the forward projection operator in Eq (2) is applied to the initial CT image to generate the corresponding corrupted sinogram with a size 720 pixels × 731 pixels. Next the iterative deep learning runs to update the network parameters by making comparison between the corrupted sinogram and the complete sinogram.
For experimental data sets, 300 female and male head slice images are randomly chosen from Visible Human Project CT Datasets. 200 images are used to train the network and the left 100 images are used to test the network. All the operations and procedure are the same as the ones for synthetic data sets.
The experimental data sets are available from https://www.nlm.nih.gov/research/visible/ visible_human.html. A license agreement for this work was approved by the National Library of Medicine, National Institute of Health, USA.

Implementation
This framework is implemented with Python 3.5.2 and Tensorflow 1.8. It runs in a workstation Advantech AIMB-785 with a CPU i7 6700 and a Graphics Processing Unit (GPU) nVidia GTX 1080Ti 11 GBytes. Nadam optimizer [34] is used for the back propagation of gradients and updating learning parameters. Batch normalization (BN) [24] is adopted to accelerate the training. Due to the limit of GPU memory, the batch size set to be 1. It makes the training process unstable. In order to avoid the fluctuation of gradients that may affect the learning effect, a small learning rate is preferred to control the update of parameters. The initial learning rate is set to be 1 × 10 −4 and gradually reduced with the exponential decay method expressed by Eq (8) from https://tensorflow.google.cn/versions/r1.10/api_docs/python/tf/train/exponential_decay. The decay rate is 10% and the decay steps 20. Int() represents the operation to obtain the integer division.
decayed learning rate ¼ learning rate � decay rate intðglobal step=decay stepsÞ ð8Þ The training time for synthetic and experimental data sets is about 9 hours for 50 outer iterations. When testing, it takes 1s to obtain the final CT image.

Image evaluation
Peak signal-to-noise ratio (PSNR) is a term for the ratio between the maximum possible power of a signal and the power of corrupting noise that affects the fidelity of its representation. Expressed in Eqs (9) and (10), it is commonly used to measure the quality of reconstruction of lossy compression codecs. In this article it is adopted to evaluate the quality of the sinograms and CT images. Here, MAX() is an operator to obtain the maximum value, m and n the height and width of the image, G the ground truth image and K the image to be evaluated. The bigger PSNR, the better the image quality.
The averaged PSNR (aPSNR) value is also used to evaluate the performance of the proposed framework. It is the mean value of a group of PSNR values and calculated by Eq (11)

Comparison with methods based on CT images
As mentioned in the Introduction section, most of the existing X-CT image deep learning processing techniques are independent on CT reconstruction algorithms. The input is the corrupted CT image, and the output is the corrected CT image or artifact. In contrast, the proposed method is the combination of CT reconstruction algorithms and deep learning techniques, and works in the domain of the projection sinograms. The estimated results are the complete sinograms. It has the potential to provide much better image quality.
A comparison between the proposed method and a typical deep convolution neural network(DCNN) [32] was conducted to demonstrate its performance. During the implementation of DCNN, all the procedures and the parameters which are set manually and not determined by the training, kept the same as those published in the reference [32].

Results
Figs 3 and 4 present the sparse-view results of one of the 100 synthetic phantoms for testing. Figs 5 and 6 present the sparse-view results of one of the 100 head slices for testing. They correspond to the cases with sampling factors of 8 and 12, respectively. Their incomplete singorams have sizes of 90 × 731 pixels and 60 × 731 pixels, respectively. All the corrupted sinograms generated by the forward projection have a size of 720 × 731 pixels. All the reconstructed CT images have a size of 512 × 512 pixels after cutting off the surrounding blank region. In order to show the most obvious difference among each group of images, some regions of interest (ROIs) in these images, indicated by the yellow boxes, are enlarged for better visualization. In these four figures, it is noticeable that, after training, the artifacts caused by the sparse sampling are suppressed drastically both in the sinograms and CT images. The PSNR values also significantly increase. They demonstrate the validity of the proposed framework. Fig 7 shows the radar maps of the PSNR values of all the 100 synthetic and experimental images for sparse-view CT testing. Table 2 lists the corresponding aPSNR values. They also confirm the validity of the proposed framework.
It should be pointed that although DCNN significantly reduces the sparse sampling artifacts, some of them remain in the images. Moreover, the bigger the sampling factor is, the more blurry the edges of the object are. These may be caused by the fact that its learning      mechanism is based on CT images. Once an imprecise value is estimated by the DCNN, it will directly distort the object in the final CT image. In contrast, DLFBP is based on sinograms. An estimation error in sinograms will be compensated by the following CT reconstruction since the final CT is a weighted sum of the values in sinograms. As such, DLFBP is tolerant to learning bias and provides better image quality. Figs 8 and 9 present the limited-angle results of one of the 100 synthetic phantoms for testing. Figs 10 and 11 present the sparse-view results of one of the 100 head slices for testing. They correspond to the cases with the projections within the angular range [0 120˚] and [0 90˚], respectively. Their incomplete singorams have sizes of 240 × 731 pixels and 180 × 731 pixels, respectively. All the corrupted sinograms generated by the forward projection have a size of 720 × 731 pixels. All the reconstructed CT images have a size of 512 × 512 pixels after cutting off the surrounding blank region. In order to show the most obvious difference among each group of images, some regions of interest (ROIs) in these images, indicated by the yellow boxes, are enlarged for better visualization. Fig 12 shows the radar maps of the PSNR values of all the 100 synthetic and experimental images for limited-angle CT testing. Table 3 lists the corresponding aPSNR values.
The images in  and the values in Fig 12 and Table 3 show that DLFBP is effective to suppress the artifacts in limited-angle CT. Moreover, DLFBP recovers more inner structure details and the edge is much more sharper than DCNN.

Discussion
Down-sampling with convolution filters plays an important role for the improvement of the image quality in the proposed framework. In conventional neural networks, max-pooling are   The incomplete singoram has a size of 180 × 731 pixels. The corrupted sinograms generated by the forward projection have a size of 720 × 731 pixels. All CT images have a size of 512 × 512 pixels after cutting off the surrounding blank region. Some regions of these images, indicated by the yellow box, are enlarged for better visualization.
https://doi.org/10.1371/journal.pone.0224426.g011  usually adopted to implement down-sampling. They run fast, but the generated maximum value may not match the true one. In contrary, the coefficients in the convolution filters are determined by training the network with the training data set and better results can be provided by this operation. In order to investigate the effect on learning accuracy from the down-sampling methods, the above experiments are repeated with max-pooling. Table 4 lists the corresponding aPSNR values. In any case, the aPSNR value of the convolution filter down-sampling is always larger than that of max-pooling. Obviously, the former performs better.

Conclusion
In this paper, we reported a deep learning reconstruction framework for incomplete data CT. It is based on the deep learning technique and classical FBP reconstruction algorithms. All network parameters are not determined manually but by training in the domain of sinograms. It has been validated by the sparse-view and limited-view CT reconstruction with synthetic and experimental data sets. It provides a possible image reconstruction resolution for X-CT with incomplete data.
In the proposed framework, U-net neural network is adopted to train the incomplete sinograms. The learning accuracy and efficiency is limited by the characteristics of U-net. In the future, this framework could be improved further by replacing U-net with more advanced learning models.