Stacked competitive networks for noise reduction in low-dose CT

Since absorption of X-ray radiation has the possibility of inducing cancerous, genetic and other diseases to patients, researches usually attempt to reduce the radiation dose. However, reduction of the radiation dose associated with CT scans will unavoidably increase the severity of noise and artifacts, which can seriously affect diagnostic confidence. Due to the outstanding performance of deep neural networks in image processing, in this paper, we proposed a Stacked Competitive Network (SCN) approach to noise reduction, which stacks several successive Competitive Blocks (CB). The carefully handcrafted design of the competitive blocks was inspired by the idea of multi-scale processing and improvement the network’s capacity. Qualitative and quantitative evaluations demonstrate the competitive performance of the proposed method in noise suppression, structural preservation, and lesion detection.


Introduction
With the wider application of X-ray computed tomography (CT) in both clinical intervention and diagnosis, the potential radiation dose has been a growing public concern [1]. The most encouraged guide in this field to reduce the radiation dose as low as possible while maintaining the imaging quality sufficient for diagnostic accuracy. A direct way to achieve this purpose is to lower the X-ray tube current or voltage. However, since the imaging procedure is an integration of quantum photons, insufficient photons will unavoidable generate quantum noise and degrade the quality of reconstructed images from traditional analytic method, i.e. filtered back-projection (FBP). The existing solutions can be primarily categorized into projection space filtering, iterative reconstruction and image post-processing. The approaches in all three categories aim to improve the reconstructed CT images from low-dose scans.
Projection space filtering [2] directly operates on raw projection data or log-transformed sinogram before FBP is applied. Although this kind of method has low computational cost, their results may suffer from structure distortion due to the lack of well definition image edges in projection domain. Iterative reconstruction (IR) [3] models the reconstruction problem as an objective function with prior constraints. Different priors have been proposed for dealing a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 with the CT issues of low dose, limited angle and few views, such as nonlocal means (NLM), total variation (TV) and its variants, and sparse representation [4][5] [6] [7][8] [9][10] [11] [12] [13] [14]. Despite promising results obtained by this type of method, their wide application is still circumscribed on account of the difficulty of accessing well-formatted raw data from commercial CT scanners and heavy computational burden. Relatively speaking, post-image processing methods can be applied directly on low-dose CT (LDCT) images and are more convenient to be combined into current CT systems. However, due to the fact that the noise and artifacts cannot be well determined in image domain, it is very hard to achieve satisfactory results by directly transplanting current general image denoising methods. In recent years, extensive attempts based on image post-processing methods have been made to deal with this problem [15] [16]. Most representatively, inspired by the theory of sparse representation, Chen et al. proposed a patch-based fast dictionary learning approach to suppress both mottled noise and streak artifacts [16]. Additionally, the block-matching 3D (BM3D) algorithm has demonstrated to be efficient in several low-level image restoration tasks and was adapted to improve the quality of LDCT images [17].
Recently, deep learning has received overwhelming concerns due to its superior performance in numerous scientific research fields. In the field of medical imaging, several preliminary researches with this idea were proposed [18] [20]. Kang et al. applied the U-Net to the multiscale data of LDCT images after decomposition by the wavelet transform [21]. By combining the idea of residual encoders into traditional CNN, Chen et al. attained promising results in LDCT [22].
In spite of some preliminary results achieved by deep learning for LDCT, the power of a deeper and wider network has not been fully explored. Several studies in the field of computer vision have made considerable progress on constructing deep architecture [23][24] [25]. However, to prevent from discard of meaningful structural details, most neural network models for image restoration, which is considered a low-level task, had limited the depth of network. This property is different from high-level tasks in computer vision, e.g. classification or detection [26] [27], in which max-pooling operation is extensively used to capture high-level features.
In this paper, we expand the frontier of CT analysis by adopting a deep network for processing the complex nonlinearity in LDCT. Inspired by the work of [23], we introduce a new architecture named Stacked Competitive Network (SCN), which is illustrated in Fig 1, into the traditional CNN. Instead of a single stacked CNN, our SCN is comprised of several successive Competitive Blocks (CB). Each CB introduces a multi-scale processing mechanism by increasing the width of the network, which can further improve the ability of the traditional CNN. In the second section, the proposed SCN model is elaborated. In the third section, qualitative and quantitative experiments were performed to evaluate the proposed SCN model. Finally, the conclusion was drawn.

Single competitive block
The main idea of the single competitive block is to optimize the local structure in a CB with a certain sparsity, which can be approximately obtained by possible dense components. It is noteworthy that assuming shift-invariant means the proposed network can be constructed from stacked blocks. For this purpose, what we need to do is to identify the optimal sparse structure and repeat it spatially.
We assume that each feature map from the previous layer relates to regions of the input image, and these feature maps are clustered into different filter banks. However, single-scale filters cannot adequately capture low-level features and texture patterns simultaneously, especially in LDCT images.
Inspired by the work of [23], convolutional filters with different scales were utilized, which means that multi-scale texture and structural features could be captured in the same image region. Furthermore, because pooling layers will result in the loss of spatial information, all of the pooling operations are discarded in our blocks.
Although multi-scale filters can capture richer information of original images, they have shortcomings. First, many redundant feature maps are used repeatedly, which inevitably increases the computational burden. In addition, the introduction of a CB significantly boosts network parameters, which increases the difficult of training.
To solve these problems, a Combination Function (CF) is designed for each block. A feature map is generated as the input of the next layer in the network. The CF can be formulated as follows: and where σ(Á) is the ReLU function; a i-1 is the feature representation of the (i-1) th layer; k represents the scale kernel size of the i th layer; is the element-wise max operation; F can be viewed as a competitive combination function, which can be implemented using the element-wise max operation in the network; and a i is the new feature map constructed by CF, which is used as the input of the next layer. The structure of a single competitive block is shown in Fig 2.

Stacked competitive networks
Since the CNN-based approaches are immune to the impact of the statistical distribution of the artifacts and noise, we model the noise reduction problem for LDCT as follows. Letting X2F m×n be an LDCT image, and letting Y g 2F m×n be the corresponding normal-dose image, the noise-reduction problem can be transformed into the problem of learning a mapping function R directly from the LDCT to NDCT as follows: Generally, R is nonlinear and complicated. To achieve this aim, k single Competitive Blocks (CBs) are stacked as a deep CNN to transform the LDCT image to its corresponding NDCT image. The noise reduction problem can be formulated as the minimization of the following loss function: where F i is the competitive mapping function of the i th layer in the proposed SCN and Θ denotes the parameters of the network. Nonlinear mapping is imposed into the first k−1 blocks to represent the nonlinear relationship between the LDCT and NDCT image features. The competitive block in the last layer, namely, F k , reconstructs the estimated NDCT image Y r .
To avoid over-fitting, a regularization term X k i¼1 kW i k 2 F (a weight decay term) is introduced, which forces to reduce the magnitudes of the weights. The objective function is re-formulated as follows: Due to the large number of parameters, the optimization of L tends to fall into a local minimum. To avoid this situation, first, an unsupervised pre-training process was utilized to initialize the first k−1 layers in a stacked strategy, and we randomly initialize the k th layer; second, we fine-tuned the whole network in a supervised way. The output of this single-hidden-layer network a i = F i (a i-1 ) is used as the input of the next layer. The original LDCT image is used as the input of the first layer, i.e., a 0 = x. After initialization, all of the layers were fine-turned with (Eq 5). As a result, the front layers of a stacked competitive block attempt to catch the low-level features, such as texture patterns in LDCT images, while the higher layers try to seize higherlevel features that contain context information from low-level features.
Owing to the SCN is an end-to-end architecture, once the network is configured, the set of the parameters, Θ should be estimated to build the mapping function R. The estimation can be achieved by minimizing the loss function L(Θ) between the estimated CT images X and the reference NDCT images Y g . Given a set of paired patches P = {(X 1 ,Y 1 ),(X 2 ,Y 2 ),. . .,(X T ,Y T )} where {X t } and {Y t } denote LDCT and NDCT image patches respectively, and T is the total number of training samples. In the training stage, the loss function was first optimized by Adam [28] and later by SGD [29] optimization.

Experimental design and results
For evaluation purpose, we compared our method with five different state-of-the-art approaches: TV-POCS [10], K-SVD [16], BM3D [17], SSCN [20], and KAIST-Net [21]. K-SVD and BM3D are the two powerful image restoration algorithms, which have been successfully employed by LDCT. TV-POCS is a classical IR method constrained by a sparse gradient prior. SSCN is a first CNN-based LDCT image restoration model, which only has singlescale convolutional kernel, e.g., 5×5. We also compared to a recently presented deep CNN model, called KAIST-Net. It can be treated as an advanced variant of the SSCN model aided by multi-scale residual learning. In all the cases, the parameters in TV-POCS, KSVD and BM3D were adjusted to achieve the best result.
Three metrics, including, peak signal-to-noise ratio (PSNR), root mean square error (RMSE), and structural similarity index measure (SSIM), were chosen for quantitative assessment. All experiments were implemented in MATLAB 2017a on a PC (Inter i7-4970 CPU, 32 G RAM and GTX 980TI graphics card). The codes of this project have been released in web page (https://github.com/Wenchao-Du/SCN-for-Image-Denoising).

Data source
Simulated data. TCIA Dataset: This dataset included 7015 NDCT images from 165 patients, which were obtained from The Cancer Imaging Archive (TCIA, https://imaging.nci. nih.gov/ncia/). The image samples were 256×256 pixels. Several representative slices are shown in Fig 3. It can be seen that different body parts were involved to maintain the diversity of data source. The corresponding LDCT images were simulated by introducing Poisson noise into the projection data from the NDCT images. Under the assumption of monoenergetic source, the projection data obey the Poisson distribution that can be formulated as where z i is the measurement along the i th X-ray path, b i is the blank scan factor, r i denotes the electronic noise, and l i is the integral of the X-ray attenuation coefficients. In our simulations, the noise level can be easily controlled with different b i . In the initial experiments, b i was uniformly set to 10 5 photons and denoted as b 0 = b i = 10 5 , i = 1,. . . ..,I. The Siddon ray-driven algorithm was utilized to produce the projection data with a fan-beam geometry. The detector-to-rotation center distance was 400 mm and the rotation center-to-source distance was 400 mm. The physical size of image was 200 mm×200 mm. The length of detector was 413 mm and the detector had 512 bins. Over a 360˚range, 1024 projections were uniformly sampled.
Since directly processing the complete image is knotty, SCN was applied on image patches instead. Usually, deep neural networks require amount of training samples, which will be difficult to obtain in the medical field due to the limitations of patient privacy. Extracting image patches with overlapped sliding windows is an efficient way to enlarge the training set.
In our experiments, 200 NDCT and corresponding simulated LDCT image pairs were randomly chosen for training and 100 images pairs were randomly chosen for testing. The test data is chosen from a different subject than the training set.
Clinical data. Mayo Clinical Dataset: This dataset included 2378 3-mm-thickness routineand quarter-dose CT images from ten patients. The usage of Mayo dataset was provided by Low Dose CT Grand Challenge (http://www.aapm.org/GrandChallenge/LowDoseCT/). The training set composed of a portion of routine-and quarter-dose image pairs. The testing set contained the rest of the image pairs. For comparison, 10-fold cross validation was used in the testing stage: the images from nine patients were involved in the training stage and the rest one was used as testing samples.

Parameter selection
The training data consist of pairs of image patches with size of 100×100 that were extracted from LDCT images with a sliding distance of four pixels. After extracting image patches, the number of training samples reached 10 6 . The network was implemented with Caffe. In our experiments, we evaluated several parameter combinations and finalized the parameter setting as follows. The learning rate was initialized to 10 −2 and gradually decayed to 10 −5 . The convolutional kernels were initialized with random Gaussian distributions with zero mean and standard deviation 0.01. The numbers of filters in the layers were 96 except the last layer, which was set to 1. To avoid patch-alignment issues and heavy computational burden, the competitive blocks were restricted to 3 filters with sizes of 5×5, 3×3 and 1×1. Due to the flexibility of CNN, the proposed SCN can perform on patches with arbitrary size. All of the testing samples were fitted into the network directly without any pre-processing operations.

Experimental results
Simulated data. In order to evaluate the performance of the proposed SCN, two typical slices, which were from thorax and abdomen, were selected. Due the difference of scan protocols between two slices, the noise and artifacts in both images appeared in different degrees. Fig 4 demonstrates the thoracic image processed by different methods. In Fig 4(b), it is obvious that the LDCT image has severe streak artifacts and noise, especially close to the tissues with high attenuation coefficients, e.g. bones. All the approaches showed different abilities on noise and artifact reduction. In Fig 4(c), it is noticed that TV-POCS smoothed some small details in the pulmonary lobes due to its notorious blocky effect. K-SVD and BM3D had a better performance on detail preservation than TV-POCS, but the artifacts radiated from the bones were still obvious. SSCN, KAIST-Net and SCN removed most artifacts and noise and the structural information were maintained better than other methods. In addition, SCN better distinguished the low-contrast regions. Fig 5 shows the magnified the region from different methods, which was indicated by the green box. Obviously, the parts pointed by the blue arrow were blurred in Fig 5(c). The other methods can discriminate these details to various degrees.
In Fig 5(h), the details were well preserved without blurring. In Fig 5(d) and 5(e), the streak artifacts can be observed adjacent to the bones, which are indicated by the red arrow.
Four ROIs, which are indicated by red dotted rectangles in Fig 4(a), were selected for quantitative assessment. The magnified ROIs are shown in Fig 6. It can be noticed that the quantitative results had coherent trend to visual effect. The SCN had the best values in all the metrics for all the ROIs. Fig 7 gives the abdominal image processed by different methods. Since the routine-dose abdominal image (Fig 7(a)) is noisier than routine-dose thoracic image (Fig 4(a)), it will be much more difficult to distinguish the structures due to the severe deterioration in low-dose abdominal image (Fig 7(b)). TV-POCS and K-SVD had limited performance in recovering the details as shown in Fig 7(c) and 7(d). The result in Fig 7(c) is still contaminated by blocky effect. Although BM3D suppressed most noise, the artifacts near the vertebral column are evident. SSCN, KAIST-Net and SCN eliminated most of the artifacts and noise, but the results in Fig 7(f) and 7(g) suffered from mild blurring and this drawback was also mentioned in our previous work [14]. Several regions with detectable structure differences are marked by the red arrows. The red arrow in the middle of the liver point to a structural detail, which was well retained by SCN in Fig 8. Other tissues, such as vertebral column and bones were best maintained in the results of SCN as well.
The quantitative results for the abdominal image processed by different approaches are shown in Table 1. It can be seen that SCN achieved an impressive result on all indices. Table 2 shows the statistical results for all the 100 images in the testing set. The proposed SCN outperformed other competing methods in all of the metrics.
Clinical data. To further evaluate the performance of SCN, a representative slice from Mayo clinical dataset was selected and Fig 9 shows the results. It is easy to be observed that SCN obtained the superior capability of detail reservation and noise reduction. In Fig 10, SCN yielded the best image quality for detail recovery and structure preservation, as highlighted by the red circles in the magnified parts. Extra artifacts were introduced by K-SVD and TV-POCS. BM3D over-smoothed the details. Only CNN based methods can differentiate the contrast-enhanced intercostal vein, which is indicated by the red arrowhead in Fig 10. Although KAIST-Net and SSCN achieved similar performances to SCN, the texture and structural details were better retained in Fig 10(h).
The coronal and sagittal images of the results are demonstrated in Figs 11 and 12. Red circles indicate the regions with visual differences. It is clear that comparing with other methods, SCN can remove most of the noise while maintaining more structure information. Table 3 summarizes the quantitative results of Figs 9 and 10. SCN outperformed the other methods in all the metrics. Table 4 shows the quantitative results of 10-fold cross validation in terms of MEANS±SDs.
Investigation of competitive blocks in SCN. As the proposed SCN model composed of several successive CBs, the impact of CB was investigated. We constructed convolutional networks with the same structure but instead used the convolution operation with single-scale 5 Ã 5 and 3 Ã 3 convolutional kernels; these CNNs are denoted as CNN-5 and CNN-3, respectively. The experiments were performed on TCIA dataset.
Meanwhile, to evaluate the robustness of SCN, training and testing data with different noise levels were generated to produce the quantitative results, which are shown in Table 5. In this table, the training set of CNN-5, CNN-3 and SCN was randomly mixed with various noise levels for b 0 = 10 5 , b 0 = 5×10 5 and b 0 = 5×10 4 . It can be noticed that SCN still achieved the best  performance, which is a powerful evidence for the merit originated from the introduction of CBs. The robustness of SCN for different noise levels is also can be confirmed. We examined the impact of the number of convolutional kernels contained in a single CB. Several SCN networks were constructed and different numbers of convolutional kernels were included in a single CB. These networks are denoted as SCN-2, SCN-3 and SCN-4 respectively. The CBs in SCN-2 contains both 1×1 and 3×3 kernels; the CBs in SCN-3 contain 1×1, 3×3 and 5×5 kernels; the CBs of SCN-4 contains 1×1, 3×3, 5×5 and 7×7 kernels. The experiments were performed on TCIA dataset.    Table 6 shows the quantitative results. In Table 6, the training and testing sets were same as Table 5 for SCN-2, SCN-3 and SCN-4. It can be observed that SCN-4 obtained better performance, which confirmed the effectiveness of multi-scale convolutional kernels and more scales can further improve the performance. However, adding more convolutional kernels will unavoidably increase the training time. To balance the computational cost and performance, we believed that 3 kernels in a single CB was a reasonable choice.

Conclusions
We proposed a novel network structure, aided by stacked competitive blocks, for LDCT image restoration. The primary advantage of this method is the introduction of multi-scale processing. Based on two public databases, our proposed SCN achieved the best performance compared with other competing methods, in terms of noise suppression and structural preservation. Next, we are planning to optimize SCN ulteriorly and try to apply it to the wider range of medical imaging tasks.