Study of low-dose PET image recovery using supervised learning with CycleGAN

PET is a popular medical imaging modality for various clinical applications, including diagnosis and image-guided radiation therapy. The low-dose PET (LDPET) at a minimized radiation dosage is highly desirable in clinic since PET imaging involves ionizing radiation, and raises concerns about the risk of radiation exposure. However, the reduced dose of radioactive tracers could impact the image quality and clinical diagnosis. In this paper, a supervised deep learning approach with a generative adversarial network (GAN) and the cycle-consistency loss, Wasserstein distance loss, and an additional supervised learning loss, named as S-CycleGAN, is proposed to establish a non-linear end-to-end mapping model, and used to recover LDPET brain images. The proposed model, and two recently-published deep learning methods (RED-CNN and 3D-cGAN) were applied to 10% and 30% dose of 10 testing datasets, and a series of simulation datasets embedded lesions with different activities, sizes, and shapes. Besides vision comparisons, six measures including the NRMSE, SSIM, PSNR, LPIPS, SUVmax and SUVmean were evaluated for 10 testing datasets and 45 simulated datasets. Our S-CycleGAN approach had comparable SSIM and PSNR, slightly higher noise but a better perception score and preserving image details, much better SUVmean and SUVmax, as compared to RED-CNN and 3D-cGAN. Quantitative and qualitative evaluations indicate the proposed approach is accurate, efficient and robust as compared to other state-of-the-art deep learning methods.


Introduction
Positron Emission Tomography (PET) is a widely used imaging modality for various clinical applications, such as lesion malignancy, disease stage, and treatment monitoring [1][2][3]. Compared with computed tomography (CT) and magnetic resonance imaging (MRI), PET is a functional imaging technique that detects the metabolism processes of human body [4]. To reach a certain PET image quality for diagnostic purposes, a typical dose of injected radioactive tracers usually ranges from 185�555 MBq, depending on PET scanners, protocols, reconstruction methods, patients and so on. Since high gamma radiation dosage in a patient may induce genetic damages and cancerous diseases [5][6][7], it inevitably raises concerns about the potential higher risk of radiation exposure damage. Thus, it is desirable to reduce the dose of radioactive tracers in PET imaging. However, the major drawback of dose reduction is that higher noise, worse contrast and information loss may be involved in the reconstructed images, resulting in an inferior image quality and unreliable diagnosis. A series of methods has been proposed to improve the image quality for the low-dose PET (LDPET) imaging, while preserving crucial diagnosis information. Those algorithms can be roughly categorized into traditional methods such as iterative reconstruction algorithms [8,9], post-processing methods [10][11][12][13] and deep learning based methods [14][15][16][17][18][19][20][21][22]. In general, those strategies for improving PET image quality are either hardware-oriented or computationally intensive. Besides, the LDPET image contains more complex spatial variations, correlations and statistical noise than the full-dose PET (FDPET) image, which limits the performance of the traditional methods.
Recently, deep learning has drawn a mount of attention in computer vision applications and medical image analysis areas [4][5][6][7]23]. For instance, the image classification [24] and face verification [25] can achieve human-level performance. Algorithms based on deep learning have made some success in low-dose CT (LDCT) reconstruction and denoising [14][15][16][17][18]. These methods learn a non-linear mapping from a LDCT image to high-quality CT image to recover missing high-frequency details. While in recovering or denoising LDPET, there are much fewer works with deep learning methods reported. Xiang et al. [19] proposed a deep auto-context CNN model that synthesized a high quality image from 1/4 of FDPET image and corresponding MR T1-image. Xu et al. [20] used a U-Net [26] like network to recover a full-dose quality PET image from 1/200 of FDPET image, and applied a multi-slice input strategy to make the network more robust to noise. Wang et al. [21] designed an end-to-end framework based on 3D conditional GANs (3D-cGANs) to estimate the high-quality PET image from the corresponding LDPET image. The 3D convolution operation makes the model avoid the discontinuous cross-like artifacts that usually occurs in 2D convolution based models. Kaplan et al. [22] proposed a deep learning model that takes specific image features into account in the loss function to denoise 1/10 of FDPET image. Chen et al. [27] proposed to combine both PET and MR information to synthesize high quality and accurate PET images. More recent work from Ouyang et al. [28] suggests that combining a generative adversarial network (GAN) with feature matching into the discriminator can lead to similar performance even without the MR information.
Rather than using deep learning method as a post-processing tool, Gong et al. [29] proposed a residual convolutional auto-encoder within a Machine Learning framework to denoise PET images. More recently, Haggstrom et al. [30] took the PET sinogram data as the input and directly generate PET reconstructed images, highlighting a 100-fold speedup for reconstruction compared to standard iterative techniques such as ordered subset expectation maximization (OSEM).
In general, physicians use both the maximum SUV (SUV max ) and the mean SUV (SUV mean ) to characterize the high uptakes [31], but SUV max is more often used in practice since SUV mean heavily depends on volume of interest (VOI) selected, while SUV max value is unique and reproducible in VOI [32,33]. Inspired by the most recent advanced neural networks, such as Dense-Net [34], Residual CNN [35], and CycleGAN [36], a cycle Wasserstein regression adversarial training framework, named S-CycleGAN, is proposed and studied for the PET brain imaging in this paper. Although some good performance in recovering or denoising LDPET images were reported, those deep learning based methods mentioned above were not evaluated quantitatively for lesion SUVs, which limited their usage in clinical applications. In order to evaluate the clinical performance of our model, we also proposed a simulation framework to produce a series of simulation data to mimic complex clinical situations. The S-CycleGAN model was then applied to the clinical and simulated LDPET datasets (10% and 30% of FDPET datasets), and studied qualitatively and quantitatively.

Methods
The goal of this work is to train a model to learn the non-linear mapping between LDPET and FDPET images. As shown in Fig 1, the proposed network is based on a CycleGAN architecture.
The proposed model includes two generators and two discriminators. We denote G AB is the mapping from LDPET domain (A) to FDPET domain (B), the G BA represents the opposite direction mapping. In addition, there are two discriminators D A and D B which intend to identify whether the output of each generator is real or fake. Then, we train the generators and discriminators simultaneously. Thus, we have the following optimization problem: Our proposed network combines four types of loss functions: adversarial loss (L adv ), cycleconsistency loss (L cyclic ), identity loss (L identity ) and supervised learning loss (L sup ). Therefore, the overall loss is defined by: where α, β and γ are hyperparameters. Adversarial loss: We employ adversarial losses to generate image samples to obey the empirical distributions in the source and target domains. To improve the training stability of GANs, we apply the 1-Wasserstein distance [37] instead of the original log-likelihood function. The 1-Wasserstein distance or Earth-Mover (EM) distance is defined as follows.
Where P(P r , P g ) denotes the set of all joint distributions γ(x, y) whose marginals are respectively P r and P g . Thus, the adversarial objective function LðG AB ; D B Þ is defined as follows.
Where λ is a regularization parameter, which controls the trade-off between the Wasserstein distance and the gradient penalty term.ỹ is uniformly sampled along straight lines for pairs of G AB (x A ) and x B . The adversarial loss for the reverse direction LðG BA ; D A Þ is defined in a similar way. The final adversarial loss(L adv ) is defined Cycle consistency loss: We adopt a cycle consistency term that the FDPET and LDPET images could be transformed mutually as an additional regularization to help learning of G AB and G BA . The cyclic loss is defined by Where k:k 1 denotes the l 1 -norm. This allows for additional information to be shared between LDPET and FDPET images in learning their corresponding generators.
Identity loss: In real clinical situation, the input to the generator G AB can be a full-dose image, but we expect the generator does not alter such clean image, and vice versa. Besides, indentity loss provides another regularization in the training procedure and is formulated as follows: Supervised learning loss: Since we have paired datasets, we can train our model in a supervised fashion. Then, we can define a supervision loss as follows:

Network architecture
Our proposed model, S-CycleGAN, is constituted of two generator networks, G AB and G BA , and two discriminator networks, D A and D B . The generator networks take one domain's image and estimate another domain's image. The discriminator networks aim to differentiate between the real and estimated image.
Generative networks: The network architecture of two generators G AB and G BA is illustrated in Fig 2. The basic structure is optimized for the LDCT image denoising in [38,39]. To reduce network complexity and adapt to PET image, we set the filter number to 64 instead of 128 in the original model and add a ReLU layer before model output. As shown in Fig 2, the first two convolution layers use 64×3×3 convolution kernels to produce 64 feature maps, and connect to 6 sets of residual modules, where each module is composed of 3 sets of convolution, batch normalization, and a ReLU layer, and one residual connection with a ReLU layer. Later on, a concatenation layer that concatenates the inputs of each module and the output of the last module, and two convolution layer with 64 feature maps are applied. Finally, the last convolution layer with 3×3 convolution kernel combined with an end-to-end bypass connection and an additional ReLU layer are used to estimate the FDPET image.
Discriminator: The discriminators take either a real PET image or an estimated one as input, and determines whether the input is real or not. As shown in Fig 3, the discriminator network is designed to have 4 stages of convolutions, followed by two fully-connected layers, of which the first has 1024 outputs and the last has 1 output. We apply 4×4 filter size for the convolution layers which have different numbers of filters as 64, 128, 256, 512 respectively. In addition, we use Leaky ReLU activation in the discriminator for all layers, with slope 0.2.

Datasets
We trained our model by using human brain datasets. A total of 109 clinic patient (range 44.3-103kg) PET/CT images were taken by the Minfound ScintCare PET/CT 720L scanner with injection of 370.81±64.38 MBq of 18F-fluorodeoxyglucose(FDG), and we randomly selected 89, 10 and 10 patient data for training, validation and testing, respectively. All scans were taken about 5 minutes and usually start at 45-60 minutes later after injection. The reconstruction was performed using the manufacturer-provided software with all physical corrections, including attenuation, scatter, randoms, dead-time and SUV correction. The size of each 3D reconstructed PET image is 192×192×96 with pixel size of 2.1 mm. FDPET and LDPET images were reconstructed with the same parameters and post filters to ensure comparable spatial  resolution in both images. Two different simulated doses, i.e. 10% and 30% counts of original scan, where generated by randomly discarding events in FDPET list mode data. Although the 10% and 30% images were generated by emulated low-count scans, however, those images have comparable quality with actual low-dose scan, and confirmed by the recent work [40]. By this way, FDPET and LDPET images are spatially aligned.
In order to evaluate the clinical feasibility of our proposed model, a Monte Carlo simulation framework using GATE [41,42] was carefully designed and shown in Fig 4. At the first step, lesion maps with different shapes, sizes, and locations were extracted from a few of known patient's datasets (different from the above 109 patients), and a patient's attenuation map (μ-map) was generated from its corresponding CT image. Then, these two maps were fed into GATE, and were simulated with the same system settings as Minfound ScientCare PET/CT 720L scanner. Finally, the simulated coincidence data of lesions combined with the clinical coincidence data of the patient were reconstructed by the manufacturer-provided software to produce the final PET image. To systematically evaluate model performance, a series of simulation data with various activities, sizes and shapes were produced and reconstructed. In order to reduce statistical variations, each simulation configuration was repeated 3 times and total 45 simulations were used in later quantitative evaluations. The details of those lesions are also provided in Table 1 and In order to reduce the computational cost for training, we extracted overlapping patches from LDPET and FDPET images instead of directly feeding entire PET images to the training pipeline. We cropped LDPET and FDPET images into patches of 56×56 at the same place for the supervised learning with sliding step 40. In total, there are 136704 and 15360 patches for training and validation. Since PET images have a large range in pixel values, we scaled pixel values to [0, 1].

Evaluation measures
Six measures are used to evaluate the model performance including the normalized root mean square error (NRMSE), structural similarity index (SSIM [43]), peak signal-to-noise ratio (PSNR), learned perceptual image patch similarity (LPIPS) [44], relative errors (RE) for SUV mean and SUV max , which are defined as following.
Where C 1 and C 2 are constants, μ x , μ y , σ x , σ y , and σ xy are mean and standard deviation in the patch centered at pixel (i,j). MAX is the peak intensity of the image, MSE is absolute mean square error. The SUV (standardized uptake value) is commonly used as a relative measure of FDG uptake. The basic expression for SUV [32] is Where r is the radioactivity activity concentration [kBq/ml] measured by the PET scanner within a region of interest (ROI), a 0 is the decay-corrected amount of injected radiolabeled FDG (kBq), and w is the weight of the patient (g), which is used as a surrogate for a distribution volume of tracer.

Implementation details
In the proposed model, training was performed by minimizing the loss function 2. We utilized the Adam optimizer [45] with β 1 = 0.5 and β 2 = 0.999 to minimize the total loss function of the proposed network. We set the learning rate to 2×10 −4 , hyperparameters α = 10, β = 5, and γ = 5. The trade-off parameter λ between Wasserstein distance and gradient penalty was set to be 10, as [37] suggested. The hyperparameters basically were derived from the original CycleGAN paper. As for the parameter, γ, it was determined by experiment in order to get a trade-off between the noise and the bias of SUV at the lesion regions. The size of the patch was set to 56×56 and the mini-batch is 16. Kernels were initialized randomly from a Gaussian distribution. All experiments were conducted using Keras [46] with Tensorflow backend on a NVIDA TITAN GTX GPU. The training epoch was set to 200 based on experience with early-stop strategy when the validation loss is minimal (the patience value is 5). It takes 7 days for training at current GPU hardware. Although the training was done on patches, the proposed network can process images of arbitrary sizes. All the testing images were simply fed into the network without decomposition and required 74ms of inference time per image slice.

Comparison with other methods
To study the effectiveness of our proposed model, we compared it with RED-CNN [15] and 3D-cGAN [21]. The network structure and parameters of these competing methods were set per the suggestions from the original papers and re-implemented by Keras. For a qualitative comparison, some sample images of the predicted FDPET from three deep learning methods, the corresponding LDPET and FDPET reconstruction are shown in Figs 7 and 8 for 10% and 30% dose of FDPET, respectively. The estimated images by all deep learning methods show better image quality than low-dose images, providing better noise reduction and structure details recovery.
The quantitative measures in terms of NRMSE, SSIM and PSNR are shown in Table 2 using 10 testing patient datasets. All three predicted images have better noise control and structure similarities than low-dose images, but similar peak signal to noise ratios. RED-CNN and 3D-  cGAN models have better NRMSE scores than S-CycleGAN, however, their predicted images suffer from over-smoothing issues and may compromise the diagnostic performance, as shown in Figs 8 and 7 (indicated by red arrows).
As suggested by Zhang et al. [47], the traditional metrics (L2/PSNR, SSIM, FSIM) disagree with human judgments, a learned perceptual image patch similarity metric was proposed to evaluate image quality. The LPIPS measurements between model prediction and FDPET is shown in Fig 9. The estimated images by all deep learning methods show better LPIPS scores than low-dose images and S-CycleGAN obtains the best score. The average LPIPS scores of LDPET (30% of FDPET), S-CycleGAN, RED-CNN and 3D-cGAN are 0.035, 0.026, 0.031 and 0.031, respectively.

Clinical evaluation for specific VOIs
In clinic, the mean and maximum SUVs are often used as bases for diagnosis to characterize suspicious high uptakes [31,32]. Therefore, the SUV measures are used to investigate the  effectiveness of the proposed method for specific VOIs on both the normal and lesion tissues. The datasets were produced by our proposed simulation framework as mentioned in Experimental Setup section. In this analysis, the mean and maximum SUV biases and deviations were evaluated for all deep learning models mentioned above. The average biases and standard deviations of SUV mean and SUV max of lesion tissues are shown in Tables 3, 4 and Tables 5, 6, respectively. Since SUV max is not critical for normal tissues, only SUV mean error is shown in Table 7. The results of different lesion sizes and FDG concentrations are also shown in above tables.  Table 7, all the models have very similar SUV-mean values in normal tissues which biases are less than 5% for both 10% and 30% dose levels. However, as seen in Tables 3 and 4, the RED-CNN and 3D-cGAN have much larger biases than S-CycleGAN in lesion tissues, especially for smaller lesion sizes and lower activities. The average SUV mean biases of S-CycleGAN, RED-CNN and 3D-cGAN for all lesions and activities are -6.4±5.3%, -18.7±11.8% and -20.0±10.8% for 10% dose level, and -2.8±4.1%, -6.3±6.4% and -9.8±6.0% for 30% dose level, respectively. It can be also seen that the biases and deviations of SUV mean decrease as the lesion size and activities increases for S-CycleGAN model in most cases. Those observations indicate the good robustness of our proposed model. SUV max deviation: The SUV max results of all three deep learning methods are shown in Tables 5 and 6 for 10% and 30% dose of FDPET, respectively. Since the single pixel value in the VOI is largely affected by the statistical property of data, the SUV max values in LDPET images have large biases and deviations, especially for the lower dose level. Our proposed S-CycleGAN model is trending to reduce the biases and deviations but this ability gets worse as lesion sizes decrease. The average SUV max biases of S-CycleGAN, RED-CNN and 3D-cGAN for all lesions and activities are -3.7±16.2%, -24.9±11.7% and -28.0±16.5% for 10% dose level, and -5.2±6.8%, -11.4±11.7% and -14.8±9.8% for 30% dose level, respectively. Those results are suggesting the S-CycleGAN method can better preserve the SUV max values than other two methods.

Ablation study
Impact of supervised learning loss: The impact of the supervised learning loss was studied for the proposed model. A modified model, named as CycleGAN, was trained and tested with all the loss functions except the supervised loss L sup . Image artifacts of missing structures are observed in about 7% of the slices generated by the CycleGAN model, as indicated by the red and yellow rectangle in Fig 10. Therefore, the use of supervised learning loss could reduce these artifacts and maintaining the fidelity of the PET image. Impact of cycle-consistency loss: The effectiveness of cycle consistence loss was also studied by comparing the S-CycleGAN and 3D-cGAN model which didn't involve this loss. As shown in Tables 3, 4, 5 and 6, S-CycleGAN model can better preserve the SUV mean and SUV max values than 3D-cGAN, which indicates the effectiveness and necessity of the cycle consistence loss even if it is originally designed for the unpaired datasets training.

Discussion
In order to systematically evaluate model performance, we designed a novel simulation framework to produce clinical-like data, in which the embedded lesions are exacted from the clinical data by considering realistic structures, sizes, activities and dose levels. Such simulations would be helpful to understand the clinical performance of the proposed method since it is almost impossible to know the true lesion uptakes in clinic. Moreover, this method can be extend to other related model performance studies.
Although our model has achieved the compelling results, there still exist some limitations. Our proposed model, S-CycleGAN, requires longer training time than other standard GANbased and CNN-based methods. The future work should consider more efficient architectures. Though this paper mainly focuses on PET brain images, the same model with different hyperparameters has been applied to PET body images too. More results will be presented in the near future once enough PET body datasets are acquired and trained.
Recent published papers [20][21][22] on LDPET image recovery were extended to even lower doses. However, it is difficult to conclude which approach can reduce more dose since different paper uses different datasets, acquisition protocols and scanners. In this paper, the training set of 99 patients has 110±23M average coincidence counts. Consequently, our proposed S-CycleGAN model actually takes the count variation into account in the training and can be used for a relatively widespread dose levels in complicated clinical situations. The recently published paper [48] uses a very similar method, the CycleGAN, for LDPET denoising, but they still didn't investigate the metrics of SUV max and robustness to different count levels. All of those approaches usually compared the structure similarity, noise, signal-to-noise ratio or SUV mean , but none of them involved the evaluation on SUV max . SUV max is more often used in the clinical practice due to its better reproducibility than SUV mean since the maximum value within a VOI (or region-of interest) is invariant with respect to small spatial shifts [32,33]. Due to the supervised training mode, the SUV mean can be easily preserved, but not the SUV max . From systematic study of SUV mean and SUV max , our proposed model has demonstrated promising results in recovering a high quality image from a LDPET image. However, smaller lesion size and lower activity actually degrade the performance of all models compared in this paper.
As shown in Tables 3, 4, 5 and 6, the SUV mean values can be relatively easier to be preserved as compared the SUV max values. Our proposed model has demonstrated better quantitative results than RED-CNN and 3D-cGAN no matter which dose level is used. When predicted images in the same dose level are compared, the SUV mean values show strong dependence on lesion sizes and activity concentrations. On the other hand, the SUV max values only show strong dependence on lesion sizes, but not the activity concentrations. Moreover, the SUV max values still have quite large variations even though 45 simulations are used in evaluations. These phenomena can be partially explained by two factors: one is image noise caused by data itself and reconstruction/post-processing methods, which can strongly affect SUV max values relying only on the single pixel values; another is the partial volume effect caused by the finite system spatial resolution and image sampling, which can heavily reduce the accuracy of SUV mean and SUV max values, especially for the smaller volumes of VOIs, or the lower activity ratio between VOIs and their surrounding background [49].
As compared to the 30% dose level, the images recovered from the 10% dose level still have good scores in normal tissues in terms of NRMSE, SSIM and PSNR, but much larger biases and deviations for the SUV mean and SUV max in lesion tissues for all three deep learning methods. This alerts us a potential risk that any diagnosis relying on these two indexes could be changed in clinical practice. Therefore, we should be cautious in developing any deep learning approaches which could largely change SUV mean and SUV max while reducing dose. For this reason, the 30% dose level is preferred in this study since it can better balance the tradeoff between SUV values and dose reduction.

Conclusion
In conclusion, we have introduced a novel deep learning based generative adversarial model with the cycle consistent to estimate the high-quality image from the LDPET image. The proposed S-CycleGAN approach has produced comparable image quality as corresponding FDPET images by suppressing image noise and preserving structure details in a supervised learning fashion. Systemic evaluations further confirm that the S-CycleGAN approach can better preserve mean and maximum SUV values than other two deep learning methods, and suggests the amount of dose reduction should be carefully decided according to the acquisition protocols and clinical usages.