Deep reconstruction model for dynamic PET images

Accurate and robust tomographic reconstruction from dynamic positron emission tomography (PET) acquired data is a difficult problem. Conventional methods, such as the maximum likelihood expectation maximization (MLEM) algorithm for reconstructing the activity distribution-based on individual frames, may lead to inaccurate results due to the checkerboard effect and limitation of photon counts. In this paper, we propose a stacked sparse auto-encoder based reconstruction framework for dynamic PET imaging. The dynamic reconstruction problem is formulated in a deep learning representation, where the encoding layers extract the prototype features, such as edges, so that, in the decoding layers, the reconstructed results are obtained through a combination of those features. The qualitative and quantitative results of the procedure, including the data based on a Monte Carlo simulation and real patient data demonstrates the effectiveness of our method.


Introduction
Positron emission tomography (PET) is a nuclear medicine, functional imaging method that evaluates the body condition based on metabolism in living tissue. Annihilation occurs due to the decay of a radio tracer such as 18 F-Fluoro-2-Deoxy-glucose (FDG) after emission positron encounters the electrons, which generates a pair of 511keV photons in the opposite direction. Once the projection data are obtained by a photon detector, the static radioactivity distribution can be reconstructed [1]. In contrast to static PET imaging, dynamic PET imaging detects data in a series of frames. In clinical applications, the dynamic information has the potential to improve early detection, the characterization of cancer and assessment of therapeutic response after obtaining the spatial and temporal radioactivity distribution [2]. To guarantee that the rapid change in tracer activity can be tracked immediately, the time interval between two frames in earlier parts of a scan will not be long enough, leading to reduced photon accumulation and lower spatial resolution. How to reconstruct the high-quality dynamic PET images and guarantee the temporal resolution at the same time has become a challenging problem [3].
One popular method is to reconstruct the radioactivity map for each frame independently based on several conventional statistical models such as maximum a posteriori (MAP) [4],  [5] and penalized weighted least square (PWLS) [6]. Because of the tradeoff between spatial resolution and temporal resolution, these frame-toframe strategies may lead to noisy results due to the low signal-to-noise ratio of the data. To solve this problem, an attractive approach is to append the smoothness regularization terms into the objective function. Compared with other approaches such as the filtering of wavelets [7], total variation (TV) regularization can be suitable for edge-preserving imaging problems in low signal to noise ratio (SNR) or few-view data sets; however, in the presence of noise, TV tends to over smooth and deblur images with a natural gradient [8].
In this study, we aimed to reduce the noise while preserving the key features (such as the boundary of a tumor) at the same time. Different from regularizations in the above methods, we built a deep reconstruction framework for PET images based on the stacked sparse autoencoder (SAE) and the maximum likelihood expectation maximization (MLEM), which is called MLEM+SAE in short. The SAE model comprised several auto-encoder templates, where each template exploited self-similarity in PET images. To incorporate the information in adjacent frames, temporal features were also learned by the single-layer sparse autoencoder. The essence of our MLEM+SAE model was to incorporate the PET images feature automatically and boost the reconstruction accuracy.

Methods
The study was approved by the Research Ethics Committee of Zhejiang University. The patients provided written informed consent.

Dynamic PET imaging model
After injecting the radio tracer into the body, the detected emission data represented the series of photon pairs. The relationship between the detected photon pairs and radioactive distribution could be written as: where x i p is the number of photons emitted from source voxel p for the ith frame, and G pq is the probability that a photon emitted from voxel p is detected at detector q. The emission data y i q is the number of photons detected at projection detector q for the ith frame. The overline for y i q indicates the expectation number of photon detections. For dynamic PET, the whole collected emission data Y was modeled by: where Y T = [y 1 , y 2 Á Á Áy N ] is the collection of whole emission data, X T = [x 1 , x 2 Á Á Áx N ] is the collection of original radioactive distribution images, and E T = [e 1 , e 2 Á Á Áe N ] is the collection of noise in total N frames.

Dynamic PET reconstruction framework based on SAE
Our whole framework is summarized in Fig 1. Considering that images in different frame followed the same biological metabolism process, we considered the adjacent reconstruction images as prior knowledge to aid in the reconstruction of other data. The whole framework included two parts, training and reconstruction. During the training step, the series of reconstruction images x 1 , x 2 Á Á Áx N were considered the input of a SAE model. The SAE model was combined by several encoders and a decoder that using the ground truth of PET images in a selected ith frame as a label. After training, we got weights parameter W and bias parameter b.
In the reconstruction step, we firstly reconstructed the dynamic PET images " x 1 ; " x 2 ; . . . ; " x N from sinograms " y 1 ; " y 2 ; . . . ; " y N by MLEM, and then we acquired the reconstruction images " x N as input importing into the SAE model with the corresponding parameter W and b that have been trained in the training step. The estimate value of PET images in the ith frame could be calculated from the Guassian weighted average of the output. The details are elaborated in the following section.
Training data. Training data in the form of a carefully chosen set of images could be used to represent samples of the prior, such as texture for the desired solution. Clearly, such images must be reliable and application specific.
Auto-encoder template and SAE model. Generally speaking, directly performing medical image analysis is challenging because the dimensions of the medical images are usually large and the structure of the biological tissue is complex. Traditional unsupervised learning methods such as principal component analysis (PCA) [9] and sparse dictionary learning [10] are widely used for data dimension reduction; However, both methods represent a single-layer feature. To address the above problem, we constructed the auto-encoder structure to infer the low-level feature in a lower dimension; indeed, the auto-encoder can be stacked layer by layer and learn the intrinsic hierarchical feature representations [11]. Fig 2 (left) shows the auto-encoder template. An autoencoder is a three layer network including an encoder and a decoder. We firstly reshaped the gray-level pixels of the two dimension reconstruction image as the column vector, considered input x, and then obtained the output o by the encoding and decoding processes: The objective function is: In the above equation, x, h and o represent the input layer, hidden layer and output layer, respectively. σ is the sigmoid function, and W 1,2 , b 1,2 , W 0 1;2 , b 0 1;2 are the encoding weight, encoding bias, decoding weight and decoding bias, respectively. α 1 and α 2 are weighting parameters. kWk F is the sum of the Frobenius norms of all of the weight matrices to regularize the value of the matrix elements. KL(ρ W k ρ 0 ) is the penalty item of the Kullback-Leibler divergence between the weight matrix sparsity and the sparsity set in advance: The first term in Eq (5) is the penalty for loss function; using this term, the auto-encoder template actually learns a function F(W, b) that is an approximation to the identity function. In other words, after learning the weights and bias, the calculated results are similar to the original input: The encoding step is to learn the feature from the original input where the basic elements are gray-level pixels, and this particular structure makes the input compressed as represented by the learned features after setting the number of hidden nodes less than the input layer. Through visualizing the linear weights in the input-to-first layer weight matrix, which is also called "filters", the features extracted by the first layer can be represented directly. For the smooth area, the representation are sparse; on the other hand, the representation could be dense if the certain patch has rich information or complex structure. phantom (a, b) and its visualization of corresponding filters (d, e) where the features are the patches of gradually changed pixels in line and circular curves. The second term in (5) is a regularization of the values for each parameter to avoid over fitting, and the last term actually decide how much of the main feature is retained; the sparsity setting ρ 0 is larger when more features are retained. After initializing all of the parameters, we updated the weights and bias using the back propagation algorithm [12]. The iteration progress gradually achieved convergence after multiple training samples. The information about the original input image is encoded in the hidden layer vector, which has learned the low-level feature of the original image into a lower dimension space. Fig 3(f) shows the features of the brain phantom learned with SAE.
Similarly, we designed several auto-encoder templates and stacked them together to learn more detailed features. The input layer of the next template is the hidden layer of the previous one. The whole framework is shown in the right of Fig 2. Given the series of reconstruction images for dynamic PET imaging x 1 , x 2 Á Á Áx N , we designed an SAE framework. First, we reshaped all of the adjacent reconstruction images x 1 , x 2 Á Á Áx N to a column vector and considered them the input of the first auto-encoder template. After solving the parameters W 1,2 and b 1,2 , we calculated the hidden layer h 1 by Eq (3) and considered it the input of the second auto-encoder template. After several iterations, we obtained the whole SAE. Next, we added the output layer after the last hidden layer, where the nodes of output layer are the same as the dimension of the ith frame image. Using the back propagation algorithm and ground truth of the middle frame i as a label, we computed the final parameters W s+1,s+2 and b s+1,s+2 .

Implementation
Model initialization. The first preparation was to initialize the parameters to be solved in the above auto-encoder templates. Random initialization does not always produce the optimal solution due to following reasons. First, greater weights easily lead to the local optimal solution; on the other hand, the gradient may vanish during the back propagation with smaller weights, a phenomenon called gradient diffusion. To avoid those situations, we adopted the restricted Boltzmann machine (RBM) to initialize the parameters and solve it using the contrastive divergence algorithm [13].
As shown in Fig 4, the restricted Boltzmann machine comprises a visible layer{v i }, i = 1, 2, . . ., N v and a hidden layer {h j }, j = 1, 2, . . ., N h . W is the N x by N h matrix connecting the visible layer to the hidden layer, and the joint probability density of RBM is: where Z is the constant of normalization, E(v, h) is the energy function defined as: where b and c are the biases of the visible layer and hidden layer, respectively. By maximizing the likelihood function, we can train the parameters θ = (W, b, c). First, because there is no connection between nodes from the same layer, we can rewrite the joint probability density function as follow: Thus, we obtain the condition probabilities: Given a training sample {v 1 , v 2 Á Á Áv N } to satisfy the independent distribution, we need to maximize the following likelihood function: where λ is the relative coefficient between the data items and regularization items. After Deep reconstruction model for dynamic PET images calculating the gradient of likelihood function to the parameters, we can solve the problem using the gradient descent algorithm. Parameter training. In the actual experiment for dynamic PET reconstruction, we did not consider the whole image in the vector of input but only a small patch for the following reasons. First, the dimension of the medical image is very large: assuming the size of one image is l by k, the total nodes in the input layer is l × k × N, and too many input nodes makes the training process become slow and difficult to obtain an accurate solution. Second, because of the limitation of the number of training images, dividing the images into patches will increase the sample quantity, avoiding under fitting. Algorithm 1 shows the detailed steps. 4. obtain the reconstruction images x 1 , x 2 Á Á Áx N from the emission data.

5.
For each patch in x: 6.
consider the pixels of patches as the input and initialize the parameters W 1,2 , b 1,2 using the RBM. 7.
calculate the hidden layer and consider it the next input of the autoencoder. 9.

Experiments
To confirm the accuracy and effectiveness of our method, we conducted several experiments using Monte Carlo simulated data and real patient data, respectively. We compared our method (shorthand for MLEM+SAE) with the traditional algorithm (MLEM) and total variation (TV) regularization based method. All of the codes were implemented and run in MATLAB R2014a (MathWorks Corporation, USA) and personal computer with i5 Intel Core CPU and 8GB memory.
In this paper, we used the following three quantitative indexes for comparison: where u i is our reconstruction value in the ith pixel, andû i is value of ground truth in the ith pixel. " u n is the mean value and n is the number of pixels in interest region. The SNR reflects the accuracy of the reconstructed image: a higher SNR indicates better results. The bias and variance reflect the difference between the result and ground truth: a lower bias or variance indicates better results.

Parameter setting
Before the experiments, we initialized our model with the number of frames N = 3, number of hidden layers s = 2, number of nodes in the hidden layers h 1 = 200, h 2 = 100, and size of patches l × k = 7 × 7. The aim of our model is to find the relation on temporal to help reconstruction. For dynamic PET imaging, the previous frame and the next frame are the most closely associated with the to be reconstructed frame. So we set N = 3 and in the input layer of our model is the i − 1th, ith and i + 1th frame. Usually, the number of hidden layers in lots of Deep reconstruction model for dynamic PET images stacked autoencoder applications is set as two to four [14] [15] [16]. As our model is a lowlevel vision task, which emphasizes more pixel level features, too deep network is not needed. Therefore the number of hidden layers is set as s = 2. The number of nodes in the input layer is decided by the defined patch size: small patches carry less information, and large patches make the whole framework more difficult to train. By contrast, the total number of nodes in the hidden layers decides the learning ability of the framework: fewer nodes lead to under fitting, and too many nodes make training difficult and promote over fitting. A different setting will influence the quality of the final results, and all of the parameters are decided by multiple tests. For the framework of SAE, we considered two parameters, focusing on their complexity and capability, which are the total number of nodes in the input layer and hidden layer, respectively. Figs 5 and 6 show the result of the influence of the patch size and total number of nodes in the hidden layers. These two figures have some similarities, at the beginning; as the patch size and number of nodes increase, the learning ability of the framework becomes stronger, and the corresponding SNR results improve. After a certain time, the results for the larger patch Deep reconstruction model for dynamic PET images worsen, the framework shows over fitting, and the corresponding SNR results worsen quickly. Overall, the best patch size was approximately 7 × 7, and the number of nodes in the hidden layer is approximately 400.

Accuracy
Monte Carlo simulation. We conducted our experiment for the brain phantom [17] and Zubal phantom [18] using the toolbox GATE [19] to perform Monte Carlo simulation.The benefits of Monte Carlo simulation are that we could simulate physical and physiological processes to compare our method with other reconstruction algorithms and the ground truth. In this paper, the dynamic PET scanner simulated was Hamamatsu SHR74000, and the corresponding radio tracer was 18 F − FDG. The sinogram of the brain phantom had 64 × 64 projections, and the number of frames was 18. The sinogram of Zubal phantom had 128 × 128 projections, and the number of frames was 16. We also obtained data with different counting rates-i.e., the count number of coincidence events set as 5 × 10 4 , 1 × 10 5 , 5 × 10 5 and 1 × 10 6 . During reconstruction, we also considered random correction, attenuation correction, scatter correction and normalization correction. The ground truth and different ROIs for the brain phantom and Zubal phantom are shown in Fig 7. The network was trained on 700 sets of Monte Carlo simulated data and tested on 100 sets of Monte Carlo simulated data. Each set had more than 20000 patches.  8 and 9 show the reconstruction results by MLEM, MLEM+SAE and TV for the different brain phantom and Zubal phantom frames. Figs 10 and 11 display selected areas of the reconstruction results by MLEM and MLEM+SAE. From the images, the reconstruction results of our method demonstrated sharper edges and higher pixel values in the radiation area. The MLEM algorithm leads to much noise due to the chessboard effect during iterations; however, by fusing multiple image data, our method provides a clearer and smoother result. The reconstruction results by TV show a good performance when the photon counting rates are high in the 1st, 3rd and 5th frames, but when in the 7th and 9th frames, there are much noise which leads to over smooth and fuzzy boundaries. However, the reconstruction process of our method does not smooth the boundary because the SAE learns the features of the edges and corners in images, resulting in more image details.
Different regions in our simulated data have different physiological and physical properties that may lead to different reconstruction results. For further analysis, we also performed comparisons among different regions of interest. Fig 12 shows the results of SNR, bias and variance comparison among different regions of interest. From Fig 12, we can conclude that, regardless of the region of interest, our method can improve the SNR and suppress bias and variance in all of the frames.
Real data. The patient heart data were obtained from the local hospital using a Hamamatsu SHR-22000 whole-body PET scanner, which has 32 crystal rings and a trans axial From the real patient data reconstruction results, much noise appears in the results using the MLEM algorithm again. By fusing the series of frames, our method produces a cleaner and smoother result. Moreover, the high-concentration radio region remains bright and distinct,

Robustness
To test the robustness of our results, we set different counting rates-i.e., the number of coincidence events during Monte Carlo simulation. Fig 14 shows the reconstruction result under different counting settings. The detail index comparisons are illustrated in Tables 1 and 2. From these figures and tables, we can conclude that the MLEM algorithm causes much more noise when the counting rate is low: the quality of the reconstruction image improves with the increasing counting rate. However, by learning how to fuse the adjacent reconstruction images, our method can also produce a high-accuracy reconstruction result with much less noise and a clearer boundary.

Discussion and conclusion
We have developed an SAE model for dynamic PET reconstruction. Compared with the existing method, our algorithm outperforms in three main aspects: i)by using the multiple frames as the input, much noise is smoothed and suppressed in the non-radioactivity area and continuous radioactivity area; ii) due to the multi-layer model for feature learning, our algorithm can recover many more details in the boundary or complex area; iii) by processing the image patch by patch, our algorithm can work on dynamic PET emission data regardless of the original size. To demonstrate the effectiveness of our method, we have tested our algorithm for different numerical indexes based on Monte Carlo simulation data and real patient data.
One of the major concerns is the tissue specificity or patient specificity. Because our parameters are trained in advance, our model may not learn the features when they come from new different test tissue PET images, which will influence the quality of the reconstruction images. As shown in Fig 15, the top row represents the reconstruction results by the MLEM algorithm for the Zubal phantom, and the bottom row represents the reconstruction results by MLEM +SAE; however, we only used the patches of the brain phantom in the training step. From the figure, our method still produces better results than the traditional MLEM algorithm; however, due to the limitation of the training data, the results are not as good as our results in Fig 9, regardless of the noise reduction or edge preservation. Table 3 shows the detailed index comparisons where our method could effectively obtain a higher signal to noise and suppress the bias and variance. With the development of computer ability and appearance of increasing medical data, we believe a stronger computation model can be pro-posed. Deep reconstruction model for dynamic PET images In contrast to the traditional reconstruction algorithm that works on the emission data directly, our method needs an initial training step, and then, reconstruction is performed for multiple frames in the reconstruction step. Thus, our method requires much more computational time. Usually, the MLEM algorithm needs 0.8-1.5 seconds for convergence, but our method needs approximately 40-50 seconds, including training and reconstruction. When the data and image dimension increase, we can use more advanced optimization methods, such as batch acceleration gradient reduction, or more complicated structures, such as a convolution neural network. Regarding the hardware, the graphic processing unit can promote high computational efficiency.