GAN-based medical image small region forgery detection via a two-stage cascade framework

Using generative adversarial network (GAN) Goodfellow et al. (2014) for data enhancement of medical images is significantly helpful for many computer-aided diagnosis (CAD) tasks. A new GAN-based automated tampering attack, like CT-GAN Mirsky et al. (2019), has emerged. It can inject or remove lung cancer lesions to CT scans. Because the tampering region may even account for less than 1% of the original image, even state-of-the-art methods are challenging to detect the traces of such tampering. This paper proposes a two-stage cascade framework to detect GAN-based medical image small region forgery like CT-GAN. In the local detection stage, we train the detector network with small sub-images so that interference information in authentic regions will not affect the detector. We use depthwise separable convolution and residual networks to prevent the detector from over-fitting and enhance the ability to find forged regions through the attention mechanism. The detection results of all sub-images in the same image will be combined into a heatmap. In the global classification stage, using gray-level co-occurrence matrix (GLCM) can better extract features of the heatmap. Because the shape and size of the tampered region are uncertain, we use hyperplanes in an infinite-dimensional space for classification. Our method can classify whether a CT image has been tampered and locate the tampered position. Sufficient experiments show that our method can achieve excellent performance than the state-of-the-art detection methods.


Introduction
Due to the privacy of medical images, the lack of data has always been a significant problem for machine learning tasks related to medical images.One way to solve this problem is the GAN [1], which can generate images that are highly similar to real images.GAN has been widely concerned in the medical image field.Several studies have used GAN to generate medical images for data enhancement and achieved gratifying performance.The image quality generated by GAN is enough to confuse radiologists.Therefore, once this technology is used for malicious attacks, it will lead to serious consequences.
Using the deep convolution neural network can detect the GAN-generated image [3][4][5].Moreover, the detection accuracy can be improved through feature engineering [6][7][8][9][10][11].To the knowledge of this paper, there is no detection method for GAN forged medical images.Although there is no specific solution to detect the medical images generated by GAN, there are some domain generic methods.For example, Frank et al. used discrete cosine transform (DCT) to detect GAN generated images [10].Marra et al., through incremental learning, can detect new GAN-generated images with only a small number of samples [12].Cozzolino et al. learn feature extraction through auto-encoders and generalize the model through a small number of samples [13].
CT-GAN [2], a GAN difficult to detect even with state-of-the-art methods, emerged.It can inject or remove large lung nodules from CT images.Examples of CT-GAN inject/remove tampering of lung nodules are shown in Fig. 2. The number of large lung nodules is a significant marker of lung cancer.Therefore, CT-GAN can make doctors misjudge
Step 1: Local detection Step 2: Global classification Heatmap Features(256) Features(40,000) Figure 1: Overview of our method.We cut out small sub-images from CT slices to train the local detection neural network.Each sub-image will be detected and output a tampered probability.The detection results were combined according to the position to generate a heatmap.Then we use GLCM to extract the features from the heatmap, which are used for PCA and SVM model training.We use the trained model for global classification.the patient's condition, seriously threatening the patient's life safety.In addition, this attack may also be used to defraud medical insurance and maliciously discredit competitors.However, new attacks like CT-GAN challenge the current detection methods.CT-GAN generates only a minimal area, and the surrounding area is used as a constraint condition to train a conditional generative adversarial network (CGAN).
In that case, the generated image will be closer to the real image.We call the attack using CGAN to forge a very small region in an image as GAN-based small region forgery attack.At present, there is no solution that can effectively detect the GAN-based small region forgery attack in medical image.The characteristic of the attack is that the ratio of the region generated by GAN is very small.Although some methods, such as Rössler et al. [4], can detect partial generation, such as face manipulation.However, because medical images' style, content, and storage format are very different from the normal images and the tampered region is too small, even state-of-the-art detection methods can not effectively detect GAN-based small region forgery attacks in medical image.It is conceivable that our medical image security is facing a considerable threat.
In order to solve above-mentioned problem, we propose a novel cascade framework based on a local detection network and a global classification method that can detect GAN-based small region forgery attacks in medical images.The first stage is local detection.We crop a small sub-image from the CT slice image to train the detector network.The sub-image size is small enough so that interference information in authentic regions will not affect the detector.Because the training data only has a single channel and the training data size is small, it is easy to over-fit.Therefore, we design a lightweight neural network with fewer parameters and use early stopping to prevent over-fitting.After training, the detector can detect the tampered region effectively.Then, we traverse the entire CT slice image by sub-image.The detection result of all sub-image will be combined and output as a heatmap.It can indicate which area may be tampered.
The second stage is global classification.We use the gray level co-occurrence matrix (GLCM) to extract features from the heatmap.Then train the PCA and SVM model with the GLCM features.Since CT-GAN can adjust the size of the tampered area to a certain extent, we use GLCM features to train PCA and SVM models for global classification.
Compared with the method that uses the whole image as input, this method can locate the tampered coordinate and requires less training data but has faster training speed and higher accuracy.
The main contributions are as follows: • A novel cascade framework based on local detection and a global classification is proposed to detect and locate the tampering regions caused by CT-GAN attacks including both injection and removal.• A local detection network with channel attention, spatial attention, depthwise separable convolution, and residual, which can better find the information of small areas in the image and prevent over-fitting.• We design a global classification method base on PCA and SVM with the gray level co-occurrence matrix (GLCM) as input features, which can effectively cooperate with local detection to classify medical images.• Experiments show that for GAN-based small region forgery attacks in medical image like CT-GAN, our method can achieve excellent performance.
The rest of this paper is organized as follows.In Section 2, we discussed the background and related work of the detection of GAN generated images in recent years.Moreover, in Section 3, we explained our method in detail.Furthermore, in Section 4, we describe our experimental results.In Section 5, we discuss our method, and in Section 6 we draw our conclusions.
2 Background and related works

Medical image
The medical image uses some particular medium to interact with the human body to show the structure of the internal tissues or organs of the human.Digital imaging and communications in medicine (DICOM) is an international standard for medical images and their related information.It is widely used in various radiological diagnostic equipment (X-ray, CT, MR, ultrasound, etc.).All medical images of patients are stored in DICOM file format.The data used in this paper are mainly CT images with DICOM format.CT equipment scans slices one after another around a certain part of the patient's body.Fig. 3 shows an example of a CT scan.The scanned image is multi-layered.A three-dimensional image can be formed by stacking layers of slice images on the z-axis.
The definition of medical images such as CT is positively correlated with radiation dose.In contrast, high-dose radiation may damage patient's health, so it is difficult to improve the definition of medical images.Besides, medical images have only one channel.So GAN can easily fit the distribution of medical images than normal three-channel color images.

Generative adversarial network
Since GAN was proposed by Goodfellow et al., it has been one of the hot spots in the CV field.The GAN model is different from the traditional neural network structure.GAN includes a generative model G and a discriminative model D. G generates a new sample from random noise, and D distinguishes whether the input sample is a real sample.The task of G is to generate images that D cannot distinguish.At the same time, the task of D is to distinguish between the images generated by G and the real images.The two networks compete against each other during training through this min-max game.In this way, G can learn the data distribution of the real sample.Up to now, GAN has derived a large number of variants, such as WGAN [14], PGGAN [15], StyleGAN [16] and so on.These variants are widely used in various CV tasks.A complete CT scan may include about 300 slice images, and only a portion is selected here.

Application of GAN in medical image
Medical images are different from normal images and have robust privacy.Even though there are many public data sets such as LIDC-IDRI, DDSM MIAS, OASIS, etc., the medical data sets are still insufficient.Because GAN can effectively alleviate the lack of training data, there are also a large number of researches of medical imaging using GAN.

Detect the GAN-generated image
Because of the high performance of GAN, it has gradually become a trend to use deep learning to distinguish whether an image is generated by GAN.Due to the excellent performance of convolutional neural networks (CNN) in CV tasks, CNNs, such as ResNet [34], XceptionNet [35], and EfficientNet [36], are widely used in various CV fields, including digital image forensics [3,5,7].Besides, Andreas et al. [4] prove the superior performance of XceptionNet in image source detection.
Using some features can make the network perform better.A way to distinguish whether an image is generated by GAN is to use GAN fingerprint.[8].GAN will leave special fingerprints in the generated image due to its structure.Through deep learning, learn those fingerprints as a feature.Then it can be used to distinguish the source of the image.Some people use the shortcomings of GAN to find some special features to better distinguish whether a image is generated by GAN.For example, McCloskey and Albright find that the saturated or underexposed pixels of image will be suppressed by the normalization operation of the GAN generator [6], and use this feature to distinguish the real camera images and GAN images.Because the statistical characteristics of GAN images are different from real images, some people use three co-occurrence matrices on RGB channels as features to distinguish the source of the image [9].Zhang et al. suppress the image content information by converting the image to the YCrCb color space and then use the Scharr operator and the gray-level co-occurrence matrix(GLCM) to obtain edge features, allowing them to simultaneously detect GAN images and copy-move images [7].In addition, someone distinguishes the source of the image from the defects of up-sampling operations in GAN.Frank et al. found that the up-sampling in GAN will cause grid-like artifacts in the generated images after DCT operation [10], which can be used to distinguish the source of the image.Durall et al. found that the images generated by GAN cannot reproduce the actual spectral distribution [11], which is also due to the upsampling operation.Therefore, after using azimuthal integration to extract the spectral features, using SVM or K-Means can distinguish the source of the image without the need to train a deep CNN.

Challenge
The CT-GAN paper also proposed some detection methods that may be useful.Unfortunately, these methods are not suitable for GAN-based small region forgery attack in medical image like CT-GAN.The reasons are as follows.
On the one hand, there is a huge difference between medical images and normal images.Medical images show the structure and density of human internal tissues or organs, so they have unique content and style.Medical images, such as CT, MR, X-ray, etc., are all taken with special equipment different from general photographing equipment and are saved according to the DICOM standard.Medical images are all single-channel in terms of the image data format, and the pixel values range of the medical image is about 4096.Compared with the normal gray-scale image ranging from 0 to 255, the range of pixel values of medical images is 16 times larger.Therefore, the pre-training model of normal images only has little effect on medical images.In addition, methods that need to extract features from three channels of an image, such as [7] that needs to compare three features extract from different channels, are not appropriate.The co-occurrence matrix is one of the most effective features to distinguish whether an image is generated by GAN.However, due to the expansion of the pixel value range, the cost of calculating the co-occurrence matrix will increase to unacceptable.So the method using the co-occurrence matrix [9] cannot work too.On the other hand, GAN-based small region forgery attacks are more difficult to detect.Take CT-GAN as an example.CT-GAN is a 3D CGAN.It designs a 3D network that references the pix2pix structure.The generator of CT-GAN is a 3D UNet [37] structure.It cuts out a small cuboid from a series of CT slices of the patient, then scales it into a small cube of 32 3 pixels and masks the 16 3 pixels in the center of the cube to zero.This cube with a masked center is input into the generator as a condition.CT-GAN trained two models.Those models can generate large or small nodules in the cube's center.It is worth mentioning that the size of a CT image is (512 × 512) pixels, in which the region modified by CT-GAN is less than (32 × 32).That means the minimum number of pixels that have been tampered with is only 1/1024 of the total.Fig. 4 shows a CT image injected into a lung nodule.
Hence, as we can see from Fig. 2 and Fig. 4, Each tampering operation by CT-GAN will modify 1/1024 to 1/256 of the pixels of the image.A CT image has been tampered with at four different locations, and the tampered pixels only account for about 1%.Generally, a CT image does not need to be operated so frequently.In other words, 99% of an image is interference information.The untampered part is equivalent to the "cover" of the tampered part, which seriously hinders the model from learning the difference between positive and negative images.That is why even state-of-the-art methods are challenging to detect the tampering trace of CT-GAN.Because of this, many methods based on statistical characteristics, such as [6,10,11], are ineffective.More serious is that CNN is not sensitive enough to small tampered regions, so it is difficult to detect such attacks accurately.Unfortunately, almost all current detection methods are based on deep CNN, so it is challenging to detect directly.The current methods for distinguishing whether an image is generated by GAN aim at the images wholly or mostly generated by GAN.There is no special detection model for GAN-based small region forgery attack in medical image like CT-GAN for the time being.In Section 4.4, We have tried to use the whole CT image as input to train the state-of-the-art network.Unfortunately, the result is inferior.
3 Our method

Motivation
Medical images are critical private information and are vitally important to the patient's life.At present, the integrity of medical images faces the threat of GAN-based small region forgery attacks.However, there is no practical method to detect GAN-based small region forgery attacks in medical images.There are two main reasons why attacks like CT-GAN are difficult to detect.First, medical images' style, content, and storage format are very different from the normal images.If we convert these DICOM images to any other image format, it will lose information like pixels or meta-data.Therefore, the model training by normal image can not be well generalized to CT images, making the pre-training model unusable.Training a new model will need much more data.Unfortunately, the sample data of medical images is very limited due to the restrictions on the use of data concerning health under the privacy regulations like GDPR or CCPA.Hence, we cannot try to solve this attack from the perspective of training data.Moreover, what makes the detection task more challenging is that the tampered region is very small while the entire CT image is large.
As mentioned above, the tampered region is less than (32 × 32) when the entire CT is (512 × 512).This means the ratio of a single tampered region in the original image may be less than 0.4%.This greatly reduces the sensitivity of general CNN detection methods since the loss of spatial information limits the learning ability of CNN.Hence, directly detecting the whole image will result in very low accuracy.Based on the above, even state-of-the-art methods are challenging to detect CT-GAN attacks.
Although no specific method can be implemented directly to detect the GAN-based small region forgery attack in medical images, some research works can still inspire us to design an effective method.
Andreas et al. [4] used a face tracking method to extract the face area of the image.They found that if the extracted facial information is used as the input of the detector, it will be more accurate than directly using the entire image as input.It means that the neural network can achieve a better performance if the classifier focuses on more precise regions.Following their idea, we refer to the common preprocessing method of copy-move forgery detection, making the detector pay more attention to the local part of the image through a sliding window.Specifically, we split the target CT image into many small sub-images to train a local detector with a corresponding method for using local classification results to determine global classification results.
Christian et al. [35] replaced the Inception modules with depthwise separable convolutions and proposed their method named XceptionNet for computer vision.Since XceptionNet makes more efficient use of model parameters, compared to Inception V3 [38], it shows better runtime performance and higher accuracy on large-scale datasets like ImageNet while having fewer parameters than general deep CNN.This architecture can effectively reduce overfitting when we cannot collect more data.Therefore, considering these features, we designed our method inspired by the XceptionNet to detect the GAN-based local tampering attacks.

Overview
Our detection method is mainly divided into two stages: local detection and global judgment.The method we propose is outlined below.
In the local detection stage, small sub-images are cut out from CT slices in a planned way to train the local detector neural network.The size of the cropped sub-image is small enough for the tampered area.In this way, the authentic regions are not enough to hinder the detector.The detector can focus on learning the difference between the real image and the GAN-generated image.The tampered area may be hidden in the original image as the background when testing.Therefore, to minimize the missed judgment, our method detects each sub-image divided by the sliding window and predicts the tampered probability of each sub-image.When all sub-images were detected, the results were combined according to the position to generate a heatmap.This heatmap can intuitively reflect which area in the original image may have been tampered with by GAN.In the global classification stage, we use GLCM to extract the features from the heatmap, which are used for PCA and SVM model training.GLCM can make the features of the heatmap more prominent.We use the trained model for global classification.
Intuitively, our method allows the neural network to observe the details of the image more carefully instead of looking at the overall situation.Thus it has a better performance when facing GAN-based small region forgery attack.

Local detection network architecture
Because our training data is insufficient and the hardware is not powerful, we tend to use lightweight networks as the local detector.Using depthwise separable convolution can reduce a large number of required training parameters while maintaining a good training effect.For example, XceptionNet [35] and MobileNet [39] both construct the primary Figure 5: The network architecture.If there is no description, the default stride of the convolution operation is 1, the padding operation defaults to "SAME", the activation function defaults to Relu, and each convolution and depthwise separable convolution layer are followed by batch normalization by default.part of the network with depthwise separable convolution, and they perform well in image classification.But our classification task does not need the too deep network.Because the training data structure is too simple, using a network like XceptionNet or MobileNet will waste many computing resources and may lead to network degradation or overfitting.Therefore, we designed a shallower network as our sub-image classifier based on the depthwise separable convolution.Our network structure is shown in Fig. 5.The network's input is a (32 × 32) image matrix, and the features of the image are extracted through a small number of traditional convolutions and a large amount of depthwise separable convolutions.
The attention mechanism can effectively improve the performance of the deep learning model.The attention mechanism is often used by copy-move detection and other detail-oriented tamper detection methods.Inspired by Woo et al. [40], we designs a simple attention mechanism for our network.Our attention mechanism is shown in Fig. 6.The spatial attention and channel attention can be computed by Eq. ( 1) and (2).This network sets the channel attention module after the convolution blocks with the largest number of channels.It is more significant to use channel attention here.Similarly, because the pooling layer will further reduce the size of the feature image, we set the spatial attention module before the convolution block containing the pooling operation, where the feature image size is the largest.After adding the attention module, the training cost does not increase much, but it can significantly improve the global classification performance.
The design of residual block refers to ResNet [34].The number of convolution kernels channels is unchanged when the input feature image size is the same as the output.The number of convolution kernels and channels is doubled when the input feature image size is different from the output (through the pooling layer).At the end of our network.Selu [41] is used as the activation function in the full connection layer of the network.The Selu function is given by Eq. ( 3), where λ and α are two meticulously designed numbers.It has better performance than Relu in the full connection layer.Our network can save computing resources through the above network structure while maintaining high accuracy.

Global classification method
In our method, the window size is fixed.This is slightly different from the sliding window commonly used in target detection tasks.In the task of target detection, if the window only covers a part of the target, the model may not be able to classify the target correctly, so it is necessary to adjust the window size and traverse the image multiple times.However, in our task, even if the window is only a part of the GAN generation region, the model can determine whether it is generated by GAN with high accuracy.Therefore, we only need a smaller window size to avoid the influence of authentic regions.
First of all, calculate a series of coordinates as the center coordinates of sub-images.Then, crop a sub-image with a size of (32 × 32) according to these center coordinates.The crop operation is because the practical part of the CT image is only the inside of a circle tangent to the square frame.Furthermore, most of the tampering occurs in this area.It will waste a lot of time and space if the extra part is included in the calculation.
The size of the CT image is marked as CT size , the sub-image size is marked as img size , and the stride marked as s.We calculate the longitudinal coordinates of all rows as follow: Where i = 0, 1, 2, ..., CT size − img size s Then, for a row with y = h, we calculate the horizontal ordinates as follow: Where j = 0, 1, 2, ..., w−imgsize Our method can record the output result of each sub-image.Then classify whether an image has been tampered with according to these results.In addition to output a final prediction result, our model can also generate a heatmap based on the results of each sub-image (see Fig. 4).For the area not counted by the formulas above, we default it to have tampered with the probability of 0.
Generally speaking, attackers usually set the tampered region square or round for GAN-based small region forgery attacks.Therefore, after local detection of the tampered slice, a series of sub-images will be judged as positive in theory.Therefore, the tampering trace of CT-GAN can be detected through some fixed modes, such as the vertical and horizontal continuous n sub-images are judged to be false.However, this detection method is not perfect.The size of the tampered area of GAN is not fixed.The attacker can set the size of the CT-GAN tampered area to a certain extent.Moreover, many studies use GAN to generate more extensive and higher resolution images.Therefore, we need a flexible global classification method.
Firstly, to make the features more prominent, GLCM is used to extract the texture features in the thermal map.After rounding the local detection results (heatmap) × 100, calculate the GLCM with a distance of 1 at four angles of 0 °, 45 °, 90 °and 135 °, respectively, then get the feature matrix of (100 × 100 × 4).Secondly, a PCA model is trained to reduce the feature to 256 dimensions.Thirdly, the feature data after dimensionality reduction are used to train an SVM model, and the best parameters of the model are found by grid search.Finally, we use the trained PCA + SVM to make the global classification.This method can adapt to different GAN tampering area sizes.

Algorithm 1 Generate the GLCM of heatmap
Input: heatmap-The heatmap matrix with size H × W . a, b-Two constants determined by angle and distance.

1) Data set
We use the source code of CT-GAN, train the inject and remove models with the LUNA16 data set [42], and use the trained models to generate 3540 different CT scan samples.Among them, 1776 scans were injected lung cancer lesions (Equivalent to a large-diameter lung nodule), and 1764 scans were removed lung cancer lesions.For each fake sample, we select the tampered point and two slices before and after it, five CT slices, and the corresponding five slices before tampering.In the end, 35400 CT slice images were obtained.The tampering points are the CT slices with lung nodules.Therefore, we randomly selected about half of the real CT slice images (about 8850) and replaced them with slices at random locations.Among the 35400 CT slice images, 1200 images are randomly selected as the test set, 4800 images are randomly selected as the training set of global classification, 2000 images are randomly selected as the verification set of local detection, and the remaining 27400 slices are used as the training set of local detection.We mark the test set described in the previous paragraph as the test set CTGAN-ALL.Besides, we divide the test set CTGAN-ALL into two parts according to inject or remove tampering.The large nodule injected CT slice images, and the real large nodule images were marked as CTGAN-INJ.The large nodule removed CT images, and the real small nodule images were marked as CTGAN-REM.
In addition, eight CT scans different from the above data sets were retained.Two of them were real lung CT scans.One of them had malignant lung cancer lesions, and the other did not.These two scans were marked as MAL and BEN.In addition, one, two, and three large nodules were injected into three scans, respectively.The three scans were marked as INJ1, INJ2, and INJ3.Similarly, one, two, and three large nodules were removed from the remaining three scans marked as REM1, REM2, and REM3.Furthermore, we add impulse noise and gaussian noise to 5000 CT slice images, then use CycleGAN to reduce noise.In the end, 5000 images modified by CycleGAN were obtained.The images without noise and the image denoised by CycleGAN, these 10,000 slice images are marked as the data set CycleGAN.We mark the images denoised by CycleGAN as the positive class and the images without noise as the negative class.Among them, 8000 images are used as the training set, 1000 images are used as the verification set, and 1000 images are used as the test set.
For each slice image in the test set, we use a (32 × 32) window to traverse the whole CT image (with the size of (512 × 512)) with 4 pixels stride.Our method use Eq. ( 4) and ( 6) to traverse the image.For each slice image of the training set and the cross-validation set, we use the method of shifting by one pixel for data enhancement so that each slice image in the training set can generate 25 sub-image images.For the fake image (positive class), we mark the coordinates of the injection center point as (0,0), take 25 coordinate points in the rectangle from (-2,-2) to (2,2).Then use these coordinate points as the center point, cut out 25 sub-images with the size of (32 × 32).For the real image (negative class), we take 10 coordinate points in the rectangle from (-2,-2) to (-1,2), and then randomly select 20 different coordinates from the coordinates calculated by Eq. ( 4) and (6).Taking these coordinate points as the center, and cut out 25 sub-images with the size of (32 × 32).By adding n negative samples corresponding to the positive samples, the model can better learn the difference between the images before and after tampering.We found that when n = 10, the model's performance is better.

2) Setup
All experiments were implemented using the Tensorflow 1.13 framework and were trained on a single NVIDIA GTX2080TI GPU.The parameters of the training phase are as follows.We set the initial learning rate to 0.0005 and use exponential decay, which decays every 600 steps and with a decay rate of 0.85.The mini-batch size is 56, the batch normalization decay parameter is 0.95, and the L2 regularization weight decay parameter is 0.0001.We use Adam optimizer to minimize cross-entropy loss.Except for the learning rate, the default parameters of the Adam optimizer are used, namely β 1 = 0.9, β 2 = 0.999, = 1 × 10 −8 .The early stopping is set to stop training when the accuracy of the validation set no longer increases for three consecutive epochs.If the early stop is not triggered, the training will stop after 30 epochs.

3) Evaluation
We regard the tampered slice image as a positive example and the real slice image as a negative example.The number of positive and negative samples in the actual scene may differ.Therefore, in addition to accuracy(ACC), we also use precision(P), recall(R) and F1-score(F1) to evaluate the model's performance.The tampering operation of CT-GAN is aimed at 3D medical images.The number of slices involved in a tampering operation can easily reach more than 30.Besides, the number of tampered slices is more if the slice interval is small.However, if the same area of 10 consecutive slices is predicted as the tampering area, we can judge that this position has been tampered with easily.However, if the above indicators are calculated in the unit of 2D slice image, the value will be deficient, which is unreasonable.Therefore, when detecting the complete CT scan, this paper also takes the 3D tampered area (a series of slice images) as the unit and counts the indicators in the following way.
For a tampered area, when 9 or more of the 10 consecutive slices, which including the tampered central slice (these slices must be positive examples in this experiment), are judged as positive examples, we consider that the tampering trace is accurately found and marked as a true positive example.Otherwise, it is regarded as a missing report and marked as a false negative example.Similarly, for a real area, suppose 9 or more consecutive slices in the real slice are judged as positive examples, or 9 of the 10 consecutive slices are judged as positive examples.In that case, it will be regarded as false positives.Finally, the precision, recall and F1-score are calculated in the above way.

Ablation Study
In order to verify the effectiveness of each module in our method, we conducted ablation studies.Four experiments were used to verify the effectiveness of local detection, attention mechanism, Selu activation function, and GLCM feature extraction.In each experiment, we ablate a module from our method.In these experiments, ablate local detection is to input the complete slice image (512 × 512) and use our network to train and predict directly.The experimental results are shown in Table 1.The experimental results show that using the sliding window to divide sub-images for local detection is very helpful to detect GAN-based small region forgery attacks like CT-GAN, which significantly improves the performance of detection.Selu activation function and attention mechanism can be slightly helpful to the performance of our method.Moreover, the performance of our method can be significantly improved by using GLCM.When the above modules are used together, the improvement effect is better.

Detection of CT-GAN inject or remove attack
For general GAN-based small region forgery attack, it may not use the same GAN structure to train two different models like CT-GAN.Therefore, we divided the training set into two parts according to the same way as the test set CTGAN-INJ and CTGAN-REM, then trained detector models and tested the inject and remove models of CT-GAN separately.After the training, we got two detectors for different tampering models.After that, we tested on the two kinds of tampering respectively.The test results are shown in Table 2.Because other feature extraction methods for detecting GAN-generated images are not suitable for CT-GAN, we use XceptionNet and ResNet50, both of which are the most advanced DCNNs, as the baseline.Some existing studies have shown that XceptionNet has an excellent performance in GAN forged image detection, and its detection accuracy can be comparable with the most advanced detection methods.Therefore, we choose Xceptionnet as the baseline.In addition, because both XceptionNet and the local detection network take depthwise separable convolution as the main structure, we also select resnet50 as another baseline.Moreover, inspired by [10] we also tried to use the DCT of the sub-image as a feature(Ours-DCT).
This experiment is divided into two kinds.One is to use the sliding window, the network as the local detector to predict the (32 × 32) sub-image.The other is to train and test in a general way.The input of the network is (512 × 512)  3. Experimental results show that when using the current data set, even the most advanced DCNNs such as XceptionNet and ResNet50, the test accuracy and F1-score are only about 65%, which means that it is difficult for them to distinguish CT-GAN tampered images and real images.Their performance is improved when used as local detectors, but the performance is still unsatisfactory, which may be due to overfitting and network degradation.Our method will seriously overfit without the sliding window.However, using the sliding window, the model converges faster, and the accuracy and f1-score of our method are increased to 93%, an increase by a percentage of 28.However, all indicators have declined when using DCT as a feature, and the accuracy and f1-score are only about 86%.

2) Detect CT scans
In order to test the performance of our method more comprehensively, we compared our method with the latest method on complete CT scans.The model used is still trained under the mixed condition of injecting and removing.The test set is the eight scans mentioned above.Fig. 8 shows several continuous CT slice images near the tampering center point and the corresponding heatmap.Table 4 shows the results of our experiment.
The experimental results show that our model can effectively find the traces of CT-GAN tampering and is more stable than other methods.Our method can determine whether a scan has been tampered with automatically through a simple strategy.For example, when any n of m consecutive images are classified to be positive, it is considered that this scan has been tampered with by CT-GAN.Therefore, even CT scan is three-dimensional, while our model is two-dimensional, our model can effectively assist in distinguishing whether a CT scan has been tampered with.
In addition, for CT scans with smaller slice spacing, such as INJ1, INJ3, REM1 and REM2, our method can detect more consecutive positive samples (more than 15).However, when the slice spacing is larger, the continuous positive samples that the model can detect are fewer.
Furthermore, many misjudgments will occur in places that are unrelated to lung nodules, such as folds of clothes and calcified muscle tissue, which doctors can easily identify.1) The correlation of sub-images In our method, we directly detect each sub-image without considering the correlation of sub-images.Although the current experimental results meet the requirements of the detection task, we believe that introducing this correlation into the detection task will probably help improve the detection accuracy, efficiency, or generalization.Hence, in the future, we plan to conduct more extensive experiments to find the correlation between these sub-images.For example, whether there is potentially hidden information between adjacent sub-images in the tampered region and how this hidden information helps to improve the performance.
2) Efficiency Since our method divides the medical images into multiple sub-images through a sliding window, the increased number of targets would reduce the detection efficiency.With the help of a high processing speed for a single sub-image, the overall speed to detect a complete scan of the lungs is acceptable.As mentioned before, the main reason for using this sliding window is that the tampered region only occupies a small ratio of a normal image, which causes the existing detection methods that treat the target image as a whole to fail.Hence, the proposed method is the better choice from the detection point of view as it is the only detection method for the CT-GAN attack.
When the forgery attack is applied to the whole image, although we notice that our method can also detect the tampered region with high accuracy, the efficiency still can not catch up with those methods that treat the target as a single unit, such as [7].If we have prior knowledge of the size of the tampered region, it will help us select the proper processing method.
3) Generalization The two most commonly used GAN structures in the medical image field are pix2pix and CycleGAN.CT-GAN uses the pix2pix structure.The works such as [28][29][30][31][32][33] are based on the CycleGAN structure.Therefore, we chose CycleGAN to construct another data set to exam the performance of our method.The CycleGAN data set is described in Section 4.1.The medical images denoised by CycleGAN can be regarded as entirely generated by CycleGAN.The test results show that our method can classify CycleGAN tampered medical images and real medical images with 99.8% accuracy.In addition, if we do not use machine learning but use some fixed pattern for global classification, it is challenging to classify GAN-based small region forgery and the image generated by GAN wholly.
CycleGAN and pix2pix are the two most commonly used GAN structures in medical image synthesis.Our method can effectively detect the images generated by CycleGAN and pix2pix.Although our method can detect the images generated by the same or similar GAN, the detection effect of other GAN models not in the training set is not as good as the former.Many studies [3,6,7,[9][10][11]43] want to improve the generalization ability of GAN detection methods.This may be achieved by studying the common defects of CNN or GAN.For example, Wang et al. [43] tested multiple latest image generation models and found that the images generated by CNN today have certain common defects.Chai et al. [3] summarizes which parts are likely to cause the face images generated by GAN to be recognized.However, the above studies did not take GAN-based small region forgery attack into consideration.How to combine these studies with GAN-based small region forgery attack is still a problem.We plan to study this in the future.
We propose a new method to detect GAN-based small region forgery attacks in the medical image.Experiments show that our method can detect CT-GAN tampering traces more accurately than general methods under the same data set.GAN-based small region forgery attacks, especially those targeting medical images, are challenging to detect by existing models that take whole images as input.However, using a sliding window to train and test a light neural network in units of sub-images, then make the global classification by PCA and SVM.Our method can still distinguish whether medical images have encountered the GAN-based small region forgery attack with high accuracy.

Figure 2 :
Figure 2: CT-GAN tampered samples.The first row shows the removal tampering of CT-GAN.A lung nodule is removed from the CT slice image by CT-GAN.The second row shows the injection tampering of CT-GAN.A small nodule was tampered with as a large nodule by CT-GAN.

Figure 3 :
Figure 3: CT scan schematic.(a) is the cartoon schematic of CT scanning operation.(b) is a part of a CT scan.A complete CT scan may include about 300 slice images, and only a portion is selected here.

Figure 4 :
Figure 4: (a) is a tampered CT image.The nodules in the red box are injected by CT-GAN.(b) is the heatmap corresponding to (a), in which the bright red spot corresponds to the injected nodule.Because the preset sliding window stride is greater than 1, the size of the heatmap is smaller than the original CT slice image.In order to facilitate observation, we enlarged the thermal map, superimposed it on the CT image, and adjusted their colors.

Figure 6 :
Figure 6: The attention module of our network.The number of nerve cells in the three FC layers of channel attention is C, C/4, C, where C means the number of channels.The stride of convolution is 1, using the SAME Padding, and the activation function is Sigmoid.

4. 4
Compare to state-of-the-art methods 1) Detect CT slices

Figure 8 :
Figure 8: A part of CT slice images and corresponding heatmaps of scans.(a) scan INJ1.(b) scan REM1.The first column of each scan is the tampered CT slice image.The second column is the heatmaps output when XceptionNet is used as the local detector.The third column is the heatmaps output when ResNet50 is used as the local detector.The fourth column is the heatmaps output of our method.

Table 1 :
The Ablation Study Result of Our Method."-SW" means that sliding windows are not used, the whole image is classified directly without local detection."-Attention" means that attention mechanism are not used."-Selu" means that uses Relu instead of Selu in our network."-GLCM" means that PCA and SVM are directly used to classify the heatmap without GLCM to extract features.

Table 2 :
The Detection Results of CT-GAN Inject and Remove Attacks.The training and testing of the two are carried out separately.The experimental results show that although training data is reduced, our model can still detect CT-GAN's inject or remove model with a high F1-score.The detection accuracy and f1-score of the inject tampering model are about 90%, while the detection accuracy and f1-score of the remove tampering model are about 97%.The above results show that our method can still effectively detect the traces of tampering in the face of a single tampering model, but our method is more sensitive to the traces of removing tampering.

Table 3 :
The Detect Result of CT-GAN with the State-of-the-art Methods and Ours.Where "-W" means whole slice image input."-DCT" means the local detection network is trained with the DCT features extracted from the sub-image.

Table 4 :
The Detection Results of Complete CT Scans.The "spacing" means the spacing between two adjacent slice images.The "2D" means the indicators are calculated in the unit of a 2D slice image.The "3D" means the indicators are calculated in the unit of a 3D tampering area.There is no difference between the "2D" and "3D" methods in detection but in evaluation.