Figures
Abstract
The malicious use of deepfake videos seriously affects information security and brings great harm to society. Currently, deepfake videos are mainly generated based on deep learning methods, which are difficult to be recognized by the naked eye, therefore, it is of great significance to study accurate and efficient deepfake video detection techniques. Most of the existing detection methods focus on analyzing the discriminative information in a specific feature domain for classification from a local or global perspective. Such detection methods based on a single type feature have certain limitations in practical applications. In this paper, we propose a deepfake detection method with the ability to comprehensively analyze the forgery face features, which integrates features in the space domain, noise domain, and frequency domain, and uses the Inception Transformer to learn the mix of global and local information dynamically. We evaluate the proposed method on the DFDC, Celeb-DF, and FaceForensic++ benchmark datasets. Extensive experiments verify the effectiveness and good generalization of the proposed method. Compared with the optimal model, the proposed method with a small number of parameters does not use pre-training, distillation, or assembly, but still achieves competitive performance. The ablation experiments evaluate the role of each component.
Citation: Ding Y, Bu F, Zhai H, Hou Z, Wang Y (2024) Multi-feature fusion based face forgery detection with local and global characteristics. PLoS ONE 19(10): e0311720. https://doi.org/10.1371/journal.pone.0311720
Editor: Jiachen Yang, Tianjin University, CHINA
Received: June 16, 2024; Accepted: September 23, 2024; Published: October 10, 2024
Copyright: © 2024 Ding et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The data used in this paper are all from public datasets: https://www.kaggle.com/c/deepfake-detection-challenge/data; https://github.com/yuezunli/celeb-deepfakeforensics; https://github.com/ondyari/FaceForensics/blob/master/dataset.
Funding: This work was partially supported by the Double First-Class Innovation Research Project for People’s Public Security University of China (No.2023SYL08). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
1 Introduction
Deepfake is a ’fake’ image or video generated by ’deep’ learning algorithms. Due to the simple production, low cost, and rapid dissemination of deepfake videos, the number and attention of deepfake videos has increased rapidly, and it is reported that in global mainstream websites and social media platforms, the number of newly released deepfake videos in 2021 has increased by more than 10 times compared with that of 2017 [1]. With the rapid development of deepfake technologies, which provide means for those who maliciously spread false information, deepfake images and videos are becoming more and more realistic, making it difficult for the human eye to distinguish between the real and the fake. Deepfake which blurs the boundary between real and fake will have a huge impact on social trust, media trust, and political trust [2]. Therefore, deepfake detection is of great practical significance.
Detection of deepfake videos is usually regarded as a binary classification problem of video frames. Generally, frames are extracted from the video to be detected, and features in the frame images are analyzed to classify the images as true or false, and thus infer the authenticity of the video. The extraction of features in traditional methods relies on manual design, but it is not effective for videos generated by deep forgeries, therefore, the most current detection adopts deep learning methods, i.e., designing the network architecture and training the network to identify forged features and discriminate the authenticity of videos through supervised learning. The CNN-based network architecture can automatically extract the local features of the image, and use the convolution kernel to process the corresponding region image to get the image features within that receptive field. This works well to detect local inconsistencies existing in the forged image, such as artifacts introduced by tampering operations, and inconsistencies in the edges of tampered and non-tampered regions.
We find that there are still some non-negligible problems with deepfake videos detection in practical applications. First, the forgery detection algorithms fail to keep up with trends. With the continuous progress of forgery technology, the quality and fidelity of the forged content are constantly improving, which makes the existing forgery detection algorithms invalid. For example, in face forgery videos generated by GAN networks, artifacts and jitters in local regions are gradually reduced, so detection methods relying only on mining local regions for anomalies are no longer effective. More forgery cues feature inconsistencies between local regions, like the lighting of the synthesized eye and mouth regions may not be consistent with the lighting of the face region. To address this problem, some studies adopted the Visual Transformer model (ViT) [3] to develop a global perspective to analyze the features of deep forgery images by constructing the relationships of distant image blocks. However, the detection performance is not satisfactory when simply porting the ViT to the deepfake detection task. Not to mention the large model size and complex computational structure of the Transformer, it requires more computational resources and storage space, which will limit its application in some realistic scenarios. Second, most of the existing forgery detection algorithms are poorly generalized. The performance of the algorithms drops dramatically in the face of unknown counterfeiting means in practical applications. Since different forgeries tend to leave specific traces or anomaly patterns in different domains, for example, some forgeries may not be obvious in the RGB domain but may show significant discriminative properties in the frequency domain or other domains. Therefore, extracting a single form of image features is not enough to cope with the ever-changing forgery methods, which is an important reason for the relatively poor generalization performance of the current algorithms.
To solve these problems, we propose a simple but effective network architecture that combines the advantages of CNN and Transformer networks. It extracts the global and local information of the video to discriminate the authenticity of the video. Specifically, the video is first converted to frames and the face images are cropped, and then the images are fed into the Multi-Feature Extraction module to obtain the spatial domain features, noise features, and frequency domain features of the face, respectively. A fusion of these three features is performed and then fed into the Inception Transformer module [4] to accomplish the true and false classification. Extensive experiments on commonly used deepfake detection datasets confirm the effectiveness of the proposed method within and across datasets. The method achieves competitive results compared to existing baseline models. Notably, we do not use any pre-trained models or techniques such as distillation learning, and the number of parameters in our model is much lower than that of existing state-of-the-art models.
Our contributions in this paper are as follows. 1) Applying the Inception Transformer model to deepfake videos detection and verifying its effectiveness on several deepfake detection benchmark datasets. It suggests that the joint use of global and local information is a good idea for the detection of forged videos. 2) A Multi-Feature Extraction (MFE) module is designed, which fuses texture features, noise features, and frequency features, while suppresses semantic information to make the model more sensitive to forgery traces. Experiments show that the MFE module is beneficial in improving the deepfake detection performance. 3) Proposed a lightweight model, named Mixformer, that consists of the underlying MFE module and Inception Transformer for forgery face detection. The model has a much smaller number of parameters, less than one-fifth of the state-of-the-art model. And the model’s performance remains competitive without the use of pre-training, distillation, or assembly. This indicates that our model is more efficient and practical.
The rest of the paper is structured as follows: Section 2 describes related work in the area of deepfake generation and detection. The proposed Mixformer model is described in detail in Section 3. The experimental procedures and results are described in Section 4. Finally, conclusions are drawn in summary in Section 5.
2 Related work
2.1 Deepfake faces generation
Deepfake originally originated from the user ’deepfakes’ who posted fake videos on the Reddit social network in 2017. With the development of deep learning technology, there are currently three main ways to generate deepfake faces: based on Variational Auto Encoders (VAE), based on Generative adversarial networks (GAN), and based on Diffusion Models (DM) approaches.
The autoencoder consists of an encoder and a decoder, the generated face retains the identity of the source face and also has the expression features of the target face. However, the reality of the face generated by the autoencoder network is limited. Deepfake face generation based on GAN networks is the result of the mutual fight between generators and discriminators. ProGAN [5] proposes to train the generator and discriminator step by step by progressively deepening the network for face synthesis. The StyleGAN family [6,7] redesigns the generator architecture to control the style of the face synthesis results. FSGAN [8] can exchange faces in the video in real-time. FaceShifter [9] extracts the attributes of a multilevel target face by attribute encoding, uses the generator to adaptively embed identity categories and attributes for face swapping, and restores regions with occlusions through a self-supervised approach to generate high-fidelity faces. Diffusion model [10] first gradually adds noise to the data through a forward process, then predicts the noise added at each step through a reverse process, and removes the noise to gradually restore the noise-free image. Common models include DALL-E2 [11], Stable Diffusion [12], etc. In practice, the face-swapping function of Roop software has added the Stable Diffusion plug-in.
2.2 Deepfake detection
The increasing number of high-quality deepfake videos brings risks to social security. To cope with it, many deepfake detection methods have been proposed. D. Afchar et al. [13] proposed MesoNet to automatically and efficiently detect fake faces in videos using meso-level image features. A facial X-ray based face image forgery detection [14] is proposed using the hybrid boundary introduced by the forged face embedded in the real image during the tampering process. The detection algorithm remains effective when the tampering technique is unknown. Zhao et al. [15] regard this task as a fine-grained classification problem and propose a multiple spatial attention mechanism to aggregate the low-level textural feature and high-level semantic features. Liu et al. [16] proposed a residual Federated learning to learn robust discriminative residual feature maps to detect forgery faces. Wodajo et al. [17] applied the convolutional Visual Transformer (ViT) to deepfake videos detection, a CNN network is used to extract learnable features, which are then fed into a ViT network, which employs an attention mechanism to classify the feature. Since the ViT network deals with image features rather than images, this drastically reduces the dimensionality of the network and improves the training speed of the network. The method was trained on the DFDC dataset and achieved competitive results, but it did not generalize well over the FaceForensics++ FaceShifter dataset. A two-branch method [18] is proposed, where images are fed into two EfficientNet B0 feature extractors separately, and the obtained image features are then fed into visual transformer encoders at different scales, and the outputs of the encoders are fused using cross-attention. The method is trained on DFDC and FaceForesics++ datasets and tested on the DFDC dataset, it achieved state-of-the-art AUC values, but the architecture has a large number of parameters, a large amount of training data, and a high computational cost.
After the propagation of different streams on the network, videos and images are usually compressed several times, which makes the forgery artifacts more difficult to recognize. Therefore, unlike the detection methods based on the spatial domain [13–18], researchers try to obtain the detection clues from other perspectives such as the frequency domain. For example, Qian et al. [19] found that the details of artifacts brought by the forgery method can be well mined in the frequency domain, and the method still maintains excellent detection performance in the face of highly compressed forgery images. Wang et al. [20] explore the spatial-temporal characteristics in the frequency domain for VIS and NIR scenario. Furthermore, to fully exploit the rich information in video sequences or images, detection methods utilizing multiple features are proposed. Liu et al. [21] combine spatial image and phase spectrum to capture the up-sampling artifacts of face forgery to improve the transferability. Peng et al. [22] use gaze features in conjunction with texture and attribute features of video sequences to enhance the representation of spatial-temporal feature differences between real and forged faces.
3 Methods
3.1 Overview
In this section, we first state the motivation for designing our method and then provide a brief overview of it. As aforementioned, ViT networks have achieved good performances in deepfake detection tasks. ViT-based deepfake detection methods utilize long-distance relations among image patches to capture global information and discover forgery clues. In fact, the traces of forgery may be global or local. Over-propagation of global information will strengthen the low-frequency representation, deteriorate the high-frequency part such as local texture, and weaken the modeling ability of ViT [23]. To solve this problem, the forgery detection network should learn both the global and local information of the image. The Inception Transformer proposed in the literature [4] fulfills this requirement exactly. It can effectively learn comprehensive features both globally and locally by incorporating the advantages of CNNs in capturing high-frequency information into ViT. Therefore, the Inception Transformer is chosen as the backbone network for the deepfake detection task in the method.
Meanwhile we observe that forgery traces are subtle and vary from case to case. Utilizing multiple types of features to analyze forgery traces is a comprehensive approach. For example, Zhou et al. [24] propose a two-stream network that utilizes both RGB stream and noise stream for image tampering detection. Lin et al. [25] propose a novel network to learn and enhance multiple tampering traces, including noise distribution and RGB visual artifacts. Liu et al. [21] combine spatial image and phase spectrum.
Motivated by these observations, we propose a novel deepfake detection model, named Mixformer, to exploiting forged features of image faces in different representation domains. The proposed Mixformer consists of two key components: 1) a Multi-Feature Extraction (MFE) module, integrates features in the space domain, noise domain, and frequency domain, enable the model learn better forgery traces. 2) an Inception Transformer module, learns the mix of global and local information dynamically. The overall architecture diagram is shown in Fig 1. The detailed steps for each of these two sections are described separately below.
3.2 Multi-feature extraction module
Convolutional networks are used to extract image semantic features, while the basis of image identification is subtle and often unrelated to semantics, such as mixed boundaries, inconsistency between the forged region and the real region, etc. Therefore, for the task of image forgeries, it is not enough to feed the image into a simple convolutional neural network to extract features for analysis. In this paper, a hybrid feature extraction module is designed to enhance the network’s ability to identify forged traces by integrating the conventional features, noise features, and frequency domain features of the image.
Firstly, as images from the same source or produced by the same device have the same noise pattern, the noise feature can be considered as an inherent specificity of the image. The forgery operation destroys the consistency of features in the original image and thus usually leaves special traces in the noise space [26,27]. Inspired by this, in this paper, we use the Steganalysis Rich Model (SRM) to extract the high-frequency noise of an image and use it as one of the discriminative features for forgery traces. We use a fixed SRM filter [24,28] with 3 SRM cores as shown in Fig 2. Three channels of RGB image are passed through the SRM layer with 3 cores respectively to obtain the noise features that reflect the incongruence between the real and the tampered regions.
The fixed SRM filters have limitations because of manually designed. Therefore, to adaptively learn features from the image noise space and further suppress the influence of the image content on the tampering traces, we added the constrained convolutional layer BayerConv2D [29]. The kernel size of Bayerconv2D is 5 and the number of filters is 3.
Secondly, tampering operations may change the frequency domain features of an image. Qian et al. [19] adopts frequency aware decomposition (FAD) to separate the high-frequency, mid-frequency, and low-frequency parts of an image. Inspired by this, we allow the network to learn the different features of the forged image and the real image in these three parts, thus further enhancing the discriminability of the features. Specifically, the image is first transformed to the frequency domain by DCT transformation. Then it is segmented into low-frequency part, mid-frequency part, and high-frequency part using a filter. These three parts are then cascaded after being transformed to the RGB domain by IDCT transform, respectively. Following the original paper [19], the low-frequency sub-band is 1/16 of the entire spectrum, the mid-frequency sub-band is 1/16 to 1/8, and the high-frequency sub-band is the remaining 7/8.
This work starts by extracting frames from the video, converting the video into a sequence of frames, and then cropping these frames to keep only the face region to get the face images. Assuming that a face image is represented as X, X is the input to the hybrid feature extraction module and is fed into the classical Conv2D layer, SRMConv2D, BayerConv2D [29], and the FAD module [19], respectively, to obtain a variety of features. These features are then cascaded to obtain hybrid features. This process is expressed as Eq (1):
(1)
Where
X ∈ R3,H,W is an input image tensor with 3 channels (RGB) and dimension H (height) and W (width).
Xmix ∈ RC,H,W is the output hybrid feature.
C represents the total number of channels of in Xmix, derived from the concatenation of features from Conv, SRM, BayarConv and FAD operations.
The hybrid feature maps corresponding to real face and forged face images are given as shown in Figs 3 and 4 respectively. The real face and the forgery face are sourced from the public Deepfake Detection Challenge dataset. Where the top left shows the RGB image, the subsequent first and second rows show the convolutional features, the third row shows the noise features of the image including Bayer features and SRM features, and the last two rows show the frequency domain features of the image. As can be seen from the figures, the convolutional features mainly extract the content information of the image and have a good representation of the semantic information of the face. While Bayer features and SRM features mainly extract noise features of the face. The middle and high-frequency domains of the frequency domain features reflect the details and edge information of the image. Comparing the feature maps of the real face and fake face, it can be found that the blending boundary of the fake face is more obvious in the noise domain, and there is a significant difference between the face texture of the real face and the fake face in the middle and high-frequency domains, which is very difficult to find in the RGB domain. From this, we can see the necessity of this multi-feature extraction module.
The top left corner displays the RGB image of a frame face from the public Deepfake Detection dataset, the rest of the first two rows are spatial domain features, the third row is the noise domain feature map, and the fourth and fifth rows are the frequency domain feature maps.
The top left corner displays the RGB image of a frame face from the public Deepfake Detection dataset, the rest of the first two rows are spatial domain features, the third row is the noise domain feature map, and the fourth and fifth rows are the frequency domain feature maps.
3.3 Inception transformer module
Literature [4] suggests that a network for understanding images should capture more high-frequency detail information at lower layers, and gradually increase the low-frequency global information as the number of network layers increases, just like human beings achieve a global understanding of images by gradually collecting local information. This idea coincides with the task of image forensics, which gradually expands from observing the subtle points of the image to the global information to comprehensively judge the authenticity of the image. Inspired by this, this paper adopts the Inception Transformer network proposed in that literature and uses the multi-feature generated above as inputs to learn the forgery traces from local to global, and outputs binary classification values.
The Inception Transformer module consists of four stages, each of which consists of a patch embedding and iFormer blocks as shown in Fig 1. The specific structure of the iFormer block is shown in Fig 5. The iFormer also has a feed-forward network (FFN) like a common Transformer, with the difference of incorporating the Inception Token Mixer (ITM). Layer normalization (LN) is used before ITM and FFN. For each block, the channel rate determines the allocation of the HF and LF portions, i.e., Ch/C and Cl/C, where Ch/C + Cl/C = 1. This structure gradually splits more channel sizes from lower to higher layers to the LF mixer, thus reducing the channel size of the HF mixer. As can be seen from the color of the iFormer in the figure, Ch/C gradually decreases in the blue part and Cl/C gradually increases in the yellow part from the lighter to the deeper layers. In this way, the iFormer effectively balances the high and low-frequency parts between all layers.
The important component of iFormer is the inception Token mixer, as Fig 6 Shown. This mixer segments the input features along the channel dimension and feeds the segmented parts into the high-frequency mixer and low-frequency mixer respectively. The high-frequency mixer is responsible for extracting local information through pooling and convolution operations, and the low-frequency mixer is implemented by self-attention in the ordinary ViT, which is responsible for learning the relationship of global information. In this way, the network can efficiently capture frequency-specific information on the corresponding channel. It can learn features over a wider frequency range than with a regular ViT.
Where the orange box indicates the high-frequency mixer part and the blue box indicates the low-frequency mixer part.
To better understand the iFormer block, we can represent it by Eqs (2) and (3):
(2)
(3)
Where ITM(∙) denotes the inception mixer. The mixing features Xmix are first linearly projected to obtain the feature mapping Xproj ∈ RN×C. Xproj is partitioned along the channel dimension into
and
, where Ch + Cl = C, and then Xh and Xl are assigned to the high-frequency mixer and low-frequency mixer, respectively.
In the high-frequency mixer, the input Xh is split in two along the channel, i.e., . The two parts are fed into two parallel branches, Xh1 through a maximal pooling and a linear layer, and Xh2 is fed into a linear layer and a depthwise separable convolution layer. In this way, the high-frequency mixer outputs are Yh1 and Yh2, as shown in the following equations.
The low-frequency mixer first performs average pooling to reduce the spatial dimension of Xl, then employs a common multi-head self-attention operation, and finally performs up-sampling to restore the original spatial dimension. This design drastically reduces the computational overhead and allows the attention operation to focus on global information. The output of the low-frequency mixer can be represented as:
(6)
Next, the outputs of the low-frequency and high-frequency mixers are cascaded along the channel dimension to obtain Yc, followed by depth-separable convolution and a cross-channel linear layer to obtain the output Y of the inception mixer.
Then it goes through a layer normalization (LN) and feed-forward network (FFN). Like the vanilla Transformer encoder, we also use a residual connection, as in Fig 5. Hence the Inception Transformer block is formally defined as:
(9)
The proposed model Mixformer consists of four main stages, each of which consists of a stack of the Inception Transformer blocks. After four stages, the finally CLS tokens are used to generate binary classification values.
4 Experiments
In this section, we use 3 public available datasets to evaluate the performance of proposed method on deepfake detection task. All experiments run on a computer with NVIDIA Geforce RTX 3080 10GB GPU. We use Pytorch to implement our methods.
4.1 Experimental parameters setting
Table 1 shows the detailed configuration of the proposed model. The model uses a 4-stage architecture. Ch/C and Cl/C in the table denote the percentage of the high-frequency portion and the percentage of the low-frequency portion, respectively, and from shallow to deep, Ch/C decreases gradually while Cl/C increases gradually. The kernel size of the Depthwise separable convolution kernel and maximum pooling is set to 3 × 3 in the high-frequency mixer. This model is trained using a binary Cross-Entropy Loss function. A small batch of 32 images was normalized with a mean of [0.485,0.456,0.406] and a standard deviation of [0.229,0.224,0.225]. The normalized face images were then augmented before each training of the model. An Adam optimizer with a learning rate of 0.1e-3 and weight decay of 0.1e-6 was used for optimization. The model was trained for a total of 50 epochs. The learning rate was reduced by a factor of 0.1 at a per-step length of 15.
Pool stride denotes the stride of the pooling and upsample layers in the attention branch.
4.2 Datasets
To make the experimental results fair, we use publicly available benchmark datasets for deepfake detection: Deepfake Detection Challenge (DFDC) [30] and FaceForensics++ (FF++) [31]. The DFDC dataset is a dataset of face-swapped videos released in the deepfake detection challenge hosted by Facebook. It comes with 23564 videos taken by 3426 subjects and then 104,500 fake videos generated by eight deepfake techniques, with a total data volume of up to 472 G. The FF++ dataset contains 1,000 real videos from YouTube and sub-datasets of fake videos generated by utilizing five deepfake techniques respectively: Deepfakes, DeepFakeDetection, Face2Face, FaceSwap, and NeuralTextures. The dataset contains three different quality versions: the original version without any post-processing (denoted as FF++RAW), the compressed video with a fixed rate quantization parameter of 23 (denoted as FF++HQ), and the compressed video with a fixed rate quantization parameter of 40 (denoted as FF++LQ). Thus, for each quality version, there are 1000 real videos and 5000 fake videos. The HQ version is used in this experiment. We also used the Celeb-DF dataset [32], which consists of 590 real videos of celebrities from the web and 5639 faked videos.
Table 2 gives the split size of the relevant datasets in the experiments, i.e., the number of videos used for training, validation, and testing, respectively. As an example, the proposed model is trained using the DFDC dataset, and faces are extracted using the BlazeFace Neural Face Detector [33], MTCNN [34], and Face Recognition DL libraries. The face images are stored in JPEG file format with an image resolution of 224 x 224. A compression ratio of 90% is also applied. The number of face images in the training set, validation set, and test set are 172245, 64592, and 16320 respectively. Each true and false class has almost the same number of images in all sets. We use Albumentations [35] for data enhancement.
The count refers to the number of videos.
4.3 Evaluation metrics
For each video to be detected, 30 facial images are extracted and passed to our trained model, which determines the authenticity of the video based on the mean probability of authenticity of these images. We evaluate our model using the average of the accuracy ACC, AUC, F1 score, and loss value over the test dataset. AUC [36] is the area covered by the ROC curve, which is a measure of the evaluation of a binary classification model The process of calculating the F1 value is shown in the following equation:
(10)
(11)
(12)
Where TP (True Positive) denotes the number of forged images predicted to be forged; FP (False Positive) denotes the number of forged images predicted to be real; and FN (False Negative) denotes the number of real images predicted to be forged. AUC and F1 score both provide a reasonable evaluation of the classifier’s accuracy, even in the presence of sample imbalance.
The loss value indicates how far the prediction result of our model is from the actual target value, and we use the logarithmic loss function to calculate the loss, as shown in Eq (13).
(13)
Where n is the number of videos to be detected,
denotes the estimated probability that the video is a forgery, and yi is the label of the video with a value of 0 for true and 1 for false.
4.4 Experimental results and discussion
A. Performance on standard datasets.
In this section, we compare the performance of the proposed method with state-of-the-art models such as CViT [17], efficientViT [18], etc. The experimental results of the comparison models are obtained from the results reported in the original paper or from the results of runnings using the public code of the paper.
First, we comparatively analyzed the test results of each model on the DFDC dataset, as shown in Table 3. Among them, the IncepFormer model is the model obtained by feeding the face image directly into the inception transformer network training. The model achieves 80.06% accuracy, 87.1% AUC, and 86.52% F1 value, which confirms that the Inception Transformer can distinguish between real and fake videos more accurately. The proposed Mixformer model has an AUC value of 89.0%, an ACC of 86.73%, and an F1 value of 91.51%, which outperforms the performance of most other models. This demonstrates the effectiveness of the proposed model for the deepfake detection. Compared with the SoTA model CrossViT on the DFDC dataset, the F1 score of the Mixformer model is 3.51% higher than it, up to 91.51%, yet the number of parameters is less than 1/5 of it. The proposed model shows excellent performance in this deepfake detection task with high validity and reliability, and the model also has the lightweight feature, which makes it have wider application prospects and practical value in real situations.
To evaluate the learning ability of the proposed models for different datasets, we also compared the detection accuracy of each model on the sub-datasets of the FF++ dataset: Deepfakes, Face2Face, FaceSwap, FaceShifter and NeuralTextures. The results are shown in Table 4. The proposed model Mixformer has a higher detection accuracy than most models. This indicates that the model can learn the forgery methods of the FF++ dataset well, especially in the FaceShifter sub-dataset, which achieves an accuracy of 98.4%. The detection accuracy of FaceSwap and NeuralTextures decreases due to the fact that, unlike the other tampering that manipulates every frame of the target sequence, these two tampering methods only manipulate the source video and the target video with the minimum number of frames required, resulting in a dip in the accuracy of frame-based detection.
Validation accuracy (in %).
B. Generalization analysis.
Generalization ability is an important metric in machine learning, which refers to the performance of a model when confronted with new scenarios or data. Generalization ability is especially important in deep forgery detection. As forgery techniques evolve and change, the detection model must be able to adapt to the changes and maintain stable performance on unseen datasets.
To evaluate the generalization ability of the proposed model, we perform cross-dataset evaluation between different datasets. As shown in Table 5, the Mixformer model which is trained with DFDC and tested on the Deepfakes subset of FF++ achieves an accuracy of 80.2%. This illustrates that the proposed framework considers DFDC has similar falsification traces with Deepfakes. Whereas, most of the cross-dataset accuracies between the subsets of FF++ are around 50%. One is because the amount of training data is too small, which results in the overfitting of the model; the other is because the differences in the falsification methods of these sub-datasets are quite large.
To further compare the generalization ability of the proposed model with other state-of-the-art models, we evaluate the model trained on the DFDC or FF++ dataset by testing it on the Celeb-DF dataset. The results are shown in Table 6. It can be seen that the proposed model achieves the optimal test performance with ACC, AUC and F1 values of 76.71%, 84.1% and 83%, respectively. This confirms that the proposed model is able to recognize tampering traces more accurately facing unseen tampering methods and has more stable detection performance. Meanwhile, the generalization performance of Mixformer is improved by 4.49%, 11%, and 14% over the ACC, AUC, and F1 values of IncepFormer, respectively, which indicates that the multi-feature extraction module greatly improves the generalization of the model. This is due to the fact that the multi-feature module synthesizes the image features in multiple domains, making the extracted features more discriminative. In addition, the model is able to learn both global and local information, which enables the model to understand tampering behaviors at different levels and further improves the generalization ability.
C. Ablation experiments.
In this section, the paper uses ablation experiments to verify the effectiveness of the multi-feature extraction module and the Inception Transformer module, we quantitatively evaluated Mixformer and its variants: 1) IncepFormer, which is the Mixformer model without the Multi-feature extraction module, 2) IncepFormer+conv2D, 3) IncepFormer + noise domain feature module, and 4) Mixformer. The results are shown in Table 7.
Comparison between different combinations of Mixformer. The results in the table are test with the DFDC dataset (in %).
The IncepFormer model has a slight performance degradation with the addition of the conv2D module, which indicates that the tampering traces of the DFDC dataset are very hard to be detected in the spatial domain, and this is a side effect of the necessity of introducing diversified features. By comparing Model 2 and Model 3, it can be seen that after adding the noise domain feature module, the AUC value does not change much, but the ACC and F1 values both increase by about 1%, which indicates that the noise domain feature helps to improve the deepfake detection ability. After adding the frequency domain feature module to Model 3, the ACC, AUC, and F1 values of the model increased significantly by 5.43%, 2.1%, and 3.98% respectively. These consistently improved performances validate that the proposed noise-domain feature module and frequency-domain feature module indeed help in the detection of deepfakes. Noise domain features and frequency domain features are complementary, hence their fusion further enhances the performance of the model.
D. Complexity analysis.
To better demonstrate the superiorty of the proposed Mixformer, we also provide model parameters (Param.) and floating point operations (FLOPs) in the paper. Recongnizing that FLOPs only measures the theoretical computational complexity, while the actual inference speed in real scenarios can be affected by hardware equipment and optimization algorithms, we also provide the actual inference speed of the detection model for reference.
Table 8 reports the comparison results of Mixformer with exiting detection method in terms of model parameters, FLOPs and inference time. Note that, the inference times in the table are calculated as the average time taken for the models to infer on each video in the Celeb-DF test set. Among them, the number of parameters in the Mixformer model is one-fifth of that in CrossViT, and although its FLOPs are not the lowest, its actual inference time is shortest. This demonstrates that the proposed model is more lightweight and has broader application scenarios.
5 Conclusions
In this paper, we propose a lightweight Transformer-based model for deep forgery detection. This model uses Inception Transformer as the backbone network because it can analyze both global and local information, which fits the needs of deepfake detection tasks. By adding spatial, noise and frequency domain features of the image, the model can effectively learn traces of forgery, leading to a gradual improvement in detection performance. On the DFDC and FaceForensic++ datasets, our model does not use any pre-training or distillation methods, but still shows competitive results. On the DFDC dataset, it outperforms the SoTA method using the F1 scores as an evaluation metric. On the FF++ dataset, its ACC performance is close to that of SoTA. The cross-dataset evaluation of the proposed method has better generalization performance than SoTA, which proves that the method has stronger generalization ability. More importantly, the proposed architecture is very lightweight with less than 1/5 of the parameters of the SoTA method. Our study shows that the Inception Transformer is able to distinguish deepfake videos from real videos. The use of fusion features in the spatial, noise and frequency domains greatly enhances the ability of deepfakes detection.
Acknowledgments
We would like to thank Fanliang Bu (People’s Public Security University of China) for the insightful comments on the manuscript and his guidance and patience enlighten us not only on this paper but also our future.
References
- 1.
Real AI. Top 10 Trends in Deep Synthesis Report(2022). 2022 Feb. https://real-ai.cn/ai-research/legislative-research/deep-synthesis/41.html.
- 2. Chesney B, Citron D. Deep fakes: A looming challenge for privacy, democracy, and national security. Calif Law Rev. 2019;107: 1753–1820.
- 3.
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. 2020. http://arxiv.org/abs/2010.11929.
- 4. Si C, Yu W, Zhou P, Zhou Y, Wang X, Yan S. Inception Transformer. Adv Neural Inf Process Syst. 2022;35: 23495–23509. Available: http://arxiv.org/abs/2205.12956.
- 5.
Karras T, Aila T, Laine S, Lehtinen J. Progressive Growing of GANs for Improved Quality, Stability, and Variation. 2017. http://arxiv.org/abs/1710.10196.
- 6.
Karras NVIDIA T, Laine NVIDIA S. A Style-Based Generator Architecture for Generative Adversarial Networks Timo Aila NVIDIA. https://github.com/NVlabs/stylegan.
- 7.
Lehtinen J, Aila NVIDIA T. Analyzing and Improving the Image Quality of StyleGAN Tero Karras NVIDIA Samuli Laine NVIDIA Miika Aittala NVIDIA Janne Hellsten NVIDIA. https://github.com/NVlabs/stylegan2.
- 8.
Nirkin Y, Keller Y, Hassner T. FSGAN: Subject Agnostic Face Swapping and Reenactment. In Proceedings of the IEEE International Conference on Computer Vision. 2019; 7184–7193. http://arxiv.org/abs/1908.05932.
- 9.
Li L, Bao J, Yang H, Chen D, Wen F. FaceShifter: Towards High Fidelity And Occlusion Aware Face Swapping. 2019. http://arxiv.org/abs/1912.13457.
- 10. Ho J, Jain A, Abbeel P. Denoising Diffusion Probabilistic Models. Adv Neural Inf Process Syst. 2020;33: 6840–6851. Available: https://github.com/hojonathanho/diffusion.
- 11.
Ramesh A, Dhariwal P, Nichol A, Chu C, Chen M. Hierarchical Text-Conditional Image Generation with CLIP Latents. 2022. http://arxiv.org/abs/2204.06125.
- 12.
Rombach R, Blattmann A, Lorenz D, Esser P, Ommer B. High-Resolution Image Synthesis with Latent Diffusion Models. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022. pp. 10684–10695. https://github.com/CompVis/latent-diffusion.
- 13. Afchar D, Nozick V, Yamagishi J, Echizen I. MesoNet: a Compact Facial Video Forgery Detection Network. 2018 IEEE international workshop on information forensics and security (WIFS). 2018. pp. 1–7.
- 14.
Li L, Bao J, Zhang T, Yang H, Chen D, Wen F, et al. Face X-ray for More General Face Forgery Detection. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020. pp. 5001–5010. https://29a.ch/photo-forensics/#noise-analysis.
- 15.
Zhao H, Zhou W, Chen D, Wei T, Zhang W, Yu N. Multi-attentional Deepfake Detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021. https://github.com/yoctta/.
- 16. Liu D, Dang Z, Peng C, Zheng Y, Li S, Wang N, et al. FedForgery: Generalized Face Forgery Detection with Residual Federated Learning. IEEE Transactions on Information Forensics and Security. 2023.
- 17.
Wodajo D, Atnafu S. Deepfake Video Detection Using Convolutional Vision Transformer. arXiv preprint arXiv:210211126. 2021. http://arxiv.org/abs/2102.11126.
- 18.
Coccomini D, Messina N, Gennaro C, Falchi F. Combining EfficientNet and Vision Transformers for Video Deepfake Detection. International conference on image analysis and processing. 2022. pp. 219–229.
- 19.
Qian Y, Yin G, Sheng L, Chen Z, Shao J. Thinking in Frequency: Face Forgery Detection by Mining Frequency-aware Clues. European conference on computer vision. 2020. pp. 86–103. http://arxiv.org/abs/2007.09355.
- 20.
Wang Y, Peng C, Liu D, Wang N, Gao X. Spatial-Temporal Frequency Forgery Clue for Video Forgery Detection in VIS and NIR Scenario. IEEE Transactions on Circuits and Systems for Video Technology. 2023. http://arxiv.org/abs/2207.01906.
- 21.
Liu H, Li X, Zhou W, Chen Y, He Y, Xue H, et al. Spatial-Phase Shallow Learning: Rethinking Face Forgery Detection in Frequency Domain. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.
- 22. Peng C, Miao Z, Liu D, Wang N, Hu R, Gao X. Where Deepfakes Gaze at? Spatial-Temporal Gaze Inconsistency Analysis for Video Face Forgery Detection. IEEE Transactions on Information Forensics and Security. 2024;19: 4507–4517.
- 23.
Park N, Kim S. How Do Vision Transformers Work? arXiv preprint arXiv:220206709. 2022. http://arxiv.org/abs/2202.06709.
- 24.
Zhou P, Han X, Morariu VI, Davis LS. Learning Rich Features for Image Manipulation Detection. IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018.
- 25. Lin X, Wang S, Deng J, Fu Y, Bai X, Chen X, et al. Image manipulation detection by multiple tampering traces and edge artifact enhancement. Pattern Recognit. 2023;133.
- 26. Jessica Fridrich. digital image forensics. IEEE Signal Process Mag. 2009;26: 26–37.
- 27. Mahdian B, Saic S. Using noise inconsistencies for blind image forensics. Image Vis Comput. 2009;27: 1497–1503.
- 28.
Rao Y, Ni J. A deep learning approach to detection of splicing and copy-move forgeries in images. 2016 IEEE International Workshop on Information Forensics and Security (WIFS). IEEE; 2016. pp. 1–6.
- 29. Bayar B, Stamm MC. Constrained Convolutional Neural Networks: A New Approach Towards General Purpose Image Manipulation Detection. IEEE Transactions on Information Forensics and Security. 2018;13: 2691–2706.
- 30. Dolhansky B, Bitton J, Pflaum B, Lu J, Howes R, Wang M, et al. The DeepFake Detection Challenge (DFDC) Dataset. 2020.
- 31.
Rössler A, Cozzolino D, Verdoliva L, Riess C, Thies J, Nießner M. FaceForensics++: Learning to Detect Manipulated Facial Images. Proceedings of the IEEE/CVF international conference on computer vision. 2019. pp. 1–11.
- 32.
Li Y, Yang X, Sun P, Qi H, Lyu S. Celeb-DF: A Large-scale Challenging Dataset for DeepFake Forensics. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020. pp. 3207–3216. https://deepfakedetectionchallenge.ai.
- 33.
Bazarevsky V, Kartynnik Y, Vakunov A, Raveendran K, Grundmann M. BlazeFace: Sub-millisecond Neural Face Detection on Mobile GPUs. arXiv preprint arXiv:190705047. 2019.
- 34.
Timesler. Pretrained Pytorch face detection (MTCNN) and recognition (InceptionResnet) models. In: https://github.com/timesler/facenet-pytorch.
- 35. Buslaev A, Iglovikov VI, Khvedchenya E, Parinov A, Druzhinin M, Kalinin AA. Albumentations: Fast and flexible image augmentations. Information (Switzerland). 2020;11.
- 36. Bradley AP. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit. 1997;30: 1145–1159.
- 37. Khalifa AH, Zaher NA, Abdallah AS, Fakhr MW. Convolutional Neural Network Based on Diverse Gabor Filters for Deepfake Recognition. IEEE Access. 2022;10: 22678–22686.
- 38. Baek JY, Yoo YS, Bae SH. Generative Adversarial Ensemble Learning for Face Forensics. IEEE Access. 2020;8: 45421–45431.
- 39. Nirkin Y, Wolf L, Keller Y, Hassner T. DeepFake Detection Based on Discrepancies Between Faces and Their Context. IEEE Trans Pattern Anal Mach Intell. 2022;44: 6111–6121. pmid:34185639
- 40.
Zi B, Chang M, Chen J, Ma X, Jiang YG. WildDeepfake: A Challenging Real-World Dataset for Deepfake Detection. MM 2020—Proceedings of the 28th ACM International Conference on Multimedia. Association for Computing Machinery, Inc; 2020. pp. 2382–2390.
- 41. Anas Raza M, Mahmood Malik K, Ul Haq I. HolisticDFD: Infusing spatiotemporal transformer embeddings for deepfake detection. Inf Sci (N Y). 2023;645.