Multiple contexts and frequencies aggregation network for deepfake detection

Zifeng Li; Wenzhong Tang; Shijun Gao; Yanyang Wang; Shuai Wang

doi:10.1371/journal.pone.0337409

Abstract

Deepfake detection faces increasing challenges since the fast growth of generative models in developing massive and diverse Deepfake technologies. Recent advances rely on introducing heuristic features from spatial or frequency domains rather than modeling general forgery features within backbones. To address this issue, we turn to the backbone design with two intuitive priors from spatial and frequency detectors, i.e., learning robust spatial attributes and frequency distributions that are discriminative for real and fake samples. To this end, we propose an efficient network for face forgery detection named MkfaNet, which consists of two core modules. For spatial contexts, we design a Multi-Kernel Aggregator that adaptively selects organ features extracted by multiple convolutions for modeling subtle facial differences between real and fake faces. For the frequency components, we propose a Multi-Frequency Aggregator to process different bands of frequency components by adaptively reweighing high-frequency and low-frequency features. Comprehensive experiments on seven popular Deepfake detection benchmarks demonstrate that MkfaNet achieves an AUC of 0.9591 in within-domain evaluations and 0.7963 in cross-domain evaluations, outperforming several state-of-the-art methods while maintaining high computational efficiency. Results confirm that MkfaNet is effective and efficient in detecting forgery, offering enhanced robustness against diverse Deepfake manipulations. Our code is available at https://github.com/GGshawn/MkfaNet.

Citation: Li Z, Tang W, Gao S, Wang Y, Wang S (2026) Multiple contexts and frequencies aggregation network for deepfake detection. PLoS One 21(1): e0337409. https://doi.org/10.1371/journal.pone.0337409

Editor: Feng Ding, Nanchang University, CHINA

Received: October 9, 2024; Accepted: November 6, 2025; Published: January 29, 2026

Copyright: © 2026 Li et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the manuscript.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

1 Introduction

With the development of generative models, Deepfake technology has made significant progress. Deepfake encompasses video, audio, and text, utilizing advanced artificial intelligence techniques such as Variational Autoencoders (VAE) [1], Generative Adversarial Networks (GAN) [2], and Diffusion Models (DM) [3] to achieve unprecedented realism. Unfortunately, these fake visual data can be used for malicious purposes, such as invading personal privacy, spreading misinformation, and undermining people’s trust in digital media [4–6]. Considering that facial deepfakes can potentially cause more significant social and ethical implications compared to synthetic media without facial content, we specifically concentrate on facial deepfake technology in this paper.

To address the potential risks posed by Deepfakes, numerous researchers are working to enhance Deepfake detection technology and strengthen existing detection systems [7–13]. These methods employ various techniques and are generally classified into three types: naive detectors [14,15], spatial detectors [16,17], and frequency detectors [18,19]. Meanwhile, researchers are striving to develop sufficiently robust detectors to withstand various forms of degradation, such as noise [20–22], compression [7,23], and, most critically, to identify previously unseen Deepfakes [24,25]. Therefore, enhancing the generalization ability of Deepfake detection models becomes particularly important. Models with strong generalization capabilities can effectively identify and counter new Deepfake attacks that have not appeared in the training data, thereby ensuring the authenticity and security of information [26].

Improving the model’s ability to capture critical facial features is an effective means of enhancing its generalization capability. These key features include but are not limited to, subtle dynamics of facial expressions, natural gradients of skin tone, and natural eye blinking. By accurately capturing these difficult-to-simulate details, the model can more effectively distinguish between real content and Deepfake-generated content [21]. In recent research, adopting multitask learning [27–32] and/or heuristic fake data generation strategies [28,33] is the mainstream method to enhance the generalization capability of Deepfake detection method. These approaches aim to improve the model’s adaptability and discrimination ability against novel forgery techniques by learning multiple related tasks simultaneously. Meanwhile, heuristic data generation methods create new and unseen fake samples to test and improve the robustness of detection algorithms. However, commonly used architectures for these methods, such as XceptionNet [34] and EfficientNet [35], primarily tend to learn global features while neglecting more local features [36–38]. Consequently, most of these methods fail to effectively model local artifacts, which is crucial for detecting high-quality Deepfake content.

We first focus on the differences between real and forged samples in the frequency domain, and our empirical analysis reveals significant disparities in their frequency distributions, as shown in Fig 1(b). Specifically, real samples exhibit a relatively uniform energy distribution in the spectrogram, indicating a balanced texture and edge information across various frequencies. In contrast, forged samples display abnormally concentrated energy peaks in the high-frequency region, highlighting the shortcomings of forgery techniques in handling high-frequency details, which result in unnatural textures and edges in the high-frequency area. To further illustrate this phenomenon, we use a pre-trained ResNet50 model to examine how it processes real and fake face images, with the results shown in Fig 1(c). Notably, ResNet50 exhibits a weaker response in the high-frequency region, indicating its insufficiency in capturing the high-frequency details of forged faces. Additionally, when processing forged face images, ResNet50’s shallow feature maps exhibit higher low-frequency responses, whereas these low-frequency responses are weaker and more uniformly

Download:

Fig 1. Illustration of frequency priors in deepfake detection.

(a): Source image. (b): Data frequency domain analysis. (c): Relative log amplitudes of Fourier transformed feature maps of ResNet50. (d): Relative log amplitudes of Fourier transformed feature maps of MkfaNet. (b) reveals the uniformity of the frequency distribution in real faces and the concentration of high-frequency anomalies in forged faces. (c) shows that ResNet50 has a relatively low logarithmic amplitude in the high-frequency region, indicating its insufficiency in capturing high-frequency details. (d) demonstrates that MkfaNet has a higher amplitude in the high-frequency region with broader coverage, highlighting its advantages in handling high-frequency details and identifying forgery features. To protect privacy, the original facial images were anonymized by replacing them with icon representations. Corresponding image identifiers from the CelebDF-v1 dataset are shown.

https://doi.org/10.1371/journal.pone.0337409.g001

distributed when processing real faces. This indicates that ResNet50 has a stronger reaction to simple features in forged images but lacks sensitivity to high-frequency details. This layered difference in frequency response reveals the underlying mechanisms by which deep networks distinguish between real and fake faces. It provides important insights and motivation for designing models that can more accurately differentiate between genuine and forged faces.

Additionally, we have observed that recent advances rely on introducing heuristic features from either the spatial or frequency domain, rather than establishing a general forgery feature detection model within the backbone network. While this approach improves detection performance to some extent, it still has limitations, especially in addressing the continuously evolving forgery techniques. Therefore, we propose MkfaNet, which integrates more powerful feature capture and analysis capabilities into the backbone network by combining the Multi-Kernel Aggregator (MKA) and Multi-Frequency Aggregator (MFA), significantly enhancing the accuracy and robustness of forgery detection. In specific, the Multi-Kernel Aggregator (MKA) module combines depth-wise separable convolutions with different dilation rates to effectively expand the model’s receptive field, enhancing its ability to capture features at various scales from the input data. It then adaptively selects features extracted through multiple convolutions based on the spatial context to model the subtle facial differences between real and fake faces; Multi-Frequency Aggregator (MFA) module optimizes the model’s response to different frequency information by separately processing and fusing the DC (Direct Current) and HC (High Current) components of images. MkfaNet, as a stack of MKA and MFA modules, shows the enhanced ability to discern image details and structural information. In the context of real and fake face recognition, it can accurately distinguish the subtle texture and frequency distortions introduced by forgery techniques, thereby improving the accuracy of fake image detection.

Deepfake techniques such as face swapping and facial expression modification pose unique detection challenges. Face swapping often introduces blending artifacts and identity inconsistencies, making it difficult to detect without spatial-aware feature extraction. Similarly, facial expression modification creates subtle yet unnatural deformations, which are better captured through frequency-based analysis. These challenges highlight the need for a detection framework that can effectively model both spatial and frequency discrepancies, motivating the design of MkfaNet.

Comprehensive experiments on seven popular deepfake detection benchmarks [39] demonstrate that our proposed MkfaNet variants achieve superior performances in both within-domain and across-domain evaluations with impressive efficiency of parameter usage.

This work mainly makes the following contributions:

In deepfake detection, accurately capturing multi-scale details is crucial because forgery techniques often intervene and modify image features at various scales. The MKA module enhances the model’s receptive field with convolutional kernels of different dilation rates, allowing the model to more effectively capture features from fine textures to larger structures, thus improving the ability to recognize subtle signs of forgery.
Deepfake technology often manipulates high-frequency details to achieve face swapping or modification, which may result in unnatural features across different frequencies. The MFA module optimizes the model’s response to these critical frequency information by independently processing and integrating the Direct Current (DC) and High Current (HC) frequency components, significantly enhancing the sensitivity and recognition capability for high-frequency detail anomalies.
Extensive testing of MkfaNet on various mainstream deepfake detection benchmarks demonstrates its effectiveness and efficiency in handling different deepfake techniques. This also shows MkfaNet’s advantage in parameter efficiency, which is extremely important for practical applications.

To provide a structured discussion, the rest of this paper is organized as follows. Sect 2 reviews related work, summarizing existing deepfake detection approaches and their limitations. Sect 3 introduces our proposed MkfaNet, detailing its architecture and the design of the Multi-Kernel Aggregator (MKA) and Multi-Frequency Aggregator (MFA) modules. Sect 4 presents the experimental setup and results, including dataset descriptions, evaluation metrics, and comparative analysis with state-of-the-art methods. Finally, Sect 5 concludes the paper, summarizing key findings and outlining potential future research directions.

2 Related work

Deepfake generation. Deepfake technology primarily involves the artificial modification of facial images and has significantly evolved since its inception. Since 2017, machine learning-based facial manipulation techniques have made substantial advancements, particularly in the areas of facial replacement and facial expression reenactment, which have garnered widespread attention [39]. Ian Goodfellow et al. introduced Generative Adversarial Networks (GANs) [40], a technology that has significantly advanced the development of realistic image synthesis, including facial images [41,42]. GANs consist of two parts: the generator and the discriminator. The generator is responsible for creating images, while the discriminator’s task is to distinguish between these generated images and real data. Variational Autoencoders (VAEs) [43] compress data into a compact form and are used in Deepfake technology to alter facial features, such as expressions and styles. Diffusion models (DMs) [44,45] create images by gradually adding noise and then progressively removing this noise during the generation process. In facial image generation, diffusion models can produce high-quality, high-resolution facial images by finely controlling the noise reduction process. Facial Deepfakes can be broadly categorized into two types: face-swapping and face-reenactment. Face-swapping refers to replacing the facial features in one image with the facial features from another image [46–48]. Face-reenactment technology modifies the original face using image processing techniques to mimic the expressions of another face. Face2Face [49] generates different expressions by tracking facial key points, while NeuralTextures [50] achieves expression transfer using rendered images generated from 3D facial models. These technologies enable more diverse and precise simulation of facial expressions.

Deepfake detection. In Deepfake detection research, methods can be broadly categorized into image-level detectors and video-level detectors. Image-level detectors analyze individual frames to identify fake images by recognizing spatial artifacts. One widely used approach is the Xception model [15], a convolutional neural network (CNN) architecture often combined with attention mechanisms like MAT [37] to enhance feature extraction. Other methods, such as Face X-ray [28], leverage the boundaries between forged faces and backgrounds to detect spatial inconsistencies. More recently, algorithms have focused on detecting blending artifacts [9,51] or learning to separate relevant and irrelevant features during training, aiming to improve generalization. Additionally, feature selection and semi-supervised learning techniques [31,32] have been explored to enhance robustness by improving feature representation and leveraging unlabeled data. Video-level detectors, in contrast, utilize temporal information from multiple frames to enhance deepfake video detection [52]. For example, FTCN [53] directly extracts temporal information using 3D CNNs with a spatial kernel size of 1, while AltFreeze [54] improves generalization by independently training spatial and temporal features. Despite their promising performance, existing deepfake detection methods face several critical challenges:

Over-reliance on heuristic features: Many models rely heavily on manually designed spatial or frequency features, which may not generalize well across different datasets and deepfake techniques.

Limited backbone architectures: The majority of approaches use traditional DNN backbones such as XceptionNet [34] and EfficientNet [35], which inherently extract global features through deep convolutional layers. This architecture design can lead to the loss of critical localized forgery artifacts, thereby reducing detection robustness.

Challenges in detecting high-quality deepfakes: As deepfake technology advances, newer forgery techniques produce more realistic facial textures and seamless blending, making them increasingly difficult to detect with existing methods. Many models struggle with capturing subtle manipulation traces, especially in high-resolution deepfakes.

These challenges highlight the necessity of designing a more effective detection backbone that can adaptively extract discriminative spatial and frequency features while maintaining strong generalization. Our proposed MkfaNet addresses these limitations by incorporating Multi-Kernel Aggregation (MKA) and Multi-Frequency Aggregation (MFA) modules, which improve the ability to capture both spatial and frequency-based artifacts, enhancing the model’s robustness against sophisticated forgery techniques.

3 Method

3.1 Overview of MkfaNet

Built upon modern ConvNets, we design a four-stage MkfaNet architecture as illustrated in Fig 2. The overall detection pipeline and architectural design of MkfaNet are illustrated in Fig 2, which outlines the flow from input image through hierarchical feature extraction modules to the final classification output. For stage i, the input image or feature is first fed into an embedding stem to regulate the resolutions and embed into C_i dimensions. Assuming the input image in resolutions, features of the four stages are in , , , and resolutions respectively. Then, the embedded feature flows into N_i Mkfa Blocks, consisting of spatial and channel aggregation blocks, for multi-kernel feature and high-low frequency aggregation.

Download:

Fig 2. Overall detection pipeline of MkfaNet.

The model consists of a four-stage hierarchical backbone, where each stage includes an embedding stem, Multi-Kernel Aggregator (MKA), and Multi-Frequency Aggregator (MFA). The pipeline takes an input face image and processes it through the stacked modules to produce a final classification score via a fully connected (FC) layer.

https://doi.org/10.1371/journal.pone.0337409.g002

3.2 Multi-kernel aggregator

Model generative models are able to create extremely realistic fake human faces that are visually almost indistinguishable from real ones by learning from a vast amount of real facial data and thus simulating features such as lighting, texture, and shape of faces. In this context, traditional single-scale feature extraction methods struggle to detect these fake faces. This is mainly because such methods typically focus on features at a fixed scale, such as coarse patterns of edges or textures, and overlook the subtle changes and complex interactions across multiple scales, which are precisely what generation techniques excel at simulating.

Therefore, to effectively distinguish these high-quality fake images from real faces, a method capable of analyzing and identifying details at multiple levels is required. In this way, we propose MKA modules as a solution, which adaptively selects organ features extracted by multiple convolutions for modeling subtle facial differences between real and fake faces. To elucidate the implementation details of the Multi-kernel Aggregator (MKA) module, as illustrated in Fig 3(a), we will delve into its architectural design, focusing on how it adaptively aggregates multi-level features to enhance the detection of key facial regions. We represent this process as follows:

(1)

Download:

Fig 3. (a) Multi-Kernel Aggregation (MKA) block, designed as a token mixer, utilizes depthwise convolution layers with different dilation rates to capture multi-scale spatial features, improving sensitivity to subtle local manipulations.

(b) Multi-Frequency Aggregation (MFA) block, serving as a channel mixer, applies frequency-aware feature decomposition through depthwise convolutions and gating mechanisms to enhance forgery-specific artifact extraction. (c) EfficientNet block, which integrates a Squeeze-and-Excitation (SE) module to adaptively recalibrate feature responses. (d) ConvNeXt block, incorporating a Channel Mixer and depthwise convolutions, serves as a baseline for comparing hierarchical feature mixing strategies.

https://doi.org/10.1371/journal.pone.0337409.g003

where denotes a multi-kernel gated aggregation module comprising the gating and multi-kernel feature branch .

Multi-kernel feature extraction. To enable the model to perceive the multi-level features of the face images, we employ three different DWConv layers with dilation ratios in parallel to capture low, middle, and high-order features: given the input feature , the input is factorized into , , and along the channel dimension, where C_l + C_m + C_h = C; afterward, X_l, X_m and X_h are assigned to , and , respectively. Finally, the output of X_l, X_m, and X_h are concatenated to form multi-kernel feature, so that .

Gated aggregation. To adaptively aggregate the extracted feature from the multi-kernel feature branch, and we employ SiLU activation in the gating branch, as ⋅ , which has been well-acknowledged as an advanced version of Sigmoid activation. SiLU has both the gating effect of Sigmoid and stable training characteristics, leading the final aggregated features as

(2)

3.3 Multi-frequency aggregator

Fig 1(b) shows the frequency domain analysis of the data, revealing significant differences in the distribution of high-frequency information between fake and real faces. Fake faces often appear unnatural in details such as skin texture and edge sharpness, resulting in a noticeably different distribution of features in the high-frequency region. Fig 1(c) illustrates the relative logarithmic amplitude of the Fourier-transformed data, with the color gradient from purple to yellow representing the transition from shallow to deep layers of the model. This gradient reveals how layers of different depths handle frequency information, providing visual evidence of the differences in frequency responses between real and fake faces.

It is evident that the shallow layers (purple) tend to capture high-frequency details related to texture and edges, while the deeper layers (yellow) strongly respond to low-frequency features, which are typically associated with the overall structure and shape of the image. At these levels, real and fake faces exhibit different frequency characteristics. Specifically, in the high-frequency details, fake faces often fail to perfectly replicate the high-frequency features of real faces due to technical limitations, resulting in anomalies or inconsistencies in the high-frequency region. This underscores the importance of addressing both low-frequency and high-frequency features in facial recognition.

We propose an MFA module that processes and reorganizes the direct current (DC) and high-frequency (HC) components of images independently, allowing the model to perform more refined and in-depth analysis at different frequency levels. Specifically, the MFA enhances the analysis of high-frequency details to identify unnatural textures and edges produced by generative models while integrating low-frequency information to maintain an understanding of the overall structure of the image. This approach not only strengthens the model’s ability to detect flaws unique to forgery techniques but also improves its capacity to capture authentic features. As a result, the accuracy and robustness of facial authenticity recognition are significantly enhanced. By comprehensively analyzing features at different frequencies, the MFA helps the model better distinguish and recognize complex real and fake faces, effectively addressing the challenges posed by high-quality forgery techniques. The structure of the MFA module, as shown in Fig 3(b), can be formalized as follows:

(3)

where is a scaling technique that operates on feature maps by distinctively mixing information from different frequency bands. In specific, the input signal is firstly decomposed into its DC component and high-frequency components. Then, two sets of parameters are introduced to re-weight these components for each channel. The two-step processing reads as,

(4)

where γ is the channel-wise scaling factor initialized as zeros. To ensure the efficient computation of high- and low-frequency components, we referred to the method of Wang et al. [55], rather than using an explicit Fourier transform. The DC component is calculated by averaging each feature map, while the HC component is obtained by subtracting the DC component from the original features. Specifically, z_DC represents the spatial average, and z_L represents the channel average.

3.4 Discussion

To further highlight the unique design of our proposed MkfaNet, we compare it with several representative backbones in terms of their feature extraction strategies and the presence of spatial and frequency aggregation mechanisms, as shown in Table 1. Unlike existing models such as XceptionNet, EfficientNet, and ConvNeXt, which either lack frequency modeling or rely solely on global features, MkfaNet explicitly incorporates both multi-scale spatial feature aggregation and frequency-aware processing. This dual-branch architecture enables MkfaNet to more effectively capture subtle local artifacts and abnormal frequency patterns, thereby enhancing its robustness against high-quality and cross-domain Deepfakes.

Download:

Table 1. Comparison of models in feature extraction and aggregation strategies.

https://doi.org/10.1371/journal.pone.0337409.t001

3.4.1 Advantages over classical CNN.

Currently, the most commonly used architectures for fake face detection are XceptionNet [34] and EfficientNet [35]. XceptionNet builds its structure using depthwise separable convolution layers(DWConv), optimizing the learning of global features [36,37], which makes the model excel in recognizing overall image structures and patterns. However, for tasks that require detailed analysis of local features, such as fake face detection, this approach may limit the model’s sensitivity to subtle facial expression differences and skin texture patterns. EfficientNet (Fig 3(c)), on the other hand, enhances model efficiency and feature representation capability by balancing network depth, width, and resolution adjustments, combined with depthwise separable convolutions and Squeeze-and-Excitation (SE) blocks. Although SE blocks improve the model’s attention to features, this global information-based recalibration fails to capture the local detail anomalies unique to forged faces.

Both XceptionNet and EfficientNet employ depthwise separable convolutions to simultaneously process spatial and channel features, limiting their ability to perceive specific contextual with tiny differences and channel features of different frequencies. Our proposed MkfaNet’s superiority over them for fake face detection lies in its combination of two core modules, which respectively focus on learning spatial context and channel features, which expand the receptive field by using depth-wise separable convolutions with different dilation rates and excels in detecting subtle anomalies introduced by strengthening the extraction of high-frequency details. In this way, in comparison to the XceptionNet and EfficientNet, MkfaNet can learn richer features, providing stronger identification capabilities and higher reliability when dealing with complex facial data and advanced forgery techniques.

3.4.2 Advantages over modern DNN.

By employing block-based designs combined with hierarchical and isotropic stages, modern DNNs can effectively handle large-scale and complex datasets, capture long-range dependencies, and perform multi-scale feature extraction. Additionally, these networks can adaptively adjust the functionality and dimensions of each layer, providing greater flexibility to meet different task requirements, thereby significantly improving model performance while maintaining parameter efficiency. ConvNext [56] (Fig 3(b)) as a modern convolutional neural network architecture, separates the processing of spatial features and channel features and uses an additional channel mixer to enhance the interaction between different channels, thereby enriching feature representation. However, ConvNext uses only one depthwise separable convolution for spatial features, and the channel mixer simply performs an up-and-down dimensional operation on the channels. While these designs improve inter-channel interaction, their sensitivity to local detail features may still be insufficient. Therefore, although this model performs well in general image tasks, it may require further adjustments or integration with other mechanisms in specialized fake face detection tasks to capture better and analyze the inherent local and high-frequency detail features of forgery techniques.

On this basis, considering the special requirements of deepfake detection, our MkfaNet model offers significant advantages: Built upon a modern DNN architecture, MkfaNet enhances the ability to capture local and high-frequency details in deepfake images by integrating MKA and MFA. The MKA module adaptively selects features of specific organs through multiple convolution processes, accurately simulating the subtle differences between real and fake faces, while the MFA module focuses on frequency components, adaptively rebalancing high- and low-frequency features to increase the model’s sensitivity to abnormal high-frequency details in fake images. This gives MkfaNet higher accuracy and efficiency compared to traditional, modern DNN architectures, especially demonstrating superior performance in handling advanced forgery techniques.

3.5 Network configurations

We provide detailed architecture configurations of MkfaNet variants in Table 2, where we scale the embedding dimensions and the number of blocks for each stage. (1) MkfaNet-Tiny, with embedding dimensions of 32,64,128,256, is designed for lightweight deepfake detection scenarios and contains only 5.2M parameters and 1.5 GFLOPs. It achieves a fast inference time of 22 ms on a batch of 480 images tested on an RTX 4090 GPU. (2) MkfaNet-Small uses larger embedding dimensions 64,128,320,512, reaching 19.8M parameters and 6.7 GFLOPs, while still maintaining a practical inference time of 98 ms under the same setting. Compared with other modern architectures such as ConvNeXt [56], which typically involve around 25M parameters, MkfaNet-Small provides a favorable trade-off between performance and computational efficiency.

Download:

Table 2. Architecture configurations of MkfaNet variants.

https://doi.org/10.1371/journal.pone.0337409.t002

4 Experiments

4.1 Settings

Datasets. To evaluate the performances and generalization abilities of our proposed backbone, we follow DeepfakeBench [39] to conduct comparison and analysis experiments on seven commonly used deepfake detection datasets, as shown in Table 3: FaceForensics++ (FF++) [15], CelebDF-v1 (CDFv1) [65], CelebDF-v2 (CDFv2) [65], DeepFakeDetection (DFD) [66], DeepFake Detection Challenge Preview (DFDC-P) [67], DeepFake Detection Challenge (DFDC) [68], and DeeperForensics-1.0 (DF-1.0) [20]. Specifically, FF++ is a large-scale database with 1.8 million forged images that contains 4 types of manipulation methods, including Deepfakes (FF-DF) [69], Face2Face (FF-F2F) [49], FaceSwap (FF-FS) [70], and NeuralTextures (FF-NT) [50]. Note that we use the lightly compressed (c23) version of FF++ as the default training data, whereas two other compressed versions of FF++ are raw and heavily compressed (c40), while others are used as testing datasets.

Beyond FF++, CelebDF-v1 and CelebDF-v2 introduce more challenging Deepfake synthesis techniques that mitigate the typical visual artifacts found in FF++, making them closer to real-world Deepfake videos. CelebDF-v2 further refines CelebDF-v1 by reducing boundary artifacts and unnatural lip movements, increasing the difficulty of detection. DFD is a high-quality deepfake dataset released by Google, containing videos forged using multiple face-swapping techniques, captured in a controlled environment with consistent lighting. DFDC-P and DFDC were developed by Facebook and include a diverse set of Deepfake videos created with multiple face manipulation techniques, representing real-world variations in lighting conditions, compression levels, and face occlusions. DFDC is especially challenging due to its mix of synthetic and real data, as well as multiple unknown manipulation techniques. DF-1.0 is designed to evaluate model robustness under various real-world perturbations, such as Gaussian noise, compression artifacts, color distortions, and adversarial attacks. These perturbations simulate realistic degradation scenarios, making DF-1.0 highly suitable for assessing cross-domain generalization capabilities.

Download:

Table 3. Within-domain and cross-domain evaluations of various deepfake detectors and backbones using the AUC metric.

All detectors are trained on FF-c23 and evaluated on other datasets. Avg. donates the average AUC for within-domain and cross-domain evaluations, and the best result for each group is highlighted in bord. † represents our reproduced results, while DeepfakeBench provides others. The values reported are accompanied by ±, which represents the margin of error corresponding to the values in the previous row, and the confidence interval is at a 95% confidence level.

https://doi.org/10.1371/journal.pone.0337409.t003

To ensure fair and consistent evaluations, We adopt the full data pre-processing workflow proposed in DeepfakeBench and use the fixed training and testing resolutions of for the cropped face images.

Implementation details. For a fair comparison, we consider three types of detectors in DeepfakeBench [39], as detailed in Table 4: (1) Naive detectors that combine a backbone and binary classifier without introducing manually designed features. Both classical CNNs (e.g., ResNet [71] and EfficientNet [35]) and modern architectures (e.g., Swin Transformer [59] and ConvNeXt [56]) are compared. (2) Spatial detectors that build upon the backbone and further utilize spatial features with manually designed algorithms. (3) Frequency detectors focus on exploring frequency components and artifacts to detect forgeries. As for training settings, we follow the official data splits provided by DeepfakeBench [39] to ensure fair and consistent evaluation. All detectors with classical CNN backbones are trained using the Adam optimizer with a learning rate of and a batch size of 32. For detectors based on modern backbones, such as ConvNeXt and our MkfaNet, we adopt the AdamW optimizer with a learning rate of 5 × 10⁻⁴ and a batch size of 256. Pre-trained weights on ImageNet-1K are used for backbone initialization when available, and MkfaNet adopts the same pre-training setting as ConvNeXt-T, as shown in Table 5. We apply several data augmentations, including image compression, horizontal flipping, rotation, Gaussian blur, and random brightness/contrast adjustment. All models are trained for 50 epochs, and the best-performing checkpoint is selected based on validation performance. For evaluation, we report the mean frame-level Area Under the Curve (AUC) over three trials.

Download:

Table 4. Summary of used deepfake detection datasets.

https://doi.org/10.1371/journal.pone.0337409.t004

Download:

Table 5. Hyper-parameters and training recipes for ImageNet-1K of Swin-T, ConvNeXt-T, and our proposed MkfaNet-T/S.

https://doi.org/10.1371/journal.pone.0337409.t005

Download:

Table 6. Confusion matrix for the CelebDF-v1 dataset.

This table presents the classification results of our model on the CelebDF-v1 test set, which contains 1,203 real samples and 1,933 fake samples. The confusion matrix includes four key values: True Positive (TP), False Negative (FN), False Positive (FP), and True Negative (TN).

https://doi.org/10.1371/journal.pone.0337409.t006

4.2 Comparison results

As shown in Table 4, we conduct within-domain and cross-domain evaluations for three versions of our MkfaNet, i.e., the lightweight detectors, the naive detectors compared to various backbones, and the advanced detectors with prior knowledge from spatial or frequency domains. cross-domain evaluation involves testing the model on different datasets.

Within-domain evaluations. We first conduct within-domain evaluations to verify the performances of detectors within the same dataset following DeepfakeBench [39]. Table 4 (middle columns) shows MkfaNet variants achieve the best average results on six within-domain datasets. As for the lightweight detectors, MkfaNet-T significantly outperforms MesoNet [14] and CapsuleNet [57] with similar parameters by 9.33% and 2.87% AUC while outperforming CNN-Aug [58] using only a quarter of the parameters of ResNet-34 [71]. When compared to naive detectors with around 20M parameters, the modern networks (Swin-T [59] and ConvNeXt-T [56]) consistently improve the classical CNNs (ResNet-50 [71], Xception [34], and EfficientNet-B4 [35]), e.g., ConvNeXt-T outperforms ResNet-50 and EfficientNet-B4 by 10.29% and 0.77% AUC on FF-c23, which might attribute to the Metaformer macro design [74] and more parameters. Meanwhile, our proposed MkfaNet-S significantly improves both classical CNNs and modern networks with efficient usage of parameters, e.g., MkfaNet-S yields 94.85% AUC and around 0.25∼8.0% performance gains in average comparing to previous backbones. When employing larger backbone encoders with manually-designed features, FFD [62] with MkfaNet-S significantly outperforms frequency detectors (F3Net [19], SPSL [60], and SRM [18]) and spatial detectors (FWA [61], X-ray [28], and CORE [63]) with Xception and HRNet [75] backbones in the similar parameter scale, while even yielding better results than UCF [64] with a larger Xception backbone.

Cross-domain evaluations. Then, we evaluate detectors on different datasets without further fine-tuning, which reflects the generalization and robustness of the compared detectors. As shown in Table 4 (right columns), all models suffer performance decreases because of the challenging domain gap. Surprisingly, our proposed MkfaNet variants achieve the best average results and form greater performance gains over existing methods, e.g., Naive detector with MkfaNet-S outperforms Xception and ConvNeXt-T by 1.43% and 1.13% average AUC, indicating that MkfaNet might learn more common and robust features. We verify this hypothesis in Sect 4.3 with visualizations.

Further analysis. The confusion matrix in Table 6 shows that the model misclassified 459 fake images as real (FN) and 181 real images as fake (FP). As observed in Fig 4, false positives are often caused by compression artifacts, occlusions, and complex backgrounds, while false negatives arise from high-quality Deepfakes with minimal detectable artifacts or low-frequency manipulations that retain high-frequency details. These cases highlight the inherent limitations of relying solely on a novel backbone for Deepfake detection.

Download:

Fig 4. Visualization of true positives, true negatives, false positives, and false negatives in deepfake detection.

This figure illustrates four types of prediction results made by MkfaNet on the CelebDF-v1 dataset. The first row presents true positives (TP), where fake images are correctly classified with high confidence. The second row shows true negatives (TN), where real images are accurately recognized as real. The third row presents false positives (FP), where fake images are mistakenly classified as real. The fourth row shows false negatives (FN), where real images are misclassified as fake due to factors such as compression artifacts, facial expressions, or lighting conditions. The softmax output probabilities indicate the model’s prediction confidence. To protect privacy, the original facial images were anonymized by replacing them with icon representations. Corresponding image identifiers from the CelebDF-v1 and UADFV datasets are shown.

https://doi.org/10.1371/journal.pone.0337409.g004

Download:

Table 7. Ablation of designed modules on FF-c23.

The module without “+” denotes the baseline modules, while those with “+” are added to the baseline (using gray backgrounds). c1, c2, and c3 represent the number of channels assigned to the MKA module’s branches with dilation rates of 1, 2, and 3, respectively.

https://doi.org/10.1371/journal.pone.0337409.t007

Meanwhile, the true positive (TP) and true negative (TN) cases demonstrate the model’s strong capability in correctly identifying both forged and authentic faces, even under challenging conditions such as lighting variation and facial expression changes. These correctly classified examples highlight the robustness of the proposed MkfaNet in detecting subtle forgery cues and preserving real content recognition.

To further improve performance, multi-task learning could be explored, incorporating auxiliary tasks such as forgery localization, frequency domain analysis, or uncertainty estimation to provide additional supervision. Additionally, integrating temporal consistency analysis for video-based detection or using adaptive feature fusion strategies could enhance robustness against challenging samples. These directions offer promising pathways for advancing Deepfake detection beyond backbone design alone.

4.3 Ablation and analysis

Ablation studies of network modules. We first ablate the designed modules in MkfaNet with a simplified experimental setting, i.e., training and evaluation on FF-c23 without using ImageNet-1K pre-trained weights. We take ConvNeXt-T [56] as the baseline for MkfaNet, which outperforms the classical bottleneck in ResNet-50 [71] in Table 7. As for the proposed Multi-kernel Aggregator (MKA) block, using the Gating Branch in Eq 2 can yield similar performances as ConvNeXt-T with around 10M fewer parameters and using Multi-DWConv with dilated ratios in (1,2,3) aggregates contextualized patterns and improves the performances. As for the Multi-Frequency Aggregator (MFA) block, adding a Squeeze-and-excitation (SE) module [76] to DWConv3 × 3 + FFN is equivalent to the EfficientNet block [35], which requires numerous parameters for performance gains. Our proposed MF module in Eq 4 brings better AUC than the SE module while using fewer parameters.

Visualization analysis. We then evaluate the learned features of MkfaNet-S by two visualizations. As shown in Fig 5, the representations of various detectors are visualized by t-SNE [72] on FF++ (c23) dataset with 5000 randomly selected samples following DeepfakeBench, where four forgery types (Deepfakes, Face2Face, FaceSwap, and NeuralTextures) in FF++ are considered. The representations of real and fake samples are more separable in MkfaNet-S than in previous works, while four different forgeries are also discriminative by MkfaNer-S. It indicates that MkfaNet can capture common features rather than over-fitting the training dataset. Meanwhile, we further investigate the spatial features learned by MkfaNet with Grad-CAM [73] visualization in Fig 6. We consider cross-domain evaluation samples from FFDI-2024, the latest Global Multimedia Deepfake Detection competition at kaggle, and compare with various backbone architectures. Fig 6 shows that MkfaNet-S precisely and consistently locates organs to determine fake or real faces, while other backbones sometimes extract irrelevant regions, which might deteriorate the generalization and robustness of forgery detection.

Download:

Fig 5. Visualization of latent embedding of detectors with t-SNE [72] on FF++ (c23) according to settings in DeepfakeBench [39].

Based on the naive detector, our MkfaNet-S distinguishes different types of forgery into several clusters, whereas other backbones could not learn the discriminative patterns without additional supervision.

https://doi.org/10.1371/journal.pone.0337409.g005

Download:

Fig 6. Grad-CAM activation maps [73] of fake images in the validation set of FFDI-2024 (collected online) as cross-domain evaluation. Compare the naive detector with different backbones with ours.

As for fake images, classical CNNs like ResNet-50 show robust but coarse localization of human faces, while modern architectures like Swin-T can activate some semantic features. Out MkfaNet-S not only exhibits precise localization of discriminative organs but also tells the difference between fake and real faces.

https://doi.org/10.1371/journal.pone.0337409.g006

5 Conclusion

In this paper, we introduce MkfaNet, a novel backbone network specifically designed for face forgery detection. It combines two core modules, the Multi-Kernel Aggregator (MKA) and the Multi-Frequency Aggregator (MFA), which effectively enhance the ability to distinguish between real and forged facial features. The MKA module targets spatial context by adaptively selecting organ-specific features extracted through multiple convolutions to simulate the subtle facial differences between real and fake faces. The MFA module focuses on frequency components, processing different frequency bands by adaptively rebalancing high-frequency and low-frequency features. Comprehensive experiments on seven popular Deepfake detection benchmarks demonstrate that MkfaNet achieves an AUC of 0.9591 in within-domain evaluations and 0.7963 in cross-domain evaluations, outperforming several state-of-the-art methods while maintaining high computational efficiency. This innovative approach not only significantly improves the accuracy of forgery detection but also enhances the model’s capability to handle complex facial data, making it a powerful tool for combating advanced forgery techniques in the future.

While MkfaNet demonstrates strong performance across multiple deepfake detection benchmarks, some limitations remain. First, its scalability to extremely large and diverse datasets needs further evaluation, as real-world deepfake videos often exhibit greater variability in quality, compression, and manipulation techniques. Second, its generalization to novel deepfake generation methods is a challenge, as emerging techniques may introduce more sophisticated forgery patterns that require adaptive detection mechanisms.

Despite these limitations, our proposed approach significantly improves forgery detection accuracy and enhances the model’s ability to handle complex facial data. Future research will focus on adapting MkfaNet to larger datasets and developing more dynamic feature extraction strategies to enhance robustness against evolving forgery techniques. This work provides a solid foundation for future advancements in deepfake detection and contributes to the ongoing fight against digital media manipulation.

References

1. Kingma DP, Welling M. An introduction to variational autoencoders. FNT in Machine Learning. 2019;12(4):307–92.
- View Article
- Google Scholar
2. Creswell A, White T, Dumoulin V, Arulkumaran K, Sengupta B, Bharath AA. Generative adversarial networks: an overview. IEEE Signal Process Mag. 2018;35(1):53–65.
- View Article
- Google Scholar
3. Croitoru F-A, Hondru V, Ionescu RT, Shah M. Diffusion models in vision: a survey. IEEE Trans Pattern Anal Mach Intell. 2023;45(9):10850–69. pmid:37030794
- View Article
- PubMed/NCBI
- Google Scholar
4. Cahlan S. How misinformation helped spark an attempted coup in Gabon. The Washington Post. 2020:13.
- View Article
- Google Scholar
5. Wakefield J. Deepfake presidents used in Russia-Ukraine war. https://www.bbc.com/news/technology-60780142
6. Liu X. Deepfake technology misused, “AI face swapping” sexual crimes trigger panic in South Korean society. 2024. https://m.yicai.com/news/102254049.html
7. Le BM, Woo SS. Quality-agnostic deepfake detection with intra-model collaborative learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023. p. 22378–89.
8. Feng C, Chen Z, Owens A. Self-supervised video forensics by audio-visual anomaly detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023. p. 10491–503.
9. Bai W, Liu Y, Zhang Z, Li B, Hu W. Aunet: learning relations between action units for face forgery detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023. p. 24709–19.
10. Ge S, Zhao S, Li C, Li J. Low-resolution face recognition in the wild via selective knowledge distillation. IEEE Trans Image Process. 2018:10.1109/TIP.2018.2883743. pmid:30507531
- View Article
- PubMed/NCBI
- Google Scholar
11. Ge S, Li J, Ye Q, Luo Z. Detecting masked faces in the wild with lle-cnns. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. p. 2682–90.
12. El-kenawy E-SM, Khodadadi N, Mirjalili S, Abdelhamid AA, Eid MM, Ibrahim A. Greylag goose optimization: nature-inspired optimization algorithm. Expert Systems with Applications. 2024;238:122147.
- View Article
- Google Scholar
13. El-Kenawy E-SM, Ibrahim A. Football Optimization Algorithm (FbOA): a novel metaheuristic inspired by team strategy dynamics. JAIM. 2024;8(1):21–38.
- View Article
- Google Scholar
14. Afchar D, Nozick V, Yamagishi J, Echizen I. MesoNet: a compact facial video forgery detection network. In: 2018 IEEE International Workshop on Information Forensics and Security (WIFS). 2018. p. 1–7. https://doi.org/10.1109/wifs.2018.8630761
15. Rossler A, Cozzolino D, Verdoliva L, Riess C, Thies J, Nießner M. Faceforensics: learning to detect manipulated facial images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019. p. 1–11.
16. Cao J, Ma C, Yao T, Chen S, Ding S, Yang X. End-to-end reconstruction-classification learning for face forgery detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022. p. 4113–22.
17. Wang C, Deng W. Representative forgery mining for fake face detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2021. p. 14923–32.
18. Luo Y, Zhang Y, Yan J, Liu W. Generalizing face forgery detection with high-frequency features. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2021. p. 16317–26.
19. Qian Y, Yin G, Sheng L, Chen Z, Shao J. Thinking in frequency: face forgery detection by mining frequency-aware clues. In: European conference on computer vision. Springer; 2020. p. 86–103.
20. Jiang L, Li R, Wu W, Qian C, Loy CC. Deeperforensics-1.0: a large-scale dataset for real-world face forgery detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020. p. 2889–98.
21. Haliassos A, Vougioukas K, Petridis S, Pantic M. Lips don’t lie: a generalisable and robust approach to face forgery detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021. p. 5039–49.
22. Li S, Xia X, Ge S, Liu T. Selective-supervised contrastive learning with noisy labels. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2022. p. 316–25. https://doi.org/10.1109/cvpr52688.2022.00041
23. Binh LM, Woo S. ADD: frequency attention and multi-view based knowledge distillation to detect low-quality compressed deepfake images. AAAI. 2022;36(1):122–30.
- View Article
- Google Scholar
24. Pu J, Mangaokar N, Kelly L, Bhattacharya P, Sundaram K, Javed M, et al. Deepfake videos in the wild: analysis and detection. In: Proceedings of the Web Conference 2021 . 2021. p. 981–92. https://doi.org/10.1145/3442381.3449978
25. Shiohara K, Yamasaki T. Detecting deepfakes with self-blended images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022. p. 18720–9.
26. Le BM, Kim J, Tariq S, Moore K, Abuadbba A, Woo SS. Sok: facial deepfake detectors. arXiv preprint 2024.
- View Article
- Google Scholar
27. Chen L, Zhang Y, Song Y, Liu L, Wang J. Self-supervised learning of adversarial example: towards good generalizations for deepfake detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2022. p. 18710–9.
28. Li L, Bao J, Zhang T, Yang H, Chen D, Wen F, et al. Face x-ray for more general face forgery detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2020. p. 5001–10.
29. Chen G, Qiao L, Shi Y, Peng P, Li J, Huang T, et al. Learning open set network with discriminative reciprocal points. In: Computer Vision–ECCV 2020 : 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16. Springer; 2020. p. 507–22.
30. Zhao T, Xu X, Xu M, Ding H, Xiong Y, Xia W. Learning self-consistency for deepfake detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021. p. 15023–33.
31. Sheikhpour R, Mohammadi M, Berahmand K, Saberi-Movahed F, Khosravi H. Robust semi-supervised multi-label feature selection based on shared subspace and manifold learning. Information Sciences. 2025;699:121800.
- View Article
- Google Scholar
32. Sheikhpour R, Berahmand K, Mohammadi M, Khosravi H. Sparse feature selection using hypergraph Laplacian-based semi-supervised discriminant analysis. Pattern Recognition. 2025;157:110882.
- View Article
- Google Scholar
33. Sheng Z, Yu Z, Liu X, Cao SY, Liu Y, Shen HL, et al. Structure aggregation for cross-spectral stereo image guided denoising. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2023. p. 13997–4006.
34. Chollet F. Xception: deep learning with depthwise separable convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. p. 1251–8.
35. Tan M, Le Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In: International conference on machine learning. PMLR; 2019. p. 6105–14.
36. Wang Y, Yu K, Chen C, Hu X, Peng S. Dynamic graph learning with content-guided spatial-frequency relation reasoning for deepfake detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. p. 7278–87.
37. Zhao H, Wei T, Zhou W, Zhang W, Chen D, Yu N. Multi-attentional deepfake detection. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2021. p. 2185–94. https://doi.org/10.1109/cvpr46437.2021.00222
38. Zhao Y, Yan K, Huang F, Li J. Graph-based high-order relation discovery for fine-grained recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2021. p. 15079–88.
39. Yan Z, Zhang Y, Yuan X, Lyu S, Wu B. Deepfakebench: a comprehensive benchmark of deepfake detection. arXiv preprint 2023. https://arxiv.org/abs/2307.01426
- View Article
- Google Scholar
40. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, et al. Generative adversarial networks. Commun ACM. 2020;63(11):139–44.
- View Article
- Google Scholar
41. Choi Y, Choi M, Kim M, Ha JW, Kim S, Choo J. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018. p. 8789–97.
42. Karras T, Aila T, Laine S, Lehtinen J. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint 2017.
- View Article
- Google Scholar
43. Kingma DP, Welling M. Auto-encoding variational bayes. arXiv preprint 2013. https://arxiv.org/abs/1312.6114
- View Article
- Google Scholar
44. Zhao W, Rao Y, Shi W, Liu Z, Zhou J, Lu J. Diffswap: High-fidelity and controllable face swapping via 3d-aware masked diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023. p. 8568–77.
45. Gerogiannis D, Papantoniou FP, Potamias RA, Lattas A, Moschoglou S, Ploumpis S. AnimateMe: 4D facial expressions via diffusion models. arXiv preprint 2024.
- View Article
- Google Scholar
46. Li L, Bao J, Yang H, Chen D, Wen F. Faceshifter: towards high fidelity and occlusion aware face swapping. arXiv preprint 2019. https://arxiv.org/abs/1912.13457
- View Article
- Google Scholar
47. Nirkin Y, Keller Y, Hassner T. Fsgan: subject agnostic face swapping and reenactment. In: Proceedings of the IEEE/CVF international conference on computer vision; 2019. p. 7184–93.
48. Perov I, Gao D, Chervoniy N, Liu K, Marangonda S, Umé C. DeepFaceLab: integrated, flexible and extensible face-swapping framework. arXiv preprint. 2020.
- View Article
- Google Scholar
49. Thies J, Zollhofer M, Stamminger M, Theobalt C, Nießner M. Face2face: real-time face capture and reenactment of rgb videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2016. p. 2387–95.
50. Thies J, Zollhöfer M, Nießner M. Deferred neural rendering: image synthesis using neural textures. ACM Transactions on Graphics. 2019;38(4):1–12.
- View Article
- Google Scholar
51. Dong S, Wang J, Ji R, Liang J, Fan H, Ge Z. Implicit identity leakage: the stumbling block to improving deepfake detection generalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2023. p. 3994–4004.
52. Choi J, Kim T, Jeong Y, Baek S, Choi J. Exploiting style latent flows for generalizing deepfake video detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2024. p. 1133–43.
53. Zheng Y, Bao J, Chen D, Zeng M, Wen F. Exploring temporal coherence for more general video face forgery detection. In: Proceedings of the IEEE/CVF international conference on computer vision; 2021. p. 15044–54.
54. Wang Z, Bao J, Zhou W, Wang W, Li H. AltFreezing for more general video face forgery detection. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2023. p. 4129–38. https://doi.org/10.1109/cvpr52729.2023.00402
55. Wang P, Zheng W, Chen T, Wang Z. Anti-oversmoothing in deep vision transformers via the Fourier domain analysis: from theory to practice. In: International Conference on Learning Representations (ICLR). 2022.
56. Liu Z, Mao H, Wu CY, Feichtenhofer C, Darrell T, Xie S. A convnet for the 2020 s. In: Conference on Computer Vision and Pattern Recognition (CVPR); 2022. p. 11976–86.
57. Nguyen HH, Yamagishi J, Echizen I. Capsule-forensics: using capsule networks to detect forged images and videos. In: IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE; 2019. p. 2307–11.
58. Wang SY, Wang O, Zhang R, Owens A, Efros AA. CNN-generated images are surprisingly easy to spot... for now. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020. p. 8695–704.
59. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, et al. Swin transformer: hierarchical vision transformer using shifted windows. In: International Conference on Computer Vision (ICCV); 2021.
60. Liu H, Li X, Zhou W, Chen Y, He Y, Xue H. Spatial-phase shallow learning: rethinking face forgery detection in frequency domain. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.
61. Li Y, Lyu S. Exposing deepfake videos by detecting face warping artifacts. arXiv preprint 2018.
- View Article
- Google Scholar
62. Dang H, Liu F, Stehouwer J, Liu X, Jain AK. On the detection of digital face manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.
63. Ni Y, Meng D, Yu C, Quan C, Ren D, Zhao Y. CORE: consistent representation learning for face forgery detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop. 2022. p. 12–21.
64. Yan Z, Zhang Y, Fan Y, Wu B. UCF: uncovering common features for generalizable deepfake detection. arXiv preprint 2023. https://arxiv.org/abs/2304.13949
- View Article
- Google Scholar
65. Li Y, Yang X, Sun P, Qi H, Lyu S. Celeb-df: a new dataset for deepfake forensics. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.
66. DFD.
67. Dolhansky B, Howes R, Pflaum B, Baram N, Ferrer CC. The deepfake detection challenge (dfdc) preview dataset. arXiv preprint 2019. https://arxiv.org/abs/1910.08854
- View Article
- Google Scholar
68. Dolhansky B, Bitton J, Pflaum B, Lu J, Howes R, Wang M. The deepfake detection challenge dataset. arXiv preprint 2020. https://arxiv.org/abs/2006.07397
- View Article
- Google Scholar
69. DeepFakes.
70. FaceSwap.
71. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Conference on Computer Vision and Pattern Recognition (CVPR); 2016. p. 770–8.
72. Van der Maaten L, Hinton G. Visualizing data using t-SNE. Journal of Machine Learning Research. 2008.
- View Article
- Google Scholar
73. Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-cam: visual explanations from deep networks via gradient-based localization. In: Conference on Computer Vision and Pattern Recognition (CVPR). 2017. p. 618–26.
74. Yu W, Si C, Zhou P, Luo M, Zhou Y, Feng J, et al. MetaFormer baselines for vision. IEEE Trans Pattern Anal Mach Intell. 2023;10.1109/TPAMI.2023.3329173. pmid:37910405
- View Article
- PubMed/NCBI
- Google Scholar
75. Sun K, Xiao B, Liu D, Wang J. Deep high-resolution representation learning for human pose estimation. In: Conference on Computer Vision and Pattern Recognition (CVPR); 2019. p. 5693–703.
76. Hu J, Shen L, Sun G. Squeeze-and-excitation networks. In: 2018 IEEE/CVF conference on computer vision and pattern recognition. 2018. p. 7132–41. https://doi.org/10.1109/cvpr.2018.00745

[ref1] 1. Kingma DP, Welling M. An introduction to variational autoencoders. FNT in Machine Learning. 2019;12(4):307–92.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Creswell A, White T, Dumoulin V, Arulkumaran K, Sengupta B, Bharath AA. Generative adversarial networks: an overview. IEEE Signal Process Mag. 2018;35(1):53–65.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Croitoru F-A, Hondru V, Ionescu RT, Shah M. Diffusion models in vision: a survey. IEEE Trans Pattern Anal Mach Intell. 2023;45(9):10850–69. pmid:37030794
View Article
PubMed/NCBI
Google Scholar

[8] View Article

[9] PubMed/NCBI

[10] Google Scholar

[ref4] 4. Cahlan S. How misinformation helped spark an attempted coup in Gabon. The Washington Post. 2020:13.
View Article
Google Scholar

[12] View Article

[13] Google Scholar

[ref5] 5. Wakefield J. Deepfake presidents used in Russia-Ukraine war. https://www.bbc.com/news/technology-60780142

[ref6] 6. Liu X. Deepfake technology misused, “AI face swapping” sexual crimes trigger panic in South Korean society. 2024. https://m.yicai.com/news/102254049.html

[ref7] 7. Le BM, Woo SS. Quality-agnostic deepfake detection with intra-model collaborative learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023. p. 22378–89.

[ref8] 8. Feng C, Chen Z, Owens A. Self-supervised video forensics by audio-visual anomaly detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023. p. 10491–503.

[ref9] 9. Bai W, Liu Y, Zhang Z, Li B, Hu W. Aunet: learning relations between action units for face forgery detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023. p. 24709–19.

[ref10] 10. Ge S, Zhao S, Li C, Li J. Low-resolution face recognition in the wild via selective knowledge distillation. IEEE Trans Image Process. 2018:10.1109/TIP.2018.2883743. pmid:30507531
View Article
PubMed/NCBI
Google Scholar

[20] View Article

[21] PubMed/NCBI

[22] Google Scholar

[ref11] 11. Ge S, Li J, Ye Q, Luo Z. Detecting masked faces in the wild with lle-cnns. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. p. 2682–90.

[ref12] 12. El-kenawy E-SM, Khodadadi N, Mirjalili S, Abdelhamid AA, Eid MM, Ibrahim A. Greylag goose optimization: nature-inspired optimization algorithm. Expert Systems with Applications. 2024;238:122147.
View Article
Google Scholar

[25] View Article

[26] Google Scholar

[ref13] 13. El-Kenawy E-SM, Ibrahim A. Football Optimization Algorithm (FbOA): a novel metaheuristic inspired by team strategy dynamics. JAIM. 2024;8(1):21–38.
View Article
Google Scholar

[28] View Article

[29] Google Scholar

[ref14] 14. Afchar D, Nozick V, Yamagishi J, Echizen I. MesoNet: a compact facial video forgery detection network. In: 2018 IEEE International Workshop on Information Forensics and Security (WIFS). 2018. p. 1–7. https://doi.org/10.1109/wifs.2018.8630761

[ref15] 15. Rossler A, Cozzolino D, Verdoliva L, Riess C, Thies J, Nießner M. Faceforensics: learning to detect manipulated facial images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019. p. 1–11.

[ref16] 16. Cao J, Ma C, Yao T, Chen S, Ding S, Yang X. End-to-end reconstruction-classification learning for face forgery detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022. p. 4113–22.

[ref17] 17. Wang C, Deng W. Representative forgery mining for fake face detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2021. p. 14923–32.

[ref18] 18. Luo Y, Zhang Y, Yan J, Liu W. Generalizing face forgery detection with high-frequency features. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2021. p. 16317–26.

[ref19] 19. Qian Y, Yin G, Sheng L, Chen Z, Shao J. Thinking in frequency: face forgery detection by mining frequency-aware clues. In: European conference on computer vision. Springer; 2020. p. 86–103.

[ref20] 20. Jiang L, Li R, Wu W, Qian C, Loy CC. Deeperforensics-1.0: a large-scale dataset for real-world face forgery detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020. p. 2889–98.

[ref21] 21. Haliassos A, Vougioukas K, Petridis S, Pantic M. Lips don’t lie: a generalisable and robust approach to face forgery detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021. p. 5039–49.

[ref22] 22. Li S, Xia X, Ge S, Liu T. Selective-supervised contrastive learning with noisy labels. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2022. p. 316–25. https://doi.org/10.1109/cvpr52688.2022.00041

[ref23] 23. Binh LM, Woo S. ADD: frequency attention and multi-view based knowledge distillation to detect low-quality compressed deepfake images. AAAI. 2022;36(1):122–30.
View Article
Google Scholar

[40] View Article

[41] Google Scholar

[ref24] 24. Pu J, Mangaokar N, Kelly L, Bhattacharya P, Sundaram K, Javed M, et al. Deepfake videos in the wild: analysis and detection. In: Proceedings of the Web Conference 2021 . 2021. p. 981–92. https://doi.org/10.1145/3442381.3449978

[ref25] 25. Shiohara K, Yamasaki T. Detecting deepfakes with self-blended images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022. p. 18720–9.

[ref26] 26. Le BM, Kim J, Tariq S, Moore K, Abuadbba A, Woo SS. Sok: facial deepfake detectors. arXiv preprint 2024.
View Article
Google Scholar

[45] View Article

[46] Google Scholar

[ref27] 27. Chen L, Zhang Y, Song Y, Liu L, Wang J. Self-supervised learning of adversarial example: towards good generalizations for deepfake detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2022. p. 18710–9.

[ref28] 28. Li L, Bao J, Zhang T, Yang H, Chen D, Wen F, et al. Face x-ray for more general face forgery detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2020. p. 5001–10.

[ref29] 29. Chen G, Qiao L, Shi Y, Peng P, Li J, Huang T, et al. Learning open set network with discriminative reciprocal points. In: Computer Vision–ECCV 2020 : 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16. Springer; 2020. p. 507–22.

[ref30] 30. Zhao T, Xu X, Xu M, Ding H, Xiong Y, Xia W. Learning self-consistency for deepfake detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021. p. 15023–33.

[ref31] 31. Sheikhpour R, Mohammadi M, Berahmand K, Saberi-Movahed F, Khosravi H. Robust semi-supervised multi-label feature selection based on shared subspace and manifold learning. Information Sciences. 2025;699:121800.
View Article
Google Scholar

[52] View Article

[53] Google Scholar

[ref32] 32. Sheikhpour R, Berahmand K, Mohammadi M, Khosravi H. Sparse feature selection using hypergraph Laplacian-based semi-supervised discriminant analysis. Pattern Recognition. 2025;157:110882.
View Article
Google Scholar

[55] View Article

[56] Google Scholar

[ref33] 33. Sheng Z, Yu Z, Liu X, Cao SY, Liu Y, Shen HL, et al. Structure aggregation for cross-spectral stereo image guided denoising. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2023. p. 13997–4006.

[ref34] 34. Chollet F. Xception: deep learning with depthwise separable convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. p. 1251–8.

[ref35] 35. Tan M, Le Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In: International conference on machine learning. PMLR; 2019. p. 6105–14.

[ref36] 36. Wang Y, Yu K, Chen C, Hu X, Peng S. Dynamic graph learning with content-guided spatial-frequency relation reasoning for deepfake detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. p. 7278–87.

[ref37] 37. Zhao H, Wei T, Zhou W, Zhang W, Chen D, Yu N. Multi-attentional deepfake detection. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2021. p. 2185–94. https://doi.org/10.1109/cvpr46437.2021.00222

[ref38] 38. Zhao Y, Yan K, Huang F, Li J. Graph-based high-order relation discovery for fine-grained recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2021. p. 15079–88.

[ref39] 39. Yan Z, Zhang Y, Yuan X, Lyu S, Wu B. Deepfakebench: a comprehensive benchmark of deepfake detection. arXiv preprint 2023. https://arxiv.org/abs/2307.01426
View Article
Google Scholar

[64] View Article

[65] Google Scholar

[ref40] 40. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, et al. Generative adversarial networks. Commun ACM. 2020;63(11):139–44.
View Article
Google Scholar

[67] View Article

[68] Google Scholar

[ref41] 41. Choi Y, Choi M, Kim M, Ha JW, Kim S, Choo J. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018. p. 8789–97.

[ref42] 42. Karras T, Aila T, Laine S, Lehtinen J. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint 2017.
View Article
Google Scholar

[71] View Article

[72] Google Scholar

[ref43] 43. Kingma DP, Welling M. Auto-encoding variational bayes. arXiv preprint 2013. https://arxiv.org/abs/1312.6114
View Article
Google Scholar

[74] View Article

[75] Google Scholar

[ref44] 44. Zhao W, Rao Y, Shi W, Liu Z, Zhou J, Lu J. Diffswap: High-fidelity and controllable face swapping via 3d-aware masked diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023. p. 8568–77.

[ref45] 45. Gerogiannis D, Papantoniou FP, Potamias RA, Lattas A, Moschoglou S, Ploumpis S. AnimateMe: 4D facial expressions via diffusion models. arXiv preprint 2024.
View Article
Google Scholar

[78] View Article

[79] Google Scholar

[ref46] 46. Li L, Bao J, Yang H, Chen D, Wen F. Faceshifter: towards high fidelity and occlusion aware face swapping. arXiv preprint 2019. https://arxiv.org/abs/1912.13457
View Article
Google Scholar

[81] View Article

[82] Google Scholar

[ref47] 47. Nirkin Y, Keller Y, Hassner T. Fsgan: subject agnostic face swapping and reenactment. In: Proceedings of the IEEE/CVF international conference on computer vision; 2019. p. 7184–93.

[ref48] 48. Perov I, Gao D, Chervoniy N, Liu K, Marangonda S, Umé C. DeepFaceLab: integrated, flexible and extensible face-swapping framework. arXiv preprint. 2020.
View Article
Google Scholar

[85] View Article

[86] Google Scholar

[ref49] 49. Thies J, Zollhofer M, Stamminger M, Theobalt C, Nießner M. Face2face: real-time face capture and reenactment of rgb videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2016. p. 2387–95.

[ref50] 50. Thies J, Zollhöfer M, Nießner M. Deferred neural rendering: image synthesis using neural textures. ACM Transactions on Graphics. 2019;38(4):1–12.
View Article
Google Scholar

[89] View Article

[90] Google Scholar

[ref51] 51. Dong S, Wang J, Ji R, Liang J, Fan H, Ge Z. Implicit identity leakage: the stumbling block to improving deepfake detection generalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2023. p. 3994–4004.

[ref52] 52. Choi J, Kim T, Jeong Y, Baek S, Choi J. Exploiting style latent flows for generalizing deepfake video detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2024. p. 1133–43.

[ref53] 53. Zheng Y, Bao J, Chen D, Zeng M, Wen F. Exploring temporal coherence for more general video face forgery detection. In: Proceedings of the IEEE/CVF international conference on computer vision; 2021. p. 15044–54.

[ref54] 54. Wang Z, Bao J, Zhou W, Wang W, Li H. AltFreezing for more general video face forgery detection. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2023. p. 4129–38. https://doi.org/10.1109/cvpr52729.2023.00402

[ref55] 55. Wang P, Zheng W, Chen T, Wang Z. Anti-oversmoothing in deep vision transformers via the Fourier domain analysis: from theory to practice. In: International Conference on Learning Representations (ICLR). 2022.

[ref56] 56. Liu Z, Mao H, Wu CY, Feichtenhofer C, Darrell T, Xie S. A convnet for the 2020 s. In: Conference on Computer Vision and Pattern Recognition (CVPR); 2022. p. 11976–86.

[ref57] 57. Nguyen HH, Yamagishi J, Echizen I. Capsule-forensics: using capsule networks to detect forged images and videos. In: IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE; 2019. p. 2307–11.

[ref58] 58. Wang SY, Wang O, Zhang R, Owens A, Efros AA. CNN-generated images are surprisingly easy to spot... for now. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020. p. 8695–704.

[ref59] 59. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, et al. Swin transformer: hierarchical vision transformer using shifted windows. In: International Conference on Computer Vision (ICCV); 2021.

[ref60] 60. Liu H, Li X, Zhou W, Chen Y, He Y, Xue H. Spatial-phase shallow learning: rethinking face forgery detection in frequency domain. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.

[ref61] 61. Li Y, Lyu S. Exposing deepfake videos by detecting face warping artifacts. arXiv preprint 2018.
View Article
Google Scholar

[102] View Article

[103] Google Scholar

[ref62] 62. Dang H, Liu F, Stehouwer J, Liu X, Jain AK. On the detection of digital face manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.

[ref63] 63. Ni Y, Meng D, Yu C, Quan C, Ren D, Zhao Y. CORE: consistent representation learning for face forgery detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop. 2022. p. 12–21.

[ref64] 64. Yan Z, Zhang Y, Fan Y, Wu B. UCF: uncovering common features for generalizable deepfake detection. arXiv preprint 2023. https://arxiv.org/abs/2304.13949
View Article
Google Scholar

[107] View Article

[108] Google Scholar

[ref65] 65. Li Y, Yang X, Sun P, Qi H, Lyu S. Celeb-df: a new dataset for deepfake forensics. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.

[ref66] 66. DFD.

[ref67] 67. Dolhansky B, Howes R, Pflaum B, Baram N, Ferrer CC. The deepfake detection challenge (dfdc) preview dataset. arXiv preprint 2019. https://arxiv.org/abs/1910.08854
View Article
Google Scholar

[112] View Article

[113] Google Scholar

[ref68] 68. Dolhansky B, Bitton J, Pflaum B, Lu J, Howes R, Wang M. The deepfake detection challenge dataset. arXiv preprint 2020. https://arxiv.org/abs/2006.07397
View Article
Google Scholar

[115] View Article

[116] Google Scholar

[ref69] 69. DeepFakes.

[ref70] 70. FaceSwap.

[ref71] 71. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Conference on Computer Vision and Pattern Recognition (CVPR); 2016. p. 770–8.

[ref72] 72. Van der Maaten L, Hinton G. Visualizing data using t-SNE. Journal of Machine Learning Research. 2008.
View Article
Google Scholar

[121] View Article

[122] Google Scholar

[ref73] 73. Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-cam: visual explanations from deep networks via gradient-based localization. In: Conference on Computer Vision and Pattern Recognition (CVPR). 2017. p. 618–26.

[ref74] 74. Yu W, Si C, Zhou P, Luo M, Zhou Y, Feng J, et al. MetaFormer baselines for vision. IEEE Trans Pattern Anal Mach Intell. 2023;10.1109/TPAMI.2023.3329173. pmid:37910405
View Article
PubMed/NCBI
Google Scholar

[125] View Article

[126] PubMed/NCBI

[127] Google Scholar

[ref75] 75. Sun K, Xiao B, Liu D, Wang J. Deep high-resolution representation learning for human pose estimation. In: Conference on Computer Vision and Pattern Recognition (CVPR); 2019. p. 5693–703.

[ref76] 76. Hu J, Shen L, Sun G. Squeeze-and-excitation networks. In: 2018 IEEE/CVF conference on computer vision and pattern recognition. 2018. p. 7132–41. https://doi.org/10.1109/cvpr.2018.00745

Figures

Abstract

1 Introduction

2 Related work

3 Method

3.1 Overview of MkfaNet

3.2 Multi-kernel aggregator

3.3 Multi-frequency aggregator

3.4 Discussion

3.4.1 Advantages over classical CNN.

3.4.2 Advantages over modern DNN.

3.5 Network configurations

4 Experiments

4.1 Settings

4.2 Comparison results

4.3 Ablation and analysis

5 Conclusion

References