Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

< Back to Article

Fig 1.

Illustration of frequency priors in deepfake detection.

(a): Source image. (b): Data frequency domain analysis. (c): Relative log amplitudes of Fourier transformed feature maps of ResNet50. (d): Relative log amplitudes of Fourier transformed feature maps of MkfaNet. (b) reveals the uniformity of the frequency distribution in real faces and the concentration of high-frequency anomalies in forged faces. (c) shows that ResNet50 has a relatively low logarithmic amplitude in the high-frequency region, indicating its insufficiency in capturing high-frequency details. (d) demonstrates that MkfaNet has a higher amplitude in the high-frequency region with broader coverage, highlighting its advantages in handling high-frequency details and identifying forgery features. To protect privacy, the original facial images were anonymized by replacing them with icon representations. Corresponding image identifiers from the CelebDF-v1 dataset are shown.

More »

Fig 1 Expand

Fig 2.

Overall detection pipeline of MkfaNet.

The model consists of a four-stage hierarchical backbone, where each stage includes an embedding stem, Multi-Kernel Aggregator (MKA), and Multi-Frequency Aggregator (MFA). The pipeline takes an input face image and processes it through the stacked modules to produce a final classification score via a fully connected (FC) layer.

More »

Fig 2 Expand

Fig 3.

(a) Multi-Kernel Aggregation (MKA) block, designed as a token mixer, utilizes depthwise convolution layers with different dilation rates to capture multi-scale spatial features, improving sensitivity to subtle local manipulations.

(b) Multi-Frequency Aggregation (MFA) block, serving as a channel mixer, applies frequency-aware feature decomposition through depthwise convolutions and gating mechanisms to enhance forgery-specific artifact extraction. (c) EfficientNet block, which integrates a Squeeze-and-Excitation (SE) module to adaptively recalibrate feature responses. (d) ConvNeXt block, incorporating a Channel Mixer and depthwise convolutions, serves as a baseline for comparing hierarchical feature mixing strategies.

More »

Fig 3 Expand

Table 1.

Comparison of models in feature extraction and aggregation strategies.

More »

Table 1 Expand

Table 2.

Architecture configurations of MkfaNet variants.

More »

Table 2 Expand

Table 3.

Within-domain and cross-domain evaluations of various deepfake detectors and backbones using the AUC metric.

All detectors are trained on FF-c23 and evaluated on other datasets. Avg. donates the average AUC for within-domain and cross-domain evaluations, and the best result for each group is highlighted in bord. † represents our reproduced results, while DeepfakeBench provides others. The values reported are accompanied by ±, which represents the margin of error corresponding to the values in the previous row, and the confidence interval is at a 95% confidence level.

More »

Table 3 Expand

Table 4.

Summary of used deepfake detection datasets.

More »

Table 4 Expand

Table 5.

Hyper-parameters and training recipes for ImageNet-1K of Swin-T, ConvNeXt-T, and our proposed MkfaNet-T/S.

More »

Table 5 Expand

Table 6.

Confusion matrix for the CelebDF-v1 dataset.

This table presents the classification results of our model on the CelebDF-v1 test set, which contains 1,203 real samples and 1,933 fake samples. The confusion matrix includes four key values: True Positive (TP), False Negative (FN), False Positive (FP), and True Negative (TN).

More »

Table 6 Expand

Fig 4.

Visualization of true positives, true negatives, false positives, and false negatives in deepfake detection.

This figure illustrates four types of prediction results made by MkfaNet on the CelebDF-v1 dataset. The first row presents true positives (TP), where fake images are correctly classified with high confidence. The second row shows true negatives (TN), where real images are accurately recognized as real. The third row presents false positives (FP), where fake images are mistakenly classified as real. The fourth row shows false negatives (FN), where real images are misclassified as fake due to factors such as compression artifacts, facial expressions, or lighting conditions. The softmax output probabilities indicate the model’s prediction confidence. To protect privacy, the original facial images were anonymized by replacing them with icon representations. Corresponding image identifiers from the CelebDF-v1 and UADFV datasets are shown.

More »

Fig 4 Expand

Table 7.

Ablation of designed modules on FF-c23.

The module without “+” denotes the baseline modules, while those with “+” are added to the baseline (using gray backgrounds). c1, c2, and c3 represent the number of channels assigned to the MKA module’s branches with dilation rates of 1, 2, and 3, respectively.

More »

Table 7 Expand

Fig 5.

Visualization of latent embedding of detectors with t-SNE [72] on FF++ (c23) according to settings in DeepfakeBench [39].

Based on the naive detector, our MkfaNet-S distinguishes different types of forgery into several clusters, whereas other backbones could not learn the discriminative patterns without additional supervision.

More »

Fig 5 Expand

Fig 6.

Grad-CAM activation maps [73] of fake images in the validation set of FFDI-2024 (collected online) as cross-domain evaluation. Compare the naive detector with different backbones with ours.

As for fake images, classical CNNs like ResNet-50 show robust but coarse localization of human faces, while modern architectures like Swin-T can activate some semantic features. Out MkfaNet-S not only exhibits precise localization of discriminative organs but also tells the difference between fake and real faces.

More »

Fig 6 Expand