Multiple contexts and frequencies aggregation network for deepfake detection

doi:10.1371/journal.pone.0337409

Fig 1.

Illustration of frequency priors in deepfake detection.

(a): Source image. (b): Data frequency domain analysis. (c): Relative log amplitudes of Fourier transformed feature maps of ResNet50. (d): Relative log amplitudes of Fourier transformed feature maps of MkfaNet. (b) reveals the uniformity of the frequency distribution in real faces and the concentration of high-frequency anomalies in forged faces. (c) shows that ResNet50 has a relatively low logarithmic amplitude in the high-frequency region, indicating its insufficiency in capturing high-frequency details. (d) demonstrates that MkfaNet has a higher amplitude in the high-frequency region with broader coverage, highlighting its advantages in handling high-frequency details and identifying forgery features. To protect privacy, the original facial images were anonymized by replacing them with icon representations. Corresponding image identifiers from the CelebDF-v1 dataset are shown.

More »

Expand

Fig 2.

Overall detection pipeline of MkfaNet.

The model consists of a four-stage hierarchical backbone, where each stage includes an embedding stem, Multi-Kernel Aggregator (MKA), and Multi-Frequency Aggregator (MFA). The pipeline takes an input face image and processes it through the stacked modules to produce a final classification score via a fully connected (FC) layer.

More »

Expand

Fig 3.

(a) Multi-Kernel Aggregation (MKA) block, designed as a token mixer, utilizes depthwise convolution layers with different dilation rates to capture multi-scale spatial features, improving sensitivity to subtle local manipulations.

(b) Multi-Frequency Aggregation (MFA) block, serving as a channel mixer, applies frequency-aware feature decomposition through depthwise convolutions and gating mechanisms to enhance forgery-specific artifact extraction. (c) EfficientNet block, which integrates a Squeeze-and-Excitation (SE) module to adaptively recalibrate feature responses. (d) ConvNeXt block, incorporating a Channel Mixer and depthwise convolutions, serves as a baseline for comparing hierarchical feature mixing strategies.

More »

Expand

Table 1.

Comparison of models in feature extraction and aggregation strategies.

More »

Expand

Table 2.

Architecture configurations of MkfaNet variants.

More »

Expand

Table 3.

Within-domain and cross-domain evaluations of various deepfake detectors and backbones using the AUC metric.

All detectors are trained on FF-c23 and evaluated on other datasets. Avg. donates the average AUC for within-domain and cross-domain evaluations, and the best result for each group is highlighted in bord. † represents our reproduced results, while DeepfakeBench provides others. The values reported are accompanied by ±, which represents the margin of error corresponding to the values in the previous row, and the confidence interval is at a 95% confidence level.

More »

Expand

Table 4.

Summary of used deepfake detection datasets.

More »

Expand

Grad-CAM activation maps [73] of fake images in the validation set of FFDI-2024 (collected online) as cross-domain evaluation. Compare the naive detector with different backbones with ours.

As for fake images, classical CNNs like ResNet-50 show robust but coarse localization of human faces, while modern architectures like Swin-T can activate some semantic features. Out MkfaNet-S not only exhibits precise localization of discriminative organs but also tells the difference between fake and real faces.

More »

Expand