FedEmoNet: Privacy-preserving federated learning with TCN-Transformer fusion for cross-corpus speech emotion recognition

doi:10.1371/journal.pone.0342953

Table 1.

Summary of related work comparing strengths, limitations, and how FedEmoNet addresses identified gaps.

More »

Expand

Fig 1.

Complete methodology pipeline of the proposed FedEmoNet framework.

Phase 1: Data processing and feature engineering including spectrogram generation, MFCC extraction, chroma features, and handcrafted features, followed by ensemble PSO optimization. Phase 2: Multi-scale TCN-Transformer fusion architecture. Phase 3: Training and evaluation. Phase 4: Explainability analysis through LIME and SHAP.

More »

Expand

Table 2.

Summary of datasets used in experiments. EmoDB and RAVDESS serve as federated training sources; CREMA-D is used exclusively for cross-corpus evaluation.

More »

Expand

Fig 2.

FedProx-based federated learning protocol.

The global server maintains model and performs weighted aggregation. Each client receives the broadcast global model, performs local training, and sends updated parameters. The privacy boundary ensures raw data never leaves the client. The global test set is held out before client distribution.

More »

Expand

Fig 3.

FedEmoNet local model architecture.

Three parallel branches: (1) CNN Branch processing spectral features through Conv2D layers with ReLU, batch normalization, MaxPool, and AdaptiveAvgPool; (2) TCN Branch processing PSR features at three scales through dilated causal convolutions with dilation rates d = 1,2,4; (3) Dense Branch processing handcrafted features. All branches produce embeddings fused via Multi-Head Attention and processed through N = 6 Transformer encoder blocks. The Classification Head combines Max Pool, Mean Pool, and Last State representations.

More »

Expand

Fig 4.

PSO optimization convergence.

(a) Fitness convergence showing global best and swarm mean stabilizing by iteration 35; (b) Feature count reduction from 150 to 103 selected features; (c) Computational cost breakdown.

More »

Expand

Fig 5.

PSO-optimized feature selection pipeline.

Starting with 150 features, ensemble methods generate rankings aggregated via Borda count. PSO with 20 particles optimizes the feature subset iteratively using a sigmoid transfer function for binary selection.

More »

Expand

Table 3.

PSO-optimized feature selection results.

More »

Expand

Table 4.

Dataset partitioning for experimental evaluation.

More »

Expand

Table 5.

Comparison of XAI methods used in FedEmoNet.

More »

Expand

Fig 6.

Comparison of XAI methods.

(a) SHAP feature attribution for an anger sample; (b) LIME feature contribution for the same sample; (c) Strong agreement between SHAP and LIME importance values (r = 0.997).

More »

Expand

Fig 7.

Comprehensive explainability analysis.

(a) Emotion-specific feature importance via LIME; (b) Multi-head attention weights; (c) SHAP feature impact distribution; (d) Cross-corpus feature consistency (r = 0.94); (e) Learned emotion embedding space via t-SNE; (f) Temporal attention pattern analysis.

More »

Expand

Fig 8.

LIME explanation examples for individual samples.

Red bars indicate negative contributions and green bars indicate positive contributions. Feature indices correspond to PSO-selected features.

More »

Expand

Table 6.

Classification performance on EmoDB (107 test samples).

More »

Expand

Fig 9.

Numerical confusion matrices.

(a) EmoDB (99.07%, 107 samples): single misclassification Sadness→Neutral; (b) RAVDESS (98.96%, 288 samples): three errors between acoustically similar pairs; (c) CREMA-D cross-corpus (68.15%, 1,488 samples): high-arousal emotions show stronger transfer.

More »

Expand

Table 7.

Classification performance on RAVDESS (288 test samples).

More »

Expand

Fig 10.

Per-emotion cross-corpus performance on CREMA-D.

(a) Detailed metrics per emotion; (b) Arousal-based analysis: high-arousal emotions (71.9%) transfer significantly better than low-arousal (62.1%).

More »

Expand

Table 8.

Per-emotion performance on CREMA-D (cross-corpus, 1,488 samples).

More »

Expand

Fig 11.

t-SNE visualization of feature distributions across datasets.

(a) Dataset-colored view showing domain shift between EmoDB, RAVDESS, and CREMA-D; (b) Emotion-colored view revealing cross-dataset clustering for high-arousal emotions; (c) Domain shift visualization highlighting CREMA-D relative to training data.

More »

Expand

Fig 12.

Reduced training data ablation.

(a) Performance vs. training data fraction showing monotonic improvement, ruling out memorization; (b) Performance degradation quantification.

More »

Expand

Table 9.

Statistical validation (10-fold CV).

More »

Expand

Fig 13.

Statistical validation.

(a) 10-fold CV comparison; (b) 95% confidence intervals; (c) Paired t-test significance; (d) Distribution across folds; (e) Effect size analysis; (f) ANOVA results (F = 78.45, p < 0.001).

More »

Expand

Fig 14.

Federated learning training dynamics.

(a) Global accuracy convergence; (b) Loss on logarithmic scale; (c) Per-client heterogeneous convergence; (d) FedProx vs FedAvg comparison.

More »

Expand

Fig 15.

Detailed FedProx protocol analysis.

(a) FedProx vs FedAvg convergence; (b) Proximal coefficient sensitivity ( optimal); (c) Non-IID distribution across 5 clients; (d) Client model drift; (e) DP accuracy-privacy trade-off; (f) Algorithm specification.

More »

Expand

Table 10.

Ablation study results.

More »

Expand

Fig 16.

Ablation study visualization for (a) EmoDB and (b) RAVDESS.

PSO feature selection, Transformer blocks, and FedProx provide the largest contributions.

More »

Expand

Table 11.

Comparison with state-of-the-art methods.

More »

Expand

Fig 17.

Privacy analysis.

(a) DP accuracy-privacy trade-off; (b) Gradient clipping impact; (c) Noise distribution; (d) Membership inference resistance (AUC → 0.52); (e) Communication efficiency (67% reduction);.

More »

Expand

Table 12.

Computational efficiency metrics.

More »

Expand