FedEmoNet: Privacy-preserving federated learning with TCN-Transformer fusion for cross-corpus speech emotion recognition
Fig 3
FedEmoNet local model architecture.
Three parallel branches: (1) CNN Branch processing spectral features through Conv2D layers with ReLU, batch normalization, MaxPool, and AdaptiveAvgPool; (2) TCN Branch processing PSR features at three scales through dilated causal convolutions with dilation rates d = 1,2,4; (3) Dense Branch processing handcrafted features. All branches produce embeddings fused via Multi-Head Attention and processed through N = 6 Transformer encoder blocks. The Classification Head combines Max Pool, Mean Pool, and Last State representations.