Fig 1.
The proposed framework includes feature embedding and fusion layers, an adaptive time-frequency fusion Transformer encoder and decoder with a deformable attention mechanism, and a medical prior-guided attention module.
The model achieves time-frequency complementarity and semantic alignment through gated connections and residual fusion, thereby improving acoustic interpretability while maintaining discriminative performance.
Fig 2.
Schematic diagram of the Adaptive Time-Frequency Fusion Transformer architecture.
The ATF-Transformer encoder on the left uses multimodal feature fusion and a time-frequency attention mechanism to jointly model time and frequency domain features. The decoder on the right uses learnable weights and gating mechanisms to achieve feature reconstruction and semantic alignment, resulting in an interpretable time-frequency fusion representation.
Fig 3.
A schematic diagram of the Medical Guided Interpretable Attention Map (MGIAM) structure.
The module introduces medical feature channels between Transformer blocks, and achieves dynamic coupling of medical semantics and attention through layer normalization, feature allocation and interpretable feature generation, so that both forward and backward propagation have medical interpretability and structural consistency.
Fig 4.
Mel spectra of partial data examples from three datasets.
Table 1.
Experimental configuration.
Table 2.
Performance comparison of different models on Dataset 1 for pertussis sound recognition (Mean ± Std).
Table 3.
Performance comparison of different models on Dataset 2 for pertussis sound recognition (Mean ± Std).
Table 4.
Performance comparison of different models on Dataset 3 for pertussis sound recognition (Mean ± Std).
Table 5.
Ablation study of ATF and MGIAM modules across three datasets for pertussis sound recognition (Mean ± Std).
Fig 5.
Comparative analysis of the experimental results of using and not using the MGIAM module on the SHAP feature importance map.
Fig 6.
Comparative experimental results of this paper and Baseline’s Transformer algorithm on confusion matrices.
Fig 7.
Experimental results of the algorithm presented in this paper on the test set using t-SNE.
Fig 8.
Comparison of experimental results between the algorithm presented in this paper and the Transformer architecture.
Fig 9.
Dependency graph comparison between Transformer and the algorithm in this paper.
Table 6.
Model stability analysis under different noise intensities across three datasets (Mean ± Std).
Table 7.
Computational cost comparison between the Transformer baseline and the proposed model.