Fig 1.
Visualization of the contribution from text, video, and audio modality in sentiment analysis.
We evaluate their performance (Acc-2) in the CMU-MOSI dataset with approaches MulT, Self-MM, ICCN, and DNMCN.
Fig 2.
Overall architecture of the TDGN.
Different colors represent different modalities: yellow for text modality, purple for audio modality, and green for visual modality.
Fig 3.
The architecture of the proposed Text-Anchored Alignment module.
Text, audio, and video features are first filtered through modality-specific gating networks to highlight semantically relevant information and suppress noise. Subsequently, text features act as queries (Q) in cross-modal attention, while gated audio and video features serve as keys (K) and values (V). This design enforces semantic alignment across modalities under text guidance, yielding noise-robust aligned representations.
Fig 4.
Architecture of the Dual-layer Gated Fusion Module.
The module integrates intra-modal gating for contrastive feature separation and inter-modal gating for weighted feature integration. The contrastive loss objective enhances the discriminative power of the individual modalities, while the inter-modal module facilitates global information fusion for final classification. A dashed red line indicates the constraint feedback used to regularize the entire fusion process.
Table 1.
Dataset split.
Table 2.
Sensitivity analysis of temperature parameter on the CMU-MOSEI dataset.
Table 3.
Performance Comparison on the MOSI Dataset. The left side of the “/” in ACC-2 and F1 Score is calculated as negative/non-negative, while the right side is calculated as negative/positive. The best result is highlighted in bold.
Table 4.
Performance Comparison on the MOSEI Dataset. The left side of the “/” in ACC-2 and F1 Score is calculated as negative/non-negative, while the right side is calculated as negative/positive. The best result is highlighted in bold.
Table 5.
Ablation results for the TGA module on the CMU-MOSI dataset.
Table 6.
Ablation results for the TGA module on the CMU-MOSEI dataset.
Table 7.
Ablation results for different fusion strategies on the CMU-MOSI dataset.
Table 8.
Ablation results for different fusion strategies on the CMU-MOSEI dataset.
Table 10.
Ablation results for different components on the CMU-MOSEI dataset.
Table 9.
Ablation results for different components on the CMU-MOSI dataset.
Fig 5.
t-SNE visualization of multimodal representations before and after applying the TGA module.
Left: the features are misaligned and distributed irregularly across modalities. Right: after TGA, the features exhibit improved alignment and compact clustering across text, video, and audio modalities.
Fig 6.
Cross-modal attention heatmaps reveal distinct alignment patterns between modalities.
The Text–Video map (left) exhibits a prominent near-diagonal activation pattern, indicating strong fine-grained temporal correspondence between textual tokens and video frames. The Text–Audio map (right) displays sparse, block-structured activations concentrated at a few emotionally salient speech intervals, reflecting selective attention to affectively relevant acoustic segments while suppressing redundant acoustic information. Brighter regions (higher attention weight) indicate stronger cross-modal correspondence. These contrasting patterns demonstrate the efficacy of the text-anchored cross-modal attention mechanism in achieving modality-specific alignment under a unified semantic guidance framework.
Fig 7.
Distribution of gated activation values for three modalities during fusion.
Histograms show Sigmoid gate activations for text, video, and audio, with a threshold of 0.5. Text exhibits a highly concentrated high-activation distribution, indicating near-complete feature preservation. Video shows a moderately dispersed, slightly left-skewed distribution, reflecting selective filtering. Audio presents the broadest and most left-skewed distribution, indicating substantial suppression. These patterns demonstrate modality-aware gating aligned with heterogeneous noise levels.