TDGN: A text-guided dual-gated network for multimodal sentiment analysis

doi:10.1371/journal.pone.0349024

Fig 1.

Visualization of the contribution from text, video, and audio modality in sentiment analysis.

We evaluate their performance (Acc-2) in the CMU-MOSI dataset with approaches MulT, Self-MM, ICCN, and DNMCN.

More »

Expand

Fig 2.

Overall architecture of the TDGN.

Different colors represent different modalities: yellow for text modality, purple for audio modality, and green for visual modality.

More »

Expand

Fig 3.

The architecture of the proposed Text-Anchored Alignment module.

Text, audio, and video features are first filtered through modality-specific gating networks to highlight semantically relevant information and suppress noise. Subsequently, text features act as queries (Q) in cross-modal attention, while gated audio and video features serve as keys (K) and values (V). This design enforces semantic alignment across modalities under text guidance, yielding noise-robust aligned representations.

More »

Expand

Fig 4.

Architecture of the Dual-layer Gated Fusion Module.

The module integrates intra-modal gating for contrastive feature separation and inter-modal gating for weighted feature integration. The contrastive loss objective enhances the discriminative power of the individual modalities, while the inter-modal module facilitates global information fusion for final classification. A dashed red line indicates the constraint feedback used to regularize the entire fusion process.

More »

Expand

Table 1.

Dataset split.

More »

Expand

Table 2.

Sensitivity analysis of temperature parameter on the CMU-MOSEI dataset.

More »

Expand

Table 3.

Performance Comparison on the MOSI Dataset. The left side of the “/” in ACC-2 and F1 Score is calculated as negative/non-negative, while the right side is calculated as negative/positive. The best result is highlighted in bold.

More »

Expand

Table 4.

Performance Comparison on the MOSEI Dataset. The left side of the “/” in ACC-2 and F1 Score is calculated as negative/non-negative, while the right side is calculated as negative/positive. The best result is highlighted in bold.

More »

Expand

Table 5.

Ablation results for the TGA module on the CMU-MOSI dataset.

More »

Expand

Table 6.

Ablation results for the TGA module on the CMU-MOSEI dataset.

More »

Expand

Table 7.

Ablation results for different fusion strategies on the CMU-MOSI dataset.

More »

Expand

Table 8.

Ablation results for different fusion strategies on the CMU-MOSEI dataset.

More »

Expand

Table 10.

Ablation results for different components on the CMU-MOSEI dataset.

More »

Expand

Table 9.

Ablation results for different components on the CMU-MOSI dataset.

More »

Expand

Fig 5.

t-SNE visualization of multimodal representations before and after applying the TGA module.

Left: the features are misaligned and distributed irregularly across modalities. Right: after TGA, the features exhibit improved alignment and compact clustering across text, video, and audio modalities.

More »

Expand

Fig 6.

Cross-modal attention heatmaps reveal distinct alignment patterns between modalities.

The Text–Video map (left) exhibits a prominent near-diagonal activation pattern, indicating strong fine-grained temporal correspondence between textual tokens and video frames. The Text–Audio map (right) displays sparse, block-structured activations concentrated at a few emotionally salient speech intervals, reflecting selective attention to affectively relevant acoustic segments while suppressing redundant acoustic information. Brighter regions (higher attention weight) indicate stronger cross-modal correspondence. These contrasting patterns demonstrate the efficacy of the text-anchored cross-modal attention mechanism in achieving modality-specific alignment under a unified semantic guidance framework.

More »

Expand

Fig 7.

Distribution of gated activation values for three modalities during fusion.

Histograms show Sigmoid gate activations for text, video, and audio, with a threshold of 0.5. Text exhibits a highly concentrated high-activation distribution, indicating near-complete feature preservation. Video shows a moderately dispersed, slightly left-skewed distribution, reflecting selective filtering. Audio presents the broadest and most left-skewed distribution, indicating substantial suppression. These patterns demonstrate modality-aware gating aligned with heterogeneous noise levels.

More »

Expand