Feature fusion based transformer for sentiment analysis in social networks

Shiyong Li; He Li; Juan Du; Shitao Yan; Chuang Dong

doi:10.1371/journal.pone.0333416

Abstract

Sentiment analysis methods aim to evaluate users’ mental health conditions by analyzing their posted content (text, images, and audio) on social networks. However, given the diversity and complexity of social media information, traditional single-modal sentiment analysis techniques exhibit limitations in accurately interpreting users’ emotional states and may even lead to contradictory conclusions. To address this challenge, this paper proposes a Feature Fusion Based Transformer (FFBT) solution. The framework consists of three key steps: Firstly, RoBERTa and ResNet50 models are employed to extract features from textual and image data in social media posts, respectively. Then, a multimodal Transformer architecture facilitates feature alignment and fusion across different modalities. Finally, the fused features are fed into a fully connected network (FCN) for sentiment classification, ultimately determining the user’s emotional state. Experiments conducted on a custom dataset constructed from social media platform data demonstrate that FFBT outperforms existing sentiment analysis algorithms by 4.1% in accuracy and 5% in F1-scores, respectively.

Citation: Li S, Li H, Du J, Yan S, Dong C (2025) Feature fusion based transformer for sentiment analysis in social networks. PLoS One 20(11): e0333416. https://doi.org/10.1371/journal.pone.0333416

Editor: Issa Atoum, Philadelphia University, JORDAN

Received: April 27, 2025; Accepted: September 12, 2025; Published: November 7, 2025

Copyright: © 2025 Li et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the paper and its Supporting information files.

Funding: This work was partly supported by the Key Scientific and Technological Research Projects in Henan Province [Grant No. 252102321074] to Shiyong Li, the National Natural Science Foundation of China [grant No. 62002180] to He Li, the Training Plan for Young Backbone Teachers in Higher Education Institutions in Henan Province [grant No. 2023GGJS120] to He Li, the Key Scientific Research Projects of Colleges and Universities in Henan Province [grant No. 24A520030] to He Li. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

1 Introduction

In the digital era, the accelerated pace of modern life has emerged as a pivotal factor driving the dramatic escalation of psychological stress, consequently elevating the incidence of mental health disorders such as depression. According to authoritative data from The 2023 Blue Book of China’s Mental Health (Fig 1), adolescents’ mental health status in China exhibits significant variations across educational stages: The basic education phase demonstrates a progressive increase in depression detection rates (10% in primary, 30% in junior high, and 40% in senior high schools), while university students display 16.54% prevalence of mild depression and 4.94% for severe cases. Notably, anxiety disorders present even more prominent risks, with 38.26% mild, 4.65% moderate, and 2.37% severe manifestations. Meanwhile, driven by traffic incentives, social media platforms systematically amplify anxiety-provoking content through algorithmic mechanisms. This “emotional amplification effect” has transformed these platforms into primary channels for emotional catharsis, where users increasingly express their feelings through various modalities, including textual posts, visual imagery, and short videos. Under this context, utilizing emotional expression data from social platforms to assess students’ mental health status or affective states, thereby enabling personalized psychological counseling and education, has emerged as a critical research topic. In recent years, researchers have employed diverse intelligent algorithms for personalized psychological education [1], online depression screening [2], and opinion mining on social platforms [3], assisting professionals in delivering more effective mental-health prevention and counseling.

Download:

Fig 1. The 2023 Blue Book of China’s Mental Health unveils the alarming state of students’ mental health, which urgently requires intervention from the education system.

https://doi.org/10.1371/journal.pone.0333416.g001

Yu et al. [4] designed a label generation module based on a self-supervised learning strategy to automatically obtain independent unimodal supervision signals. Subsequently, by jointly training multimodal and unimodal tasks, they learned effective representations for multimodal scenarios. Yao et al. [5] argued that human language is inherently multimodal, integrating natural language, facial movements, and acoustic behaviors. They proposed a Multimodal Transformer (MulT) for sentiment analysis, which captures interactions between multimodal sequences at different time steps and implicitly aligns one modality’s sequence to another. Hazarika et al. [6] introduced the MISA framework for sentiment analysis, mapping each modality into two subspaces: a modality-invariant subspace and a modality-specific subspace. These representations together provide a comprehensive view of multimodal data and are fused for sentiment prediction. Zadeh et al. [7] modeled multimodal sentiment analysis as a joint task of intra-modal dynamics and inter-modal dynamics, proposing a novel Tensor Fusion Network (TFN) for this purpose.

These algorithms have achieved notable results and advanced sentiment analysis, but they also exhibit certain limitations. On one hand, traditional unimodal feature extraction techniques (e.g., Fine-tuning CNNs [8], LCA-BERT [9]) struggle to comprehensively capture deep semantic features in text or effectively extract fine-grained visual features in images. On the other hand, traditional multimodal fusion techniques (e.g., TFN [7], MulT [5]) directly fuse full-modal features, leading to suboptimal performance on cross-modal datasets. Multimodal sentiment analysis faces two major challenges:

Unimodal encoders fail to extract sufficiently fine-grained features, limiting their discriminative power.
Inadequate cross-modal feature alignment, which can lead to polarity reversal (e.g., conflicting sentiment signals between text and images shown in Table 1).

Download:

Table 1. The combination of text and different emojis may demonstrate opposed emotions.

https://doi.org/10.1371/journal.pone.0333416.t001

To address the aforementioned challenges in multimodal sentiment analysis, this paper proposes a Feature Fusion- Based Transformer (FFBT) for multimodal sentiment analysis. The approach first encodes textual inputs with RoBERTa [10] and extracts image features via ResNet50 [11]. Next, cross-modal multi-head attention aligns the two modalities, after which the aligned features are fed into a multimodal Transformer encoder to intensify inter-modal interactions and achieve deep feature fusion. Finally, the fused representations are passed through fully connected layers to accurately predict the sentiment category of the multimodal content. The key contributions of the proposed FFBT method are summarized below:

High-quality unimodal encoders: RoBERTa is employed as the text encoder, leveraging its bidirectional Transformer and dynamic masking to fully capture contextual semantics; ResNet50 serves as the image encoder, extracting 2048-dimensional fine-grained image features via residual blocks and global average pooling.
Cross-modal feature alignment: After unimodal extraction, a cross-modal Transformer encoder [5] explicitly aligns text and image features, enabling the model to capture intricate interactions such as “sarcastic emoji vs. positive text.” A cosine-embedding contrastive loss is further employed to pull together matched image–text pairs and push apart mismatched ones in the shared space, refining alignment and sharpening the model’s discrimination of fine-grained emotional nuances.

Leveraging data from social-network platforms, this paper constructs a multimodal dataset that contains both text and images, named Social Data in Text and Image (SDTI). Experimental results demonstrate that the proposed method achieves higher accuracy (+4.1%) and F1-score (+5%) compared to baseline algorithms.

The remainder of this paper is organized as follows: Sect 2 reviews related work on multimodal sentiment analysis; Sect 3 outlines the overall pipeline of the FFBT approach; Sect 4 details the design specifics and enhancements of FFBT; Sect 5 presents the experiments conducted on SDTI together with their results; and Sect 6 concludes the paper and discusses future directions.

2 Related works

This section briefly surveys the work of the predecessors in sentiment analysis, covering both single-modal and multimodal approaches.

2.1 Single-modal sentiment analysis

Chen et al. [12] proposed a simple framework for contrastive learning of visual representations, advancing computer vision by enabling models to learn robust features without labeled data through positive and negative sample training.

Tran et al. [13] examined spatiotemporal convolutions in action recognition, showing they effectively capture video motion information and improve recognition accuracy.

Carreira and Zisserman [14] introduced a two-stream model for action recognition and the Kinetics dataset, achieving top performance by integrating spatial and temporal information.

Kugate et al. [15] developed an efficient method for video key frame extraction using CNNs and clustering, identifying representative frames for applications like video summarization.

Fan et al. [16] presented a hybrid CNN-RNN model for video emotion recognition, capturing spatial and temporal features to achieve state-of-the-art performance.

Pang et al. [17] used dynamic graph convolutional networks for aspect-based sentiment analysis, updating graph structures to capture aspect relationships better and improve classification accuracy.

Li et al. [18] employed dual graph convolutional networks to model aspect interconnections in sentiment analysis, achieving top results through enhanced sentiment categorization precision.

Although the aforementioned studies have advanced emotion recognition, single-modal approaches remain confined within their respective modalities and thus cannot fully convey the richness of human affect. In reality, users on social media rarely rely on a single data type when posting content. Consequently, this paper adopts a multimodal framework for emotion recognition.

2.2 Basic multimodal fusion for sentiment analysis

As shown in Table 1, different images paired with the same text data may exhibit entirely different emotional tendencies. Sometimes, a purely narrative text cannot be accurately judged for sentiment, but when paired with a “joyful” emoji, the sentiment tendency of the text + image data can be easily determined. Therefore, single-modal data is not sufficient to make an accurate sentiment judgment, and sentiment analysis must be conducted on multimodal data to achieve an ideal state. Driven by the demand for more accurate emotion understanding, researchers have developed many general-purpose multimodal datasets and baseline models.

Busso et al. [19] introduced the IEMOCAP dataset, which includes multimodal data like audio, video, and text from emotionally charged actor interactions, providing detailed emotion annotations for model training and evaluation.

Poria et al. [20] presented the MELD dataset, designed for conversational emotion recognition, containing multimodal data from multi-party conversations with detailed emotional state annotations, effectively training advanced emotion recognition models.

Abdullah et al. [21] explored deep learning for multimodal emotion recognition, developing a model integrating text, audio, and video features for diverse contexts, matching existing approaches’ performance.

Building upon these baseline models, researchers have further explored a variety of general-purpose multimodal fusion strategies.

Gupta et al. [22] proposed Visatronic, a multimodal speech synthesis model combining text and audio features with a decoder-only architecture, significantly improving speech synthesis quality across datasets.

Mai et al. [23] used hybrid contrastive learning for multimodal sentiment analysis, combining text, audio, and video features to enhance sentiment classification accuracy through modality alignment.

Zadeh et al. [7] developed the Tensor Fusion Network (TFN), using a tensor fusion layer to merge text, audio, and video features, outperforming existing methods in multimodal and unimodal sentiment analysis.

Liu et al. [24] introduced an efficient multimodal fusion approach using modality-specific factors, reducing parameters while maintaining high accuracy in sentiment analysis.

Kakuba et al. [25] extend conventional speech-based emotion recognition by proposing a deep-learning multi-learning framework termed DBMER. Integrating CNNs, RNNs, and multi-head attention, DBMER significantly outperforms traditional multimodal methods in both accuracy and robustness.

2.3 Advanced multimodal fusion for sentiment analysis

In recent years, to more precisely determine the emotional categories expressed by diverse modalities, researchers have shifted toward context- and causality-enhanced fusion strategies and have achieved notable progress.

Poria et al. [26] focused on context-dependent sentiment analysis in user-generated videos, proposing a model considering video content context to improve sentiment classification accuracy.

Xing et al. [27] proposed an adapted dynamic memory network for conversational emotion recognition, dynamically updating memory based on conversation context to better capture emotional states.

Hazarika et al. [28] introduced ICON, an interactive conversational memory network for multimodal emotion detection, using a memory network to grasp conversation context and enhance emotion recognition precision.

Xiao et al. [29] proposed Atlantis, an aesthetic-oriented multi-granularity fusion network for joint multimodal aspect-based sentiment analysis, combining text, image, and aesthetic features for state-of-the-art performance.

Yue et al. [30] developed Knowlenet, a knowledge fusion network for multimodal sarcasm detection, combining text, image, and knowledge graph features to significantly improve sarcasm detection accuracy.

Other researchers have conducted more detailed studies from perspectives such as graphs, hypergraphs, counterfactual interventions, and polarity, proposing more fine-grained multimodal fusion methods.

Since cross-modal attention tends to overlook modality-specific cues when distinguishing similar samples, Huang et al. [31] devise a cross-sample fusion strategy: features from different samples are combined, and adversarial training coupled with pairwise-prediction tasks is employed to preserve fine-grained modality-specific information.

Traditional multimodal models often neglect causal relations between modalities, giving rise to spurious correlations and underperforming cross-modal attention. To overcome these limitations, Huang et al. [32] propose the Attention-based Causality-Aware Fusion (AtCAF) network. First, the Causality-Aware Text Debiasing Module (CATDM) is introduced with a front-door adjustment to capture debiased textual causal representations. Second, a Counterfactual Cross-modal Attention (CCoAt) module is employed to inject causal information during fusion, yielding aggregated representations that are causally consistent.

Chen et al. [33] tackle the optimization imbalance inherent in multimodal learning by introducing the Adaptive Gradient-Scaling & Sparse Mixture-of-Experts model (AGS-SMoE). The framework devises a gradient-scaling policy that equalizes the training dynamics across heterogeneous encoders. In addition, it integrates a Sparse Mixture-of-Experts (SMoE) mechanism that sparsely perceives and processes multimodal tokens according to their current contention states, thereby mitigating redundant computation and enhancing efficiency.

Li et al. [34] argue that conventional multimodal retrieval systems fall short of educational demands for instructional slide retrieval. They propose EduCross, a framework that introduces Dual-Channel Adversarial Bipartite Hypergraph Learning, which combines a generative adversarial network with text–image dual channels to achieve precise bidirectional graph–text mapping.

Ramkrishnan et al. [35] extracted seven key acoustic features from audio datasets and evaluated each feature with both traditional classifiers and deep-learning models (LSTM and CNN). They found that the importance of individual features varies across models, underscoring the necessity of feature optimization in speech-based emotion recognition.

Huang et al. [36] propose a novel approach named Heterogeneous Hypergraph Attention Network with Counterfactual Learning (). They construct a heterogeneous hypergraph based on emotional-expression features and mitigate bias through a Counterfactual Intervention Task. Additionally, single-modal labels are leveraged to let the model adaptively identify which modality carries bias, further enhancing its ability to handle biased information.

To counteract the text-bias that distorts affective representations in multimodal conversational emotion recognition, Li et al. [37] introduce FrameERC, a new framework that captures crucial emotional cues overlooked by conventional spatial message-passing GNNs. FrameERC amplifies the effective contribution of non-textual modalities in Emotion Recognition in Conversations (ERC), ensuring a more holistic semantic understanding.

Shi et al. [38] propose a novel dialogue-based multimodal emotion-recognition framework that learns a unified semantic embedding from multimodal data, mitigating text-overreliance bias without a marked increase in parameters. A modality-aware extraction module further captures both generic and sensitive information in parallel within the shared multimodal semantic space.

Huang et al. [39] observe that, given a fixed emotional polarity, the perceived emotional intensity increases monotonically with the expressive amplitude. Building on this insight, they propose a Polarity-Aware Mixture-of-Experts network for Multimodal Sentiment Analysis (PAMoE-MSA) that jointly learns polarity-specific and polarity-common representations to explicitly capture the monotonic relationship underlying multimodal emotional expressions.

Compared with the aforementioned previous works, the FFBT method primarily contributes in the following aspects:

Enhanced Context and Cross-modal Interaction: By employing a multimodal Transformer encoder, it better captures the correlations and contextual relationships between text and images, thereby strengthening the interaction across different modalities.
Improved Robustness to Noisy Data: Leveraging the inherent robustness of RoBERTa and ResNet50, it effectively handles diverse data types and noise.
High Scenario Adaptability: Featuring a flexible design paradigm, the framework serves as a versatile foundation for model architecture, demonstrating broad applicability across multiple scenarios.

3 Overview

As illustrated in Fig 2, the proposed architecture comprises three sequential stages.

Download:

Fig 2. The architecture of the proposed Feature Fusion Based Transformer (FFBT) approach is shown in the figure. After RoBERTa and ResNet50 extract features, and the features are aligned by cross-modal attention, classification can be done as positive, neutral, or negative by FCN with Softmax function.

https://doi.org/10.1371/journal.pone.0333416.g002

During the feature extraction stage, the text is first converted into word embedding vectors using subword tokenizers, and then these word embeddings are fed into the RoBERTa model to extract the text’s feature vectors. The image is preprocessed to generate a tensor, which is then passed through the ResNet50 model to obtain the image’s feature vectors.

In the feature alignment stage, the text and image feature vectors first generate their respective projected vectors. Then, through a cross-modal attention mechanism, the text and image features are aligned, producing cross-attention vectors for both text and image. The projected vectors and cross-attention vectors are then concatenated to form the aligned feature vectors for both text and image.

During the feature fusion stage, the aligned feature vectors of the text and image are each passed through a fully connected neural network classifier to obtain their respective probability vectors. The probability vectors of the text and image are combined and then processed through a Softmax function to compute the final probability vector. Finally, based on the maximum value in the probability vector, the model determines the sentiment category to which the given text and image belong.

4 Design details

4.1 Feature extraction

1) Compared with image data, text data plays a more crucial role in sentiment classification tasks, as it carries the majority of emotional tendencies when people express feelings. This paper employs the RoBERTa model for text feature extraction, and its overall architectural diagram is illustrated in Fig 3.

Download:

Fig 3. The overall logic architecture of RoBERTa, which has been adapted from the structure of BERT.

https://doi.org/10.1371/journal.pone.0333416.g003

When text data is fed into RoBERTa, the model encodes it into a feature vector .

For a given input text, meaningless symbols are first removed. The cleaned text is then tokenized into a token sequence using a subword tokenizer such as Byte-Pair Encoding. Special tokens are inserted: “CLS” (or “<s>”) is prepended to the sequence, and “SEP” (or “</s>”) is appended at the end of each sentence.

(1)

Where, denotes the token sequence after adding special tokens, each and V is the vocabulary size; L is the length of token sequence.

After the token sequence is fed into the embedding layer, every token is converted to Token Embedding. Position Embedding is then added—without the Segment Embedding used in BERT—to yield the final text input embedding .

(2)

(3)

(4)

Where is the token-embedding matrix with vocabulary size V and embedding dimension d (e.g., 768); is the position-embedding matrix; is the position-index sequence; and denote the token and positional encodings, respectively.

Finally, is fed into the Transformer encoder of RoBERTa. The encoder consists of N identical layers (usually N = 12) stacked on top of each other. Each layer comprises two sub-layers: the first is a multi-head self-attention layer, and the second is a feed-forward neural network. Residual connections are applied around every sub-layer. The core attention mechanism of the Transformer involves three key input matrices— Q, K, and V, which are obtained by multiplying the input with three distinct linear projection matrices , , and , respectively.

(5)

(6)

(7)

By measuring the relevance between Q and K with dot products, we obtain an attention-score matrix. A scaling factor d_t is then applied to stabilize the variance of these scores, after which Softmax normalizes them into final attention weights. These weights are multiplied by V to produce the updated semantic representation.

(8)

where m denotes the number of attention heads in the multi-head attention.

The semantic representations produced by all m attention heads are concatenated, multiplied by an output projection matrix, and then passed through a residual connection followed by layer normalization to yield the self-attention output at layer .

(9)

where is the output projection matrix, and denotes the input to the current layer or the output of the preceding layer; specifically, .

Next, is fed into the feed-forward network; after a linear transformation, it yields the feed-forward network’s output . A residual connection followed by layer normalization is then applied, producing the final output of the l-th Transformer encoder layer .

(10)

(11)

Where, is the output-projection matrix; and are the weight matrices of the feed-forward network linear transformation; and are the bias matrices of the feed-forward network linear transformation; GELU denotes the activation function; LayerNorm is the layer-normalization function.

After processing through all N Transformer encoder layers, the final output is obtained. Averaging the vectors across all positions yields the ultimate textual feature vector .

(12)

(13)

2) In sentiment classification tasks, image data provides complementary information to textual data, enabling more precise sentiment analysis. The cross-modal interaction between image and textual data establishes a foundation for subsequent feature fusion models to capture inter-modal semantic relationships. For image feature extraction, we employ the ResNet50 model, whose overall architecture is depicted in Fig 4.

Download:

Fig 4. The overall architecture of ResNet50.

https://doi.org/10.1371/journal.pone.0333416.g004

When an image is fed into ResNet50, the model processes it and outputs the feature vector .

For each image, we first perform preprocessing: resize it to 224 × 224, center-crop, and normalize it, yielding the image input tensor X.

(14)

The image then undergoes an initial convolution and pooling, producing the ResNet50 input tensor Z.

(15)

where is the convolution kernel matrix with stride 2 and padding 3, and is the corresponding bias vector.

Next, Z is processed through ResNet50’s four residual-block groups, yielding the final feature .

(16)

Where denotes the i-th residual-block group in ResNet50. Each group consists of three consecutive convolution layers with kernel sizes , , and . For any such group, the input and output satisfy the Eqs 17–20 as follows:

(17)

(18)

(19)

(20)

where and denote the weight matrices of the and convolution layers, respectively. A, B, and C represent transitional variables.

Finally, global average pooling is applied to to obtain the image feature vector .

(21)

where denotes global average pooling.

4.2 Feature alignment and fusion

To narrow the gap between modalities and enhance their interaction, we introduce a cross-modal feature-alignment layer, in which the cross-modal component [40] employs a multimodal Transformer encoder to mediate inter-modal communication.

The extracted textual feature vector and the extracted image feature vector are fed into a multimodal Transformer encoder, yielding the cross-modal-enhanced representations and . The encoder is composed of N_L identical layers stacked one after another, where each layer contains two sublayers: a multi-head cross-modal attention layer and a position-wise feed-forward network. Residual connections surround every sub-layer. Unlike the self-attention described earlier, in cross-modal attention the query matrix Q is derived from the current modality M₁, whereas the key matrix K and value matrix V are derived from the other modality M₂. Except for this cross-modal attention module, the remaining computations of the multimodal Transformer encoder are the same as Eqs 9–11. The cross-modal attention layer is formulated as:

(22)

(23)

(24)

(25)

The cross-modal enhanced representation obtained via the cross-modal attention mechanism cannot be used directly for classification; it must first be converted into a fixed-size feature embedding through a pooling operation [41], as shown in the following equation:

(26)

The projected representation and the cross-modal enhanced embedding are concatenated to yield the final fused features for text and image, respectively, as shown in the following equation:

(27)

Finally, each fused feature H is passed through a fully-connected layer to produce separate probability vectors and ; these two vectors are then summed to obtain the final probability vector , as shown in the following equation:

(28)

(29)

where and are the weight matrix and bias matrix of the linear transformation, respectively; C denotes the number of emotion classes; λ controls the weighting factor when summing the text and image probability vectors.

4.3 Multimodal sentiment classification

For each text-image pair (T,I) fed into the multimodal sentiment classifier, we obtain a predicted probability distribution P. The index of the maximum value in this probability vector is taken as the predicted emotion label . The computation proceeds as follows:

(30)

(31)

where P is the model’s predicted probability vector over all emotion classes, and is the final predicted emotion label.

During training, the model is optimized with the AdamW optimizer to minimize the sum of cross-entropy loss and Contrastive Loss:

(32)

where is the Softmax output and y_i is the ground-truth emotion label; the contrastive loss employs a cosine-embedding loss.

5 Evaluation

In the experimental implementation phase, we will introduce the dataset required for the experiment, the baseline algorithms, and the relevant results in sequence.

5.1 Datasets

This paper implements experiments with various models on the SDTI dataset to evaluate the performance of the proposed FFBT method. The evaluation metrics include class-specific Precision, Recall, and F1-Score for individual category recognition, along with overall classification metrics including Accuracy, Macro-F1, and Weighted-F1.

The SDTI dataset was constructed by relabeling data from the public MVSA-Single benchmark [42], with all data collection and analysis procedures complying with the source platform’s terms and conditions, and obtaining necessary ethical approvals or user authorizations. The SDTI dataset comprises 5,129 text-image paired samples, with a standardized split of 4,001 samples for training, 616 for validation, and 512 for testing, resulting in a ratio of approximately 8:1:1 between training, validation, and test sets. As shown in Table 2, the emotional label distribution across different categories in the SDTI dataset follows an approximate 6:1:3 ratio.

Download:

Table 2. The distribution of different sentiment labels in dataset SDTI.

https://doi.org/10.1371/journal.pone.0333416.t002

5.2 Baseline

To validate the performance of our proposed FFBT method, we conduct comprehensive comparisons with the following baseline algorithms:

ResNet-SVM algorithm [43]: uses CNNs or ResNet series networks to extract text-image features and employs SVM to classify by finding the hyperplane that maximizes the interval between classes in the feature space. For nonlinearly separable data, it maps to a higher-dimensional space through kernel tricks to find the optimal decision boundary.

CNN-DT algorithm [44]: uses CNNs to extract features from text and another pre-trained CNN to extract features from images. After simply concatenating the two parts of the features, it uses a Decision Tree to build a tree-like structure for feature classification by recursively selecting the optimal features and splitting points: it starts from the root node, divides the data into different branches according to the feature values, each branch represents a decision rule, and recursively divides the dataset into smaller, purer subsets until reaching the leaf node, with each leaf node corresponding to a classification result.

BERT-RF algorithm [45]: proposes using BERT for text feature extraction, while image features are implemented with general CNNs or ResNet. Then, it uses Random Forrest to classify sentiments by constructing multiple decision trees, each of which randomly selects a subset of features and samples during training, and then aggregates the classification results of each tree through a majority voting mechanism.

DNN-LR algorithm [46]: leverages two distinct pre-trained CNNs to extract features from text and images, respectively, and subsequently utilizes Logistic Regression for sentiment tendency classification.

RoBERTa–ClipVisionModel [47]: concatenates the textual and image features extracted by RoBERTa and CLIP’s Vision Transformer, respectively, and then feeds the fused vector into a multilayer perceptron for sentiment classification.

MultiSentiNet [48]: extracts objects and scenes from images as visual-semantic cues and introduces an image-guided LSTM network with attention to extract key textual terms.

ViLT [49]: a Transformer-based framework that introduces image-language cross-modal embeddings to jointly relate visual and textual information.

VAuLT [50]: an extension of ViLT that improves performance on vision-and-language tasks involving more complex text inputs than image captions, while having minimal impact on training and inference efficiency.

FFBT (BERT): identical to our proposed method except that it uses BERT instead of RoBERTa for text feature extraction.

5.3 Implementation details

In the experimental deployment, as mentioned in Sect 5.1, the SDTI dataset will be allocated to the training set, validation set, and test set in a ratio of approximately 8:1:1. The optimizer used by the proposed FFBT method is AdamW. The learning rate is initialized at 5e-6 and reduced with a weight decay ratio of 1e-2. In response to the sample sizes across the training, validation, and test sets, we configured the batch size for the training and validation sets to 16 and for the test set to 8. Our model is implemented based on PyTorch. This paper uses Accuracy, Macro-F1, and Weighted-F1 as performance metrics to comprehensively assess the model, whose definitions are as follows.

(33)

(34)

(35)

(36)

Where TP stands for True Positive, TN stands for True Negative, FP stands for False Positive, and FN stands for False Negative.

The macro average refers to the simple average of the metrics for each category, regardless of the support degree (i.e., the number of samples) of each category. The weighted average refers to the weighted average of the metrics based on the support degree of each category.

(37)

(38)

Where N is the number of categories, M_i is the metric value for each category (such as precision, recall, f1-score, etc.), and S_i is the sample size or support degree.

5.4 Main results

On the SDTI dataset, the proposed FFBT algorithm is compared with the baseline methods; the results are reported in Fig 5 and Table 3.

Download:

Fig 5. The performance of different methods for individual category recognition.

https://doi.org/10.1371/journal.pone.0333416.g005

Download:

Table 3. The comparison of different methods on SDTI dataset.

https://doi.org/10.1371/journal.pone.0333416.t003

From the results in the above Fig 5 and Table 3, we can observe that: (1) CNN-DT performs the worst because decision trees refine rules via locally optimal splits that can grow overly complex and fit training noise; each split relies on only one feature, making it impossible to capture word-order or negation dependencies such as “not + happy,” leading to low accuracy for subtle emotions like sarcasm. (2) DNN-LR outperforms Decision Tree by about 7%, since Logistic Regression handles high-dimensional, sparse textual features more efficiently. (3) When BERT replaces the DNN for text processing, BERT-RF reaches an even higher score, thanks to BERT’s powerful attention mechanism and Random Forest’s ability to reduce overfitting on textual features. (4) ResNet-SVM uses ResNet to extract fine-grained image features and leverages the kernel trick of SVM to model non-linear relations, achieving comparatively strong benchmark performance. (5) By incorporating visual cues into text, MultiSentiNet improves further. (6) RoBERTa-ClipVisionModel, which concatenates text and image features after Transformer encoding, slightly surpasses MultiSentiNet. (7) ViLT and VAuLT, both built on Transformers with explicit cross-modal learning, achieve the best scores among all baselines. (8) FFBT (BERT) trails FFBT marginally, as RoBERTa is essentially an enhanced version of BERT.

On the SDTI dataset, compared with the strongest baseline, our FFBT model—excluding the RoBERTa-equivalent FFBT (BERT)—improves accuracy, Macro-F1, and Weighted-F1 by 4.1%, 7%, and 5%, respectively. Across all metrics, the results consistently demonstrate the superiority of FFBT. This performance gain stems from three key enhancements: first, RoBERTa’s byte-level Byte-Pair Encoding eliminates out-of-vocabulary issues, greatly expanding the vocabulary and enabling richer textual feature extraction; second, ResNet50’s residual design mitigates gradient explosion, producing robust image representations; and finally, feeding the textual and image features into a multimodal Transformer encoder with cross-modal attention strengthens their semantic interaction, allowing complementary information across modalities and yielding more accurate predictions.

5.5 Ablation study

To gain a clearer understanding of the roles of different modules in the proposed model, we organized a series of ablation experiments on the SDTI dataset.

In this section, we first validate the superiority of multimodal analysis over unimodal analysis by examining text-only (“FFBT w/o img”) and image-only (“FFBT w/o text”) configurations. Subsequently, we investigate the contribution of key components by: (1) removing the cross-modal alignment module (“FFBT w/o align”), (2) replacing the fusion module with simple concatenation (“FFBT only concat”) or element-wise addition (“FFBT only combine”), and (3) eliminating the contrastive loss (“FFBT w/o Contrastive Loss”). The experimental results are presented in Fig 6 and Table 4.

Download:

Fig 6. The performance of Ablation Study for individual category recognition.

https://doi.org/10.1371/journal.pone.0333416.g006

Download:

Table 4. Ablation Study results on SDTI dataset.

https://doi.org/10.1371/journal.pone.0333416.t004

Where the contribution is defined as:

(39)

From the ablation studies above, we can observe that:

(1) Due to the higher semantic explicitness of textual data compared to the abstract nature of images—and considering that sentiment classification may rely more heavily on textual content (e.g., sentiment words directly appearing in the text)—the performance metrics of “FFBT w/o img” are consistently higher than those of “FFBT w/o text” by 4.1%. (2) When only textual or image data is used for unimodal sentiment classification (multi-label task), the performance metrics lag behind those of the text-image bimodal approach by 4.7% and 8.8%, respectively. This disparity arises because the combined text-image information enhances the classifier’s recognition capability. (3) The performance degradation of “FFBT w/o align” compared to the full FFBT model underscores the critical role of cross-modal alignment in modeling latent relationships between textual tokens and visual regions for effective multimodal sentiment classification. (4) Both simplified fusion variants exhibited significant limitations: “FFBT only concat” showed –3.4% accuracy degradation, and “FFBT only combine” suffered –5.1% accuracy loss. This empirically confirms that our proposed fusion mechanism enables fine-grained feature interaction, outperforming naive fusion strategies by dynamically learning modality-specific attention weights. (5) After incorporating the contrastive loss from contrastive learning, FFBT achieves better alignment between image and text feature vectors, enabling finer-grained feature interaction. Consequently, its performance metrics slightly outperform those of “FFBT w/o Contrastive Loss”.

5.6 Hyper-parameter analysis

In the series of experiments we designed, the number of attention heads m in the cross-modal fusion mechanism and the weighting factor λ in Eq 29 are two key hyper-parameters that directly affect inference performance. Therefore, this section analyzes the performance of the proposed method under various settings of m and λ.

We first examine how the number of attention heads m affects performance. Fig 7 reports results for different m values. Setting m = 8 yields the best outcome because it aligns the per-head dimension with the fine-grained details required for text–image alignment. Values of 2 or 4 under-fit the data, while 16 heads are too fine-grained, leading to over-fitting and memory bottlenecks. Consequently, we fix m = 8 in all subsequent experiments.

Download:

Fig 7. Hyper-parameter Analysis of different number of attention heads m for proposed FFBT.

https://doi.org/10.1371/journal.pone.0333416.g007

We then conduct experiments to investigate the effect of varying λ in Eq 29. As shown in Fig 8, setting yields the best accuracy and F1-score. When only image features are used () or only text features are used (), performance drops. Hence, leveraging both image and textual information is essential for overall emotion classification. Based on these results, we set in our final method.

Download:

Fig 8. Hyper-parameter Analysis of different λ in Eq 29 for proposed FFBT.

https://doi.org/10.1371/journal.pone.0333416.g008

6 Conclusions

This paper proposes a Feature-Fusion-Based Transformer (FFBT) designed to identify the emotional tendencies expressed by students on social networking platforms, laying the groundwork for future intelligent and personalized psychological support education at all school levels. The method leverages RoBERTa for textual content and ResNet50—a CNN-based architecture with residual connections—for image content extracted from student Weibo posts. A multimodal Transformer encoder then performs cross-modal interaction and fusion, after which fully-connected layers classify the fused features into emotional categories. Experimental results demonstrate that FFBT not only overcomes the limitations of traditional handcrafted features and modality isolation but also exploits the combined power of textual and image information to improve the accuracy of student emotion classification. Considering the exponential growth of social-media data and the increasing complexity of human emotions, future sentiment-analysis systems should move beyond simple positive, negative, and neutral labels to more nuanced states such as guilt, pride, or anger. Consequently, our future work will investigate even more sophisticated feature alignment and fusion strategies to capture these subtle emotional tendencies, continually providing technological support for intelligent and personalized psychological counseling.

Supporting information

S1 Data.

The relabeled SDTI dataset derived from the public MVSA-Single benchmark [46].

https://doi.org/10.1371/journal.pone.0333416.s001

(ZIP)

Acknowledgments

In this article, we employed a large AI model to scrutinize the grammar of the text. We extend our gratitude to Moonshot AI for providing the “Kimi” artificial intelligence technology, version 1.0, 2024. For more information about this technology, please visit the website at https://kimi.moonshot.cn.

References

1. Gao Y, Zhen Y, Li H, Chua T-S. Filtering of brand-related microblogs using social-smooth multiview embedding. IEEE Trans Multimedia. 2016;18(10):2115–26.
- View Article
- Google Scholar
2. Nanomi Arachchige IA, Sandanapitchai P, Weerasinghe R. Investigating machine learning & natural language processing techniques applied for predicting depression disorder from online support forums: a systematic literature review. Information. 2021;12(11):444.
- View Article
- Google Scholar
3. Cambria E, Schuller B, Xia Y, Havasi C. New avenues in opinion mining and sentiment analysis. IEEE Intell Syst. 2013;28(2):15–21.
- View Article
- Google Scholar
4. Yu W, Xu H, Yuan Z, Wu J. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. AAAI. 2021;35(12):10790–7.
- View Article
- Google Scholar
5. Tsai Y-HH, Bai S, Pu Liang P, Kolter JZ, Morency L-P, Salakhutdinov R. Multimodal transformer for unaligned multimodal language sequences. Proc Conf Assoc Comput Linguist Meet. 2019;2019:6558–69. pmid:32362720
- View Article
- PubMed/NCBI
- Google Scholar
6. Hazarika D, Zimmermann R, Poria S. Misa: modality-invariant and-specific representations for multimodal sentiment analysis. In: Proceedings of the 28th ACM international conference on multimedia; 2020. p. 1122–31.
7. Zadeh A, Chen M, Poria S, Cambria E, Morency LP. Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250; 2017.
- View Article
- Google Scholar
8. Campos V, Jou B, Giro-i Nieto X. From pixels to sentiment: fine-tuning CNNs for visual sentiment prediction. Image and Vision Computing. 2017;65:15–22.
- View Article
- Google Scholar
9. Zhang X, Yan Z. LCA-BERT: A local and context fusion sentiment analysis model based on BERT. In: International Artificial Intelligence Conference, 2023. p. 300–11.
10. Liu Y. Roberta: a robustly optimized bert pretraining approach. arXiv preprint 2019. https://doi.org/10.48550/arXiv.1907.11692
11. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016. p. 770–8. https://doi.org/10.1109/cvpr.2016.90
12. Chen T, Kornblith S, Norouzi M, Hinton G. A simple framework for contrastive learning of visual representations. In: International conference on machine learning; 2020. p. 1597–607.
13. Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M. A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition; 2018. p. 6450–9.
14. Carreira J, Zisserman A. Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2017. p. 6299–308.
15. H Kugate A, Y Balannanavar B, Goudar RH, Rathod VN, G M D, Kulkarni A, et al. Efficient key frame extraction from videos using convolutional neural networks and clustering techniques. EAI Endorsed Trans Context Aware Syst App. 2024;10.
- View Article
- Google Scholar
16. Fan Y, Lu X, Li D, Liu Y. Video-based emotion recognition using CNN-RNN and C3D hybrid networks. In: Proceedings of the 18th ACM International Conference on Multimodal Interaction. 2016. p. 445–50. https://doi.org/10.1145/2993148.2997632
17. Pang S, Xue Y, Yan Z, Huang W, Feng J. Dynamic and multi-channel graph convolutional networks for aspect-based sentiment analysis. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 . 2021. https://doi.org/10.18653/v1/2021.findings-acl.232
18. Li R, Chen H, Feng F, Ma Z, Wang X, Hovy E. Dual graph convolutional networks for aspect-based sentiment analysis. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021. https://doi.org/10.18653/v1/2021.acl-long.494
19. Busso C, Bulut M, Lee C-C, Kazemzadeh A, Mower E, Kim S, et al. IEMOCAP: interactive emotional dyadic motion capture database. Lang Resources & Evaluation. 2008;42(4):335–59.
- View Article
- Google Scholar
20. Poria S, Hazarika D, Majumder N, Naik G, Cambria E, Mihalcea R. MELD: A multimodal multi-party dataset for emotion recognition in conversations. arXiv preprint 2018. https://doi.org/10.48550/arXiv.1810.02508
21. Abdullah SMSA, Ameen SYA, M. Sadeeq MA, Zeebaree S. Multimodal emotion recognition using deep learning. JASTT. 2021;2(01):73–9.
- View Article
- Google Scholar
22. Gupta A, Likhomanenko T, Yang KD, Bai RH, Aldeneh Z, Jaitly N. Visatronic: a multimodal decoder-only model for speech synthesis. arXiv preprint 2024. https://arxiv.org/abs/2411.17690
- View Article
- Google Scholar
23. Mai S, Zeng Y, Zheng S, Hu H. Hybrid contrastive learning of tri-modal representation for multimodal sentiment analysis. IEEE Trans Affective Comput. 2023;14(3):2276–89.
- View Article
- Google Scholar
24. Liu Z, Shen Y, Lakshminarasimhan VB, Liang PP, Zadeh A, Morency LP. Efficient low-rank multimodal fusion with modality-specific factors. arXiv preprint 2018. https://arxiv.org/abs/1806.00064
25. Kakuba S, Poulose A, Han DS. Deep learning approaches for bimodal speech emotion recognition: advancements, challenges, and a multi-learning model. IEEE Access. 2023;11:113769–89.
- View Article
- Google Scholar
26. Poria S, Cambria E, Hazarika D, Majumder N, Zadeh A, Morency LP. Context-dependent sentiment analysis in user-generated videos. In: Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long papers); 2017. p. 873–83.
27. Xing S, Mai S, Hu H. Adapted dynamic memory network for emotion recognition in conversation. IEEE Trans Affective Comput. 2022;13(3):1426–39.
- View Article
- Google Scholar
28. Hazarika D, Poria S, Mihalcea R, Cambria E, Zimmermann R. ICON: interactive conversational memory network for multimodal emotion detection. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018. https://doi.org/10.18653/v1/d18-1280
29. Xiao L, Wu X, Xu J, Li W, Jin C, He L. Atlantis: aesthetic-oriented multiple granularities fusion network for joint multimodal aspect-based sentiment analysis. Information Fusion. 2024;106:102304.
- View Article
- Google Scholar
30. Yue T, Mao R, Wang H, Hu Z, Cambria E. KnowleNet: knowledge fusion network for multimodal sarcasm detection. Information Fusion. 2023;100:101921.
- View Article
- Google Scholar
31. Huang Q, Chen J, Huang C, Huang X, Wang Y. Text-centered cross-sample fusion network for multimodal sentiment analysis. Multimedia Systems. 2024;30(4).
- View Article
- Google Scholar
32. Huang C, Chen J, Huang Q, Wang S, Tu Y, Huang X. AtCAF: attention-based causality-aware fusion network for multimodal sentiment analysis. Information Fusion. 2025;114:102725.
- View Article
- Google Scholar
33. Chen J, Huang Q, Huang C, Huang X. Actual cause-guided adaptive gradient scaling for balanced multimodal sentiment analysis. ACM Trans Multimedia Comput Commun Appl. 2025;21(6):1–24.
- View Article
- Google Scholar
34. Li M, Zhou S, Chen Y, Huang C, Jiang Y. EduCross: dual adversarial bipartite hypergraph learning for cross-modal retrieval in multimodal educational slides. Information Fusion. 2024;109:102428.
- View Article
- Google Scholar
35. P R, Dhaman RK, Poulose A. Feature importance and model performance in deep learning for speech emotion recognition. In: 2024 11th International Conference on Advances in Computing and Communications (ICACC). 2024. p. 1–6. https://doi.org/10.1109/icacc63692.2024.10845528
36. Huang C, Lin Z, Huang Q, Huang X, Jiang F, Chen J. H²CAN: Heterogeneous hypergraph attention network with counterfactual learning for multimodal sentiment analysis. Complex Intell Syst. 2025;11(4):196.
- View Article
- Google Scholar
37. Li M, Shi J, Bai L, Huang C, Jiang Y, Lu K, et al. FrameERC: framelet transform based multimodal graph neural networks for emotion recognition in conversation. Pattern Recognition. 2025;161:111340.
- View Article
- Google Scholar
38. Shi J, Li M, Chen Y, Cui L, Bai L. Multimodal graph learning with framelet-based stochastic configuration networks for emotion recognition in conversation. Information Sciences. 2025;686:121393.
- View Article
- Google Scholar
39. Huang C, Lin Z, Han Z, Huang Q, Jiang F, Huang X. PAMoE-MSA: polarity-aware mixture of experts network for multimodal sentiment analysis. Int J Multimed Info Retr. 2025;14(1).
- View Article
- Google Scholar
40. You R, Guo Z, Cui L, Long X, Bao Y, Wen S. Cross-modality attention with semantic graph embedding for multi-label classification. AAAI. 2020;34(07):12709–16.
- View Article
- Google Scholar
41. Reimers N, Gurevych I. Sentence-bert: sentence embeddings using siamese bert-networks. arXiv preprint 2019.
- View Article
- Google Scholar
42. Niu T, Zhu S, Pang L, El Saddik A. Sentiment analysis on multi-view social data. In:International conference on multimedia modeling. Springer; 2016. p. 15–27.
43. Wang Q. Support vector machine algorithm in machine learning. In: 2022 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA). 2022. p. 750–6. https://doi.org/10.1109/icaica54878.2022.9844516
44. Lu Y, Ye T, Zheng J. Decision tree algorithm in machine learning. In: 2022 IEEE International Conference on Advances in Electrical Engineering and Computer Applications (AEECA). 2022. https://doi.org/10.1109/aeeca55500.2022.9918857
45. Palimkar P, Shaw RN, Ghosh A. Machine learning technique to prognosis diabetes disease: random forest classifier approach. In: Advanced computing and intelligent technologies: proceedings of ICACIT 2021 ; 2022. p. 219–44.
46. Yu Y, Lin H, Meng J, Zhao Z. Visual and textual sentiment analysis of a microblog using deep convolutional neural networks. Algorithms. 2016;9(2):41.
- View Article
- Google Scholar
47. Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, et al. Learning transferable visual models from natural language supervision. In: International conference on machine learning. PmLR; 2021. p. 8748–63.
48. Xu N, Mao W. Multisentinet: a deep semantic network for multimodal sentiment analysis. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management; 2017. p. 2399–402.
49. Kim W, Son B, Kim I. Vilt: vision-and-language transformer without convolution or region supervision. In: International conference on machine learning. 2021. p. 5583–94.
50. Chochlakis G, Srinivasan T, Thomason J, Narayanan S. Vault: augmenting the vision-and-language transformer with the propagation of deep language representations. arXiv preprint. 2022. 17.
- View Article
- Google Scholar

[ref1] 1. Gao Y, Zhen Y, Li H, Chua T-S. Filtering of brand-related microblogs using social-smooth multiview embedding. IEEE Trans Multimedia. 2016;18(10):2115–26.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Nanomi Arachchige IA, Sandanapitchai P, Weerasinghe R. Investigating machine learning & natural language processing techniques applied for predicting depression disorder from online support forums: a systematic literature review. Information. 2021;12(11):444.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Cambria E, Schuller B, Xia Y, Havasi C. New avenues in opinion mining and sentiment analysis. IEEE Intell Syst. 2013;28(2):15–21.
View Article
Google Scholar

[8] View Article

[9] Google Scholar

[ref4] 4. Yu W, Xu H, Yuan Z, Wu J. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. AAAI. 2021;35(12):10790–7.
View Article
Google Scholar

[11] View Article

[12] Google Scholar

[ref5] 5. Tsai Y-HH, Bai S, Pu Liang P, Kolter JZ, Morency L-P, Salakhutdinov R. Multimodal transformer for unaligned multimodal language sequences. Proc Conf Assoc Comput Linguist Meet. 2019;2019:6558–69. pmid:32362720
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref6] 6. Hazarika D, Zimmermann R, Poria S. Misa: modality-invariant and-specific representations for multimodal sentiment analysis. In: Proceedings of the 28th ACM international conference on multimedia; 2020. p. 1122–31.

[ref7] 7. Zadeh A, Chen M, Poria S, Cambria E, Morency LP. Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250; 2017.
View Article
Google Scholar

[19] View Article

[20] Google Scholar

[ref8] 8. Campos V, Jou B, Giro-i Nieto X. From pixels to sentiment: fine-tuning CNNs for visual sentiment prediction. Image and Vision Computing. 2017;65:15–22.
View Article
Google Scholar

[22] View Article

[23] Google Scholar

[ref9] 9. Zhang X, Yan Z. LCA-BERT: A local and context fusion sentiment analysis model based on BERT. In: International Artificial Intelligence Conference, 2023. p. 300–11.

[ref10] 10. Liu Y. Roberta: a robustly optimized bert pretraining approach. arXiv preprint 2019. https://doi.org/10.48550/arXiv.1907.11692

[ref11] 11. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016. p. 770–8. https://doi.org/10.1109/cvpr.2016.90

[ref12] 12. Chen T, Kornblith S, Norouzi M, Hinton G. A simple framework for contrastive learning of visual representations. In: International conference on machine learning; 2020. p. 1597–607.

[ref13] 13. Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M. A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition; 2018. p. 6450–9.

[ref14] 14. Carreira J, Zisserman A. Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2017. p. 6299–308.

[ref15] 15. H Kugate A, Y Balannanavar B, Goudar RH, Rathod VN, G M D, Kulkarni A, et al. Efficient key frame extraction from videos using convolutional neural networks and clustering techniques. EAI Endorsed Trans Context Aware Syst App. 2024;10.
View Article
Google Scholar

[31] View Article

[32] Google Scholar

[ref16] 16. Fan Y, Lu X, Li D, Liu Y. Video-based emotion recognition using CNN-RNN and C3D hybrid networks. In: Proceedings of the 18th ACM International Conference on Multimodal Interaction. 2016. p. 445–50. https://doi.org/10.1145/2993148.2997632

[ref17] 17. Pang S, Xue Y, Yan Z, Huang W, Feng J. Dynamic and multi-channel graph convolutional networks for aspect-based sentiment analysis. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 . 2021. https://doi.org/10.18653/v1/2021.findings-acl.232

[ref18] 18. Li R, Chen H, Feng F, Ma Z, Wang X, Hovy E. Dual graph convolutional networks for aspect-based sentiment analysis. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021. https://doi.org/10.18653/v1/2021.acl-long.494

[ref19] 19. Busso C, Bulut M, Lee C-C, Kazemzadeh A, Mower E, Kim S, et al. IEMOCAP: interactive emotional dyadic motion capture database. Lang Resources & Evaluation. 2008;42(4):335–59.
View Article
Google Scholar

[37] View Article

[38] Google Scholar

[ref20] 20. Poria S, Hazarika D, Majumder N, Naik G, Cambria E, Mihalcea R. MELD: A multimodal multi-party dataset for emotion recognition in conversations. arXiv preprint 2018. https://doi.org/10.48550/arXiv.1810.02508

[ref21] 21. Abdullah SMSA, Ameen SYA, M. Sadeeq MA, Zeebaree S. Multimodal emotion recognition using deep learning. JASTT. 2021;2(01):73–9.
View Article
Google Scholar

[41] View Article

[42] Google Scholar

[ref22] 22. Gupta A, Likhomanenko T, Yang KD, Bai RH, Aldeneh Z, Jaitly N. Visatronic: a multimodal decoder-only model for speech synthesis. arXiv preprint 2024. https://arxiv.org/abs/2411.17690
View Article
Google Scholar

[44] View Article

[45] Google Scholar

[ref23] 23. Mai S, Zeng Y, Zheng S, Hu H. Hybrid contrastive learning of tri-modal representation for multimodal sentiment analysis. IEEE Trans Affective Comput. 2023;14(3):2276–89.
View Article
Google Scholar

[47] View Article

[48] Google Scholar

[ref24] 24. Liu Z, Shen Y, Lakshminarasimhan VB, Liang PP, Zadeh A, Morency LP. Efficient low-rank multimodal fusion with modality-specific factors. arXiv preprint 2018. https://arxiv.org/abs/1806.00064

[ref25] 25. Kakuba S, Poulose A, Han DS. Deep learning approaches for bimodal speech emotion recognition: advancements, challenges, and a multi-learning model. IEEE Access. 2023;11:113769–89.
View Article
Google Scholar

[51] View Article

[52] Google Scholar

[ref26] 26. Poria S, Cambria E, Hazarika D, Majumder N, Zadeh A, Morency LP. Context-dependent sentiment analysis in user-generated videos. In: Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long papers); 2017. p. 873–83.

[ref27] 27. Xing S, Mai S, Hu H. Adapted dynamic memory network for emotion recognition in conversation. IEEE Trans Affective Comput. 2022;13(3):1426–39.
View Article
Google Scholar

[55] View Article

[56] Google Scholar

[ref28] 28. Hazarika D, Poria S, Mihalcea R, Cambria E, Zimmermann R. ICON: interactive conversational memory network for multimodal emotion detection. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018. https://doi.org/10.18653/v1/d18-1280

[ref29] 29. Xiao L, Wu X, Xu J, Li W, Jin C, He L. Atlantis: aesthetic-oriented multiple granularities fusion network for joint multimodal aspect-based sentiment analysis. Information Fusion. 2024;106:102304.
View Article
Google Scholar

[59] View Article

[60] Google Scholar

[ref30] 30. Yue T, Mao R, Wang H, Hu Z, Cambria E. KnowleNet: knowledge fusion network for multimodal sarcasm detection. Information Fusion. 2023;100:101921.
View Article
Google Scholar

[62] View Article

[63] Google Scholar

[ref31] 31. Huang Q, Chen J, Huang C, Huang X, Wang Y. Text-centered cross-sample fusion network for multimodal sentiment analysis. Multimedia Systems. 2024;30(4).
View Article
Google Scholar

[65] View Article

[66] Google Scholar

[ref32] 32. Huang C, Chen J, Huang Q, Wang S, Tu Y, Huang X. AtCAF: attention-based causality-aware fusion network for multimodal sentiment analysis. Information Fusion. 2025;114:102725.
View Article
Google Scholar

[68] View Article

[69] Google Scholar

[ref33] 33. Chen J, Huang Q, Huang C, Huang X. Actual cause-guided adaptive gradient scaling for balanced multimodal sentiment analysis. ACM Trans Multimedia Comput Commun Appl. 2025;21(6):1–24.
View Article
Google Scholar

[71] View Article

[72] Google Scholar

[ref34] 34. Li M, Zhou S, Chen Y, Huang C, Jiang Y. EduCross: dual adversarial bipartite hypergraph learning for cross-modal retrieval in multimodal educational slides. Information Fusion. 2024;109:102428.
View Article
Google Scholar

[74] View Article

[75] Google Scholar

[ref35] 35. P R, Dhaman RK, Poulose A. Feature importance and model performance in deep learning for speech emotion recognition. In: 2024 11th International Conference on Advances in Computing and Communications (ICACC). 2024. p. 1–6. https://doi.org/10.1109/icacc63692.2024.10845528

[ref36] 36. Huang C, Lin Z, Huang Q, Huang X, Jiang F, Chen J. H²CAN: Heterogeneous hypergraph attention network with counterfactual learning for multimodal sentiment analysis. Complex Intell Syst. 2025;11(4):196.
View Article
Google Scholar

[78] View Article

[79] Google Scholar

[ref37] 37. Li M, Shi J, Bai L, Huang C, Jiang Y, Lu K, et al. FrameERC: framelet transform based multimodal graph neural networks for emotion recognition in conversation. Pattern Recognition. 2025;161:111340.
View Article
Google Scholar

[81] View Article

[82] Google Scholar

[ref38] 38. Shi J, Li M, Chen Y, Cui L, Bai L. Multimodal graph learning with framelet-based stochastic configuration networks for emotion recognition in conversation. Information Sciences. 2025;686:121393.
View Article
Google Scholar

[84] View Article

[85] Google Scholar

[ref39] 39. Huang C, Lin Z, Han Z, Huang Q, Jiang F, Huang X. PAMoE-MSA: polarity-aware mixture of experts network for multimodal sentiment analysis. Int J Multimed Info Retr. 2025;14(1).
View Article
Google Scholar

[87] View Article

[88] Google Scholar

[ref40] 40. You R, Guo Z, Cui L, Long X, Bao Y, Wen S. Cross-modality attention with semantic graph embedding for multi-label classification. AAAI. 2020;34(07):12709–16.
View Article
Google Scholar

[90] View Article

[91] Google Scholar

[ref41] 41. Reimers N, Gurevych I. Sentence-bert: sentence embeddings using siamese bert-networks. arXiv preprint 2019.
View Article
Google Scholar

[93] View Article

[94] Google Scholar

[ref42] 42. Niu T, Zhu S, Pang L, El Saddik A. Sentiment analysis on multi-view social data. In:International conference on multimedia modeling. Springer; 2016. p. 15–27.

[ref43] 43. Wang Q. Support vector machine algorithm in machine learning. In: 2022 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA). 2022. p. 750–6. https://doi.org/10.1109/icaica54878.2022.9844516

[ref44] 44. Lu Y, Ye T, Zheng J. Decision tree algorithm in machine learning. In: 2022 IEEE International Conference on Advances in Electrical Engineering and Computer Applications (AEECA). 2022. https://doi.org/10.1109/aeeca55500.2022.9918857

[ref45] 45. Palimkar P, Shaw RN, Ghosh A. Machine learning technique to prognosis diabetes disease: random forest classifier approach. In: Advanced computing and intelligent technologies: proceedings of ICACIT 2021 ; 2022. p. 219–44.

[ref46] 46. Yu Y, Lin H, Meng J, Zhao Z. Visual and textual sentiment analysis of a microblog using deep convolutional neural networks. Algorithms. 2016;9(2):41.
View Article
Google Scholar

[100] View Article

[101] Google Scholar

[ref47] 47. Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, et al. Learning transferable visual models from natural language supervision. In: International conference on machine learning. PmLR; 2021. p. 8748–63.

[ref48] 48. Xu N, Mao W. Multisentinet: a deep semantic network for multimodal sentiment analysis. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management; 2017. p. 2399–402.

[ref49] 49. Kim W, Son B, Kim I. Vilt: vision-and-language transformer without convolution or region supervision. In: International conference on machine learning. 2021. p. 5583–94.

[ref50] 50. Chochlakis G, Srinivasan T, Thomason J, Narayanan S. Vault: augmenting the vision-and-language transformer with the propagation of deep language representations. arXiv preprint. 2022. 17.
View Article
Google Scholar

[106] View Article

[107] Google Scholar

Figures

Abstract

1 Introduction

2 Related works

2.1 Single-modal sentiment analysis

2.2 Basic multimodal fusion for sentiment analysis

2.3 Advanced multimodal fusion for sentiment analysis

3 Overview

4 Design details

4.1 Feature extraction

4.2 Feature alignment and fusion

4.3 Multimodal sentiment classification

5 Evaluation

5.1 Datasets

5.2 Baseline

5.3 Implementation details

5.4 Main results

5.5 Ablation study

5.6 Hyper-parameter analysis

6 Conclusions

Supporting information

S1 Data.

Acknowledgments

References