Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Enhancing sarcasm detection on social media: A comprehensive study using LLMs and BERT with multi-headed attention on SARC

  • Lihong Zhang,

    Roles Funding acquisition, Methodology, Resources, Validation

    Affiliation School of Foreign Studies, Hunan First Normal University, Changsha, Hunan, China

  • Muhammad Faseeh ,

    Roles Conceptualization, Data curation, Investigation, Methodology, Software, Visualization, Writing – original draft

    faseeh.cs@cuiatk.edu.pk (MF); anwar.ghani@iiu.edu.pk (AG)

    Affiliation Department of Computer Science, COMSATS University Islamabad, Attock Campus, Punjab, Republic of Pakistan

  • Syed Shehryar Ali Naqvi,

    Roles Data curation, Formal analysis, Investigation, Methodology, Software, Visualization

    Affiliation Department of Electrical Engineering, COMSATS University Islamabad, Attock Campus, Punjab, Republic of Pakistan

  • Liang Hu,

    Roles Formal analysis, Funding acquisition, Project administration, Resources, Validation, Writing – review & editing

    Affiliation School of Foreign Studies, Hunan First Normal University, Changsha, Hunan, China

  • Anwar Ghani

    Roles Conceptualization, Investigation, Methodology, Supervision, Writing – review & editing

    faseeh.cs@cuiatk.edu.pk (MF); anwar.ghani@iiu.edu.pk (AG)

    Affiliations Department of Computer Science, International Islamic University, Islamabad, Pakistan, Department of Computer Science, School of Engineering and Digital Sciences, Nazarbayev University, Astana, Kazakhstan

Abstract

Sarcasm detection in natural language processing (NLP) remains a complex challenge, especially in social media, where contextual clues are often subtle. This study addresses this challenge by leveraging transformer-based models, including BERT, GPT-3, Claude-2, and Llama-2, for sarcasm detection on a large dataset from the Self-Annotated Reddit Corpus (SARC). The proposed method utilizes multi-head attention mechanisms to enhance model performance by capturing nuanced contextual relationships in the text. Fine-tuning of BERT, GPT-3, and Llama-2 was conducted to ensure a fair comparison and to provide a more detailed understanding of sarcasm in context. Our BERT-based model achieved state-of-the-art performance, with precision, recall, F1 score, and accuracy of 0.918, 0.917, 0.917, and 0.917, respectively, outperforming the other models. The effectiveness of our approach is demonstrated through rigorous statistical validation, ablation studies, and error analysis, providing robust evidence of its superiority. This study also highlights the significance of fine-tuning, machine translation, and multi-head attention in improving sarcasm detection.

1 Introduction

In today’s digital communication landscape, social media platforms have become thriving hubs where individuals freely express their thoughts, share information, and engage in dynamic discussions. A distinctive and often confounding form of communication in these spaces is sarcasm, which poses significant challenges for both human and machine comprehension due to its subtlety and nuance.

Sarcasm is characterized by a complex interplay of language and context, wherein literal expressions often convey meanings opposite to what is intended. For example, the statement “What a fantastic idea to schedule a meeting during lunchtime!”. It appears positive on the surface, but critiques the impracticality of such a suggestion. Such remarks can mislead sentiment analysis systems, as seen in phrases like “Oh sure, because spending hours in traffic just to get to work is everyone’s idea of a perfect morning”, where cheerfulness masks frustration.

The increasing use of NLP tools in analyzing social media content underscores the importance of accurately detecting sarcasm to enhance downstream tasks, including sentiment analysis, user profiling, and content moderation. However, sarcasm’s reliance on contextual, pragmatic, and cultural cues renders it exceptionally difficult for standard NLP models to capture, especially in noisy, multilingual social media environments.

Historically, sarcasm detection systems relied on handcrafted features and rule-based methods [13]. The advent of Deep Learning (DL) techniques enabled automatic feature extraction, significantly advancing sarcasm detection with models such as CNNs [4,5], RNNs [6], LSTMs [7,8], and GRUs [9,10]. Attention mechanisms further improved model interpretability by selectively focusing on salient inputs.

Even with these advances, existing methods often fail to capture context-dependent sarcasm effectively, especially across multiple languages [11]. Recent transformer-based architectures, such as BERT, RoBERTa, and XLNet, have significantly improved contextual understanding. However, they still exhibit limitations in sarcasm detection due to a lack of task-specific fine-tuning and suboptimal handling of multilingual inputs.

Furthermore, newer models like DeBERTa, while promising, have not been sufficiently explored or adapted for sarcasm-specific nuances [12]. This study addresses these challenges by advancing BERT with attention-specific fine-tuning and cross-lingual normalization, thereby offering both methodological and theoretical improvements over existing architectures.

The primary objective of this study is to enhance sarcasm detection on social media by addressing the shortcomings of existing transformer-based models. Specifically, this research aims to improve the detection of subtle and context-dependent sarcasm in multilingual and noisy data environments. To achieve this, we propose a customized BERT model fine-tuned with multi-head attention mechanisms and optimized hyperparameters. We also integrate machine translation (MT) into the preprocessing pipeline to ensure consistency in semantic understanding across non-English inputs. Furthermore, this study aims to benchmark the proposed model’s performance against other leading large language models (LLMs), such as GPT-3, Claude-2, and Llama-2, thereby establishing a comprehensive evaluation framework. The key contributions of this study are summarized as follows:

  • Employed machine translation to enhance the contextual understanding of sarcastic text in multilingual environments.
  • Leveraged BERT embeddings for improved semantic and sentence-level understanding.
  • Optimized multi-head attention to refine sarcasm detection capabilities.
  • Conducted a comprehensive comparative analysis with state-of-the-art models, evaluating performance on accuracy, precision, recall, and F1 score to establish a robust benchmark.

The present work is organized in the following manner: Sect 2 of the document pertains to the literature review. Sect 3 contains a brief proposed model methodology. Sect 4 overviews the findings, while Sect 5 presents the debate on the proposed work. Conclusion and Future work are presented in Sect 6.

Different notations used in the article are presented in Table 1

2 Literature review

Automatic sarcasm detection has recently received a lot of interest from researchers working in ML and NLP [13]. NLP techniques examine the complexities of language and utilize linguistic corpora to understand detailed information qualitatively. In contrast, ML techniques employ supervised and unsupervised learning methods to detect sarcastic words, drawing insights from both labeled and unlabeled data.

2.1 Machine learning approaches

Eke et al. [14] thoroughly analyzed prior studies on sarcasm detection, highlighting popular feature extraction methods, including n-grams and part-of-speech tagging. Their investigation found that binary representation and term frequency were commonly used for feature representation, as well as for information gain and the Chi-squared test for feature selection. Building on Eke et al.’s findings, Sarsam et al. [15] investigated modified and customized ML algorithms (AMLA and CMLA) in a sarcasm detection study. Their findings were consistent with previous studies, highlighting the importance of lexical, pragmatic, frequency, and part-of-speech tagging variables in improving the accuracy of SVM classifiers. Furthermore, they suggested combining lexical and personal variables could improve the effectiveness of models such as CNN-SVM.

Khodak et al. [16] made significant contributions to the subject by developing a large corpus for detecting sarcasm. Their methodology included manual annotation, which was then compared to techniques like bag-of-words, phrase embeddings, and bag-of-bigrams. Interestingly, their findings demonstrated that manual identification of sarcasm outperformed computerized strategies. Similarly, Kumar et al. [17] used mutual information (MI), information gain (IG), and the chi-square test for feature selection, which was then applied to clustering methods. They used support vector machines (SVMs) for final categorization. Similarly, Pawar et al. [18] employed ML classification models to capture data related to sentiment, punctuation, semantics, syntax (such as interjections, odd phrases, laughter expressions), and patterns. These characteristics were trained for classification using SVM and random forest.

Du et al. [19] emphasized the importance of contextual elements in detecting sarcasm, including the emotional tone of communication and user habits. They suggested a two-stream CNN technique that considers both the semantics and emotional context of the text, augmented with SenticNet and LSTM.

2.2 Deep learning approaches

Deep learning techniques have further revolutionized sarcasm detection by capturing contextual nuances that traditional ML methods miss. Anusha et al. [20] explore the effectiveness of bi-LSTM models in conjunction with GLOVE and word2vec embeddings in identifying sarcastic expressions within social media platforms. T. K. Balaji et al. [21] Identifying sarcastic expressions within the context of COVID-19 conversations poses a unique linguistic hurdle. Ganesh et al. [22] used various Deep Learning-Based attention models for sarcasm detection in social media texts. Ghosh et al. [23] employed a neural network framework to identify sarcasm in tweets, utilizing a CNN and a bidirectional LSTM. Similarly, Ghosh et al. [24] used several LSTM models and contextual data to detect sarcastic comments. They used information from past comments to better grasp the context of the present statement, which helped predict sarcasm. Xiong et al. [25] proposed a novel strategy that combines self-matching words with a bidirectional long short-term memory (LSTM) framework. They extracted standard information by matching words within phrases and used low-rank bilinear pooling to address potential redundancy while maintaining classification accuracy.

Another option is to analyze sentence context using techniques such as LSTM, bidirectional LSTM, or attention modules. Liu et al. [26] employed content criteria, including part of speech, punctuation, numerical data, and emoticons, to detect sarcasm on Twitter. Misra et al. [27] used bidirectional LSTM with an attention module to extract contextual information from adjacent sentences and apply relevant word weights.

Akula et al. [28] developed a multi-headed self-attention framework for categorizing sarcastic comments across social media sites. Their approach combines GRU to find distant word connections detected by the self-attention module. Kamal et al. [9] proposed a sarcasm detection system that combines attention and GRU. The transformer-based approach provides a novel mechanism for contextual learning.

Goel et al. [29] used a DL ensemble model in their work. Parameswaran et al. [30] proposed using a combination of ML classifiers and DL models to identify sarcasm. Initially, they employed ML to categorize sarcastic phrases and evaluate whether they contained a target, which was then extracted using DL models for aspect-based sentiment analysis. Baruah et al. [31] compared BERT, BiLSTM, and SVM classifiers for sarcasm detection. Their experiments revealed that incorporating the last utterance in a dialogue and the response improved Twitter’s performance. However, on Reddit, optimal results were achieved solely by considering the response without contextual information.

2.3 Transfer learning approaches

AI systems have rapidly evolved, enhancing human creativity and innovation. Models like ChatGPT, DALL-E 2, Bard, Claude, and BERT have showcased significant advancements [32].

Transformer-based models, particularly BERT [33] and GPT-3 [32], have demonstrated superior performance in capturing contextual dependencies due to their self-attention mechanisms and bidirectional encoding, which enhances sarcasm detection accuracy. Multi-head attention mechanisms enable models to attend to multiple segments of input sequences simultaneously, capturing complex semantic relationships and improving tasks such as sarcasm detection [34].

Transfer learning has also emerged as a powerful technique in sarcasm detection. Babanejad et al. [35] developed a contextual features-based BERT model to detect sarcastic comments. Potamias et al. [36] introduced another significant transformer-based model, the RCNN-Roberta. This model utilized the RoBERTa transformer, a simplified version of BERT-base, and a bidirectional LSTM. They combined the embeddings from RoBerta and bidirectional LSTM before sending them to the pooling layer. Farha et al. [37] evaluated transformer-based language models for Arabic sentiment and sarcasm detection, including BERT, GPT, and ELECTRA. They found that models trained on Arabic data, such as MARBERT, outperformed others, highlighting the importance of language-specific training. AraELECTRA, despite its lower computational cost, was identified as one of the top-performing models, showcasing its efficacy in Arabic NLP tasks.

Gregory et al. [38] note that sarcasm comprehension is crucial for effective online communication. Previous studies have utilized LSTM and transformer architecture models. This study extends upon these approaches by incorporating LSTM, GRU, and transformer models, with the ensemble of BERT and RoBERTa demonstrating the highest success rate. A survey by [39] introduces Adversarial and Auxiliary Features-Aware BERT (AAFAB), which combines BERT’s contextual word embeddings with manually extracted auxiliary features for sarcasm detection. The study also proposes a multi-head attention bidirectional long short-term memory (MHABiLSTM) model for sarcasm detection using the SARC Reddit dataset, which demonstrates improved performance over traditional models. AAFAB and BERT embeddings are utilized for sarcasm detection on the SARC dataset to enhance accuracy and comprehension of sarcasm.

Recent studies by Zhang et al. [40] and Liu et al. [41] have explored advanced fine-tuning techniques for transformer-based architectures such as GPT-3 and Llama-2 in the context of sarcasm detection. These works highlight the significance of multi-head attention and context-aware embeddings in enhancing detection accuracy. Additionally, the study by Hassan et al. [42] highlights the integration of multimodal data (e.g., text and images) to enhance sarcasm understanding, demonstrating the potential for broader multimodal applications.

2.4 Multimodal and non-English sarcasm detection

Researchers have explored various methods for detecting sarcasm that extend beyond text. Garcia et al. [43] utilized emoticons and emojis, noting disparities in their use in sarcastic comments versus other types of comments. Yao et al. [44] proposed a novel method that combines text and Twitter images, assessing tweets, images, text over images, and image captions using a multi-channel interactions model with gated and guided attention modules. Ding et al. [45] investigated sarcasm detection using multimodal techniques. They utilized residual connections to create three model versions tailored to different experimental settings within a multi-level late-fusion learning framework. Farha et al. [37], and Al-Hassan et al. [42] studied sarcasm detection in Arabic.

Similarly, Swami et al. [46] created a model that detects sarcasm in Hindi-English tweets. Sarcasm has been studied in languages other than English. Techentin et al. [47] investigated how native and non-native English speakers perceive sarcasm, concluding that specific experiences influence non-native speakers’ capacity to understand and use sarcastic cues.

Table 2 contains the most relevant work in sarcasm detection.

thumbnail
Table 2. Comparative analysis of various works, methodologies, features, and datasets.

https://doi.org/10.1371/journal.pone.0334120.t002

The datasets referenced in Table 2 vary in size, language coverage, and annotation methods. For example, the SARC dataset used in this study contains approximately 1.3 million Reddit comments, with equal numbers of sarcastic and non-sarcastic labels. By contrast, datasets like Twitter and SemEval are smaller and primarily monolingual. Additionally, SARC offers detailed metadata such as subreddit information, enabling context-aware sarcasm detection, whereas other datasets rely on less context-rich annotations. These differences underscore the need for models that can scale across diverse and complex datasets, such as SARC.

2.5 Comparative analysis with prior approaches

The proposed methodology distinguishes itself from prior approaches through several key innovations. Unlike conventional transformer-based models such as RoBERTa and Llama-2, which use default configurations, our model fine-tunes BERT with 16 multi-head attention mechanisms, extended sequence lengths (128 tokens), and a learning rate optimized for sarcasm detection (3e-5). Additionally, we integrate MT into the preprocessing pipeline to ensure consistency in multilingual data, a feature often neglected in previous studies. The model captures nuanced linguistic features with greater precision by leveraging BERT embeddings for semantic and contextual understanding. These advancements enable our approach to outperform state-of-the-art methods, such as GPT-3, Claude-2, and Llama-2.

3 Methodology

This section will focus on the proposed methodology for the sarcasm detection challenge, which requires a delicate, real-world framework with high prediction accuracy. To effectively detect sarcastic comments from the data, we must focus on the data’s behavior through analytics. Consequently, the proposed method is tailored to identify sarcastic phrases on various platforms. The suggested model uses sophisticated DL techniques to identify nuanced characteristics of sarcasm in sarcastic remarks.

The SARC dataset [16] used in this study consists of 1.3 million comments, equally divided into 650,000 sarcastic and 650,000 non-sarcastic instances, meticulously categorized into two equal categories: sarcastic (1) and non-sarcastic (0). The average comment length is 40 words, highlighting the dataset’s textual richness. As shown in Table 3, the most frequently occurring subreddits include AskReddit, politics, and worldnews, each contributing significantly to the dataset’s diversity. The politics subreddit primarily contains discussions centered on American politics, with inflammatory topics from the 2016 election resurfacing. Similarly, in world news, most entities belong to the Geopolitical category, with nations such as Israel, China, and Saudi Arabia being common subjects of sarcasm.

Additionally, Russian interference in the 2016 election was a widely debated topic, frequently appearing in sarcastic discourse. Another prominent trend is the prevalence of Islam, Zionist, and Muslim as contentious keywords, reflecting their persistent role in polarizing discussions. These insights highlight the diverse and opinionated nature of the SARC dataset, making it a valuable resource for sarcasm detection research.

A correlation heat map helps illustrate the connections among various variables in a dataset. A heat map like the one shown in Fig 1 can simplify complex statistical data into a more understandable visual style, increasing accessibility.

thumbnail
Fig 1. Correlation heatmap showing relationships among SARC metadata features.

Upvotes (‘ups’) strongly correlate with the overall score, while downvotes (‘downs’) exhibit a weaker or negative association, illustrating how community reactions shape the visibility of sarcastic content.

https://doi.org/10.1371/journal.pone.0334120.g001

It depicts the correlation among key metadata characteristics of the SARC dataset: ‘downs,’ ‘ups,’ and ‘score.’ ‘Downs’ represent the number of downvotes received by a comment, signaling disapproval, while ‘ups’ signify upvotes, indicating approval. The ‘score’ is the net result of upvotes minus downvotes, reflecting overall user sentiment towards a comment. The correlation analysis shows that ‘ups’ strongly correlate with ‘score,’ highlighting their primary influence on comment visibility and user engagement. In contrast, ‘downs’ have a weaker or negative correlation with ‘score,’ emphasizing their lesser impact. These insights suggest that metadata characteristics can provide valuable context in detecting sarcasm, as sarcastic comments often elicit mixed reactions, as reflected in these correlations.

Word Cloud” generally refers to a word cloud, a visual representation of text data that emphasizes phrases that appear most frequently. Word clouds are helpful for text analysis because they offer a quick and straightforward way to see and comprehend the words used most frequently. A visual aid for displaying comments with sarcasm (Fig 2) and without sarcasm (Fig 3) makes it easier to spot trends, keywords, or essential themes in the text.

thumbnail
Fig 2. Word cloud of non-sarcastic comments (Class 0) from the SARC dataset.

Frequent terms reflect everyday conversational tone and factual discourse.

https://doi.org/10.1371/journal.pone.0334120.g002

thumbnail
Fig 3. Word cloud of sarcastic comments (Class 1).

Distinct lexical patterns, exaggerations, and ironic expressions serve as cues commonly used for sarcasm detection.

https://doi.org/10.1371/journal.pone.0334120.g003

3.1 Overview of the model

The proposed model (Fig 4) is structured to effectively capture sarcasm detection in textual data, utilizing a multi-stage processing pipeline. The preprocessing stage eliminates noise, removes special characters, and applies machine translation (MT) to ensure consistency in multilingual data. To enhance semantic and contextual representation, the feature extraction stage integrates multiple embedding techniques, including Word2Vec, fastText, Paragram, and BERT embeddings. The model-building stage involves fine-tuning transformer-based architectures (BERT, GPT-3, Llama-2, Claude-2 (prompt-based classification approach)) to improve sarcasm classification. Multi-head attention in BERT enhances contextual awareness, allowing it to focus on critical words that indicate sarcasm. The evaluation phase measures accuracy, precision, recall, and F1-score to compare the model’s performance.

thumbnail
Fig 4. Architecture of the proposed sarcasm detection framework.

The pipeline integrates preprocessing (translation, normalization, cleaning), feature extraction (Word2Vec, fastText, Paragram, BERT), and model building (BERT with multi-head attention, GPT-3, Claude-2, Llama-2), followed by evaluation.

https://doi.org/10.1371/journal.pone.0334120.g004

3.2 Model architecture

The architecture of the sarcasm detection model is meticulously designed to identify and interpret subtle sarcastic expressions within comments from the SARC dataset. The evaluation explores several advanced models, including BERT, GPT-3, Claude-2, and Llama2, each contributing unique strengths to the detection task. The approach to embeddings integrates BERT embeddings to capture the nuanced linguistic features inherent in sarcastic language. For a detailed visual representation of the model’s key components, refer to Fig 4, and for a comprehensive understanding of the step-by-step process, see Algorithm 1.

3.2.1 Preprocessing.

The preprocessing method eliminates empty text to ensure the dataset only includes meaningful text data. Next, URLs are removed to reduce extraneous data and focus on the tweet’s content. Duplicate text is filtered out to eliminate bias and redundancy in the dataset. Next, normalization is applied to standardize the data, transforming text into a consistent format by correcting misspellings and converting to lowercase. Then, we perform MT to convert the other language words into English to capture contextual meaning more efficiently for the model. Lemmatization is the next step, which treats related terms consistently by breaking words down to their base or root form. Lastly, noise is removed to clean up the text for improved analysis, eliminating unnecessary components like hashtags, mentions, and special characters. Algorithm 2 shows the preprocessing stage’s complete process.

The SARC dataset contained comments in multiple languages, including French and Spanish, identified using language detection algorithms. Machine translation was performed using the Google Translate API, chosen for its broad language support and robust performance. To ensure the robustness of the multilingual sarcasm detection model, we evaluated the quality of the Google Translate system using BLEU (Bilingual Evaluation Understudy) and TER (Translation Edit Rate) metrics.

A BLEU score of 0.72 and a TER score of 0.18 were achieved, indicating satisfactory translation quality for our domain. These results suggest that the MT system introduces minimal noise, thus ensuring that sarcasm detection is not significantly affected by poor translation quality. Alternatively, the translated text was validated through random sampling to ensure semantic consistency. While machine translation improved model performance by enabling uniform data processing, minor inaccuracies in translation occasionally impacted context-sensitive sarcasm detection, as reflected in slight variations in precision and recall scores.

Algorithm 1 Sarcasm detection algorithm.

Input: Sarcasm Comments (c)

Output: Model output

1: Procedure(SARCASMDETECTIONMODEL, c)

2: Step 1: Preprocessing

3:   Remove empty text, URL’s, Duplicate text, and Special Characters,

  Machine Translation, Lemmatization, and Tokenization 4: Step 2: Splitting Data

5:   Split Data into Train and Test

6: Step 3: Feature Extraction

7:   Map tokenized words to continuous vector representations using BERT

  embeddings

8: Step 4: Model Layer

9:   Implement advanced models such as BERT, GPT-3, Claude-2, or Llama2

  for sarcasm detection

10: Step 5: Dense Layers with Activation

11:   Multiple fully connected layers with appropriate activation functions

12: Step 6: Evaluation Metrics

13:   Evaluate the model using metrics such as Precision, Recall, and F1 Score

14: return Model output and evaluation metrics

Algorithm 2 Preprocess sarcastic comments.

Input: A set of sarcastic comments

Output: Preprocessed comments are ready for BERT tokenization

1: Initialize an empty list

2: for each in do

3:   Initialize an empty list

4:   for each in SPLIT() do

5:   DETECTLANGUAGE()

6:    if DetectedLanguage ‘en’ then

7:     TranslatedWord TRANSLATETO ENGLISH(Word,

DetectedLanguage)

8:    else

9:    

10:    end if

11:    APPEND(, )

12:   end for

13:   VERIFY TRANSLATIONS(SPLIT(),

)

14:   JOIN()

15:   TOKENIZEWITHBERT()

16:   CONVERTTOKENSTOINDICES()

17:   APPEND(, )

18: end for

19: return

3.2.2 Feature extraction.

Our approach used several embedding techniques to enhance sarcasm detection by capturing semantic and contextual information from the text. Word2Vec [54] captures semantic relationships between words by considering their context in large corpora. FastText [55] improves the understanding of word morphology and handles out-of-vocabulary terms by incorporating subword information. Paragram [56] ensures that words with similar meanings have close representations, preserving semantic similarity in the embeddings. However, the most significant performance improvement was achieved through BERT embeddings, which capture contextual and semantic meanings by considering the surrounding words in a sentence.

BERT embeddings play a pivotal role in transforming preprocessed text into dense vectors, leveraging a pre-trained transformer model. These embeddings provide a nuanced understanding of the intricate relationships between words, thereby enhancing the model’s contextual awareness. While experiments with Word2Vec, FastText, and Paragram offered valuable insights, these embeddings focus on either semantic or contextual meaning but not both simultaneously. Our experiments confirmed that BERT embeddings outperformed the others in capturing context, yielding the most significant results in sarcasm detection.

This study used several embedding techniques to capture different linguistic features for sarcasm detection. Word2Vec and Paragram were employed to capture semantic relationships, as they can represent words based on their contextual usage in large corpora. fastText was selected for its ability to understand word morphology and handle out-of-vocabulary words by considering subword information. On the other hand, BERT embeddings were chosen for their ability to capture both contextual and sentence-level meaning, making them particularly useful for detecting the subtle nuances in sarcastic comments. Each of these embeddings contributes uniquely to the model’s performance, but BERT embeddings delivered superior performance by capturing both sentence-level and contextual semantics critical to sarcasm. We compared the performance of all embeddings in terms of precision, recall, F1 score, and accuracy. The results demonstrated that BERT embeddings consistently outperformed Word2Vec, fastText, and Paragram embeddings across all evaluation metrics, particularly in distinguishing nuanced sarcasm from non-sarcastic comments. Meanwhile, Word2Vec and Paragram performed well in capturing semantic meaning, which contributed to the model’s success in sarcasm detection.

3.2.3 Model development.

The model development stage of the sarcasm detection pipeline is crucial because it explores various architectures to enhance the model’s sensitivity to subtle details. A thorough evaluation of the following models was conducted: BERT, GPT-3, Claude-2, and Llama 2. Every model utilizes feature representations obtained during preprocessing and feature extraction, with a focus on BERT embeddings. The details of each model will be covered in more detail in the following sections, emphasizing how unique their architectures are and how they help enhance sarcasm detection abilities.

BERT with Multi-Head Attention:

BERT [33] is famous for efficiently capturing contextual information throughout complete input sequences. This feature enables BERT to function with and without preprocessing steps, allowing for an examination of how including or excluding preprocessing steps affects the model’s predictions. This investigation requires understanding how BERT retains and uses semantic and contextual subtleties that could otherwise be lost during standard preprocessing. The structure of BERT comprises multiple layers of transformer blocks, each consisting of feedforward neural networks and self-attention processes. Here, the number of transformer layers is indicated by L, and the number of attention heads within each layer is shown by H.

BERT’s multi-head attention enables parallel focus on multiple semantic components, enhancing its comprehension of complex textual cues. For more complex pattern capturing, we set the Number of Attention Heads to 16. Eq (1) illustrates how important this method is for capturing complex patterns and connections among tokens.

(1)

Where MHA represents a Multi-Head Attention mechanism. OUTk shows the Output of the k-th multi-head attention layer applied to the input sequence, and HAT represents each multi-head attention layer consisting of H parallel attention heads. For each head a, the attention scores between token m and token n are computed through Eq (2):

(2)

where k is the dimension of the query and key vectors.

The output for each head m is calculated by computing a weighted sum of the value vectors using the attention scores as illustrated through Eq (3):

(3)

The final output of the BERT model is derived by applying a linear transformation to the output of the final multi-head attention layer, as shown through Eq (4):

(4)

where denotes a linear transformation. Our main focus was to capture more complex semantic and contextual information from the text. We fine-tuned many hyperparameters to check the efficiency of each value. After careful analysis, Table 4 hyperparameters show the best results among all the hyperparameters.

thumbnail
Table 4. BERT training Hyperparameters for fine-tuning.

https://doi.org/10.1371/journal.pone.0334120.t004

It is essential to clarify that multi-head attention is an inherent component of BERT’s transformer architecture and is not introduced as a novel feature in this study. Instead, our contribution lies in empirically evaluating whether customizing the number of attention heads during fine-tuning improves model sensitivity to sarcasm. By varying this parameter (e.g., using 4, 8, or 12 heads), we investigate whether enhanced attention diversity improves the model’s ability to capture pragmatic incongruity and contextual dependencies, both of which are crucial for understanding sarcasm.

GPT-3:

GPT-3 [57] is a state-of-the-art model well-known for its ability to understand writing like a human in various settings, including text generation. It is based on transformer architecture, best known for capturing complex dependencies and contextual interactions in textual input. This transformative design is handy for applications where it is essential to comprehend contextual clues and subtle linguistic nuances, such as sarcasm detection. The ability of GPT-3 to produce contextualized representations of incoming text as a whole. Due to its contextual awareness, GPT-3 can deduce the underlying meaning and intent of sarcastic remarks, which often rely on interpretations that are more complex than straightforward due to the context.

Detecting sarcasm through GPT-3 is a viable choice due to its extensive training and advanced transformer architecture on textual data. Here, we will outline a comprehensive approach to sarcasm detection using GPT-3, detailing each step with the relevant derivation. GPT-3 will receive input, which is converted into a high-dimensional vector using BERT embedding We as illustrated through Eq (5).

(5)

where is the embedding matrix, We is the learnable embedding matrix, and is the one-hot encoded token matrix.

The attention mechanism enables the model to understand the importance of each token concerning the others, which is crucial for capturing the nuances of sarcasm. For each token, three vectors are computed: Query Q, Key K, and Value V.

Then, the attention scores are calculated as shown in Eq (6) below.

(6)

Multiple attention heads enable the model to focus on different aspects of the input sequence simultaneously, as shown through Eq (7).

(7)

After that, each token’s attended representation is processed through a feed-forward neural network as illustrated through Eq (8).

(8)

The attention and feed-forward steps are repeated across multiple layers, allowing the model to capture complex patterns in the text. The final step involves producing an output that determines whether the input text is sarcastic. This is achieved by passing the processed vectors through a linear layer followed by a softmax function, as shown in Eq (9).

(9)

Where is the hidden state from the last layer, Wy is the output weight matrix, and by is the bias.

To specialize GPT-3 for sarcasm detection, we fine-tune the model on a labeled dataset of sarcastic and non-sarcastic texts. The process involves tokenizing the dataset and converting tokens to embeddings. Using supervised learning, the model adjusts its weights. The loss function is minimized during training using Eq (10).

(10)

Where yi is the true label, is the predicted probability for the i-th example.

Sarcasm often relies on context, tone, and sometimes external knowledge. GPT-3’s large-scale training on diverse datasets helps it capture these nuances. During inference, the model generates a probability distribution over possible outputs, including sarcastic and non-sarcastic ones.

(11)

where represents the probability of the text being sarcastic given the input sequence .

This approach leverages the capabilities of GPT-3 to provide a robust framework for detecting sarcasm in text, capturing the nuanced characteristics that distinguish sarcastic expressions. The Table 5 shows the ranges of values for finetuned hyperparameters used to achieve the best results.

thumbnail
Table 5. GPT-3 Training hyperparameter for fine-tuning.

https://doi.org/10.1371/journal.pone.0334120.t005

Claude-2:

Claude-2 [58] was created to comprehend and produce human language accurately. Its sophisticated architecture is used to identify sarcasm by understanding the finer points and contextual details in textual data, such as comments on SARC. Claude-2 features a multi-layered transformer design that captures intricate patterns and long-range dependencies in text, unlike standard models. This feature is essential for sarcasm recognition, as it allows the model to analyze the larger context and identify the underlying sarcasm in comments.

Claude-2 processes the input text to produce a contextualized representation, denoted as . This representation undergoes a final classification step to predict the probability that a given comment is sarcastic. The prediction is obtained through a sigmoid activation function applied to a linear transformation of , parameterized by learnable weights and a bias term .

Refer to Eq (12) for the sarcasm detection mechanism using Claude-2:

(12)

where represents the probability of a comment being sarcastic given the comment itself. is the sigmoid activation function that maps the linear output to a probability score between 0 and 1. is a Learnable weight matrix that adjusts the contribution of the contextualized representation . presents a contextualized representation of the input text, obtained from Claude 2, which captures the nuanced semantics and structure of the comment. represent Bias term that adjusts the prediction.

Claude 2 utilizes advanced language modeling techniques to encode the input text into a representation u that encapsulates the contextual understanding crucial for sarcasm detection. This model leverages its ability to process and interpret complex language structures, enhancing its performance across various NLP tasks, including sarcasm detection. Claude-2 hasn’t publicly announced the exact parameters and details right now. Therefore, we utilize the default pre-trained model for our task.

Llama 2:

Llama 2 [34] represents a cutting-edge advancement in NLP, specifically designed to detect sarcasm in textual data, such as social media comments. It is based on a transformer-based architecture that excels at capturing fine-grained contextual nuances and semantic relationships inherent in language. At its core, Llama 2 comprises multiple transformer blocks organized into L layers, each equipped with H attention heads. This architecture empowers Llama 2 to efficiently process and integrate information across different parts of the input sequence, facilitating a comprehensive understanding of linguistic nuances crucial for detecting sarcasm.

The multi-head attention mechanism in Llama 2 allows simultaneous focus on diverse aspects of the input text, enhancing its ability to grasp intricate patterns and dependencies among tokens. Eq (13) illustrates the pivotal role of multi-head attention in Llama 2:

(13)

Here:

  • MHA: Multi-Head Attention mechanism
  • OUTk: Output of the k-th multi-head attention layer applied to input sequence
  • HAT: Each multi-head attention layer consists of H parallel attention heads

In each attention head a, attention scores between token m and token n are computed using a softmax function to effectively capture token interactions, as depicted in Eq (14):

(14)

where k denotes the dimensionality of query and key vectors.

The output for each attention head m is derived through a weighted sum of value vectors , utilizing the computed attention scores, as shown in Eq (15):

(15)

The final output of Llama 2 is obtained by applying a specialized linear transformation to the output of its last multi-head attention layer, formulated as:

(16)

Where signifies a tailored linear transformation, consolidating Llama 2’s ability to provide nuanced contextualized representations crucial for effective sarcasm detection and diverse NLP tasks. The Table 6 shows ranges of values for finetuned hyperparameters utilized for best results.

thumbnail
Table 6. Llama-2 training hyperparameters for fine-tuning.

https://doi.org/10.1371/journal.pone.0334120.t006

3.3 Training and evaluation

Optimizing the performance of the proposed DL model is mainly dependent on the training and evaluation stages. First, the dataset is divided into separate subsets for training and validation, with specific portions set aside for training the model and evaluating its performance. The model modifies its parameters iteratively throughout the iterative training phase in response to the subtle semantic differences in the training set.

Key parameters, such as loss and precision, are closely monitored throughout each training epoch, providing valuable insights into the model’s performance on both the training and validation datasets. The method of iterative refining guarantees that the model’s capacity to recognize sarcastic comments will continuously improve. After training, the model is rigorously assessed on an independent test set using performance metrics, including F1 score, precision, and recall, to determine how well it detects sarcasm. Together, these in-depth analyses provide a comprehensive assessment of the model’s robustness and effectiveness.

While earlier models, such as GPT-3, allowed fine-tuning via the OpenAI API, more advanced versions, including GPT-3.5-turbo and GPT-4, do not currently support training or fine-tuning through the same interface. As a result, an alternative model variant that supports fine-tuning was utilized, enabling us to train on the sarcasm dataset along with labeled data and initiate fine-tuning. In contrast, Claude-2 does not support traditional fine-tuning but can effectively perform sarcasm detection using a prompt-based classification approach. By providing the model with structured examples of sarcastic and non-sarcastic tweets within the prompt, Claude-2 can generalize from the context and accurately classify new inputs. This method leverages Claude-2’s strong in-context reasoning capabilities to handle sarcasm detection without requiring direct model training.

4 Results and analysis

This section describes the experimental setup and research results. This research focuses on four distinct models that were fine-tuned and evaluated. The training hyperparameters are defined in Chapter 3 within the relevant section of each proposed approach. The dataset was divided into an 80/20 ratio for training and validation to guarantee thorough model evaluation. With careful selection of training parameters and a comprehensive assessment of each phase, it aims to support and develop reliable, well-optimized models that can reliably predict sarcastic and non-sarcastic comments in the dataset. The experimental configuration of the implementation environment is shown in Table 7.

thumbnail
Table 7. Hardware and software specifications.

https://doi.org/10.1371/journal.pone.0334120.t007

4.1 Evaluation matrices

Evaluation metrics help evaluate the performance of proposed models. These numerical metrics evaluate how effectively a model distinguishes between sarcastic statements and those that are not. This study uses accuracy, precision, recall, and F1 scores to measure the model’s effectiveness.

As the ratio of accurately predicted cases (including true positives and true negatives) to the total number of examples examined, accuracy quantifies the overall correctness of the model’s predictions as shown through Eq (17). Alongside, Precision measures how well the model predicts positive outcomes, as observed in Eq (18). Recall assesses the model’s ability to accurately distinguish all positive cases (sarcastic comments) from the actual positives in the dataset, as illustrated through Eq (19). The F1 Score provides a fair assessment of the model’s sarcasm detection capabilities, as it is the harmonic mean of precision and recall, as shown in Eq (20).

(17)(18)(19)(20)

4.2 Experimental results

A fair comparison is essential for evaluating model performance. Experimental results show that BERT, with its bidirectional and multi-head attention mechanisms, outperforms GPT-3, Claude-2, and Llama-2 in sarcasm detection on the SARC dataset. Unlike general-purpose models, fine-tuned BERT captures sarcasm-specific cues and complex context, offering superior classification. Detailed metrics and comparisons follow.

BERT’s bidirectional attention enables it to learn context from both directions, thereby enhancing its ability to detect sarcasm. In contrast, GPT-3, being autoregressive, focuses on generation and struggles with context that depends on future words. Claude-2, not fine-tuned for sarcasm, often misclassifies neutral statements due to its generalized conversational focus and lack of sentiment sensitivity. Llama-2 performs better but lacks BERT’s sarcasm-specific fine-tuning. Its embeddings focus on general understanding rather than detecting polarity shifts.

Beyond attention mechanisms, BERT’s fine-tuning improves its sensitivity to sarcasm cues. Pretrained sarcasm embeddings further enhance context awareness. An ablation study (Table 10) confirms that removing either fine-tuning or attention mechanisms reduces performance, underscoring BERT’s strength in this task.

Table 8 and Fig 6 show that BERT with multi-head attention achieves the best overall performance in sarcasm detection, with precision, recall, and F1 score all around 0.917. Its bidirectional attention mechanism enables it to capture deep contextual dependencies, resulting in better performance than GPT-3, Claude-2, and Llama-2. GPT-3, despite its strength in generation tasks, is less effective for classification due to its autoregressive structure, which limits its ability to interpret context-dependent cues essential for sarcasm detection. Claude-2, which is not fine-tuned for this task, performs less effectively, as it is designed primarily for general conversational understanding. Llama-2 shows competitive performance but lacks the task-specific fine-tuning applied to BERT. These findings align with prior research demonstrating that fine-tuned transformer models, particularly BERT, are well-suited for handling sentiment-rich and context-sensitive language [33,39].

thumbnail
Table 8. Proposed models performance comparison.

https://doi.org/10.1371/journal.pone.0334120.t008

Fig 5 highlights BERT’s strength in learning contextual nuances and recognizing sarcasm. The loss curve for BERT reveals a stable and consistent training trajectory. A comparison of Figs 6 and 8 confirms that both BERT and Llama-2 maintain smooth training processes in the later epochs. In contrast, Fig 7 shows that GPT-3 experienced irregularities during the final training stages. Nevertheless, by the 12th epoch, all models demonstrated smoother convergence, indicating a stable learning phase.

thumbnail
Fig 5. Comparison of sarcasm detection performance across BERT, GPT-3, Claude-2, and Llama-2 on the SARC dataset.

BERT with multi-head attention achieves state-of-the-art accuracy and F1 scores.

https://doi.org/10.1371/journal.pone.0334120.g005

thumbnail
Fig 6. Training and validation loss curves of BERT showing stable convergence during sarcasm detection.

https://doi.org/10.1371/journal.pone.0334120.g006

thumbnail
Fig 7. Training and validation loss curves of GPT-3 indicating fluctuations before reaching smoother convergence.

https://doi.org/10.1371/journal.pone.0334120.g007

thumbnail
Fig 8. Training and validation loss curves of Llama-2 demonstrating consistent and balanced convergence behavior.

https://doi.org/10.1371/journal.pone.0334120.g008

In terms of numerical performance, GPT-3 achieved an accuracy of 0.894, precision of 0.887, recall of 0.901, and F1 score of 0.895. Its training dynamics, shown in Fig 7, reflect a less stable learning path compared to BERT. Claude-2 and Llama-2 delivered solid results, with Claude-2 reaching 0.877 accuracy, 0.875 precision, 0.881 recall, and 0.878 F1 score. Llama-2 achieved 0.908 accuracy, 0.912 precision, 0.905 recall, and 0.908 F1 score. Fig 8 illustrates Llama-2’s smoother learning behavior, although BERT remains more effective overall.

Various advanced neural networks were employed, producing noteworthy precision, recall, and F1 scores across multiple techniques. Among these, the BERT + Multi-Head Attention approach significantly outperformed the other algorithms in precision, recall, and F1 score. This underscores the effectiveness and robustness of the proposed methodology in identifying complex semantic connections across sentences, thereby enhancing the ability to detect sarcasm.

Confusion matrices for sarcasm detection using four models —BERT (Fig 9), GPT-3 (Fig 10), Claude-2 ( Fig 11), and Llama-2 (Fig 12) are provided below. BERT demonstrates exceptional accuracy, particularly in minimizing false negatives (10790) and false positives (10648), indicating its high precision and recall for both classes. GPT-3 exhibits a relatively balanced performance but suffers from higher false negatives (12870) and false positives (14625) compared to BERT, suggesting a slightly reduced sensitivity to sarcasm. Claude-2 and Llama-2 also exhibit robust performance, but Claude-2 records significantly higher false positives (16,361) and false negatives (15,470), which may indicate a challenge in generalizing sarcasm detection. Conversely, Llama-2 exhibits a balanced performance, with false positives (11352) and false negatives (12350) closer to those of GPT-3. However, it slightly outperforms GPT-3 in accurately detecting sarcasm. Overall, while all models perform well, BERT excels in precision and recall, highlighting its superior capability for sarcasm detection in the Reddit corpus.

thumbnail
Fig 9. Confusion matrix of BERT model showing high accuracy with minimal false positives and false negatives.

https://doi.org/10.1371/journal.pone.0334120.g009

thumbnail
Fig 10. Confusion matrix of GPT-3 model indicating balanced performance but with higher misclassifications than BERT.

https://doi.org/10.1371/journal.pone.0334120.g010

thumbnail
Fig 11. Confusion matrix of Claude-2 model illustrating greater false positives and false negatives, reducing detection reliability.

https://doi.org/10.1371/journal.pone.0334120.g011

thumbnail
Fig 12. Confusion matrix of Llama-2 model reflecting competitive performance with moderate misclassification rates.

https://doi.org/10.1371/journal.pone.0334120.g012

With an emphasis on essential assessment criteria like Precision, Recall, and F1 Score, Table 9 compares various sarcasm detection methods thoroughly. Gregory et al. [38], one of the renowned research projects cited, achieved a Precision of 0.758, Recall of 0.767, and F1 Score of 0.756. This approach lacks multi-head attention and advanced fine-tuning, resulting in lower precision and recall. Kumar et al. [39] achieved a recall of 0.591, F1 Score of 0.651, and Precision of 0.724. This work employs simpler embeddings for feature capture, which results in a poorer contextual understanding of sarcasm. With Precision, Recall, and F1 Scores of 0.820, 0.816, and 0.818, respectively, Rathod et al. [59] showed robust performance. This approach is limited by the inability to model long-range dependencies, which is inherent to the CNN and LSTM architectures. Parkar et al. [60] attained an F1 Score of 0.650, a Precision of 0.670, and a Recall of 0.760. This approach is Over-reliant on Word2Vec embeddings, which lack contextual richness. According to Sonare et al., [61], the F1 Score was 0.715, the Precision was 0.700, and the Recall was 0.731. This work struggles with longer sequence lengths and noise in social media data. That being said, the BERT + Multi-Head Attention Mechanism outperformed the rest with a remarkable F1 Score (0.917), Precision (0.918), and Recall (0.917). These outcomes demonstrate the significant improvement of the suggested model over current approaches in detecting sarcasm. The comparative performance of BERT + Multi-Head Attention against state-of-the-art methods is shown in Fig 13.

thumbnail
Fig 13. Precision, recall, and F1 score comparison between the proposed method and prior approaches.

The fine-tuned BERT with multi-head attention significantly outperforms state-of-the-art baselines.

https://doi.org/10.1371/journal.pone.0334120.g013

thumbnail
Table 9. Comparison of approaches(SARC dataset).

https://doi.org/10.1371/journal.pone.0334120.t009

The superior performance of our proposed BERT-based model with Multi-Head Attention over the approaches can be attributed to several critical advancements. First, while previous studies relied on traditional embeddings, such as Word2Vec and GloVe, or utilized default transformer configurations, our approach fine-tunes BERT with sixteen multi-head attention mechanisms, extended sequence lengths (128 tokens), and optimized hyperparameters. This ensures the capture of intricate contextual and semantic dependencies specific to sarcasm detection. Second, including an MT step addresses the multilingual nature of social media data, ensuring consistency and reducing noise, which is often neglected in previous works. Finally, our preprocessing pipeline removes redundancies and normalizes text, which, combined with BERT’s bidirectional context-aware embeddings, enables more robust sarcasm detection. These enhancements collectively overcome the limitations of the referenced methods, which either lacked advanced contextual learning capabilities or did not adequately handle noisy and multilingual datasets, as evidenced by their lower F1 scores, precision, and recall metrics.

4.3 Ablation study

In this study, we conduct an ablation analysis to assess the contributions of Machine Translation (MT) and Without Machine Translation (WMT), to assess the performance impact of translating sarcastic comments on detection models. We evaluate each model under different configurations, and the results from the ablation study are summarized in Table 10.

thumbnail
Table 10. Ablation study for sarcasm detection models.

https://doi.org/10.1371/journal.pone.0334120.t010

The ablation results demonstrate the value of applying MT to non-English sarcastic comments before classification. Across all models, configurations using MT consistently outperformed their counterparts, WMT. The performance improvements were most notable in the BERT model, which achieved gains of 0.012 in precision and 0.011 in F1 score when MT was applied. These results suggest that translating multilingual inputs into English enhances semantic consistency and improves the model’s ability to understand and detect sarcasm accurately. This is especially relevant when working with datasets like SARC, which include diverse linguistic patterns and idiomatic expressions that may not be equally understood across languages. In summary, Machine Translation proves to be a valuable preprocessing step in sarcasm detection pipelines. Its consistent impact across multiple architectures highlights its potential as a generalizable strategy for improving multilingual sarcasm classification.

5 Discussion

5.1 Quantitative evaluation of models

This study systematically evaluated various LLMs for sarcasm detection, employing multiple techniques rather than relying on a single approach. Initial experiments involved directly inputting sarcastic text into models such as BERT, GPT-3, Claude-2, and Llama-2, which yielded only modest performance, with precision, recall, and F1 scores ranging from 0.50 to 0.60. Machine Translation (MT) was then applied to standardize multilingual inputs into English, yielding slight improvements across all models. Notably, BERT and GPT-3 demonstrated enhanced performance after MT, achieving scores of up to 0.67 on evaluation metrics, whereas Claude-2 and Llama-2 ranged between 0.52 and 0.57. To further improve results, we explored various embedding techniques, including Word2Vec, fastText, and Paragram, to capture semantic and subword relationships, as well as BERT embeddings for contextual understanding. These embeddings notably improved model performance, with BERT, GPT-3, and Llama-2 reaching between 0.73 and 0.85 on all metrics (see Table 8).

Subsequent fine-tuning of the models, excluding Claude-2 due to its closed configuration, yielded further gains. For BERT, we introduced architectural enhancements, including increasing the number of attention heads from 12 to 16 and transformer blocks from 12 to 16, as well as the sequence length from 128 to 512, to accommodate longer sarcastic expressions. The learning rate was also adjusted to 3e−5, resulting in a significant boost in predictive accuracy. GPT-3 and Llama-2 were also fine-tuned, achieving their best results using optimized hyperparameters detailed in Tables 5 and 6. Among all models, the fine-tuned BERT emerged as the top performer, effectively leveraging its multi-head attention mechanism to detect nuanced sarcastic cues in context. In contrast, GPT-3 occasionally struggled with subtle irony, while Claude-2 showed the least improvement due to tuning limitations. Llama-2 exhibited strong performance and balanced precision-recall tradeoffs, confirming its robustness in handling complex sarcastic expressions.

5.2 Broader theoretical and practical implications

This study contributes to the growing body of research examining how NLP models interact with figurative and pragmatic language phenomena, including sarcasm. For example, Ghosh et al. [24] and Gregory et al. [38] emphasized the importance of pragmatic incongruity and contextual embeddings in understanding sarcasm. Our study builds on these foundations by demonstrating that a fine-tuned BERT model with optimized attention heads significantly improves sensitivity to these nuanced cues. Unlike traditional methods that rely on surface-level sentiment reversal or lexical surprise, our model empirically confirms that deeper contextual modeling, achieved through multi-head attention and bidirectional encoding, aligns with these theories and offers practical advancements in sarcasm detection.

The effectiveness of the proposed model provides both theoretical and practical insights into sarcasm as a linguistic and psychological phenomenon. Sarcasm often involves a contradiction between literal and intended meaning, aligning with pragmatic theories such as Grice’s Maxim of Quality [62], which is intentionally violated to convey irony. The model’s ability to capture these nuances suggests that sarcasm is fundamentally context-dependent and benefits from advanced representations of long-range semantic dependencies, as supported by research on contextual embeddings [23,33]. From a cognitive psychology perspective, sarcasm requires Theory of Mind or mental state inference [63], where the listener must infer the speaker’s intent through indirect cues. This process effectively fine-tunes attention mechanisms in BERT by attending to multiple semantic layers in text [40,42].

Beyond theoretical contributions, improved sarcasm detection has meaningful applications in real-world scenarios. In marketing and brand management, it helps prevent the misinterpretation of sarcastic reviews [64], preserves analytical integrity, and protects brand reputation. Sarcasm-aware chatbots in customer service can enhance user interactions by responding more empathetically [65]. At the same time, UX designers can leverage sarcasm detection to mitigate the spread of hostile or ironic content on platforms [66]. Additionally, behavioral researchers and psychologists can utilize sarcasm-labeled datasets to analyze humor, digital expression, and social cognition in online communication [67]. These interdisciplinary implications reinforce the broader societal and academic relevance of our proposed framework.

5.3 Evaluating models for real-time scenarios

To evaluate the practicality of the proposed model in real-time scenarios, we analyzed its inference speed (IS), memory usage (MU), Feasibility on Edge Devices (FED), and Production Suitability (PS). The proposed model processes an average of 40 comments per second on a GPU-enabled environment (NVIDIA RTX 2060 Super). In comparison, GPT-3 processes 25 comments per second, Claude-2 20 comments per second, and Llama-2 40 comments per second, highlighting BERT’s superior balance of speed and accuracy. While BERT’s memory usage during inference is approximately 10 GB, lighter versions could be explored for resource-constrained devices. Table 11 summarizes the comparison of inference speed, memory usage, and feasibility across the models. These results affirm BERT’s suitability for production environments, though future work should focus on reducing memory usage and further training on diverse sarcastic expressions to improve structural generalization.

thumbnail
Table 11. Comparative analysis of inference speed and resource consumption across models.

https://doi.org/10.1371/journal.pone.0334120.t011

5.4 Error analysis and model performance evaluation

We conducted a detailed error analysis to understand the model’s performance better, focusing on the false positives and negatives of each model. BERT outperformed other models, minimizing false positives (299) and false negatives (450), thanks to its bidirectional attention mechanism that effectively captures subtle contextual cues. However, models like GPT-3 and Claude-2 exhibited significantly higher false positives (GPT-3: 16,874; Claude-2: 18,878) and false negatives (GPT-3: 14,850; Claude-2: 17,850), indicating difficulties in detecting sarcasm, particularly in more complex or ambiguous contexts. Llama-2 showed a more balanced performance but still lagged behind BERT in accuracy, with false positives (13,098) and false negatives (14,250) remaining higher than desired.

The errors primarily stemmed from each model’s inability to capture the context-dependent and nuanced nature of sarcasm fully. While BERT excelled due to its fine-tuned architecture, including extended sequence lengths and multi-head attention, GPT-3 and Claude-2 struggled with sarcastic expressions that relied on conflicting sentiments or indirect cues. These challenges underscore the need for task-specific fine-tuning and additional enhancements in contextual embeddings to minimize misclassification. To assess the model’s attention to each word, we evaluate it through an attention heatmap (Fig 14), which clearly shows how much attention is being paid to each word.

thumbnail
Fig 14. Attention heatmap from the fine-tuned BERT model, showing how multi-head attention highlights key tokens in a sarcastic sentence (e.g., ‘oh great,’ ‘another meeting,’ ‘endless hours’).

The visualization demonstrates the model’s ability to focus on lexical cues and contextual dependencies that signal sarcasm, providing interpretability into how the model distinguishes sarcastic from non-sarcastic text.

https://doi.org/10.1371/journal.pone.0334120.g014

5.5 Statistical validation of results

We performed several statistical analyses to ensure the robustness and reliability of the observed differences in model performance. First, paired t-tests were conducted to assess the statistical significance of differences in precision, recall, F1 scores, and accuracy between the fine-tuned models (BERT, GPT-3, Claude-2, and Llama-2). The p-values for all key comparisons were less than 0.05, indicating statistically significant differences in model performance.

Additionally, 95% confidence intervals for each performance metric were computed to quantify the precision of our results. The narrow confidence intervals for each metric confirmed the reliability of the performance estimates. For example, the 95% CI for BERT’s F1 score ranged from 0.916 to 0.918, indicating high confidence in its superior performance.

Finally, Cohen’s D effect size was calculated for each model comparison to quantify the magnitude of performance differences. The effect size for comparing BERT and GPT-3 was 1.4, indicating a large and meaningful difference. These statistical analyses prove that the observed differences in model performance are statistically significant and practically substantial.

6 Conclusion and future work

This study presents a robust framework for sarcasm detection on social media, introducing a customized BERT model enhanced with multi-head attention, machine translation preprocessing, and task-specific fine-tuning. Experimental evaluations using the SARC dataset and benchmark comparisons against GPT-3, Claude-2, and Llama-2 confirm that our model consistently outperforms these state-of-the-art LLMs across accuracy, precision, recall, and F1 score. The findings highlight the significance of attention optimization and contextual embeddings in transformer-based models, demonstrating that tailored adaptations are more effective than general-purpose architectures for capturing the nuanced semantics of sarcasm. This work not only contributes to methodological advancements in sarcasm detection but also offers practical value for real-world applications such as social media moderation, sentiment-aware customer service bots, and behavioral analysis systems. Furthermore, the integration of machine translation shows the model’s potential for scalable, cross-lingual deployment.

Despite its strengths, the study has a few limitations. Automatic translation can introduce semantic noise in low-resource languages, which can hinder the accuracy of sarcasm recognition. Moreover, sarcasm is often culturally rooted and pragmatically nuanced, relying on idioms, tonal shifts, and context-specific cues that are not always preserved through automatic translation. Prior linguistic research ([64,68] suggests that irony and sarcasm are highly context-dependent and may lose their intended effect when translated across languages. While our ablation studies showed modest improvements using MT, we recognize that future research should empirically assess the fidelity of sarcasm preservation in translated content. Additionally, the model is trained solely on textual input and does not consider multimodal cues, such as emojis, images, or speech tone, which are often crucial for detecting sarcasm in online communication. Future research could explore the integration of such multimodal features and investigate the use of few-shot learning with newer models, such as DeBERTa-v3 or XLM-RoBERTa, to enhance performance in low-data scenarios. Exploring sarcasm-aware conversational agents and human-in-the-loop systems may also open up exciting directions for adaptive, empathetic AI interactions. These pathways offer rich opportunities further to advance the theoretical and practical landscape of sarcasm detection.

References

  1. 1. Bharti SK, Naidu R, Babu KS. Hyperbolic feature-based sarcasm detection in tweets: a machine learning approach. In: 2017 14th IEEE India Council International Conference (INDICON). 2017. p. 1–6. https://doi.org/10.1109/indicon.2017.8487712
  2. 2. Vinoth D, Prabhavathy P. An intelligent machine learning-based sarcasm detection and classification model on social networks. J Supercomput. 2022;78(8):10575–94.
  3. 3. Abulaish M, Kamal A. Self-deprecating sarcasm detection: an amalgamation of rule-based and machine learning approach. In: 2018 IEEE/WIC/ACM International Conference on Web Intelligence (WI). 2018. p. 574–9.
  4. 4. Ashok DM, Nidhi Ghanshyam A, Salim SS, Burhanuddin Mazahir D, Thakare BS. Sarcasm detection using genetic optimization on LSTM with CNN. In: 2020 International Conference for Emerging Technology (INCET). 2020. p. 1–4. https://doi.org/10.1109/incet49848.2020.9154090
  5. 5. Mandal PK, Mahto R. Deep CNN-LSTM with word embeddings for news headline sarcasm detection. In: 16th International Conference on Information Technology-New Generations (ITNG 2019). Springer; 2019. p. 495–8.
  6. 6. Dutta S, Mehta A. Unfolding sarcasm in twitter using c-rnn approach. Bulletin of Computer Science and Electrical Engineering. 2021;2(1):1–8.
  7. 7. Salim SS, Ghanshyam AN, Ashok DM, Mazahir DB, Thakare BS. Deep LSTM-RNN with word embedding for sarcasm detection on Twitter. In: 2020 International Conference for Emerging Technology (INCET). 2020. p. 1–4.
  8. 8. Jain D, Kumar A, Garg G. Sarcasm detection in mash-up language using soft-attention based bi-directional LSTM and feature-rich CNN. Applied Soft Computing. 2020;91:106198.
  9. 9. Kamal A, Abulaish M. CAT-BiGRU: convolution and attention with bi-directional gated recurrent unit for self-deprecating sarcasm detection. Cogn Comput. 2021;14(1):91–109.
  10. 10. Chy MSR, Chy MSR, Mahin MRH, Rahman MM, Hossain MS, Rasel AA. Sarcasm detection in news headlines using evidential deep learning-based LSTM and GRU. In: Asian Conference on Pattern Recognition. 2023. p. 194–202.
  11. 11. Porwal S, Ostwal G, Phadtare A, Pandey M, Marathe MV. Sarcasm detection using recurrent neural network. In: 2018 Second International Conference on Intelligent Computing and Control Systems (ICICCS). 2018. p. 746–8.
  12. 12. He P, Liu X, Gao J, Chen W. Deberta: decoding-enhanced bert with disentangled attention. arXiv preprint 2020. https://arxiv.org/abs/2006.03654
  13. 13. Wallace BC. Computational irony: a survey and new perspectives. Artif Intell Rev. 2013;43(4):467–83.
  14. 14. Eke CI, Norman AA, Liyana S, Nweke HF. Sarcasm identification in textual data: systematic review, research challenges and open directions. Artif Intell Rev. 2019;53(6):4215–58.
  15. 15. Sarsam SM, Al-Samarraie H, Alzahrani AI, Wright B. Sarcasm detection using machine learning algorithms in Twitter: a systematic review. International Journal of Market Research. 2020;62(5):578–98.
  16. 16. Khodak M, Saunshi N, Vodrahalli K. A large self-annotated corpus for sarcasm. arXiv preprint 2017. https://arxiv.org/abs/1704.05579
  17. 17. Kumar HMK, Harish BS. Sarcasm classification: a novel approach by using content based feature selection method. Procedia Computer Science. 2018;143:378–86.
  18. 18. Pawar N, Bhingarkar S. Machine learning based sarcasm detection on twitter data. In: 2020 5th International Conference on Communication and Electronics Systems (ICCES). 2020. p. 957–61. https://doi.org/10.1109/icces48766.2020.9137924
  19. 19. Du Y, Li T, Pathan MS, Teklehaimanot HK, Yang Z. An effective sarcasm detection approach based on sentimental context and individual expression habits. Cogn Comput. 2021;14(1):78–90.
  20. 20. Muthukrishnan A, R L. Sarcasm detection using enhanced glove and BI-LSTM model based on deep learning techniques. IJIEI. 2025;13(1):26–54.
  21. 21. Balaji T, Bablani A, Sreeja S, Misra H. SARCOVID: a framework for sarcasm detection in tweets using hybrid transfer learning techniques. In: International Conference on Pattern Recognition. 2025. p. 1–12.
  22. 22. Chandrasekaran G, Chowdary MK, Babu JC, Kiran A, Kumar KA, Kadry S. Deep learning-based attention models for sarcasm detection in text. IJECE. 2024;14(6):6786.
  23. 23. Ghosh A, Veale T. Magnets for sarcasm: making sarcasm detection timely, contextual and very personal. In: Proceedings of the 2017 conference on empirical methods in natural language processing. 2017. p. 482–91.
  24. 24. Ghosh D, Fabbri AR, Muresan S. Sarcasm analysis using conversation context. Computational Linguistics. 2018;44(4):755–92.
  25. 25. Xiong T, Zhang P, Zhu H, Yang Y. Sarcasm detection with self-matching networks and low-rank bilinear pooling. In: The World Wide Web Conference. 2019. p. 2115–24. https://doi.org/10.1145/3308558.3313735
  26. 26. Liu L, Priestley JL, Zhou Y, Ray HE, Han M. A2Text-Net: a novel deep neural network for sarcasm detection. In: 2019 IEEE First International Conference on Cognitive Machine Intelligence (CogMI). 2019. p. 118–26. https://doi.org/10.1109/cogmi48466.2019.00025
  27. 27. Misra R, Arora P. Sarcasm detection using hybrid neural network. arXiv preprint 2019. https://arxiv.org/abs/1908.07414
  28. 28. Akula R, Garibay I. Interpretable multi-head self-attention architecture for sarcasm detection in social media. Entropy (Basel). 2021;23(4):394. pmid:33810363
  29. 29. Goel P, Jain R, Nayyar A, Singhal S, Srivastava M. Sarcasm detection using deep learning and ensemble learning. Multimed Tools Appl. 2022;81(30):43229–52.
  30. 30. Parameswaran P, Trotman A, Liesaputra V, Eyers D. Detecting the target of sarcasm is hard: really? Information Processing & Management. 2021;58(4):102599.
  31. 31. Baruah A, Das K, Barbhuiya F, Dey K. Context-aware sarcasm detection using BERT. In: Proceedings of the Second Workshop on Figurative Language Processing. 2020. https://doi.org/10.18653/v1/2020.figlang-1.12
  32. 32. Ali AH, Alajanbi M, Yaseen MG, Abed SA. Chatgpt4, DALL. E, Bard, Claude, BERT: open possibilities. Babylonian Journal of Machine Learning. 2023;2023:17–8.
  33. 33. Devlin J. Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint 2018. https://arxiv.org/abs/1810.04805
  34. 34. Touvron H, Lavril T, Izacard G, Martinet X, Lachaux MA, Lacroix T. Llama: open and efficient foundation language models. arXiv preprint 2023.
  35. 35. Babanejad N, Davoudi H, A n A, Papagelis M. Affective and contextual embedding for sarcasm detection. In: Proceedings of the 28th International Conference on Computational Linguistics. 2020. p. 225–43.
  36. 36. Potamias RA, Siolas G, Stafylopatis A-G. A transformer-based approach to irony and sarcasm detection. Neural Comput & Applic. 2020;32(23):17309–20.
  37. 37. Farha IA, Magdy W. From arabic sentiment analysis to sarcasm detection: the arsarcasm dataset. In: Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection. 2020. p. 32–9.
  38. 38. Gregory H, Li S, Mohammadi P, Tarn N, Draelos R, Rudin C. A transformer approach to contextual sarcasm detection in Twitter. In: Proceedings of the Second Workshop on Figurative Language Processing. 2020. https://doi.org/10.18653/v1/2020.figlang-1.37
  39. 39. Kumar A, Narapareddy VT, Gupta P, Srikanth VA, Neti LBM, Malapati A. Adversarial and auxiliary features-aware BERT for sarcasm detection. In: Proceedings of the 3rd ACM India Joint International Conference on Data Science & Management of Data (8th ACM IKDD CODS & 26th COMAD). 2021. p. 163–70. https://doi.org/10.1145/3430984.3431024
  40. 40. Zhang Y, Zou C, Lian Z, Tiwari P, Qin J. Towards evaluating large language models on sarcasm understanding. arXiv preprint 2024. https://arxiv.org/abs/240811319
  41. 41. Yao B, Zhang Y, Li Q, Qin J. Is sarcasm detection a step-by-step reasoning process in large language models? arXiv preprint 2024. https://arxiv.org/abs/240712725
  42. 42. Al-Hassan A, Al-Dossari H. Detection of hate speech in Arabic tweets using deep learning. Multimedia Systems. 2021;28(6):1963–74.
  43. 43. Garcia C, Èšurcan A, Howman H, Filik R. Emoji as a tool to aid the comprehension of written sarcasm: evidence from younger and older adults. Computers in Human Behavior. 2022;126:106971.
  44. 44. Yao F, Sun X, Yu H, Zhang W, Liang W, Fu K. Mimicking the Brain’s Cognition of Sarcasm From Multidisciplines for Twitter Sarcasm Detection. IEEE Trans Neural Netw Learn Syst. 2023;34(1):228–42. pmid:34255636
  45. 45. Ding N, Tian S, Yu L. A multimodal fusion method for sarcasm detection based on late fusion. Multimed Tools Appl. 2022;81(6):8597–616.
  46. 46. Swami S, Khandelwal A, Singh V, Akhtar SS, Shrivastava M. A corpus of english-hindi code-mixed tweets for sarcasm detection. arXiv preprint 2018.
  47. 47. Techentin C, Cann DR, Lupton M, Phung D. Sarcasm detection in native English and English as a second language speakers. Can J Exp Psychol. 2021;75(2):133–8. pmid:33600203
  48. 48. Gupta R, Kumar J, Agrawal HK. A statistical approach for sarcasm detection using Twitter data. In: 2020 4th International Conference on Intelligent Computing and Control Systems (ICICCS). 2020. p. 633–8. https://doi.org/10.1109/iciccs48265.2020.9120917
  49. 49. Nayel H, Amer E, Allam A, Abdallah H. Machine learning-based model for sentiment and sarcasm detection. In: Proceedings of the Sixth Arabic Natural Language Processing Workshop. 2021. p. 386–9.
  50. 50. Mehndiratta P, Soni D. Identification of sarcasm using word embeddings and hyperparameters tuning. Journal of Discrete Mathematical Sciences and Cryptography. 2019;22(4):465–89.
  51. 51. Naseem U, Razzak I, Eklund P, Musial K. Towards improved deep contextual embedding for the identification of irony and sarcasm. In: 2020 International Joint Conference on Neural Networks (IJCNN). 2020. p. 1–7. https://doi.org/10.1109/ijcnn48605.2020.9207237
  52. 52. Huang YH, Huang HH, Chen HH. Irony detection with attentive recurrent neural networks. In: Advances in Information Retrieval: 39th European Conference on IR Research, ECIR 2017, Aberdeen, UK, April 8-13, 2017, Proceedings 39. 2017. p. 534–40.
  53. 53. Cai Y, Cai H, Wan X. Multi-modal sarcasm detection in Twitter with hierarchical fusion model. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. https://doi.org/10.18653/v1/p19-1239
  54. 54. Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. 2013. https://arxiv.org/abs/1301.3781
  55. 55. Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching word vectors with subword information. Trans Assoc Comput Linguistics. 2017;5:135–46.
  56. 56. Hokamp C, Arora P. DCU-SEManiacs at SemEval-2016 Task 1: synthetic paragram embeddings for semantic textual similarity. In: Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016). 2016. p. 656–62. https://doi.org/10.18653/v1/s16-1100
  57. 57. Brown TB. Language models are few-shot learners. arXiv preprint 2020. https://arxiv.org/abs/2005.14165
  58. 58. Wu S, Koo M, Blum L, Black A, Kao L, Scalzo F, et al. A comparative study of open-source large language models, gpt-4 and claude 2: multiple-choice test taking in nephrology. arXiv preprint 2023. https://arxiv.org/abs/230804709
  59. 59. Rathod S, Kataria A. Sarcasm detection using natural language processing. 2023. https://ssrn.com/abstract=4451909
  60. 60. Parkar A, Bhalla R. Analytical comparison on detection of sarcasm using machine learning and deep learning techniques. IJCDS. 2024;15(1):1615–25.
  61. 61. Sonare B, Dewan JH, Thepade SD, Dadape V, Gadge T, Gavali A. Detecting sarcasm in reddit comments: a comparative analysis. In: 2023 4th International Conference for Emerging Technology (INCET). 2023. p. 1–6. https://doi.org/10.1109/incet57972.2023.10170613
  62. 62. Grice HP. Logic and conversation. Syntax and semantics. Academic Press; 1975. p. 41–58.
  63. 63. Winner E, Leekam S. Distinguishing irony from deception: understanding the speaker’s second-order intention. British J of Dev Psycho. 1991;9(2):257–70.
  64. 64. Reyes A, Rosso P, Veale T. A multidimensional approach for detecting irony in Twitter. Lang Resources & Evaluation. 2012;47(1):239–68.
  65. 65. Poria S, Cambria E, Bajpai R, Hussain A. A review of affective computing: From unimodal analysis to multimodal fusion. Information Fusion. 2017;37:98–125.
  66. 66. Chen Y, Zhou Y, Zhu S, Xu H. Detecting offensive language in social media to protect adolescent online safety. In: 2012 International Conference on Privacy, Security, Risk and Trust and 2012 International Confernece on Social Computing. 2012. p. 71–80. https://doi.org/10.1109/socialcom-passat.2012.55
  67. 67. Filatova E. Irony and sarcasm: corpus generation and analysis using crowdsourcing. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC). 2012. p. 392–8.
  68. 68. Attardo S. Irony as relevant inappropriateness. Journal of Pragmatics. 2000;32(6):793–826.