Improving the accuracy of cybersecurity spam email detection using ensemble techniques: A stacking approach Machine learning for spam email detection

Ye Tian; Xin Dai; Zhijun Li; Hong Guo; Xiao Mao

doi:10.1371/journal.pone.0331574

Abstract

With the widespread adoption of internet technologies and email communication systems, the exponential growth in email usage has precipitated a corresponding surge in spam proliferation. These unsolicited messages not only consume users’ valuable time through information overload but also pose significant cybersecurity threats through malware distribution and phishing schemes, thereby jeopardizing both digital security and user experience. This emerging challenge underscores the critical importance of developing effective spam detection mechanisms as a cornerstone of modern cybersecurity infrastructure. Through empirical analysis of machine learning (ML) performance on publicly available spam datasets, we established that algorithmic ensemble methods consistently outperform individual models in detection accuracy. We propose an optimized stacking ensemble framework that strategically combines predictions from four heterogeneous base models (NBC, k-NN, LR, XGBoost) through meta-learner integration. Our methodology incorporates grid search cross-validation with hyperparameter space optimization, enabling systematic identification of parameter configurations that maximize detection performance. The enhanced model was rigorously evaluated using comprehensive metrics including accuracy (99.79%), precision, recall, and F1-score, demonstrating statistically significant improvements over both baseline models and existing solutions documented in the literature.

Citation: Tian Y, Dai X, Li Z, Guo H, Mao X (2025) Improving the accuracy of cybersecurity spam email detection using ensemble techniques: A stacking approach Machine learning for spam email detection. PLoS One 20(9): e0331574. https://doi.org/10.1371/journal.pone.0331574

Editor: Elochukwu Ukwandu, Cardiff Metropolitan University - Llandaff Campus: Cardiff Metropolitan University, UNITED KINGDOM OF GREAT BRITAIN AND NORTHERN IRELAND

Received: December 17, 2024; Accepted: August 18, 2025; Published: September 3, 2025

Copyright: © 2025 Tian et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the manuscript and its Supporting Information files.

Funding: This study was funded by the grants from Ministry of Finance of the People’s Republic of China [Grant no: GY2023G-5]. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

In the digital era, email persists as a mission-critical communication channel, retaining its role as an efficient, cost-effective, and ubiquitous tool for both personal and organizational exchanges, despite competition from instant messaging platforms and social media [1]. While modern email systems exhibit unparalleled versatility in professional collaboration and information dissemination, their widespread adoption has inadvertently expanded the attack surface for cyber threats. Modern email systems enable seamless collaboration, yet their ubiquity comes at a cost. Malicious actors exploit these platforms through disguised payloads, such as drive-by downloads embedded in promotional emails. Beyond direct security breaches, such unsolicited communications create systemic inefficiencies by congesting network bandwidth, depleting computational resources, and causing productivity losses through repetitive filtering tasks.

For example, emails containing advertisements can hijack computers by installing malicious software when users click on embedded advertising links. These emails may also disrupt communication by consuming bandwidth via the installed malware. While email remains indispensable, its vulnerabilities, such as time wastage, resource depletion, financial losses, and security risks to individuals and organizations, must not be overlooked. Indeed, studies indicate that up to 90% of cyberattacks originate from email-based threats [2]. This dual nature of email technology underscores the need for robust countermeasures that safeguard its utility while addressing associated risks, including financial liabilities, data breaches, and reputational harm.

The persistent challenge of email spam continues to plague digital communication systems, with global networks transmitting billions of unsolicited messages daily. Spam remains a long-standing issue, flooding internet users with vast volumes of unwanted content. Among these tactics, social engineering techniques are particularly deceptive, as they aim to deceive users, propagate malware, and facilitate unauthorized access to sensitive data [3]. Emails are typically classified as either spam or legitimate (“ham”) [4]. Spam emails are increasingly weaponized for malicious purposes, such as distributing advertisements, phishing links, and malware. Industry analyses from Kaspersky Lab and Cisco Talos reveal staggering spam prevalence rates: unsolicited messages constitute 50–85% of the estimated 200 billion daily emails processed globally [5]. Spammers exploit spam for objectives like phishing and hacking. Financial media platforms further amplify spam’s reach by providing spammers access to user data for targeted attacks. Spam is representative of self-propelled advertising material [6]. Modern spam has evolved into a multifaceted cybercrime tool.

Traditional spam detection methods typically rely on predefined rules to identify unwanted emails. Despite the proliferation of spam detection algorithms, users continue to report attacks and fraud stemming from spam emails [7]. The evolution of ML architectures has catalyzed a paradigm shift in spam detection, offering statistically superior alternatives to conventional rule-based systems. Research demonstrates the successful implementation of established supervised learning models, including k-Nearest Neighbors (k-NN), Support Vector Machines (SVMs), and Naïve Bayes (NB) classifiers, within email filtering workflows [8]. These frameworks achieve unprecedented accuracy rates in spam detection [9,10].

These approaches exhibit critical limitations in addressing the polymorphic nature of modern spam campaigns, as evidenced by persistent user reports of successful phishing attempts and financial fraud incidents. Despite advances in detection technology, spam filtering systems still face significant challenges due to evolving spam tactics and the complexity of email content. Key challenges include:

Diversity of spam content: Spam emails range from simple text to complex HTML with embedded images, complicating pattern identification.

High volume and data imbalance: Email systems process millions of daily messages, where spam constitutes a small but impactful minority. This imbalance biases detection models, increasing false positives or false negatives.

Evasion techniques: Spammers employ obfuscated text, image-based content, and dynamic generation to bypass traditional rule-based and heuristic filters.

Resource constraints: Machine learning models require substantial computational resources for training and deployment, limiting scalability.

Nevertheless, traditional ML classification algorithms often prove insufficient in addressing rapidly evolving spam threats. These algorithms depend on standalone machine learning models, prone to overfitting and generalization errors [11,12], and struggle to adapt to the dynamic complexity of modern spam. Recent advances in ML research highlight the growing prominence of hybrid models that integrate traditional algorithms with advanced techniques. For instance, combining SVMs with Random Forest (RF) classifiers enhances generalization capabilities and boosts classification accuracy [13]. By leveraging the complementary strengths of diverse algorithms, such hybrid approaches outperform individual models [14]. Furthermore, ensemble methods improve robustness and prediction accuracy, making them viable for real-time email classification in production environments. Models like RF and XGBoost are particularly favored for their resilience to adversarial attacks and computational efficiency in large-scale deployments. This paper demonstrates how ensemble techniques can enhance spam detection system performance.

The subsequent sections of this manuscript are organized to provide a comprehensive overview of this work. The research done in this domain has been discussed in the related work section. The materials and methods section describes the research methodology in detail and explains the research setup. The evaluation of the proposed model section presents the relevant evaluation metrics. The results and discussion section presents the results and discusses the relevant findings. The final section provides a well-organized and concise summary of the study.

Related work

Spam filtering is a multidisciplinary field encompassing AI-driven techniques, feature engineering, comparative analysis of ML algorithms, and evaluation of filtering methodologies. The core objective of ML in this domain is to develop generalizable predictive models that sustain high classification accuracy on out-of-sample data via rigorous inductive learning paradigms. Generally, training data is used to construct methods for effectively predicting the results of each conceivable problem situation by extracting information using the training data [15].

Recent benchmarking studies have significantly advanced the technical understanding of classifier performance in spam detection. Nandhini et al. [16] identified Logistic Regression (LR), Decision Trees (DTs), NB, k-NN, and SVMs as the five most prevalent classical ML algorithms. Their comparative analysis revealed that DT and k-NN achieved the highest accuracy; however, k-NN exhibited significantly longer computational convergence times compared to other algorithms.

Jain et al. [17] established baseline metrics using a curated dataset of 5,572 emails, achieving state-of-the-art 98.79% accuracy with an SVM model. Subsequent experiments employing NB classifiers on the same benchmark attained comparable efficacy (98.56% accuracy) [18]. In another email classification study, a dataset of 5,574 English messages achieved 95.48% accuracy with NB and 97.83% accuracy with SVM, demonstrating algorithm-specific performance trade-offs [19].

Sahin et al. [20] developed a spam detection method using a k-NN classifier, achieving 98.08% accuracy in their experiments. In a related study, [21] applied the k-NN algorithm with Chi-Square feature selection for text classification and demonstrated its effectiveness in filtering spam emails.

Thakur et al. [22] conducted a comparative analysis of multiple ML algorithms on a unified dataset, using accuracy and precision as evaluation metrics. The accuracy of SVM was 98.09%. Cota et al. [23] evaluated two publicly available corpora with distinct data splits: 80% training and 20% testing in the first experiment, and 70% training and 30% testing in the second. When applying the Random Forest (RF) algorithm, the model attained accuracies of 85.25% and 86.25% on these configurations, respectively.

Alsuwit et al. [24] investigated the classification of spam emails through ML and deep learning (DL) techniques. Their study compared LR, NB, RF, and Artificial Neural Networks (ANNs) to enhance detection accuracy and operational efficiency. Experimental results demonstrated 97% accuracy for LR, NB, and RF models, while the ANN marginally outperformed them at 98%. Despite these high accuracy rates, the authors emphasized ongoing challenges in optimizing robustness for real-world spam filtering systems.

Gordana et al. [25] integrated Latent Dirichlet Allocation (LDA) with ML algorithms for spam detection. Their results demonstrated that LR classifier achieved the highest test accuracy (98.56%), outperforming SVM at 98.11% and NB at 95.15%. The study concluded that LR exhibits superior performance to NB and SVM in text categorization tasks, particularly for spam detection.

Gallo et al. [26] employed a wrapper-based approach with supervised ML to analyze phishing attempts in suspicious emails. Their study evaluated multiple ML algorithms—including NB, k-NN, Linear SVM, Radial Basis Function (RBF) SVM, DT, RF, AdaBoost, and Multilayer Perceptron (MLP) Neural Networks—using precision, recall, and F1-score metrics. Results demonstrated that RF achieved the highest precision (95.2%) with 36 features. However, the method necessitates manual intervention for all incoming emails, fails to mitigate targeted attacks on specific individuals, and inherits limitations common to supervised learning frameworks.

Hnini et al. [27] proposed three neural network (NN)-based methods for spam detection. The emails were pre-processed using natural language processing (NLP) techniques, with features extracted via bag-of-words (BoW), n-grams, and Term Frequency-Inverse Document Frequency (TF-IDF). Among the tested models, the k-NN algorithm demonstrated superior performance across four evaluation metrics on the test dataset. A novel spam classification method [28] integrating the Harris Hawks Optimization (HHO) algorithm with k-NN achieved 94.3% detection accuracy, as reported by the authors.

Naveen et al. [29] proposed a hybrid ML classifier based on TF-IDF to develop a phishing email detection system. Their results demonstrated that the hybrid model achieved 87.5% accuracy, outperforming traditional methods. Furthermore, the study highlighted TF-IDF’s superiority over the Count Vectorizer technique in feature extraction. The authors emphasized the importance of integrating multiple models for robust phishing detection, providing critical insights into ML-driven cybersecurity solutions.

Saini et al. [30] proposed a novel spam detection approach that utilizes RF for feature extraction and inputs the extracted features into a LR classifier to predict email legitimacy (spam vs. ham).

Spam filtering techniques have been extended to diverse domains. Sadia et al. [31] conducted a Twitter spam detection study focused on iPhone-related tweets, employing content-based features and ML algorithms such as NB, LR, k-NN, DT, and SVM. Among these, the NB classifier achieved the highest accuracy (89%), demonstrating its efficacy in spam identification.

Aufar et al. [32] explored the application of DT and RF classifiers for sentiment analysis of YouTube comments, aiming to facilitate the categorization of positive and negative feedback. Experimental results indicated that the DT classifier marginally outperformed the RF classifier, achieving 89.4% accuracy versus 88.2% for RF.

In addition, technologies and methods developed for spam filtering, such as feature extraction [31,33,34] and anomaly detection [35,36], are also equally applicable to fake news detection. By leveraging these approaches, researchers can analyze textual content, identify deceptive patterns [37], and differentiate between trustworthy and misleading information.

However, many current spam email detection techniques rely on standalone models, which are susceptible to overfitting and classification errors [38]. To address these limitations, ensemble methods that combine multiple classification algorithms have been proposed. By aggregating predictions from diverse models, these approaches reduce both false positive and false negative rates while enhancing the overall accuracy of spam detection systems.

Raza et al. [39] surveyed diverse ML-based technologies for spam classification. Their analysis revealed that supervised ML methods dominate current research, with primary focus on Bag-of-Words (BoW) and email body features. Key research priorities include the development of multi-algorithmic systems, novel feature engineering, real-time classification frameworks, and minimization of false positive rates. Their findings suggest that ensemble methods consistently outperform single-algorithm approaches, with the NB and SVM combination being the most prevalent hybrid model in this domain. However, the study lacks detailed discussion on feature selection techniques or specific extraction methodologies.

In work [40], four ML algorithms were employed, NB, SVM, DT, and k-NN, to construct a meta-learning model. Subsequently, an ensemble model was created using stacking method. A stacking ensemble model was subsequently developed, demonstrating superior performance with 95.8% classification accuracy.

Ghosh et al. [41] demonstrated that combining DT, NB, and SVMs via bagging and boosting significantly enhances detection accuracy, reduces false positive rates, and achieves higher F1 scores compared to standalone models. The stacking approach, as proposed in [42], involves two stages: (1) training multiple base classifiers to generate initial predictions, and (2) feeding these predictions into a meta-classifier for final decision-making. This method employs a meta-learner to optimally aggregate outputs from diverse base models trained on the same dataset, thereby refining the final prediction through learned combination rules [43].

Ghourabi and Alohaly [44] implemented a stacking framework with base learners including RF and Gradient Boosting. Their ensemble model surpassed standalone classifiers across precision, recall, and accuracy metrics, demonstrating stacking’s efficacy for spam detection [45]. Additionally, the study incorporated cost-sensitive learning to enhance minority-class prediction accuracy in imbalanced datasets.

To address the class imbalance issue prevalent in spam detection datasets, [46] proposed a novel ensemble method called Fisher–Markov-based Phishing Ensemble Detection (FMPED). This approach integrates low-sampling techniques and demonstrates significant improvements in detection rates. Furthermore, FMPED achieves superior performance metrics, with notable enhancements in both F1-score and accuracy compared to baseline methods.

In this study [47], the hyperparameters of four distinct classifiers were optimized via grid search, with soft voting employed as the aggregation mechanism for final predictions. Experimental results demonstrated that the proposed ensemble model achieved 99.32% accuracy, significantly outperforming individual classifiers. These findings highlight the model’s robust capability to distinguish spam from ham.

Our review of the literature reveals that current spam detection research prioritizes enhancing the accuracy and efficiency of ML algorithms. Commonly employed classifiers include SVM, NB, RF, and k-NN. While existing methods achieve 90% to 99% accuracy, their real-world deployment is constrained by poor generalization capabilities and prohibitive computational costs. The classification of mainstream methods, as shown in Table 1. To address these limitations, we propose a hybrid stacking ensemble model that integrates NB Classifier, k-NN, Logistic Regression (LR), and XGBoost, augmented and fine-tuned via grid search optimization. This approach demonstrates superior performance in accuracy, generalization, and robustness compared to prior studies. The proposed model exhibits high effectiveness and efficiency in spam detection tasks.

Download:

Table 1. Classification of Mainstream Methods.

https://doi.org/10.1371/journal.pone.0331574.t001

Materials and methods

Dataset description

To enhance the diversity and robustness of our spam classification model, we integrated two publicly accessible datasets: Kaggle’s email classification dataset (Dataset 1) and the Enron Corpus (Dataset 2). Dataset 1 contains 5,572 samples and two attributes: one indicating whether an email is spam or not, and the other containing the email text. The first attribute is the category label, which signifies if the email is spam or not, that is, if an email is spam or valid (commonly referred to as ham), and the second is the text message. The dataset comprises 87% ham emails and 13% spam emails. Fig 1 shows a partial view of the dataset; complete emails are omitted due to length. On the other hand, Dataset 2 was converted to CSV format by Marcel Wiechmann [48]. To ensure consistency, the ‘category’ and ‘message’ features in Dataset 2 were aligned with those in Dataset 1. The resulting data frames were then combined into a single CSV file.

Download:

Fig 1. The dataset visualization.

https://doi.org/10.1371/journal.pone.0331574.g001

Data preprocessing

Preprocessing is an essential prerequisite for data analysis. Raw datasets frequently exhibit inconsistencies, including missing values, duplicate entries, and formatting irregularities. Such issues can compromise the reliability of analytical methods by introducing noise and bias. Through preprocessing, raw unstructured data is transformed into a structured and standardized format, enabling robust analysis. This critical step ensures the validity and reproducibility of analytical outcomes.

The combined dataset had imbalanced class distributions, and we increased the number of spam emails to balance the dataset and prevent overfitting toward the majority class. Specifically, we replicated the spam emails in the dataset to increase their count to match that of the ham emails. This oversampling approach ensured that our model had equal representation of both classes, which is essential for accurate classification performance. Fig 2 shows balance of the dataset achieved after oversampling.

Download:

Fig 2. Combined dataset distribution.

https://doi.org/10.1371/journal.pone.0331574.g002

In this work, we opted for basic oversampling predicated on computational efficiency and implementation simplicity. This approach provided a controlled basis for subsequent ensemble architecture comparisons, isolating the effects of sampling strategies. While we acknowledge potential overfitting risks inherent in sample duplication, future iterations will explore advanced techniques such as hybrid sampling, Synthetic Minority Oversampling Technique (SMOTE), or cost-sensitive learning. These methods aim to enhance generalization while preserving low email processing latency—a critical requirement for real-time filtering systems.

In this case, the data preprocessing pipeline outlined above ensures that email datasets are cleaned and standardized, including removing noise, normalizing text, and eliminating extraneous information to guarantee their readiness for downstream analysis.

The preprocessing steps include:

Text Cleaning: involves removing non-linguistic elements such as special characters, punctuation, and HTML tags from raw text to eliminate semantically irrelevant symbols and mitigate noise. This process also filters embedded HTML/CSS scripts to neutralize phishing attack vectors and enhance system security. Finally, text standardization ensures structural uniformity (e.g., consistent whitespace, encoding formats) and normalization, preparing the data for downstream NLP tasks such as tokenization and feature extraction.

Tokenization: segments raw text into individual words, phrases, or symbols (tokens), transforming unstructured input into analyzable components (e.g., converting “Check this link” to [“Check”, “this”, “link”]). This process enables systematic feature extraction by standardizing text representations and providing structured input for downstream NLP tasks, including TF-IDF vectorization, word embeddings, and transformer-based models. Additionally, tokenization resolves ambiguities in compound terms through context-aware splitting—for instance, decomposing “state-of-the-art” into sequential tokens ([“state”, “of” “the”, “art”])—thereby preserving semantic integrity while optimizing compatibility with algorithmic processing pipelines.

Lowercasing: converts all text tokens to lowercase to standardize linguistic representations and mitigate feature redundancy. This process ensures case insensitivity by eliminating duplicate lexical variants (e.g., merging “FREE” and “free” into a unified feature), while simultaneously countering adversarial obfuscation tactics that exploit alternating case patterns (e.g., neutralizing “PaYPal” to “paypal”). By collapsing case-sensitive variations into a single canonical form, lowercasing reduces the dimensionality of the feature space, thereby optimizing computational efficiency in downstream machine learning workflows without compromising semantic fidelity.

Stop Word Removal: filters high-frequency, low-information words (e.g., “the”, “is”, “and”) from textual data, reducing corpus volume by 40–50% while retaining semantically meaningful context and mitigating linguistic noise. By eliminating generic terms, this process amplifies domain-specific keyword signals, such as security-critical lexemes like “password” and “invoice”, thereby enhancing detection models’ focus on discriminative features. To counter adversarial tactics in spam campaigns, dynamically curated stop word lists are deployed, targeting terms like “click” that are systematically overused in phishing emails to manipulate TF-IDF distributions. This adaptive mechanism not only preserves statistical integrity in feature engineering pipelines but also disrupts malicious attempts to exploit lexical redundancy in automated text classification systems.

Stemming: reduces words to their base or root forms through heuristic rules, such as converting “running” to “run”, to standardize morphological variants like “phishing” and “phished” into a common root (“phish”). This process enhances feature consistency in text analysis while operating with minimal computational overhead, making it suitable for real-time applications. However, stemming may generate non-dictionary roots (e.g., truncating “troubling” to “troubl”), potentially introducing semantic ambiguity that could affect downstream tasks reliant on precise lexical semantics.

Lemmatization: maps words to their canonical form (lemma) through morphological analysis (e.g., reducing “better” to “good” and “meeting” to “meet”), preserving semantic integrity by retaining linguistically meaningful roots (e.g., normalizing “accounts” to “account”). Unlike rule-based stemming, lemmatization incorporates part-of-speech (POS) tagging to ensure context sensitivity, for instance, maintaining “saw” as a noun while mapping its verbal usage to “see”. This granular disambiguation enhances the detection of nuanced phishing terminology by resolving inflectional variants (e.g., distinguishing “payment” from “payments”), a critical capability for minimizing false negatives in security-focused natural language processing pipelines.

For comprehensive data mining, we introduce the following features: Num_Characters, representing the number of characters; Num_Words, representing the number of words; and Num_Sentences, representing the number of sentences. These features are detailed in Fig 3.

Download:

Fig 3. Details of these features.

https://doi.org/10.1371/journal.pone.0331574.g003

Analysis of the pair plots identified several outliers within the ‘ham’ category. To minimize the impact of the outlier, we set an upper limit, as shown in Fig 4.

Download:

Fig 4. Setting an upper limit.

https://doi.org/10.1371/journal.pone.0331574.g004

Feature extraction

The Term Frequency-Inverse Document Frequency (TF-IDF) technique converts preprocessed text into numerical feature vectors compatible with ML algorithms. This vectorized representation facilitates subsequent analysis and modeling. The extracted features aim to effectively represent the email content, which is crucial for accurate classification.

TF-IDF vectorizer

TF-IDF is a cornerstone technique in NLP and ML for quantifying word significance within documents. It considers both the word’s frequency within the document (Term Frequency, TF) and its inverse document frequency (IDF). The combination of these two metrics is the TF-IDF score, which could be used to evaluate the relevance of a word to a specific document. The resulting TF-IDF score reflects the relevance of a word to a specific document. TF-IDF is calculated by multiplying the TF and IDF values.

Term Frequency (TF): represents how often a word appears in a specific document. In our context, this involves calculating the frequency of each word within an email.

Inverse Document Frequency (IDF): measures how rare a word is across all documents in the dataset. It is calculated as the inverse frequency of documents containing the word.

TF-IDF Score: is calculated by multiplying the TF and IDF values. This score assigns a weight to each word, indicating its importance within a particular document.

TF-IDF is widely used to analyze word distribution within a set of documents, enabling tasks such as similarity measurement, document categorization, and establishing links between documents. The TF-IDF calculation is shown in Equation (1).

Proposed methodology

ML algorithms are well-suited for addressing complex spam classification issues. By leveraging statistical models and algorithmic frameworks, these techniques enable robust analysis of large datasets and accurate predictive capabilities. As a result, ML provides a dynamic and adaptive approach to spam detection, overcoming limitations inherent in traditional rule-based methods while enhancing email management efficiency and mitigating risks from malicious or unsolicited emails. The primary ML techniques applied in spam detection include:

- Supervised Learning: Training models on labeled datasets (spam/ham) to learn discriminative patterns and classify new emails accordingly. Common algorithms include SVM, DT, RF, and NN.

- Natural Language Processing (NLP): Applying NLP pipelines to preprocess email content and extract semantically meaningful features from it, such as word embeddings, TF-IDF vectors, and sentiment analysis.

- Ensemble Methods: Combining multiple base models to improve classification accuracy and robustness. Commonly used techniques are bagging, boosting and stacking.

The advantages of using ML in spam detection are:

- Adaptability: ML models can be retrained on newer datasets to counter emerging spam tactics, ensuring sustained detection efficacy.

- Improved Accuracy: Advanced algorithms and feature extraction techniques enhance the precision and recall metrics in spam detection.

- Scalability: ML-based systems efficiently process high-volume email traffic while maintaining performance as data scales.

The integration of machine learning (ML) in spam detection systems offers three strategic advantages:

In this paper, we propose a stacking ensemble approach for spam classification, integrating the Naïve Bayes Classifier (NBC), k-Nearest Neighbors (k-NN), Logistic Regression (LR), and Xtreme Gradient Boosting (XGBoost). This approach synergizes the complementary strengths of each base classifiers while mitigating their individual limitations. The selected classifiers were chosen for their heterogeneous capabilities in modeling distinct data patterns and classification boundaries.

NBC models the probabilistic relationship between features and target variables and handles high-dimensional data well. K-NN identifies similar samples and leverages local patterns in the data. LR is a linear model that offers interpretability and efficient computation. XGBoost improves classification accuracy and reduces the risk of overfitting. This heterogeneous combination synergizes robustness, interpretability, and efficiency, collectively improving spam detection performance.

The framework illustrated in Fig 5 shows that we merge two datasets, perform preprocessing and balancing operations on them before feeding them to the base classifiers. The outputs of these classifiers are then aggregated and used as input for the stacking-based meta-classifier. This section describes the implementation of the NBC, k-NN, LR, and XGBoost algorithms. These models are discussed in detail in the following section.

Download:

Fig 5. The proposed methodology framework.

https://doi.org/10.1371/journal.pone.0331574.g005

The stacking approach enhances classifier performance by aggregating diverse feature representations from base models. For instance, whereas a base classifier may underperform on generic spam patterns, another might lack sensitivity to keyword-specific phishing tactics. By fusing their predictions, the stacking ensemble synthesizes a broader feature spectrum pertinent to spam classification, thereby achieving higher accuracy and robustness compared to individual models.

Since spam detection is fundamentally a text classification task, we focus on standard ML models for feature engineering over DL approaches. DL-based methods, while powerful, demand substantial computational resources for both training and deployment, including high processing power and memory capacity, which are often impractical for real-time spam filtering systems.

Naïve bayes classifier

NBC calculates the probability of each feature within each category and uses these probabilities to estimate the likelihood of a given feature set belonging to each category. This approach simplifies likelihood computation under assumption of feature independence within categories, a foundational principle derived from Bayesian theory. As a supervised learning algorithm, NBC is widely employed in text classification tasks such as spam detection and sentiment analysis. Its efficiency stems from rapid probabilistic calculations, enabling real-time predictions even with noisy input data. Additionally, NBC robustly estimates class probabilities while maintaining computational simplicity, making it a preferred choice in open-source spam filtering systems. The mathematical formulation of NBC is presented in Equation (2).

(2)

where w is a feature vector comprising multiple email attributes, c∈{spam,ham} denotes the category to which the email belongs, P(w|c_i) quantifies the probability of the complete feature vector appearing in a spam (or a ham); and P(c_i|w) estimates the probability of it being a spam (or a ham) under the complete feature vector. The algorithm flow is as follows:

# Definition:

is a sample vector containing multiple feature attributes.

The categorization of feature attributes .

The category to which the sample belongs .

# Calculate the prior probability p(y[i]) for each category based on the distribution of categories in the sample set

# Calculate the frequency of each feature attribute division under each category p(a[j] in d[k] | y[i])

# Calculate p(x|y[i]) for each sample

p(x|y[i]) = p(a[1] in d | y[i]) * p(a[2] in d | y[i]) *...

# Prediction:

p(y[i]|x) = (p(x|y[i]) * p(y[i]))/ p(x)

K-Nearest Neighbors

k-NN classifies new emails based on the categories of its K nearest neighbors in the training data, using distance as the determining factor. The user-defined K-value determines the number of neighbors to consider. The K training emails closest to the new email are selected as its nearest neighbors. The majority label (spam or ham) among the K nearest neighbors is assigned to the new email. k-NN is simple to implement and makes no assumptions about data distribution. However, with larger datasets, prediction time increases due to the need to find the K nearest neighbors for each new sample. Furthermore, k-NN performance is sensitive to the choice of K-value and distance metric. k-NN uses a chosen distance metric to determine the nearest neighbors. Category labels are assigned based on a majority vote among the K nearest neighbors. Equation (3) shows the calculation for Euclidean distance.

(3)

Logistic regression

Logistic Regression (LR) is a classification method well-suited for predicting discrete probabilities. It utilizes a logistic function to model the probability of an event, resulting in a binary output (0 or 1). In our case, these values represent ‘spam’ or ‘ham’. LR is valuable not only for its predictive power but also for providing insights into the contribution of each feature to the probability of a positive outcome. The binary dependent variable in LR facilitates analysis of the relationship between independent and dependent variables. Equation (4) shows the formula for LR..

(4)

LR optimizes a set of coefficients through an iterative process, typically using algorithms such as gradient descent. These coefficients weight the individual features, influencing the prediction and minimizing the difference between predicted probabilities and actual labels in the training data. LR classifies an email as spam or ham by comparing the predicted probability to a threshold. If the probability is above the threshold, the email is classified as spam; otherwise, it is classified as ham. The threshold is typically 0.5 but can be adjusted as needed. The LR algorithm proceeds as follows:

# Definition:

w: weight, b: bias, learning_rate, num_iterations

# Iteration training start

☐# Repeat for iteration

☐# Calculate combination

z = w * feature + b

prob = sigmoid(z)

☐# Calculate loss function

loss = - 1/(total_data) * (category *log(prob)+(1-category) * log(1-prob))

☐# Calculate gradients

d_w = 1/(total_data) * feature*(prob-category)

d_b = 1/(total_data) * ∑ (prob-category)

☐# Update weight

w -= learning_rate * d_w

☐# Update bias

b -= learning_rate * d_b

# Prediction:

if prob > 0.5 then 1 else 0

XGBoost Classifier

XGBoost is an optimized gradient-boosting framework that iteratively enhances model accuracy by constructing a series of ordered decision trees, each correcting the errors of its predecessor. XGBoost is an efficient and scalable gradient boosting algorithm known for its training efficiency, strong predictive performance, controllable parameters, and ease of use. The XGBoost prediction formula is given in Equation (5).

(5)

where prob denotes the final tree model, which is the result of the previous tree model; ft(xi) denotes the newly generated tree model; and t denotes the total number of base tree models.

This approach builds classification trees sequentially, using the residuals of each tree to train the next. During training, it integrates the predictions from previous trees to improve performance. Pruning is used to prevent overfitting and simplify the decision trees by removing less influential nodes.

In this case, the XGBoost algorithm proceeds as follows:

Initialization: A set of decision tree models is trained on a small subset of emails to classify them as spam or ham.

Boosting: New models are added and trained to correct the errors of previous models, with a focus on misclassified emails.

Gradient Descent: The parameters of the new model are optimized using gradient descent to minimize the ensemble loss function.

Regularization: Overfitting is prevented by penalizing models with excessive parameters or complex decision boundaries.

Pruning: Leaves with low weights are removed to prevent overfitting and improve generalization.

Repeat: Steps 2–5 are repeated until a stopping criterion is met (e.g., a target accuracy or maximum number of rounds).

Return: The final ensemble of models is used to classify new emails as spam or ham via majority voting.

Stacking model

Traditional standalone ML models often exhibit suboptimal performance due to inherent algorithmic biases and data-specific limitations. To address this, the stacking ensemble learning (Fig. 6) integrates heterogeneous base classifiers through a hierarchical two-tier architecture. This synthesis of collective strengths enhances overall accuracy, robustness, and generalization capabilities beyond the performance of any individual base model.

Download:

Fig 6. The process of the stack model.

https://doi.org/10.1371/journal.pone.0331574.g006

We propose a stacking ensemble approach for spam email classification. The stacking ensemble framework operates through a meticulously designed three-stage pipeline to optimize spam detection accuracy.

- Base Layer: The base classifiers (NBC, k-NN, LR, XGBoost) generate cross-validated probabilistic predictions on the training data. These outputs are concatenated to form a meta-training dataset Dmeta, where each sample corresponds to a vector of base model probabilities.

- Meta-Learning Phase: The meta-classifier (XGBoost) is trained on Dmeta to learn optimal weightings across base model predictions, minimizing classification error through gradient-boosted tree ensembles.

- Inference: For new emails, the base classifiers first produce probability predictions, which are aggregated into meta-test features. The meta-classifier then synthesizes these features to emit final spam/ham labels.

Hyperparameter tuning technique

Grid search is a widely used hyperparameter tuning technique in ML that systematically explores all possible combinations of hyperparameters to optimize model performance. While alternative methods like random search and Bayesian optimization exist, we selected grid search for its deterministic search behavior and compatibility with parallel processing architectures. In this study, a grid search is used to optimize hyperparameters such as the number of trees, number of iterations, and learning rate.

Evaluations of proposed model

This section presents a comprehensive evaluation of the proposed model using standard performance metrics: accuracy, precision, recall, F1-score, and the confusion matrix. The evaluation results provide valuable insights for informed decision-making in ML.

Confusion matrix

The confusion matrix is a common metric for evaluating the performance of machine learning models. The performance of machine learning classifiers is often evaluated using a tabular structure known as the confusion matrix. This matrix was also referred to as the error matrix by Karl Pearson. The confusion matrix is composed of four values: True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN). The confusion matrix represents the counts of True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN). As shown in Table 2, the confusion matrix provides granular insights into model behavior by enumerating TP, FP, TN, and FN counts. This decomposition enables rigorous assessment of both effectiveness and robustness, making it indispensable for diagnostic optimization in classification systems.

Download:

Table 2. The confusion matrix.

https://doi.org/10.1371/journal.pone.0331574.t002

Accuracy

Accuracy is a commonly used evaluation metric. It measures the overall correctness of the model’s predictions by calculating the ratio of correctly predicted instances to the total number of instances. While straightforward, accuracy can be misleading when categories are unbalanced, as it can be skewed by the dominant category. Therefore, accuracy should be used in conjunction with other metrics, particularly when different types of errors have varying costs. The formula for calculating accuracy is shown in Equation (6).

(6)

Precision

Precision is a key metric for evaluating classifier accuracy in machine learning. It measures the proportion of correctly predicted positive cases out of all instances predicted as positive. Also known as Positive Predictive Value, precision reflects the classifier’s ability to identify true positives while minimizing false positives. Precision is expressed as a ratio, quantifying the accuracy of positive predictions. The formula for calculating precision is shown in Equation (7).

(7)

Recall

In evaluating classification performance, recall indicates the completeness of a classifier. Recall measures the classifier’s ability to identify all relevant instances in the dataset. Specifically, it quantifies the proportion of correctly identified relevant instances out of all relevant instances. Equation (8) shows the formula for recall, a quantitative measure of the classifier’s ability to identify relevant instances.

(8)

F1-score

The F-measure is a weighted harmonic mean that balances precision and recall to evaluate a test’s accuracy. The F1-score is the most frequently used F-measure, balancing precision and recall. The formula is given in Equation (9).

(9)

Results and discussion

This study investigates the efficacy of a stacking ensemble approach for enhancing spam email classification accuracy. As evidenced in Fig 7, the proposed stacking model achieves superior performance compared to all individual base classifiers (NBC, k-NN, LR, and XGBoost, when evaluated through 5-fold stratified cross-validation. These results validate that strategic integration of heterogeneous classifiers through stacking significantly improves discriminative power in spam detection tasks.

Download:

Fig 7. Accuracy comparison with base classifiers and the proposed stacking model.

https://doi.org/10.1371/journal.pone.0331574.g007

Performance comparison

To demonstrate the effectiveness of our proposed model, we compare it with four state-of-the-art base classifiers: NBC, k-NN, LR, and XGBoost. As shown in Table 3 and Fig 8, our proposed model achieves the highest accuracy, at 97.78%. This indicates that the stacking method, with different base classifier combinations, consistently performs better than any individual classifier.

Download:

Table 3. Performance of the proposed stacking model.

https://doi.org/10.1371/journal.pone.0331574.t003

Download:

Fig 8. Overall performance of base classifiers and the proposed stacking model.

https://doi.org/10.1371/journal.pone.0331574.g008

Our proposed model achieves the highest precision (96.58%), highest recall (97.81%), and also the highest F1-score (96.89%).

XGBoost also outperforms the other three base models across all metrics, achieving a precision of 95.21%, accuracy of 96.42%, recall of 96.73%, and F1-score of 95.74%. In contrast, k-NN performs poorly across all metrics, with a precision of 89.68%, accuracy of 90.92%, recall of 65.95%, and F1-score of 72.86%. These results suggest that both XGBoost and our proposed model are well-suited for this dataset.

Further experiments, using different combinations of base classifiers in the stacking approach, were conducted to evaluate the generalization and robustness of the results. The results consistently showed that our proposed method outperformed the individual base classifiers, achieving high classification performance.

Discussion

The performance comparison demonstrates that our proposed approach significantly enhances spam classification accuracy using ensemble ML. Our experiments demonstrate that combining the outputs of multiple base classifiers yields improvements in accuracy, precision, recall, and F1-score.

To validate our findings and ensure consistent results, we conducted additional experiments. We attempted a 5-fold cross-validation grid search, optimizing for accuracy, to determine the best values for the proposed model’s hyperparameters: max_depth (number of trees), nrounds (iterations), and η(learning rate).

In addition, experimental performance metrics (Accuracy, Precision, Recall, F1-score) were collected through multiple independent trials for both the proposed model and tuned hybrid model. Each trial maintained identical experimental conditions to ensure comparability. The 95% confidence intervals (CIs) computed by nonparametric bootstrap resampling are detailed in Fig 9.

Download:

Fig 9. Comparison of performance evaluation metrics between baseline and adjusted models.

https://doi.org/10.1371/journal.pone.0331574.g009

As shown in Table 4 and Fig 9, Our hybrid model achieves 97.78% accuracy before tuning, indicating the proportion of correctly classified samples. Precision is 96.58%, recall is 97.81%, and the F1-score is 96.89%. Hyperparameter tuning significantly improves the performance of our proposed classifier. After tuning, accuracy increases to 99.79%, precision to 98.82%, recall to 98.76%, and the F1-score to 98.87%.

Download:

Table 4. Performance comparison before and after tuning.

https://doi.org/10.1371/journal.pone.0331574.t004

From Fig 10, it can be observed that the face grid shows the performance of the model in terms of accuracy at max_depth of 5, 6, and 7 with varying nrounds and learning rates (η). The colsample_bytree was held constant at 0.6.

Download:

Fig 10. The 5-fold cross-validation for the determination of best hyperparameter values for XGBoost.

https://doi.org/10.1371/journal.pone.0331574.g010

The results show that accuracy generally improves with increasing nrounds for each max_depth. However, the improvement in accuracy becomes negligible as nrounds approaches 700. With a boosting iteration of 560, the model achieves peak accuracy (99.79%) with a max_depth of 7 and an ηof 0.25.

Compared to models with max_depth = 5 (99.7%) and max_depth = 6 (99.6%), the model with max_depth = 7 constructs more complex tree structures, capturing higher-order feature interactions and nonlinear patterns, thereby achieving superior expressiveness. When using η = 0.25, an optimal balance between training speed and stability is attained: this η avoids premature convergence (e.g., η = 0.5 results in accuracy ≤96.6%) while mitigating inefficient optimization observed at lower rates (e.g., η = 0.1 requires 700 rounds to reach only 98.7%). The choice of nrounds = 560 provides sufficient training time for the deeper model (max_depth = 7) to optimize fully, without overfitting (as evidenced by stable accuracy up to 700 rounds). In conclusion, the superiority of the optimal parameter combination (max_depth = 7, η = 0.25, nrounds = 560) stems from the synergistic interplay of model capacity, learning efficiency, and regularization, which collectively adapt to the data’s intrinsic complexity while dynamically balancing overfitting and underfitting risks.

We also explored different base classifier combinations within the stacking framework, including NBC and DT, LR and NBC, LR and DT, and a combination of all three. Our experiments used the full dataset. The combination of all three classifiers yielded the best performance, achieving 98% accuracy. The remaining combinations achieved slightly lower accuracies, ranging from 95% to 98%. These results demonstrate that stacking multiple base classifiers can improve peak performance.

Stacking multiple classifiers generally results in higher accuracy than using individual models. Hyperparameter tuning further enhances the hybrid model’s performance, increasing accuracy, recall, and F1-score. This is because the stacking model considers predictions from multiple models, leading to more robust and diverse predictions than a single model can provide. The high accuracy and effective combination of multiple classifiers make our tuned hybrid model promising for spam detection. Our tuned hybrid model successfully improves both prediction accuracy and the identification of positive samples. Thus, the findings are in favor of using our hybrid model with tuning proposed in this paper.

The comparison of the findings obtained in the classification process performed in this study with other similar studies in the literature is given in Table 5. In this table, the most successful algorithm and accuracy rates are given. Fig 11 presents a comparison of our results with those of similar studies.

Download:

Table 5. Comparison of the most successful accuracy rates on the same and different data sets.

https://doi.org/10.1371/journal.pone.0331574.t005

Download:

Fig 11. Comparison of results.

https://doi.org/10.1371/journal.pone.0331574.g011

As can be seen, this approach surpasses previous studies in terms of accuracy, generalization and robustness by combining grid search and stacking methods. The constructed model is very effective and efficient in detecting spam emails. Overall, our results show the effectiveness of stacking for enhancing spam classification accuracy in real-world applications.

The increased computational cost of the hybrid model stems from training and predicting with multiple classifiers. However, the training and testing time of our proposed hybrid model does not increase significantly and remains acceptable. The proposed stacking model (99.79% accuracy) can be integrated into enterprise email gateways to reduce false positives in spam classification while blocking malicious content (phishing, malware links). By minimizing manual review efforts, organizations can save operational costs associated with email security management.

As evidenced by Table 5 and Fig 11, this approach advances email security paradigms by balancing accuracy, generalizability, and operational feasibility, offering a deployable solution for next-generation spam mitigation.

Conclusions

This paper reviews and synthesizes the state-of-the-art in spam email detection. We aim to provide a methodological analysis of current research, examining various ML methods for spam detection and identifying areas for improvement in efficiency. The field is moving from traditional spam detection methods toward more complex approaches, with the goal of increasing accuracy and efficiency. This study advances spam email detection by proposing a hybrid stacking ensemble framework that integrates NB, k-NN, LR, and XGBoost classifiers, achieving state-of-the-art accuracy and robustness through hyperparameter-optimized meta-learning.

Our experimental results clearly demonstrate that combining the outputs of multiple base classifiers and fine-tuning them using hyperparameters leads to improvements in accuracy, precision, recall, and F1-score. These technological advances have the potential to improve email system functionality, strengthen spam defenses, and minimize resource usage. Our results validate stacking as a cornerstone technique for next-generation spam detection, bridging the gap between academic innovation and industrial practicality.

However, the current training data sourced from Kaggle and Enron datasets exhibit critical representational gaps that hinder real-world generalization. First, these datasets predominantly contain English-language spam, overlooking prevalent non-English threats such as Chinese phishing campaigns leveraging localized social engineering tactics. Second, emerging attack vectors like AI-generated emails, which mimic writing styles of specific individuals using Deepseek or similar models, are absent from the training corpus. Such omissions create domain adaptation challenges. To address this, future work will focus on integrating transformer-based classifiers for contextual analysis, applying cross-lingual transfer learning (e.g., multilingual BERT fine-tuning), and implementing adversarial training with synthetically generated attack samples, ensuring robustness against evolving threat landscapes. Leveraging these advanced models could mitigate challenges in spam detection, including adapting to changing spam strategies and reducing false positives, ultimately contributing to more resilient and effective solutions.

References

1. Thakur K, Ali ML, Obaidat MA, Kamruzzaman A. A systematic review on deep-learning-based phishing email detection. Electronics. 2023;12(21):4545.
- View Article
- Google Scholar
2. Doshi J, Parmar K, Sanghavi R, Shekokar N. A comprehensive dual-layer architecture for phishing and spam email detection. Comput Security. 2023;133:103378.
- View Article
- Google Scholar
3. Alawida M, Omolara AE, Abiodun OI, Al-Rajab M. A deeper look into cybersecurity issues in the wake of Covid-19: A survey. J King Saud Univ Comput Inf Sci. 2022;34(10):8176–206. pmid:37521180
- View Article
- PubMed/NCBI
- Google Scholar
4. Oviedo V, Fox Tree JE. Meeting by text or video-chat: Effects on confidence and performance. Comput Human Behavior Reports. 2021;3(1):100054.
- View Article
- Google Scholar
5. Zavrak S, Yilmaz S. Email spam detection using hierarchical attention hybrid deep learning method. Expert Systems with Applicat. 2023;233:120977.
- View Article
- Google Scholar
6. Jáñez-Martino F, Alaiz-Rodríguez R, González-Castro V, Fidalgo E, Alegre E. A review of spam email detection: analysis of spammer strategies and the dataset shift problem. Artif Intell Rev. 2022;56(2):1145–73.
- View Article
- Google Scholar
7. Zhang Z, Damiani E, Hamadi HA, Yeun CY, Taher F. Explainable Artificial Intelligence to Detect Image Spam Using Convolutional Neural Network. In: 2022 International Conference on Cyber Resilience (ICCR). 2022. https://doi.org/10.1109/iccr56254.2022.9995839
8. Hosseinalipour A, Ghanbarzadeh R. A novel approach for spam detection using horse herd optimization algorithm. Neural Comput & Applic. 2022;34(15):13091–105.
- View Article
- Google Scholar
9. Gangavarapu T, Jaidhar CD, Chanduka B. Applicability of machine learning in spam and phishing email filtering: review and approaches. Artif Intell Rev. 2020;53(7):5019–81.
- View Article
- Google Scholar
10. Kaddoura S, Chandrasekaran G, Elena Popescu D, Duraisamy JH. A systematic literature review on spam content detection and classification. PeerJ Comput Sci. 2022;8:e830. pmid:35174265
- View Article
- PubMed/NCBI
- Google Scholar
11. Ismail SSI, Mansour RF, Abd El-Aziz RM, Taloba AI. Efficient E-Mail Spam Detection Strategy Using Genetic Decision Tree Processing with NLP Features. Comput Intell Neurosci. 2022;2022:7710005. pmid:35371228
- View Article
- PubMed/NCBI
- Google Scholar
12. Ravi Kumar G, Murthuja P, Anjan Babu G, Nagamani K. An Efficient Email Spam Detection Utilizing Machine Learning Approaches. Proc. Lect. Notes Data Eng. Commun. Technol. 2022, 96, pp. 141–51.
- View Article
- Google Scholar
13. Shajahan R, Lekshmy PL. Hybrid learning approach for e-mail spam detection and classification. In: International Conference on Intelligent Cyber Physical Systems and Internet of Things. Cham; Springer International Publishing; 2023. 781–94.
14. Omotehinwa TO, Oyewola DO. Hyperparameter Optimization of Ensemble Models for Spam Email Detection. Applied Sciences. 2023;13(3):1971.
- View Article
- Google Scholar
15. Faisal MF, Saqlain MNU, Bhuiyan MAS, Miraz MH, Patwary MJA. Credit Approval System Using Machine Learning: Challenges and Future Directions. 2021 International Conference on Computing, Networking, Telecommunications & Engineering Sciences Applications (CoNTESA). 2021;76–82. https://doi.org/10.1109/contesa52813.2021.9657153
16. Nandhini S, Marseline K.S. J. Performance Evaluation of Machine Learning Algorithms for Email Spam Detection. 2020 International Conference on Emerging Trends in Information Technology and Engineering (ic-ETITE). 2020;1–4. https://doi.org/10.1109/ic-etite47903.2020.312
17. Jain T, Garg P, Chalil N, Sinha A, Verma VK, Gupta R. SMS Spam Classification Using Machine Learning Techniques. In: 2022 12th International Conference on Cloud Computing, Data Science & Engineering (Confluence). 2022;273–9. https://doi.org/10.1109/confluence52989.2022.9734128
18. Kumar R, Murthy KSR, Ramesh Babu J, Shaik A. Live Text Analyzer to Detect Unsolicited Messages Using Count Vectorizer. Journal of Engineering Sciences. 2023;14(06).
- View Article
- Google Scholar
19. Junnarkar A, Adhikari S, Fagania J, Chimurkar P, Karia D. E-mail spam classification via machine learning and natural language processing. In: 2021 Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV). 2021.
20. Şahin DÖ, Demirci S. Spam Filtering with KNN: Investigation of the Effect of k Value on Classification Performance. In: 2020 28th Signal Processing and Communications Applications Conference (SIU). 2020;1–4. http://dx.doi.org/10.1109/siu49456.2020.9302516
21. Georgieva-Trifonova T. Research on Filtering Feature Selection Methods for E-Mail Spam Detection by Applying K-NN Classifier. In: 2022 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA). 2022;1–4. https://doi.org/10.1109/hora55278.2022.9799999
22. Thakur P, Joshi K, Thakral P, Jain S. Detection of Email Spam using Machine Learning Algorithms: A Comparative Study. In: 2022 8th International Conference on Signal Processing and Communication (ICSC). 2022;349–52. https://doi.org/10.1109/icsc56524.2022.10009149
23. Cota RP, Zinca D. Comparative Results of Spam Email Detection Using Machine Learning Algorithms. In: 2022 14th International Conference on Communications (COMM). 2022;1–5. https://doi.org/10.1109/comm54429.2022.9817305
24. Alsuwit MH, Haq MA, Aleisa MA. Advancing Email Spam Classification using Machine Learning and Deep Learning Techniques. Eng Technol Appl Sci Res. 2024;14(4):14994–5001.
- View Article
- Google Scholar
25. Borotić G, Granoša L, Kovačević J, Babac MB. Effective Spam Detection with Machine Learning. Croatian Regional Development J. 2023;4(2):43–64.
- View Article
- Google Scholar
26. Gallo L, Maiello A, Botta A, Ventre G. 2 Years in the anti-phishing group of a large company. Computers & Security. 2021;105:102259.
- View Article
- Google Scholar
27. Hnini G, Riffi J, Mahraz MA, Yahyaouy A, Tairi H. Spam Filtering System Based on Nearest Neighbor Algorithms. Lecture Notes in Networks and Systems. Springer International Publishing. 2020. p. 36–46.
- View Article
- Google Scholar
28. Mashaleh AS, Binti Ibrahim NF, Al-Betar MA, Mustafa HMJ, Yaseen QM. Detecting Spam Email with Machine Learning Optimized with Harris Hawks optimizer (HHO) Algorithm. Procedia Computer Science. 2022;201:659–64.
- View Article
- Google Scholar
29. Palanichamy N, Murti YS. Improving phishing email detection using the hybrid machine learning approach. JTDE. 2023;11(3):120–42.
- View Article
- Google Scholar
30. Saini H, Saini KS. Hybrid Model for Email Spam Prediction Using Random Forest for Feature Extraction. In: 2023 International Conference on Artificial Intelligence and Applications (ICAIA) Alliance Technology Conference (ATCON-1). 2023;1–4. https://doi.org/10.1109/icaia57370.2023.10169126
31. Sadia A, Bashir F, Khan RQ, Khalid A. Comparison of Machine Learning Algorithms for Spam Detection. JAIT. 2023;14(2):178–84.
- View Article
- Google Scholar
32. Aufar M, Andreswari R, Pramesti D. Sentiment Analysis on Youtube Social Media Using Decision Tree and Random Forest Algorithm: A Case Study. In: 2020 International Conference on Data Science and Its Applications (ICoDSA). 2020;1–7. https://doi.org/10.1109/icodsa50139.2020.9213078
33. Marijić A, Bagić Babac M. Predicting song genre with deep learning. Glob Knowl Mem Commun. 2023.
- View Article
- Google Scholar
34. Abayomi‐Alli O, Misra S, Abayomi‐Alli A. A deep learning method for automatic SMS spam classification: Performance of learning algorithms on indigenous dataset. Concurrency and Computation. 2022;34(17).
- View Article
- Google Scholar
35. Bagic Babac M, Cemeljic H. Preventing security incidents on social networks: An analysis of harmful content dissemination through applications. Police and Security. 2023;32:239–70.
- View Article
- Google Scholar
36. Adnan M, Imam MO, Javed MF, Murtza I. Improving spam email classification accuracy using ensemble techniques: a stacking approach. Int J Inf Secur. 2023;23(1):505–17.
- View Article
- Google Scholar
37. Brzic B, Boticki I, Bagic Babac M. Detecting Deception Using Natural Language Processing and Machine Learning in Datasets on COVID-19 and Climate Change. Algorithms. 2023;16(5):221.
- View Article
- Google Scholar
38. Kontsewaya Y, Antonov E, Artamonov A. Evaluating the Effectiveness of Machine Learning Methods for Spam Detection. Procedia Computer Science. 2021;190:479–86.
- View Article
- Google Scholar
39. RAZA M, Jayasinghe ND, Muslam MMA. A Comprehensive Review on Email Spam Classification using Machine Learning Algorithms. In: 2021 International Conference on Information Networking (ICOIN). 2021. 327–32. https://doi.org/10.1109/icoin50884.2021.9334020
40. Al-shanableh N, Alzyoud MS, Nashnush E. Enhancing email spam detection through ensemble machine learning: a comprehensive evaluation of model integration and performance. Communications of the IIMA. 2024;22(1).
- View Article
- Google Scholar
41. Ghosh S, Basu S, Gupta A. Enhancing Spam Detection Using Bagging and Boosting. Journal of Machine Learning Research. 2020;21(101):1–20.
- View Article
- Google Scholar
42. Kumar P, Thakur S. Stacking-based ensemble learning for improved spam detection. Information Systems Frontiers. 2021;23(3):723–35.
- View Article
- Google Scholar
43. Hoc HT, Silhavy R, Prokopova Z, Silhavy P. Comparing Stacking Ensemble and Deep Learning for Software Project Effort Estimation. IEEE Access. 2023;11:60590–604.
- View Article
- Google Scholar
44. Ghourabi A, Alohaly M. Enhancing Spam Message Classification and Detection Using Transformer-Based Embedding and Ensemble Learning. Sensors (Basel). 2023;23(8):3861. pmid:37112202
- View Article
- PubMed/NCBI
- Google Scholar
45. Liu T, Li S, Dong Y, Mo Y, He S. Spam detection and classification based on distilbert deep learning algorithm. Appl Scid Eng J Adv Res. 2024;3(3):6–10.
- View Article
- Google Scholar
46. Qi Q, Wang Z, Xu Y, Fang Y, Wang C. Enhancing Phishing Email Detection through Ensemble Learning and Undersampling. Appl Sci. 2023;13(15):8756.
- View Article
- Google Scholar
47. Tazwar A, Daiyan MM, Jiabul Hoque M, Saifuddin M, Khaliluzzaman Md. Enhancing Spam Email Detection with a Soft Voting Ensemble of Optimized Machine Learning. In: 2024 IEEE International Conference on Computing, Applications and Systems (COMPAS). 2024;1–6. https://doi.org/10.1109/compas60761.2024.10796598
48. Wiechmann M. GitHub—M Wiechmann/Enron_spam_data: The Enron-Spam Dataset Preprocessed in a Single, Clean Csv File. https://github.com/MWiechmann/enron_spam_data Accessed 2022 August 17.
49. Keskin S, Sevli O. Machine Learning Based Classification for Spam Detection. Sakarya University Journal of Science. 2023.
- View Article
- Google Scholar
50. Vinitha VS, Renuka DK, Kumar LA. Long Short-Term Memory Networks for Email Spam Classification. In: 2023 International Conference on Intelligent Systems for Communication, IoT and Security (ICISCoIS). 2023;176–80. https://doi.org/10.1109/iciscois56541.2023.10100445
51. AbdulNabi I, Yaseen Q. Spam email detection using deep learning techniques. Proc Computer Science. 2021;184(2):853–8.
- View Article
- Google Scholar
52. Aldakheel EA, Zakariah M, Gashgari GA, Almarshad FA, Alzahrani AIA. A deep learning-based innovative technique for phishing detection in modern security with uniform resource locators. Sensors (Basel). 2023;23(9):4403. pmid:37177607
- View Article
- PubMed/NCBI
- Google Scholar
53. Alguliyev R, Aliguliyev R, Sukhostat L. An Approach for Business Email Compromise Detection using NLP and Deep Learning. In: 2024 IEEE 18th International Conference on Application of Information and Communication Technologies (AICT). 2024;1–6. https://doi.org/10.1109/aict61888.2024.10740431
54. Atawneh SH, Aljehani H. Phishing email detection model using deep learning. Electronics. 2023.
- View Article
- Google Scholar
55. Erkuş C, Kaya B. E-Mail Spam Detection Using BERT and LSTM. In: 2024 International Conference on Decision Aid Sciences and Applications (DASA). 2024. 1–5. https://doi.org/10.1109/dasa63652.2024.10836404

[ref1] 1. Thakur K, Ali ML, Obaidat MA, Kamruzzaman A. A systematic review on deep-learning-based phishing email detection. Electronics. 2023;12(21):4545.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Doshi J, Parmar K, Sanghavi R, Shekokar N. A comprehensive dual-layer architecture for phishing and spam email detection. Comput Security. 2023;133:103378.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Alawida M, Omolara AE, Abiodun OI, Al-Rajab M. A deeper look into cybersecurity issues in the wake of Covid-19: A survey. J King Saud Univ Comput Inf Sci. 2022;34(10):8176–206. pmid:37521180
View Article
PubMed/NCBI
Google Scholar

[8] View Article

[9] PubMed/NCBI

[10] Google Scholar

[ref4] 4. Oviedo V, Fox Tree JE. Meeting by text or video-chat: Effects on confidence and performance. Comput Human Behavior Reports. 2021;3(1):100054.
View Article
Google Scholar

[12] View Article

[13] Google Scholar

[ref5] 5. Zavrak S, Yilmaz S. Email spam detection using hierarchical attention hybrid deep learning method. Expert Systems with Applicat. 2023;233:120977.
View Article
Google Scholar

[15] View Article

[16] Google Scholar

[ref6] 6. Jáñez-Martino F, Alaiz-Rodríguez R, González-Castro V, Fidalgo E, Alegre E. A review of spam email detection: analysis of spammer strategies and the dataset shift problem. Artif Intell Rev. 2022;56(2):1145–73.
View Article
Google Scholar

[18] View Article

[19] Google Scholar

[ref7] 7. Zhang Z, Damiani E, Hamadi HA, Yeun CY, Taher F. Explainable Artificial Intelligence to Detect Image Spam Using Convolutional Neural Network. In: 2022 International Conference on Cyber Resilience (ICCR). 2022. https://doi.org/10.1109/iccr56254.2022.9995839

[ref8] 8. Hosseinalipour A, Ghanbarzadeh R. A novel approach for spam detection using horse herd optimization algorithm. Neural Comput & Applic. 2022;34(15):13091–105.
View Article
Google Scholar

[22] View Article

[23] Google Scholar

[ref9] 9. Gangavarapu T, Jaidhar CD, Chanduka B. Applicability of machine learning in spam and phishing email filtering: review and approaches. Artif Intell Rev. 2020;53(7):5019–81.
View Article
Google Scholar

[25] View Article

[26] Google Scholar

[ref10] 10. Kaddoura S, Chandrasekaran G, Elena Popescu D, Duraisamy JH. A systematic literature review on spam content detection and classification. PeerJ Comput Sci. 2022;8:e830. pmid:35174265
View Article
PubMed/NCBI
Google Scholar

[28] View Article

[29] PubMed/NCBI

[30] Google Scholar

[ref11] 11. Ismail SSI, Mansour RF, Abd El-Aziz RM, Taloba AI. Efficient E-Mail Spam Detection Strategy Using Genetic Decision Tree Processing with NLP Features. Comput Intell Neurosci. 2022;2022:7710005. pmid:35371228
View Article
PubMed/NCBI
Google Scholar

[32] View Article

[33] PubMed/NCBI

[34] Google Scholar

[ref12] 12. Ravi Kumar G, Murthuja P, Anjan Babu G, Nagamani K. An Efficient Email Spam Detection Utilizing Machine Learning Approaches. Proc. Lect. Notes Data Eng. Commun. Technol. 2022, 96, pp. 141–51.
View Article
Google Scholar

[36] View Article

[37] Google Scholar

[ref13] 13. Shajahan R, Lekshmy PL. Hybrid learning approach for e-mail spam detection and classification. In: International Conference on Intelligent Cyber Physical Systems and Internet of Things. Cham; Springer International Publishing; 2023. 781–94.

[ref14] 14. Omotehinwa TO, Oyewola DO. Hyperparameter Optimization of Ensemble Models for Spam Email Detection. Applied Sciences. 2023;13(3):1971.
View Article
Google Scholar

[40] View Article

[41] Google Scholar

[ref15] 15. Faisal MF, Saqlain MNU, Bhuiyan MAS, Miraz MH, Patwary MJA. Credit Approval System Using Machine Learning: Challenges and Future Directions. 2021 International Conference on Computing, Networking, Telecommunications & Engineering Sciences Applications (CoNTESA). 2021;76–82. https://doi.org/10.1109/contesa52813.2021.9657153

[ref16] 16. Nandhini S, Marseline K.S. J. Performance Evaluation of Machine Learning Algorithms for Email Spam Detection. 2020 International Conference on Emerging Trends in Information Technology and Engineering (ic-ETITE). 2020;1–4. https://doi.org/10.1109/ic-etite47903.2020.312

[ref17] 17. Jain T, Garg P, Chalil N, Sinha A, Verma VK, Gupta R. SMS Spam Classification Using Machine Learning Techniques. In: 2022 12th International Conference on Cloud Computing, Data Science & Engineering (Confluence). 2022;273–9. https://doi.org/10.1109/confluence52989.2022.9734128

[ref18] 18. Kumar R, Murthy KSR, Ramesh Babu J, Shaik A. Live Text Analyzer to Detect Unsolicited Messages Using Count Vectorizer. Journal of Engineering Sciences. 2023;14(06).
View Article
Google Scholar

[46] View Article

[47] Google Scholar

[ref19] 19. Junnarkar A, Adhikari S, Fagania J, Chimurkar P, Karia D. E-mail spam classification via machine learning and natural language processing. In: 2021 Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV). 2021.

[ref20] 20. Şahin DÖ, Demirci S. Spam Filtering with KNN: Investigation of the Effect of k Value on Classification Performance. In: 2020 28th Signal Processing and Communications Applications Conference (SIU). 2020;1–4. http://dx.doi.org/10.1109/siu49456.2020.9302516

[ref21] 21. Georgieva-Trifonova T. Research on Filtering Feature Selection Methods for E-Mail Spam Detection by Applying K-NN Classifier. In: 2022 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA). 2022;1–4. https://doi.org/10.1109/hora55278.2022.9799999

[ref22] 22. Thakur P, Joshi K, Thakral P, Jain S. Detection of Email Spam using Machine Learning Algorithms: A Comparative Study. In: 2022 8th International Conference on Signal Processing and Communication (ICSC). 2022;349–52. https://doi.org/10.1109/icsc56524.2022.10009149

[ref23] 23. Cota RP, Zinca D. Comparative Results of Spam Email Detection Using Machine Learning Algorithms. In: 2022 14th International Conference on Communications (COMM). 2022;1–5. https://doi.org/10.1109/comm54429.2022.9817305

[ref24] 24. Alsuwit MH, Haq MA, Aleisa MA. Advancing Email Spam Classification using Machine Learning and Deep Learning Techniques. Eng Technol Appl Sci Res. 2024;14(4):14994–5001.
View Article
Google Scholar

[54] View Article

[55] Google Scholar

[ref25] 25. Borotić G, Granoša L, Kovačević J, Babac MB. Effective Spam Detection with Machine Learning. Croatian Regional Development J. 2023;4(2):43–64.
View Article
Google Scholar

[57] View Article

[58] Google Scholar

[ref26] 26. Gallo L, Maiello A, Botta A, Ventre G. 2 Years in the anti-phishing group of a large company. Computers & Security. 2021;105:102259.
View Article
Google Scholar

[60] View Article

[61] Google Scholar

[ref27] 27. Hnini G, Riffi J, Mahraz MA, Yahyaouy A, Tairi H. Spam Filtering System Based on Nearest Neighbor Algorithms. Lecture Notes in Networks and Systems. Springer International Publishing. 2020. p. 36–46.
View Article
Google Scholar

[63] View Article

[64] Google Scholar

[ref28] 28. Mashaleh AS, Binti Ibrahim NF, Al-Betar MA, Mustafa HMJ, Yaseen QM. Detecting Spam Email with Machine Learning Optimized with Harris Hawks optimizer (HHO) Algorithm. Procedia Computer Science. 2022;201:659–64.
View Article
Google Scholar

[66] View Article

[67] Google Scholar

[ref29] 29. Palanichamy N, Murti YS. Improving phishing email detection using the hybrid machine learning approach. JTDE. 2023;11(3):120–42.
View Article
Google Scholar

[69] View Article

[70] Google Scholar

[ref30] 30. Saini H, Saini KS. Hybrid Model for Email Spam Prediction Using Random Forest for Feature Extraction. In: 2023 International Conference on Artificial Intelligence and Applications (ICAIA) Alliance Technology Conference (ATCON-1). 2023;1–4. https://doi.org/10.1109/icaia57370.2023.10169126

[ref31] 31. Sadia A, Bashir F, Khan RQ, Khalid A. Comparison of Machine Learning Algorithms for Spam Detection. JAIT. 2023;14(2):178–84.
View Article
Google Scholar

[73] View Article

[74] Google Scholar

[ref32] 32. Aufar M, Andreswari R, Pramesti D. Sentiment Analysis on Youtube Social Media Using Decision Tree and Random Forest Algorithm: A Case Study. In: 2020 International Conference on Data Science and Its Applications (ICoDSA). 2020;1–7. https://doi.org/10.1109/icodsa50139.2020.9213078

[ref33] 33. Marijić A, Bagić Babac M. Predicting song genre with deep learning. Glob Knowl Mem Commun. 2023.
View Article
Google Scholar

[77] View Article

[78] Google Scholar

[ref34] 34. Abayomi‐Alli O, Misra S, Abayomi‐Alli A. A deep learning method for automatic SMS spam classification: Performance of learning algorithms on indigenous dataset. Concurrency and Computation. 2022;34(17).
View Article
Google Scholar

[80] View Article

[81] Google Scholar

[ref35] 35. Bagic Babac M, Cemeljic H. Preventing security incidents on social networks: An analysis of harmful content dissemination through applications. Police and Security. 2023;32:239–70.
View Article
Google Scholar

[83] View Article

[84] Google Scholar

[ref36] 36. Adnan M, Imam MO, Javed MF, Murtza I. Improving spam email classification accuracy using ensemble techniques: a stacking approach. Int J Inf Secur. 2023;23(1):505–17.
View Article
Google Scholar

[86] View Article

[87] Google Scholar

[ref37] 37. Brzic B, Boticki I, Bagic Babac M. Detecting Deception Using Natural Language Processing and Machine Learning in Datasets on COVID-19 and Climate Change. Algorithms. 2023;16(5):221.
View Article
Google Scholar

[89] View Article

[90] Google Scholar

[ref38] 38. Kontsewaya Y, Antonov E, Artamonov A. Evaluating the Effectiveness of Machine Learning Methods for Spam Detection. Procedia Computer Science. 2021;190:479–86.
View Article
Google Scholar

[92] View Article

[93] Google Scholar

[ref39] 39. RAZA M, Jayasinghe ND, Muslam MMA. A Comprehensive Review on Email Spam Classification using Machine Learning Algorithms. In: 2021 International Conference on Information Networking (ICOIN). 2021. 327–32. https://doi.org/10.1109/icoin50884.2021.9334020

[ref40] 40. Al-shanableh N, Alzyoud MS, Nashnush E. Enhancing email spam detection through ensemble machine learning: a comprehensive evaluation of model integration and performance. Communications of the IIMA. 2024;22(1).
View Article
Google Scholar

[96] View Article

[97] Google Scholar

[ref41] 41. Ghosh S, Basu S, Gupta A. Enhancing Spam Detection Using Bagging and Boosting. Journal of Machine Learning Research. 2020;21(101):1–20.
View Article
Google Scholar

[99] View Article

[100] Google Scholar

[ref42] 42. Kumar P, Thakur S. Stacking-based ensemble learning for improved spam detection. Information Systems Frontiers. 2021;23(3):723–35.
View Article
Google Scholar

[102] View Article

[103] Google Scholar

[ref43] 43. Hoc HT, Silhavy R, Prokopova Z, Silhavy P. Comparing Stacking Ensemble and Deep Learning for Software Project Effort Estimation. IEEE Access. 2023;11:60590–604.
View Article
Google Scholar

[105] View Article

[106] Google Scholar

[ref44] 44. Ghourabi A, Alohaly M. Enhancing Spam Message Classification and Detection Using Transformer-Based Embedding and Ensemble Learning. Sensors (Basel). 2023;23(8):3861. pmid:37112202
View Article
PubMed/NCBI
Google Scholar

[108] View Article

[109] PubMed/NCBI

[110] Google Scholar

[ref45] 45. Liu T, Li S, Dong Y, Mo Y, He S. Spam detection and classification based on distilbert deep learning algorithm. Appl Scid Eng J Adv Res. 2024;3(3):6–10.
View Article
Google Scholar

[112] View Article

[113] Google Scholar

[ref46] 46. Qi Q, Wang Z, Xu Y, Fang Y, Wang C. Enhancing Phishing Email Detection through Ensemble Learning and Undersampling. Appl Sci. 2023;13(15):8756.
View Article
Google Scholar

[115] View Article

[116] Google Scholar

[ref47] 47. Tazwar A, Daiyan MM, Jiabul Hoque M, Saifuddin M, Khaliluzzaman Md. Enhancing Spam Email Detection with a Soft Voting Ensemble of Optimized Machine Learning. In: 2024 IEEE International Conference on Computing, Applications and Systems (COMPAS). 2024;1–6. https://doi.org/10.1109/compas60761.2024.10796598

[ref48] 48. Wiechmann M. GitHub—M Wiechmann/Enron_spam_data: The Enron-Spam Dataset Preprocessed in a Single, Clean Csv File. https://github.com/MWiechmann/enron_spam_data Accessed 2022 August 17.

[ref49] 49. Keskin S, Sevli O. Machine Learning Based Classification for Spam Detection. Sakarya University Journal of Science. 2023.
View Article
Google Scholar

[120] View Article

[121] Google Scholar

[ref50] 50. Vinitha VS, Renuka DK, Kumar LA. Long Short-Term Memory Networks for Email Spam Classification. In: 2023 International Conference on Intelligent Systems for Communication, IoT and Security (ICISCoIS). 2023;176–80. https://doi.org/10.1109/iciscois56541.2023.10100445

[ref51] 51. AbdulNabi I, Yaseen Q. Spam email detection using deep learning techniques. Proc Computer Science. 2021;184(2):853–8.
View Article
Google Scholar

[124] View Article

[125] Google Scholar

[ref52] 52. Aldakheel EA, Zakariah M, Gashgari GA, Almarshad FA, Alzahrani AIA. A deep learning-based innovative technique for phishing detection in modern security with uniform resource locators. Sensors (Basel). 2023;23(9):4403. pmid:37177607
View Article
PubMed/NCBI
Google Scholar

[127] View Article

[128] PubMed/NCBI

[129] Google Scholar

[ref53] 53. Alguliyev R, Aliguliyev R, Sukhostat L. An Approach for Business Email Compromise Detection using NLP and Deep Learning. In: 2024 IEEE 18th International Conference on Application of Information and Communication Technologies (AICT). 2024;1–6. https://doi.org/10.1109/aict61888.2024.10740431

[ref54] 54. Atawneh SH, Aljehani H. Phishing email detection model using deep learning. Electronics. 2023.
View Article
Google Scholar

[132] View Article

[133] Google Scholar

[ref55] 55. Erkuş C, Kaya B. E-Mail Spam Detection Using BERT and LSTM. In: 2024 International Conference on Decision Aid Sciences and Applications (DASA). 2024. 1–5. https://doi.org/10.1109/dasa63652.2024.10836404

Figures

Abstract

Introduction

Related work

Materials and methods

Dataset description

Data preprocessing

Feature extraction

TF-IDF vectorizer

Proposed methodology

Naïve bayes classifier

K-Nearest Neighbors

Logistic regression

XGBoost Classifier

Stacking model

Hyperparameter tuning technique

Evaluations of proposed model

Confusion matrix

Accuracy

Precision

Recall

F1-score

Results and discussion

Performance comparison

Discussion

Conclusions

References