Figures
Abstract
In recent years, the challenge of imbalanced data has become increasingly prominent in machine learning, affecting the performance of classification algorithms. This study proposes a novel data-level oversampling method called Cluster-Based Reduced Noise SMOTE (CRN-SMOTE) to address this issue. CRN-SMOTE combines SMOTE for oversampling minority classes with a novel cluster-based noise reduction technique. In this cluster-based noise reduction approach, it is crucial that samples from each category form one or two clusters, a feature that conventional noise reduction methods do not achieve. The proposed method is evaluated on four imbalanced datasets (ILPD, QSAR, Blood, and Maternal Health Risk) using five metrics: Cohen’s kappa, Matthew’s correlation coefficient (MCC), F1-score, precision, and recall. Results demonstrate that CRN-SMOTE consistently outperformed the state-of-the-art Reduced Noise SMOTE (RN-SMOTE), SMOTE-Tomek Link, and SMOTE-ENN methods across all datasets, with particularly notable improvements observed in the QSAR and Maternal Health Risk datasets, indicating its effectiveness in enhancing imbalanced classification performance. Overall, the experimental findings indicate that CRN-SMOTE outperformed RN-SMOTE in 100% of the cases, achieving average improvements of 6.6% in Kappa, 4.01% in MCC, 1.87% in F1-score, 1.7% in precision, and 2.05% in recall, with setting SMOTE’s neighbors’ number to 5.
Citation: Hemmatian J, Hajizadeh R, Nazari F (2025) Addressing imbalanced data classification with Cluster-Based Reduced Noise SMOTE. PLoS ONE 20(2): e0317396. https://doi.org/10.1371/journal.pone.0317396
Editor: Agbotiname Lucky Imoize, University of Lagos Faculty of Engineering, NIGERIA
Received: September 7, 2024; Accepted: December 29, 2024; Published: February 10, 2025
Copyright: © 2025 Hemmatian et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the paper and its Supporting information files.
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
1. Introduction
Machine Learning (ML) is a subfield of artificial intelligence (AI) that has seen remarkable growth and advancement in recent years [1, 2]. The primary goal of machine learning is to enable machines to learn independently, reducing the need for human intervention [3]. ML algorithms have become integral to various industries, contributing significantly to sectors such as medicine applications, optical character recognition [4], medical image processing [5], wireless communications [6], software defect prediction [7, 8], self-driving cars [9], and image recognition [10]. For instance, the StackedEnC AOP [11] model excels in forecasting antioxidant proteins, while the iAFPs Mv BiTCN [12] model leverages self-attention transformer embedding’s for precise antifungal peptide predictions. Additionally, the AIPs DeepEnC GA [13] model combines evolutionary features with a genetic algorithm based deep ensemble approach to effectively predict anti-inflammatory peptides. These cutting edge models demonstrate the significant role of machine learning in advancing research. Deepstacked AVPs, SnTCN AIPs [14, 15] can be mentioned among the state-of-the-art models in this field. The effectiveness of these algorithms is a key driver of progress in these fields [16]. Classifications in machine learning are crucial as they allow machines to make decisions and predictions based on data.
In general, classification section is an unavoidable part of machine learning applications. One of the problems is the poor performance of classifiers in dealing with imbalanced data. In imbalanced data, the amount of data in each class is significantly different from the other classes [17]. There are inherent challenges in learning from class-imbalanced data. The skewed distribution of training examples causes standard classifiers to be biased, favoring the majority class and struggling to detect rare instances [18]. Accuracy metrics for classifiers often prove unreliable due to their failure to account for minority classes. For example, in a dataset where 90% of samples represent healthy individuals and only 10% have cancer, this imbalance can severely hinder the model’s ability to accurately identify cancer cases. Consequently, the model may achieve high accuracy overall by correctly classifying the majority class (healthy individuals) while performing poorly in identifying the minority class (cancer cases). Imbalanced datasets can lead to several issues, including:
- a) Bias toward majority class: Models may become biased toward the majority class, resulting in poor performance on the minority class.
- b) Misleading Accuracy: High overall accuracy can be misleading, as it may reflect the model’s ability to predict the majority class rather than its effectiveness in identifying the minority class.
- c) Poor Generalization: The model may struggle to generalize to new, unseen data, particularly for the minority class.
- d) Increased False Negatives: There is a higher likelihood of misclassifying minority class instances, leading to increased false negatives.
In the field of machine learning, data skewness has led many researchers to concentrate on class-imbalanced learning [19]. Addressing the classification problem of imbalanced data is a vital area of research in machine learning. The literature offers a range of methods, as illustrated in Fig 1, including data-level, algorithm-level, and hybrid approaches, to tackle this issue [20]. Below is an overview of key methods and strategies discussed in the literature for addressing class imbalance in machine learning.
Data-level techniques primarily focus on modifying the dataset to create a more balanced representation of classes. This can be achieved by either reducing the number of samples in the majority classes or increasing the number of samples in the minority classes [21]. Currently, data-level methods primarily focus on data preprocessing, utilizing resampling to redistribute the training data across different classes [22]. One advantage of data-level techniques is that resampling and classifier training are independent of each other [23]. Resampling methods can be classified into three types: (i) undersampling, (ii) oversampling, and (iii) combined methods [24]. Fig 2 illustrates an example of an imbalanced dataset that is balanced using two methods: undersampling and oversampling.
Oversampling involves increasing the number of instances in the minority class to match the majority class. A common technique is random oversampling, where instances from the minority class are duplicated. Synthetic Minority Over-sampling Technique (SMOTE) generates synthetic samples for the minority class by interpolating between existing instances [25]. This method has been widely used due to its ability to reduce overfitting while improving model performance on the minority class. Borderline-SMOTE [26] and ADASYN [27] (Adaptive Synthetic Sampling) are variants of SMOTE. They focus on generating synthetic samples near the decision boundary or difficult samples, aiming to create better class separability and improve model performance near the decision boundary.
Random undersampling reduces the size of the majority class by randomly removing samples, making the classes more balanced [28]. However, it risks losing valuable information. Cluster-based undersampling methods, such as k-means [29], are applied to the majority class to select representative samples, preserving diversity in the data while reducing the sample size.
Combining oversampling and undersampling (e.g., SMOTE with undersampling) has been shown to improve model robustness and reduce overfitting and information loss [30].
Algorithm-level methods modify the learning process or the objective function to account for imbalanced classes [31]. These techniques can be categorized into several types, including ensemble-based methods, threshold methods, one-class learning, cost-sensitive learning, and active learning methods [32]. Generally, cost-sensitive methods are more appealing to researchers compared to other available approaches [33]. Cost-sensitive modifications approaches assign higher misclassification costs to the minority class. This can be achieved by modifying the loss function, such as using Weighted Cross-Entropy Loss in neural networks or Gini Impurity for decision trees, where higher penalties are applied to misclassified minority samples [34]. Instead of altering the data or model, thresholds can be adjusted after model training to favor the minority class by shifting the classification decision boundary. Using ensemble methods is a common method in algorithm level. Boosting techniques (e.g., AdaBoost, Balanced Random Forest) adjust sample weights or focus on misclassified instances, effectively giving more attention to the minority class [35]. Balanced Random Forests introduce class weights to ensure that both classes are equally represented during the model-building process. Variants like Easy Ensemble and Balance Cascade create balanced subsets of data by combining undersampling with bagging techniques, training separate classifiers for each balanced subset and aggregating their results [36]. Also, ensembles combining boosting and bagging techniques have shown promise in improving classification performance for minority classes.
Hybrid Approaches of overcoming imbalanced data combine data-level and algorithm-level techniques to leverage the strengths of both. The key idea is to leverage the strengths of different methods to achieve better overall performance [37]. Some studies propose the hybridization of sampling and cost-sensitive learning [38]. SMOTE with Cost-Sensitive Algorithms applies SMOTE to balance the dataset, followed by training with a cost-sensitive algorithm to further optimize performance for the minority class. Hybrid Sampling with ensemble learning is a combination of oversampling the minority class and undersampling the majority class, followed by training an ensemble of classifiers. SMOTEBoost [39] and Random Undersampling Boosting (RUSBoost) [40] are two notable examples that integrate data balancing within ensemble learning frameworks.
In addition, deep learning models have been adapted to address class imbalance by modifying architectures, loss functions, or training techniques. In neural networks, loss functions are adjusted to penalize minority class misclassification more heavily (Class-weighted Loss Functions). Weighted cross-entropy loss is commonly used, but more advanced options like Focal Loss focus on harder-to-classify samples by reducing the importance of easy examples [41]. In computer vision, augmentation techniques such as rotations, flips, and color variations can generate synthetic data for minority classes. Generative Adversarial Networks (GANs) have been used to generate new samples for minority classes, particularly in image and text data [42].
Furthermore, imbalanced data scenarios can also be treated as anomaly detection problems, where the minority class is considered an anomaly. Models like autoencoders or One-Class SVMs are trained on the majority class, and instances of the minority class are detected as outliers during testing [43].
Since traditional metrics like accuracy are misleading in imbalanced scenarios, researchers use metrics that better capture the performance on minority classes. Precision, Recall, and F1-score metrics focus on the performance specific to each class, providing insights into the true positive and false positive rates. Area Under the ROC Curve (AUC-ROC) considers the true positive rate against the false positive rate, which is helpful for evaluating models on imbalanced data. When classes are highly imbalanced, the precision-recall curve is often a more reliable metric than AUC-ROC, as it emphasizes minority class performance [44].
Addressing class imbalance in machine learning remains a multi-faceted challenge that requires careful consideration of data characteristics, domain-specific requirements, and computational resources. The choice of method(s) often depends on the dataset, application context, and model complexity. For instance, resampling and ensemble techniques are commonly applied in traditional ML algorithms, whereas cost-sensitive learning and custom loss functions are popular in deep learning. Hybrid approaches, combining multiple methods, tend to yield better results in highly imbalanced settings, as they leverage the advantages of data-level and algorithm-level adjustments.
1.1 motivation and contributions
This article primarily focuses on data-level approaches. Typically, proposed data-level methods do not consider classification applications and merely aim to balance the data samples. In contrast, our proposed method addresses imbalanced data with a focus on classification applications. We introduce a three-stage oversampling method similar to the Reduced Noise Synthetic Minority Oversampling Technique (RN-SMOTE) [45], which is a state-of-the-art approach in this area. Our method involves oversampling, removing noisy samples, and then oversampling again. In the removal step, the classification concept is applied to ensure that noise reduction does not split samples of a class into more than two clusters.
The proposed Cluster-Based Reduced Noise SMOTE (CRN-SMOTE) method addresses the critical challenge of imbalanced data classification by integrating SMOTE and DBSCAN techniques. Traditional methods often neglect the clustering characteristics of minority class samples, leading to the creation of multiple clusters within a single category. CRN-SMOTE innovatively focuses on maintaining one or two concentrated clusters, which is essential for effective classification. By systematically balancing the data, identifying and removing noisy samples, and controlling the clustering process, this method enhances the quality of synthetic samples. The contribution lies in its ability to preserve the integrity of class distributions while significantly reducing noise, ultimately improving classification performance. This approach not only outperforms existing methods like RN-SMOTE but also provides a robust framework for handling imbalanced datasets across various applications. The significant contributions are outlined below:
- Integrate SMOTE with a cluster-based approach to create high-quality synthetic samples for minority classes, addressing the limitations of traditional oversampling techniques.
- Enforce a constraint on the number of clusters formed per category in noise reduction step, ensuring that samples remain concentrated in one or two clusters, which is crucial for effective classification.
- Demonstrate significant improvements in key performance metrics (Kappa, MCC, F1-score, Precision, and Recall) compared to state-of-the-art methods.
- Offer empirical evidence of its effectiveness through rigorous evaluation on multiple imbalanced datasets, showcasing its superiority in maintaining class integrity and enhancing model performance.
The rest of the manuscript is organized as follows: In Section 2, we review and evaluate related works and existing methods. Section 3 introduces the proposed method and discusses how to adjust the parameters. In Section 4, titled "Simulation Results," we present the dataset, the metrics for evaluating imbalanced data, and the simulation results, followed by a discussion of these results. Future work is outlined in Section 5. Finally, Section 6 draws conclusions based on the findings.
2. Related works
In this study, we propose a novel cluster-based Reduced Noise SMOTE approach that can be integrated into state-of-the-art data-level methods to address the issue of imbalanced data, specifically the RN_SMOTE method [45]. This section reviews both SMOTE and RN_SMOTE methods, followed by an overview of Density-Based Spatial Clustering of Applications with Noise (DBSCAN), a density-based technique for noise reduction [46].
2.1. Synthetic Minority Oversampling Technique (SMOTE)
An effective data-level method for addressing imbalanced data is SMOTE [25]. SMOTE is an oversampling technique used to balance the original training dataset. Rather than simply repeating minority class samples, the key idea behind SMOTE is to generate synthetic samples. These synthetic samples are specifically created to mimic the characteristics of the original data.
Suppose X is a data matrix with m different classes. represents the matrix of data for class Ci,(where i = 1, …, m), and
is the number of samples in class Ci. In cases of imbalanced data, there is typically at least one class with significantly more samples (
) compared to others.
In SMOTE, after identifying the majority class, the number of synthetic samples to be generated for each minority class is calculated. For each minority class, synthetic samples are created using the original samples and their neighbors. The K (i.e., KSMOTE) nearest neighbors () of each sample (
) are determined using the Euclidean distance. The synthetic sample (
) is then generated using the following equation:
(1)
where
is a randomly selected sample from the neighbors of
, i.e.,
, and β is a random value between 0 and 1. The synthetic sample is incorporated into the original samples, and this process iterates until the number of samples in the minority class equals that of the majority class. In the end, the number of samples in each class is equal to
. Algorithm 1 provides a detailed overview of the upsampling process in imbalanced datasets using SMOTE.
Algorithm 1. SMOTE(KSMOTE, X).
Input: Imbalanced data (X), KSMOTE (number of neighbors)
Output: Balanced data
• Determine the number of samples of the class with maximum samples ()
• For each minority class with sample number of
set the number of synthesis data to zero: Csyn_data = 0,
while
Randomly select a sample from , denoted as
Determine the K nearest neighbors of , referred to as
,
Randomly select a sample from , denoted as
Generate the synthetic sample: ,
Insert the synthesis data () into the imbalanced data X,
Increment the count of synthetic data: Csyn_data = Csyn_data + 1
This type of data oversampling is widely used in fields such as data science and machine learning. It provides a secure platform for algorithms to improve their performance without compromising privacy or the safety of real data. Additionally, it can enhance existing datasets, particularly in cases where the original data is limited or biased. SMOTE is a valuable tool for generating synthetic data and creating balance between classes in imbalanced datasets.
These new data points are created by interpolating between several minority class samples defined within a neighborhood. For this reason, the approach is said to focus on the "feature space" rather than the "data space." In other words, the algorithm considers the values and relationships of features rather than individual data points [47].
2.2. Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
The DBSCAN algorithm, introduced by Martin Ester et al. in 1996, is a density-based clustering method capable of discovering arbitrary clusters within a dataset [48]. One of its main advantages over other clustering methods is its ability to identify clusters of various shapes without requiring prior knowledge of the number of clusters. DBSCAN is capable of identifying areas dense enough to be classified as clusters [49]. This clustering method relies on specific user-defined parameters, including the neighborhood radius (ε) and the minimum number of points [50] within the neighborhood. In DBSCAN, each data point can be categorized into one of the following three types: core points, noise points, and border points.
A data point is labeled as a core point when it is surrounded by a minimum number of neighboring points (including itself) within a proximity of ε. These core points serve as the focal points of a cluster. A data point is classified as a noise point, or outlier, when it lacks sufficient neighboring points within the specified distance (ε) and does not meet the criteria to be a core point. Noise points are not part of any cluster. A data point that is not a core point but lies within the ε distance of a core point is referred to as a border point. Border points are on the edge of a cluster and may be considered part of the cluster or noise, depending on the clustering criteria [51]. Fig 3 illustrates the placement of all three types of data points based on a minimum of 5 samples and an ε radius.
The steps of DBSCAN algorithm are given in Algorithm 2. In the given DBSCAN algorithm, the border points are considered as part of the cluster. In DBSCAN, the parameters ε and N are usually set empirical. Also, an algorithm has been introduced which is performed corresponding to distances among samples, which is called K-distance. The K-distance algorithm calculates the distances between a point and its K-nearest neighbors in the dataset. These distances are used to determine the density reachability and core distance, which are essential for identifying clusters and outliers in DBSCAN. DBSCAN
Algorithm 2. DBSCAN(ε, N).
Input: input data (D), neighborhood radius (ε), minimum number of samples (N)
Output: Remove data points detected as noise and provide clustered, clean data.
For each sample in D, denoted as di:
• Calculate the number of samples within the neighborhood radius ε of di, referred to as Ni,
• If Ni ≥ N, categorize di as a core point,
• Else:
• If there is a core point in the neighborhood of di, categorize di as a border point.
• Otherwise, categorize di as a noise point.
Remove noise points and cluster the data points that are connected to one another.
2.3. Reduced Noise-SMOTE (RN-SMOTE)
The RN-SMOTE algorithm is a data-level oversampling method designed to address the challenges of imbalanced data. As illustrated in Fig 4, RN-SMOTE consists of three primary stages that enhance class distribution and minimize the negative impact of noise on synthetic samples.
In the initial stage, RN-SMOTE applies the SMOTE technique to balance the class distribution by generating synthetic samples for the minority class. In the second stage, the RN-SMOTE algorithm employs the DBSCAN algorithm to mitigate the noise effect on the synthetic samples generated by SMOTE. DBSCAN identifies clusters of high-density data points and separates them from low-density regions, which are often considered noise. By applying DBSCAN, RN-SMOTE aims to refine the synthetic samples and reduce the potential negative impact of noise on the model’s performance. In the final stage, the RN-SMOTE algorithm ensures that the class distribution is well-balanced, without being biased towards the majority class. In this stage, SMOTE algorithm is again utilized to restore balance to the data, making it ready for classification.
This balanced class distribution contributes to improved model performance and a more accurate understanding of the underlying patterns in the data. In summary, the RN-SMOTE algorithm combines SMOTE and DBSCAN to balance class distribution and reduce noise. This approach leads to more robust and accurate models of classification when dealing with imbalanced datasets [45].
3. Proposed method: Cluster-based RN-SMOTE (CRN-SMOTE)
In this study, a cluster-based approach is presented using SMOTE and DBSCAN techniques to address imbalanced data classification. In common oversampling methods, there is not any attention to the synthetic samples of each class, and it causes samples in a category are placed in several clusters. While the samples of a category usually make one or two concentrated clusters. The proposed method, similar to RN-SMOTE, includes three stages. Firstly, the data are balanced using SMOTE technique. Then, the noisy samples are determined and removed to reduce the noise effect with a limit on the number of clusters created per category. The number of clusters is set to one or two, because the samples belong to the same category. Finally, after reduction the noise effect, the data again is balance using SMOTE. The proposed method is called Cluster-based Reduced Noise SMOTE (CRN-SMOTE). Algorithm of the proposed CRN-SMOTE is given in the following:
Algorithm 3. Proposed cluster based RN-SMOTE.
Input: imbalanced data (X), Ncluster (desired number of clusters), N (minimum number of samples), KSMOTE
Output: balanced data.
• Balance the samples of different categories using SMOTE (KSMOTE, X)
• For each SMOTEed category :
• Set ε to a small positive value (e.g., ε = 0.1)
• Repeat while
Apply DBSCAN(ε, N) to remove noise samples for each category
Calculate the number of sample clusters in , denoted as
If 0<
End while
Else:
Increase ε by a small positive value: ε + δ → ε, (where δ is a small positive value)
• Concatenate the category with maximum samples with the cluster-based reduced noise categories:
• balance the samples of different categories in XDBSCAN using SMOTE(KSMOTE, XDBSCAN) resulting in XCRN-SMOTE
In Algorithm 3, after balancing the categories of the dataset using SMOTE, noisy samples are removed from the SMOTEed categories (
), with a limit on the number of clusters for the residual samples. A modified DBSCAN method is employed to eliminate noisy samples and determine the desired number of clusters for each category. In the proposed cluster-based method, unlike the standard DBSCAN approach, there is no need to predefine the neighborhood radius (ε). Instead, ε is set to a small positive value along with a specified minimum number of samples. This initial parameter setting for DBSCAN typically leads to the formation of clusters with a higher number of residual samples (usually more than two). Subsequently, the ε value is incrementally increased by a small positive value (δ) until the limitation on the number of clusters is satisfied:
(2)
where
is the number of clusters in residual samples for category Ci, and Ncluster is defined as the desired number of clusters.
In the proposed method, the desired number of clusters is set to one or two, as the samples belong to the same category, which is crucial for classification applications. Therefore, removing noisy samples should not split the samples of a category into more than two clusters. expresses the samples of category Ci (where i = 1, …, m and Ci ≠ Cmax). By concatenating the residual data from the reduced noise categories with the category that has the maximum samples, the data matrix is constructed as follows:
(3)
Finally, balanced data (XCRN-SMOTE) is obtained using the SMOTE technique on XDBSCAN as follows:
(4)
In the proposed CRN-SMOTE, the cluster-based approach effectively reduces noise in the data by ensuring that samples from each category form one or two clusters. This is crucial as it helps prevent the misclassification of minority class samples that could occur due to noise in the dataset. Unlike many traditional data-level methods that merely aim to balance datasets without considering classification implications, CRN-SMOTE is specifically designed with classification performance in mind. This results in a more targeted approach to addressing imbalanced data issues. However, while the focus on clustering can enhance performance, it may also introduce implementation complexity. Additionally, the effectiveness of CRN-SMOTE depends on the quality of the clustering process; if the clustering parameters are not optimally set or if the underlying data structure is complex, it may lead to poor performance or misclassification.
4. Experiments and results
The proposed CRN-SMOTE method is evaluated across a range of imbalanced datasets from diverse domains. Performance comparisons between CRN-SMOTE and RN-SMOTE (the state-of-the-art method) are conducted using standard metrics for evaluating imbalanced data classification, including Kappa, Precision, Recall, F1-score, and MCC [52, 53]. Fig 5 illustrates the methodology used for evaluating the various datasets. For datasets with sufficient samples, the data is split into training and test sets, with the training dataset balanced using conventional methods to enhance classifier performance. Additionally, K-fold cross-validation is employed for datasets with fewer samples, where the training folds are balanced and fed into the classifier. The datasets and evaluation metrics are then detailed. The efficacy of the proposed method is assessed using three prominent classifiers—Support Vector Machine (SVM), Random Forest (RF), and AdaBoost (ADA)—utilizing their default parameters.
Furthermore, the results of the proposed CRN-SMOTE method are compared with two other state-of-the-art SMOTE-based methods: SMOTE-Tomek Link [54] and SMOTE-ENN [55] for further investigation.
4.1. Datasets
In this study, four conventional imbalanced datasets from the UCI Machine Learning Repository are utilized, and their specific details are provided in Table 1 [56]. These datasets are normalized to ensure that features with different scales did not bias the classifiers.
4.2. Evaluation metrics
One of the significant challenges in working with imbalanced datasets is selecting appropriate evaluation criteria. Due to the inherent imbalance in class distribution, the accuracy metric may not be sufficient for assessing the performance of classification models. In this study, we utilized a set of criteria that have proven effective in evaluating imbalanced datasets, including Kappa, MCC, F1 score, Precision, and Recall. These metrics are derived from the confusion matrix, a visual representation (such as Table 2) that summarizes the model’s performance by comparing predicted labels to actual labels.
A confusion matrix is a valuable tool used to assess the performance of a machine learning model or classification algorithm by comparing predictions to actual true labels. Table 3 presents the formulas for common metrics derived from a confusion matrix.
In the kappa metric, po represents the relative observed agreement among rates, while pe denotes the hypothetical probability of chance agreement. These parameters define as follows:
(5)
(6)
4.3. Simulation results
In the following, the results (the values of evaluation metrics) are presented for different settings on four mentioned datasets using Support Vector Machine (SVM), Random Forest (RF), and Ada Boost (ADA) classifiers. The results are calculated on the original imbalanced data, balanced data using the state-of-the-art RN-SMOTE method, and balanced data using the proposed CRN-SMOTE method with cluster sizes of 1 (1CRN-SMOTE) and 2 (2CRN-SMOTE). In all cases, the data is balanced using the SMOTE(KSMOTE, X) method, where KSMOTE is set to 4, 5, and 6. A 10-fold cross-validation methodology is applied to obtain and compare the results without any correlation to the dataset, and the average of the results are reported. Tables 4–6 show the results using SVM, ADA, and RF classifiers, respectively.
The results presented in Tables 4–6 demonstrate the superior performance of the proposed cluster-based CRN-SMOTE method in comparison to the state-of-the-art RN-SMOTE. The results show that both 1CRN-SMOTE and 2CRN-SMOTE improve the classifiers’ performance based on conventional evaluation metrics for imbalanced datasets in most settings.
For a more comprehensive investigation, the results of RN-SMOTE, 1CRN-SMOTE, and 2CRN-SMOTE on each dataset using all three classifiers (SVM, Random Forest, and Ada Boost) are given in Figs 6–9. The average of the metric values is reported for KSMOTE = 5 and using a 10-fold cross-validation evaluation.
The results again confirm the superior performance of the proposed CRN-SMOTE method compared to the RN-SMOTE approach. The cluster-based CRN-SMOTE techniques demonstrate improved classification performance across the various datasets and classifiers evaluated.
In continuation, the results of the proposed CRN-SMOTE (i.e., 1CRN-SMOTE) are compared with SMOTE-Tomek Link and SMOTE-ENN, two state-of-the-art SMOTE-based methods, as well as RN-SMOTE. The results are provided for KSMOTE = 5 and the RF classifier across all four mentioned datasets (see Tables 7 and 8). The findings demonstrate that the proposed CRN-SMOTE method outperforms other state-of-the-art methods in most cases. On the Blood dataset, SMOTE-ENN outperforms the proposed CRN-SMOTE method, while on all three other datasets, the proposed 1CRN-SMOTE demonstrates the best performance.
4.4. Discussion
In this subsection, we compare the standard DBSCAN with the proposed method for removing noisy samples. The ε value significantly affects the performance of DBSCAN and must be adapted to the dataset’s density. Determining the appropriate fraction value in the K-distance algorithm is empirical and is typically based on the distances between data points. In classification applications, it is expected that samples belonging to the same category form a single cluster. However, using DBSCAN to remove noisy samples often results in multiple clusters for each category, which is unsuitable for classification tasks. This can negatively impact performance when addressing imbalanced data classification.
Fig 10 illustrates a case that exemplifies the challenges observed in the maternal health risk dataset [33], which consists of three distinct categories. The results indicate that DBSCAN divides the samples from minority classes into six clusters (Fig 10a and 10b), which may not be ideal for classification purposes. In contrast, our proposed method groups samples from each category into one or two clusters. The results presented in Table 9 demonstrate the effectiveness of our approach, as indicated by the classification metrics, which were obtained using the Random Forest classifier. In RN-SMOTE, the standard DBSCAN algorithm is applied.
5. Future work
A systematic study can be conducted to analyze how the oversampling process, facilitated by SMOTE, interacts with the denoising phase using DBSCAN. Understanding this relationship is crucial for identifying whether specific oversampling strategies yield better noise reduction outcomes or if particular noise characteristics influence the effectiveness of synthetic sample generation. For future work, we propose the following two suggestions:
- 1) Evaluating the Correlation between Oversampling and Noise Reduction: Investigating the correlation between the oversampling step and the noise reduction phase presents an attractive research direction that could significantly enhance the CRN-SMOTE method. This analysis could reveal insights into how these two processes influence each other and lead to more effective strategies for handling imbalanced datasets.
- 2) Implementing a Mutual Neighborhood Check in SMOTE: We recommend incorporating a mutual neighborhood check within the SMOTE algorithm by utilizing a fixed number of neighbors. This approach aims to improve the quality of synthetic sample generation by ensuring that newly created samples are based not only on the closest minority class neighbors but also on their mutual relationships. By doing so, we can enhance the representativeness of the synthetic samples and mitigate the impact of noise.
6. Conclusion
The study presents a novel approach to addressing the prevalent issue of imbalanced data classification in machine learning through the introduction of Cluster-Based Reduced Noise SMOTE (CRN-SMOTE). This method effectively integrates the strengths of Synthetic Minority Over-sampling Technique (SMOTE) with a cluster-based noise reduction strategy, thereby enhancing the quality of synthetic samples generated for minority classes.
The results of the experiments conducted on four distinct imbalanced datasets demonstrate that CRN-SMOTE consistently outperforms the existing state-of-the-art methods, Reduced Noise SMOTE (RN-SMOTE), SMOTE-Tomek Link, and SMOTE-ENN, across all evaluation metrics, including Cohen’s kappa, Matthew’s correlation coefficient (MCC), F1-score, precision, and recall. The significant improvements observed—averaging 6.6% in Kappa and 4.01% in MCC—underscore the effectiveness of the proposed method in improving classification performance, particularly in challenging datasets such as QSAR and Maternal Health Risk.
In conclusion, the CRN-SMOTE method not only addresses the critical challenge of imbalanced data but also preserves the integrity of class distributions while significantly reducing noise. Its systematic approach to oversampling, noise reduction, and clustering control establishes a robust framework that can be applied across various domains and classification tasks. The empirical evidence presented validates its superiority over traditional methods, making it a valuable contribution to the field of machine learning and data classification.
References
- 1. Sarkar C, Das B, Rawat VS, Wahlang JB, Nongpiur A, Tiewsoh I, et al. Artificial intelligence and machine learning technology driven modern drug discovery and development. International Journal of Molecular Sciences. 2023 Jan 19;24(3):2026. pmid:36768346
- 2. Badillo S, Banfai B, Birzele F, Davydov II, Hutchinson L, Kam‐Thong T, et al. An introduction to machine learning. Clinical pharmacology & therapeutics. 2020 Apr;107(4):871–85. pmid:32128792
- 3.
Mallick PK, Borah S, editors. Emerging Trends and Applications in Cognitive Computing.
- 4. Hajizadeh R, Aghagolzadeh A, Ezoji M. Fusion of LLE and stochastic LEM for Persian handwritten digits recognition. International Journal on Document Analysis and Recognition (IJDAR). 2018 Jun;21:109–22.
- 5. Wang Z, Li X, Duan H, Su Y, Zhang X, Guan X. Medical image fusion based on convolutional neural networks and non-subsampled contourlet transform. Expert Systems with Applications. 2021 Jun 1;171:114574.
- 6. Haq I., Soomro J. A., Mazhar T., Ullah I., Shloul T. A., Ghadi Y. Y., et al. (2023). Impact of 3G and 4G technology performance on customer satisfaction in the telecommunication industry. Electronics, 12(7), 1697.
- 7. Ali M., Mazhar T., Al-Rasheed A., Shahzad T., Ghadi Y. Y., & Khan M. A. (2024). Enhancing software defect prediction: a framework with improved feature selection and ensemble machine learning. PeerJ Computer Science, 10, e1860. pmid:39669467
- 8.
Ali, M., Mazhar, T., Arif, Y., Al-Otaibi, S., Ghadi, Y. Y., Shahzad, T., et al. (2024). Software defect prediction using an intelligent ensemble-based model. IEEE Access.
- 9. Muthalagu R, Bolimera A, Kalaichelvi V. Lane detection technique based on perspective transformation and histogram analysis for self-driving cars. Computers & Electrical Engineering. 2020 Jul 1;85:106653.
- 10. Zhong F., Chen Z., Ning Z., Min G. and Hu Y., 2018. Heterogeneous visual features integration for image recognition optimization in internet of things. Journal of computational science, 28, pp.466–475.
- 11. Rukh G., Akbar S., Rehman G., Alarfaj F. K., & Zou Q. (2024). StackedEnC-AOP: prediction of antioxidant proteins using transform evolutionary and sequential features based multi-scale vector with stacked ensemble learning. BMC bioinformatics, 25(1), 256. pmid:39098908
- 12. Akbar S., Zou Q., Raza A., & Alarfaj F. K. (2024). iAFPs-Mv-BiTCN: Predicting antifungal peptides using self-attention transformer embedding and transform evolutionary based multi-view features with bidirectional temporal convolutional networks. Artificial Intelligence in Medicine, 151, 102860. pmid:38552379
- 13. Raza A., Uddin J., Zou Q., Akbar S., Alghamdi W., & Liu R. (2024). AIPs-DeepEnC-GA: Predicting anti-inflammatory peptides using embedded evolutionary and sequential feature integration with genetic algorithm based deep ensemble model. Chemometrics and Intelligent Laboratory Systems, 254, 105239.
- 14. Akbar S., Raza A., & Zou Q. (2024). Deepstacked-AVPs: predicting antiviral peptides using tri-segment evolutionary profile and word embedding based multi-perspective features with deep stacking model. BMC bioinformatics, 25(1), 102. pmid:38454333
- 15. Raza A., Uddin J., Almuhaimeed A., Akbar S., Zou Q., & Ahmad A. (2023). AIPs-SnTCN: Predicting anti-inflammatory peptides using fastText and transformer encoder-based hybrid word embedding with self-normalized temporal convolutional networks. Journal of chemical information and modeling, 63(21), 6537–6554. pmid:37905969
- 16. Chen Z, Liu B. Lifelong machine learning. Springer Nature; 2022 Jun 1.
- 17. Al-Sawwa J, Ludwig SA. Performance evaluation of a cost-sensitive differential evolution classifier using spark–Imbalanced binary classification. Journal of Computational Science. 2020 Feb 1;40:101065.
- 18. Kim M, Hwang KB. An empirical evaluation of sampling methods for the classification of imbalanced data. PLoS One. 2022 Jul 28;17(7):e0271260. pmid:35901023
- 19. Zuo T, Li F, Zhang X, Hu F, Huang L, Jia W. Stroke classification based on deep reinforcement learning over stroke screening imbalanced data. Computers and Electrical Engineering. 2024 Mar 1;114:109069.
- 20. Johnson JM, Khoshgoftaar TM. Survey on deep learning with class imbalance. Journal of big data. 2019 Dec;6(1):1–54.
- 21.
García V, Sánchez JS, Mollineda RA. Exploring the performance of resampling strategies for the class imbalance problem. In Trends in Applied Intelligent Systems: 23rd International Conference on Industrial Engineering and Other Applications of Applied Intelligent Systems, IEA/AIE 2010, Cordoba, Spain, June 1–4, 2010, Proceedings, Part I 23 2010 (pp. 541–549). Springer Berlin Heidelberg.
- 22. García V, Sánchez JS, Mollineda RA. On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowledge-Based Systems. 2012 Feb 1;25(1):13–21.
- 23.
Nguyen HM, Cooper EW, Kamei K. A comparative study on sampling techniques for handling class imbalance in streaming data. In The 6th International Conference on Soft Computing and Intelligent Systems, and The 13th International Symposium on Advanced Intelligence Systems 2012 Nov 20 (pp. 1762–1767). IEEE.
- 24.
Burnaev E, Erofeev P, Papanov A. Influence of resampling on accuracy of imbalanced classification. In Eighth international conference on machine vision (ICMV 2015) 2015 Dec 8 (Vol. 9875, pp. 423–427). SPIE.
- 25. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research. 2002 Jun 1; 16:321–57.
- 26.
Han, H., Wang, W. Y., & Mao, B. H. (2005, August). Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In International conference on intelligent computing (pp. 878–887). Berlin, Heidelberg: Springer Berlin Heidelberg.
- 27.
He, H., Bai, Y., Garcia, E. A., & Li, S. (2008, June). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence) (pp. 1322–1328). Ieee.
- 28.
Prusa, J., Khoshgoftaar, T. M., Dittman, D. J., & Napolitano, A. (2015, August). Using random undersampling to alleviate class imbalance on tweet sentiment data. In 2015 IEEE international conference on information reuse and integration (pp. 197–202). IEEE.
- 29. Hamerly G., & Elkan C. (2003). Learning the k in k-means. Advances in neural information processing systems, 16.
- 30.
Zeng, M., Zou, B., Wei, F., Liu, X., & Wang, L. (2016, May). Effective prediction of three common diseases by combining SMOTE with Tomek links technique for imbalanced medical data. In 2016 IEEE International Conference of Online Analysis and Computing Science (ICOACS) (pp. 225–228). IEEE.
- 31.
Alhakbani H. Handling class imbalance using swarm intelligence techniques, hybrid data and algorithmic level solutions (Doctoral dissertation, Goldsmiths, University of London).
- 32.
Spelmen VS, Porkodi R. A review on handling imbalanced data. In 2018 international conference on current trends towards converging technologies (ICCTCT) 2018 Mar 1 (pp. 1–11).
- 33. Zhou ZH, Liu XY. On multi‐class cost‐sensitive learning. Computational Intelligence. 2010 Aug;26(3):232–57.
- 34.
Rezaei-Dastjerdehei, M. R., Mijani, A., & Fatemizadeh, E. (2020, November). Addressing imbalance in multi-label classification using weighted cross entropy loss function. In 2020 27th national and 5th international iranian conference on biomedical engineering (ICBME) (pp. 333–338). IEEE.
- 35.
Rezaei-Dastjerdehei, M. R., Mijani, A., & Fatemizadeh, E. (2020, November). Addressing imbalance in multi-label classification using weighted cross entropy loss function. In 2020 27th national and 5th international iranian conference on biomedical engineering (ICBME) (pp. 333–338). IEEE.
- 36. Liu X. Y., Wu J., & Zhou Z. H. (2008). Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(2), 539–550. pmid:19095540
- 37. Khushi M, Shaukat K, Alam TM, Hameed IA, Uddin S, Luo S, et al. A comparative performance analysis of data resampling methods on imbalance medical data. IEEE Access. 2021 Aug 3; 9:109960–75.
- 38. Wang S, Li Z, Chao W, Cao Q. Applying adaptive over-sampling technique based on data density and cost-sensitive SVM to imbalanced learning. In The 2012 international joint conference on neural networks (IJCNN) 2012 Jun 10 (pp. 1–8).
- 39.
Chawla, N. V., Lazarevic, A., Hall, L. O., & Bowyer, K. W. (2003). SMOTEBoost: Improving prediction of the minority class in boosting. In Knowledge Discovery in Databases: PKDD 2003: 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, Cavtat-Dubrovnik, Croatia, September 22–26, 2003. Proceedings 7 (pp. 107–119). Springer Berlin Heidelberg.
- 40. Seiffert C., Khoshgoftaar T. M., Van Hulse J., & Napolitano A. (2009). RUSBoost: A hybrid approach to alleviating class imbalance. IEEE transactions on systems, man, and cybernetics-part A: systems and humans, 40(1), 185–197.
- 41.
Leng, Z., Tan, M., Liu, C., Cubuk, E. D., Shi, X., Cheng, S., et al. (2022). Polyloss: A polynomial expansion perspective of classification loss functions. arXiv 2022. arXiv preprint arXiv:2204.12511.
- 42.
Mullick, S. S., Datta, S., & Das, S. (2019). Generative adversarial minority oversampling. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 1695–1704).
- 43.
Leevy J. L., Hancock J., Khoshgoftaar T. M., & Zadeh A. A. (2023>, November). One-Class Classifier Performance: Comparing Majority versus Minority Class Training. In 2023 IEEE 35th International Conference on Tools with Artificial Intelligence (ICTAI) (pp. 86–91). IEEE.
- 44.
Gaudreault, J. G., Branco, P., & Gama, J. (2021, October). An analysis of performance metrics for imbalanced classification. In International Conference on Discovery Science (pp. 67–77). Cham: Springer International Publishing.
- 45. Arafa A, El-Fishawy N, Badawy M, Radad M. RN-SMOTE: Reduced Noise SMOTE based on DBSCAN for enhancing imbalanced data classification. Journal of King Saud University-Computer and Information Sciences. 2022 Sep 1;34(8):5059–74.
- 46. Sawant K. Adaptive methods for determining DBSCAN parameters. International Journal of Innovative Science, Engineering & Technology. 2014 Jun;1(4):329–34.
- 47. Fernández A, Garcia S, Herrera F, Chawla NV. SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. Journal of artificial intelligence research. 2018 Apr 20; 61:863–905.
- 48. Ester M., Kriegel H. P., Sander J., & Xu X. (1996, August). A density-based algorithm for discovering clusters in large spatial databases with noise. In kdd (Vol. 96, No. 34, pp. 226–231).
- 49. Ahmed KN, Razak TA. A comparative study of different density based spatial clustering algorithms. International Journal of Computer Applications. 2014 Aug; 975:8887.
- 50.
Gunawan A, de Berg M. A faster algorithm for DBSCAN. Master’s thesis. 2013 Mar.
- 51. Starczewski A, Goetzen P, Er MJ. A new method for automatic determining of the DBSCAN parameters. Journal of Artificial Intelligence and Soft Computing Research. 2020 Jul 1;10(3):209–21.
- 52. Vieira SM, Kaymak U, Sousa JM. Cohen’s kappa coefficient as a performance measure for feature selection. In International conference on fuzzy systems 2010 Jul 18 (pp. 1–8).
- 53. He H, Garcia EA. Learning from imbalanced data. IEEE Transactions on knowledge and data engineering. 2009 Jun 26;21(9):1263–84.
- 54. Hairani H., Anggrawan A., & Priyanto D. (2023). Improvement performance of the random forest method on unbalanced diabetes data classification using Smote-Tomek Link. JOIV: international journal on informatics visualization, 7(1), 258–264.
- 55. Hairani H., & Priyanto D. (2023). A new approach of hybrid sampling SMOTE and ENN to the accuracy of machine learning methods on unbalanced diabetes disease data. International Journal of Advanced Computer Science and Applications, 14(8).
- 56.
Markelle Kelly, Rachel Longjohn, Kolby Nottingham, The UCI Machine Learning Repository.