Figures
Abstract
Electronic payment methods are increasingly prevalent worldwide, facilitating both in-person and online transactions. As credit card usage for online payments grows, fraud and payment defaults have also risen, resulting in significant financial losses. Detecting fraudulent transactions is challenging due to the highly imbalanced nature of transaction datasets, where fraudulent activities constitute only a small fraction of the data. To address this, we propose a novel hybrid feature selection framework designed to enhance the performance of machine learning models in credit card fraud detection. Our framework integrates three complementary feature selection techniques: Pearson correlation, information gain (IG), and random forest importance (RFI), each optimized for the dataset‘s characteristics. Pearson Correlation eliminates redundancy by removing highly correlated features, while IG and RFI evaluate the relevance of the remaining features. A union operation combines the most informative features from these methods, ensuring comprehensive and efficient feature selection. To validate the proposed approach, we test it on five diverse datasets with varying characteristics and imbalance levels, employing five state-of-the-art machine learning algorithms: Random Forest (RF), Extra Trees (ET), XGBoost (XGBC), AdaBoost, and CatBoost. We primarily propose this work for PCA-transformed datasets, but for the validation of our research, we also apply it to a real-world dataset. The results demonstrate that our methodology outperforms existing baseline approaches, achieving superior fraud detection performance across all datasets. Our findings highlight the robustness and adaptability of the proposed framework, offering a practical solution for real-world fraud detection systems. Additionally, we believe that our proposed framework can serve as a decision support system for the detection of fraudulent transactions in real-time credit cards, with the potential to make a substantial contribution to the business industry.
Citation: Siam AM, Bhowmik P, Uddin MP (2025) Hybrid feature selection framework for enhanced credit card fraud detection using machine learning models. PLoS One 20(7): e0326975. https://doi.org/10.1371/journal.pone.0326975
Editor: Sunil Kumar Sharma, Majmaah University, SAUDI ARABIA
Received: January 6, 2025; Accepted: June 8, 2025; Published: July 16, 2025
Copyright: © 2025 Siam et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: Data relevant to this study are available from https://www.kaggle.com/datasets/almahmudsiam/5-datasetss.
Funding: The author(s) received no specific funding for this work.
Competing interests: We have declared that no competing interests exist.
1 Introduction
In the modern era, technology impacts nearly every aspect of our lives, including education, healthcare, economics, finance, trade, industry, politics, and entertainment. Consumers’ methods to perform transactions have undergone significant changes and growth in recent years. The evolution of modern lifestyles, technological advancements, and the widespread adoption of online applications have all contributed to the rise of electronic commerce (e-commerce) and online credit card transactions for purchases and payments [1, 2]. The trend towards a cashless society is undeniable, with future transactions increasingly shifting away from traditional cash-based methods. Consequently, customers will no longer need to carry cash for in-store purchases, compelling businesses to enhance their infrastructure to accommodate diverse payment processing methods. This shift is expected to intensify in the coming years [3].
However, in addition to these advancements, there has been a notable increase in security issues associated with credit card transactions. Credit card fraud—a form of identity theft where unauthorized individuals exploit a credit card or its account information to complete transactions—has become a major concern for both financial institutions and their customers [3]. Fraudulent transactions occur worldwide and have resulted in substantial financial losses. Alarmingly, in 2020, global losses from such transactions amounted to 28.58 billion USD, with the United States alone reporting losses of 11 billion USD. From 2011 to 2021, global losses surged from 9.84 billion USD to 32.39 billion USD, and they are projected to reach a cumulative total of 408.50 billion USD over the next decade. In Malaysia, fraudulent credit card transactions cost the banking industry 51.3 million RM in 2016, and 12.8% of credit cardholders reportedly struggled to meet minimum balance payments [4].
To mitigate these challenges, financial institutions and credit card users require robust fraud detection systems capable of preventing fraudulent transactions. Automated anomaly detection systems, leveraging machine learning (ML) algorithms, play a critical role in addressing this issue [5–7]. ML, a subfield of artificial intelligence, uses computational methods to identify patterns in historical data and make predictions [8]. ML algorithms are generally categorized as supervised, unsupervised, or reinforcement learning, with supervised algorithms being the most widely applied in credit card fraud detection [9,10]. Detecting fraudulent transactions is particularly challenging due to their evolving nature and the inherent imbalance in datasets, where fraudulent transactions represent a small fraction of total transactions [3].
A variety of ML models have been employed for fraud detection, including Decision Trees (DT), Random Forest (RF), ANN, Naive Bayes (NB), CatBoost (CB), Adaptive Boosting (AdaBoost), Extreme Gradient Boosting (XGBC), Logistic Regression (LR) and Support Vector Machines (SVM). In addition, recent advances in deep learning (DL), a subset of ML, have introduced neural network-based models that mimic human cognitive processes. Popular DL algorithms include Convolutional Neural Networks (CNN), Multilayer Perceptrons (MLP), Recurrent Neural Networks (RNN), Long-Short-Term Memory (LSTM) and Gated Recurrent Units (GRU) [11]. In this paper, we utilize five supervised ML algorithms—RF, Extra Trees (ET), XGBC, AdaBoost, and CatBoost—along with an ensemble technique that combines these models to detect credit card fraud. We develop a hybrid feature selection technique that not only enhances model performance but also effectively addresses data imbalance issues. To validate our methodology, we experiment with five diverse datasets, varying in size (large, medium, and small) and imbalance ratios (0.172%, 14.56%, 30%, 50%, and 55.51%). This ensures the robustness of our proposed sampling technique. Our contributions are as follows:
- We review the limitations of existing research works on credit card fraud detection, with a focus on ML and DL techniques.
- We propose a manually designed hybrid feature selection technique to retain only the most relevant features.
- By reducing the number of features, we achieve faster training times and minimize computational complexity.
- We develop an empirical framework that could benefit banks, merchants, cardholders, insurance companies, and other stakeholders.
- Our proposed method is tested on five datasets, demonstrating superior performance compared to existing approaches.
The remainder of this paper is organized as follows: Sect 2 discusses related studies and their limitations. Sect 3 presents the datasets used in this study. Sect 4 describes our proposed methodology, and Sect 5 outlines the performance metrics. Sect 6 discusses the simulation results and comparisons with existing methods. Finally, Sect 7 concludes the study and outlines directions for future research.
2 Related work
Recent years have witnessed a surge in research exploring the application of ML and DL techniques to detect fraudulent transactions. Various approaches have been proposed, particularly in credit scoring, to establish effective fraud detection methodologies. This section provides an overview of key studies, focusing on their ML models and approaches to credit card fraud detection.
Ileberi et al. [7] introduced a genetic algorithm (GA) for feature selection, followed by applying five ML models—DT, RF, LR, ANN, and NB—to detect credit card fraud. In [23], the authors described the development and deployment of a fraud detection system integrating manual and automatic classification, comparing multiple ML methods. Makki et al. [24] conducted an extensive experimental study addressing the imbalance classification problem. Xu et al. [25] proposed an ensemble strategy to balance training data for extreme learning machine (ELM) classifiers, employing a weighting method based on generalized fuzzy soft sets (GFSS) theory. In [13], the authors utilized Isolation Forest (IF) and Local Outlier Factor (LOF) algorithms to identify fraudulent transactions. In this study [26], the author introduced a method that combines ensemble learning with a generative adversarial network (GAN), enhanced by Ensemble Synthesized Minority Oversampling Techniques (ESMOTE-GAN).
In [14], researchers demonstrated that combining the Classification and Regression Trees (CART) algorithm with Particle Swarm Optimization (PSO) achieved higher accuracy than CART combined with NB. Baabdullah et al. [1] developed ML and DL algorithms, including RF, CNN, and LSTM, optimized with ADAM, SGD, and MSGD. Their CNN-SGD combination achieved 97% accuracy. Sohony et al. [65] designed an ensemble majority voting technique using three feedforward neural networks and two RF classifiers. Randhawa et al. [69] introduced hybrid methods combining ensemble majority voting with AdaBoost. Table 1 summarizes key studies, highlighting the ML models used, datasets, and their identified limitations.
Most of the papers mentioned above have limitations, as identified through our review and analysis. Some do not address the issue of imbalanced datasets, while others rely solely on a single performance metric for evaluation. In addition, certain studies do not employ feature selection methods, and some proposed models suffer from high computational time and complexity. To overcome these challenges, we propose a novel hybrid feature selection methodology designed to address all the aforementioned issues. In this study, we utilize five datasets, each with distinct characteristics: one is highly imbalanced, one is fully balanced, one is nearly balanced, and the remaining two have moderate imbalance levels. Our manually designed hybrid feature selection methodology effectively reduces computation time and complexity while maintaining robust performance. Unlike previous studies, we do not rely on a single performance metric to evaluate machine learning models. Instead, we use six different performance metrics to provide a comprehensive evaluation, ensuring a more accurate and reliable model selection process.
3 Datasets
We have validated our proposed methodology using multiple datasets. In this paper, we utilized five datasets that are widely referenced in existing studies. Fig 1 visually represents all these datasets. Table 2 lists all the features used across these datasets, and Table 3 summarizes their key characteristics. The details of each dataset are as follows:
European Cardholders 2013. This dataset [27] contains a total of 284,807 samples, of which 492 were identified as fraudulent by European cardholders in September 2013. It exhibits a significant imbalance, with fraudulent transactions accounting for only 0.172% of all transactions. Table 2 lists all features of this dataset. Except for ‘Time’ and ‘Amount’, all features undergo PCA transformation to maintain the privacy of customer information and transactional details. The Class feature indicates the type of transaction, where a value of 0 denotes a legal transaction and 1 represents a fraudulent transaction. In this paper, we refer to this dataset as the European Cardholders 2013 dataset.
European Cardholders 2023. This dataset [28] comprises 568,630 rows and 31 columns. It is fully balanced, eliminating the need for the SMOTE technique. Except for ‘Time’, ‘Amount’, and ‘id’, all features undergo PCA transformation to ensure the privacy of sensitive data. The Class feature is the dependent variable, with 1 indicating fraudulent transactions and 0 denoting legal transactions. The ‘id’ column is removed as it has no impact on the analysis. In this study, we refer to this dataset as the European Cardholders 2023 dataset.
German Dataset. The third dataset used in this study is the German credit card dataset, a real-world dataset [29], which consists of 21 features, 13 of which are categorical and 8 are numerical [13]. We applied one-hot encoding to convert categorical features into numerical ones, resulting in 62 features, including the classification feature. This dataset contains 1,000 credit card transactions, with 700 classified as normal and 300 as fraudulent, representing 30% fraudulent transactions. In our research, we refer to this dataset as the German Dataset. This dataset offers a valuable benchmark as it reflects real-world transaction data without any dimensionality reduction, pre-processing transformations, or modifications, unlike many synthetic datasets or PCA-transformed datasets commonly used in the literature.
Australian Dataset. The fourth dataset [30] consists of 15 features and 690 instances, including 307 legal and 383 fraudulent transactions. All attribute names and values have been changed to meaningless symbols to protect the confidentiality of the data. The dataset is not significantly imbalanced, with fraudulent transactions comprising 55.51% of the data, thus eliminating the need for the SMOTE technique. To ensure confidentiality, feature names and values in this dataset have been replaced with non-meaningful symbols [31]. In this study, we refer to this dataset as the Australian Dataset.
Abstract Dataset. The final dataset [32] contains 3,075 samples and 12 features. Of these, 448 samples are fraud cases, accounting for 14.6% of the total. The dataset includes the Merchant_id features, which are irrelevant for fraud detection, and Transaction_date, which contains NaN values. These two features were removed, leaving 10 features, including four categorical variables with values ‘Y’ and ‘N’. We replaced ‘Y’ with 1 and ‘N’ with 0 during preprocessing. This dataset, presented by the author [33], is both abstract and relatively small. In this paper, we refer to it as the Abstract Dataset.
Among the five datasets, the European Cardholders 2013 and 2023 datasets apply PCA to all features (except ‘Time’ and ‘Amount’) to anonymize sensitive transaction details. PCA transforms original variables into uncorrelated components, making it infeasible to reverse-engineer or identify individual user behavior. This obfuscation ensures compliance with privacy standards by structurally removing direct personal identifiers.
4 Proposed methodology
This section outlines the proposed framework of our methodology, which consists of two primary components: data pre-processing and feature engineering. Subsequently, a machine learning system is implemented to classify transactions. Fig 2 illustrates the workings of this approach, while Fig 3 provides an overview of the entire fraud detection system’s flow. Note that no human or animal subject is related to this research.
4.1 Data preprocessing
Data preprocessing encompasses various techniques applied to raw data to prepare it for further analysis [34]. Raw data often contains imperfections, incompleteness, and inconsistencies that can hinder accurate analysis. By employing data preprocessing strategies, the quality of the data is enhanced, improving the accuracy and efficiency of subsequent mining processes [35]. As a critical step in data mining, data preprocessing involves preparing and transforming data into a suitable format for analysis. This process aims to reduce data volume, identify relationships within the data, normalize values, remove outliers, and extract relevant features. It includes strategies such as data cleaning, integration, transformation, and reduction [36]. In our work, we applied data preprocessing techniques as necessary. Specifically, we utilized oversampling, normalization, and encoder methods to prepare the datasets for analysis in this study.
Algorithm 1. Data preprocessing with encoding and SMOTE.
Input: Dataset(O, F) O := Observations, F := Features
Output: ,
1:
2:
Target variable
3:
Features (independent variables)
4: if X has Categorical Features then Encoding categorical variables
5: Encoder OneHotEncoder(handle_unknown=‘ignore’)
6:
Apply one-hot encoding to features
7: end if
8:
Split dataset into training and testing sets
9: if is_imbalanced(D) then Check if the dataset is imbalanced
10:
SMOTE applied to the training set to handle imbalance
11: else
12:
No need for SMOTE if the dataset is balanced
13: end if
14: Retrun
4.1.1 Oversampling.
In classification problems, the issue of imbalanced datasets arises when the number of instances representing different classes varies significantly [37]. Most learning algorithms are designed to achieve high predictive accuracy and strong generalization capabilities. However, this inductive bias presents a considerable challenge in the context of imbalanced data, as models tend to favor the majority class, leading to poor classification performance for the minority class [38]. Below are some significant limitations associated with imbalanced datasets:
- Biased model performance: Models perform poorly on the minority class while frequently predicting the majority class.
- Poor recall and precision: The minority class often suffers from low recall and precision.
- Overfitting on the majority class: The model may overfit the majority class, further degrading overall performance.
- Difficulty distinguishing noise from minority examples: Classifiers may struggle to differentiate between noise and valid minority class instances, often ignoring the minority class entirely [39].
In the case of credit card fraud detection, datasets are typically imbalanced because fraudulent transactions occur rarely compared to a large number of legitimate transactions. To address this issue, various oversampling techniques can be employed, such as Random Oversampling, SMOTE, Borderline SMOTE, ADASYN, SMOTE-ENN, and SMOTE-NC. In this study, we apply the SMOTE (Synthetic Minority Over-sampling Technique) approach when the ratio between genuine and fraudulent transactions exceeds 85% to 30%. For the highly imbalanced dataset (ECC-2013), where we apply both SMOTE and the advanced oversampling technique ADASYN, SMOTE-ENN.
The primary concept of SMOTE is to generate synthetic instances for the minority class by interpolating between multiple adjacent minority class examples. This helps mitigate the issue of overfitting and allows the decision boundaries for the minority class to extend further into the majority class domain [39]. SMOTE achieves this by oversampling the minority class, creating synthetic examples along the line segments that connect each minority class sample with its k-nearest neighbors (KNN) within the same class. Neighbors are randomly selected based on the required level of oversampling. The synthetic samples are generated as follows: (i) Calculate the difference between the feature vector of the current sample and its nearest neighbor. (ii) Multiply this difference by a random number in the range of 0 to 1. (iii) Add the result to the original feature vector. This process generates a new data point at a random location along the line segment connecting the two feature vectors. By doing so, SMOTE effectively expands the decision region for the minority class, reducing bias and improving classification performance [39].
where is one of the K-nearest neighbors of xi, and
is a real random number.
SMOTE-ENN is a hybrid resampling technique that performs both oversampling and undersampling of the data. Here, the first SMOTE technique is to oversample the minority class and then undersample the edited nearest neighbor (ENN) to remove the overlapping instances to obtain a balanced dataset [40]. Building upon SMOTE, ADASYN enhances synthetic sample generation for the minority class by adaptively focusing on regions where the classifier struggles due to class imbalance. By generating more synthetic samples in areas with higher classification complexity, ADASYN dynamically adjusts the data distribution, improving the overall learning process [41].
4.1.2 Encoding.
Machine learning algorithms require numerical inputs; thus, it is essential to convert categorical variables into numerical values using encoding techniques. Various encoding methods are available to perform this conversion, including one-hot encoding, ordinal encoding, sum coding, Helmert coding, polynomial coding, backward difference coding, and binary coding [42]. In this paper, we employed the one-hot encoding technique to encode categorical variables. One-hot encoding is one of the most widely used encoding schemes. It compares the levels of a categorical variable to a predefined reference level. This method transforms a single variable with n observations and d unique values into d binary variables, each containing n observations. Each binary variable indicates the presence (1) or absence (0) of a specific category [42].
4.1.3 Normalization.
Normalization involves transforming data values to a specific range, such as 0 to 1 or −1 to 1. This technique is particularly beneficial for mining tasks like classification, artificial neural networks (ANN), and clustering algorithms. It is especially useful in backpropagation neural networks, where scaling the data properties can significantly accelerate the learning process. Common normalization methods include min-max normalization, z-score normalization, and decimal scaling [36]. In this study, we applied min-max normalization to both training and testing samples, which is applied for effective feature scaling [43]. This method converts numerical feature values into a new range (0 to 1) based on their minimum and maximum values [44]. Min-max normalization rescales the values of a feature using Equation (1):
where X is the original feature value, and
are the minimum and maximum values, respectively, in the dataset, and
is the normalized value of the feature.
4.2 Feature selection
Feature selection (FS) is an essential phase in the implementation of machine learning techniques, as it enables the extraction of the most relevant attributes for accurate classification [45]. This is mainly due to the dataset employed during the training and testing phases potentially possessing a vast feature space, which may negatively affect the models’ overall performance [7]. The following papers use different FS techniques to develop their ML models. Kasongo [46], Ileberi et al. [7], and Hassanat et al. [47] employ genetic algorithm-based FS techniques to enhance the performance of machine learning models. Mienye et al. [48] implemented a particle swarm optimization (PSO) technique for their heart disease prediction. Will Koehrsen developed a feature selector tool [49], which Varmedja et al. [50] used to detect credit card fraud. Bhowmik et al. [51] implemented a hybrid feature selection technique for phishing website prediction. In the paper [52], the authors propose a GA-KNN feature selection method that identifies optimal feature combinations and enhances overall model performance. For imbalanced datasets, this paper [53] proposes a two-stage approach integrating enhanced Sparse Autoencoder (SAE) for feature learning and Softmax regression for classification, aiming to improve minority class prediction performance. In this paper [54], the authors proposed an unsupervised feature learning method using a stacked Sparse Autoencoder (SSAE). The SSAE learn robust feature representations that were employed to train classifiers and enhance performance. The authors of this paper [40], proposed a hybrid resampling technique that combines both undersampling using Edited Nearest Neighbor (ENN) and oversampling using the SMOTE technique (SMOTE-ENN) to balance the dataset. In this paper [55], the authors proposed a feature selection technique that combines Information Gain with a cost-sensitive Adaptive Boosting (AdaBoost) classifier for chronic kidney disease prediction.
In this paper, we implement a hybrid feature selection technique. This approach integrates two distinct methodologies: the filter and the embedding methods. The initial phase of this hybrid strategy involves the filter method, which incorporates two techniques utilized in this paper: Pearson correlation and Information Gain. We first employ Pearson correlation on the balanced dataset to determine highly correlated features, as the SMOTE technique generates synthetic instances that enhance the correlations among features. Synthetic instances are likely to reside within the same region of the feature space when generated along the line segment connecting a minority class instance to its nearest neighbor. As a result, the characteristics of these synthetic cases may exhibit strong correlations with each other. The Pearson Correlation Coefficient has a range of values from -1 to 1. A value of -1 indicates a negative correlation between the data, a value of 1 indicates a positive correlation and a value of 0 indicates no correlation exists between the variables [69]. This paper employs a Pearson correlation threshold of 0.95 to identify highly correlated features. Features exceeding this threshold are removed, leaving only the uncorrelated features (F1).
Secondly, we apply two filtering methods, Information Gain (IG) and the embedded Random Forest Importance (RFI) technique, on the features (F1). Each method calculates an importance score for every feature and has its own threshold value. In this paper, we manually selected different optimized threshold values for each dataset because different datasets have varying characteristics, such as data imbalance ratios, feature distributions and relationships, feature dimensionality, and model complexity. Features from both methods that exceed their respective threshold values are stored in separate sets: F2 for IG-selected features and F3 for RFI-selected features. Finally, the union of F2 and F3 (F) contains all features selected by either method.
This ensures that features selected by at least one of the methods are included in the final feature set (F).
Algorithm 2. Hybrid feature selection.
Input: Balanced Dataset (,
)
Output: Final Selected Features (F)
Step 1: Pearson Correlation
1: Compute Pearson correlation for all feature pairs in
2:
Remove highly correlated features based on threshold
0.95
Step 2: Information Gain (IG)
3: Compute Information Gain scores for all selected features in F1
4: Select threshold
5:
Select features exceeding the threshold
Step 3: Random Forest Importance (RFI)
6: Compute Random Forest Importance scores for all features in F1
7: Select threshold
8:
Select features exceeding the threshold
Step 4: Combine Selected Features
9:
Union of features selected by IG and RFI
10: Return F
After using Pearson Correlation, RFI, and IG to choose features, Tables 4, 5, 6, 7 and 8 show how many features are in each dataset: European Cardholders 2013, European Cardholders 2023, the Australian Dataset, the Abstract Dataset, and the German Dataset respectively. After applying the SMOTE technique, European Cardholders 2013 and the Abstract Dataset remove correlated features illustrated in Table 4 and Table 7, respectively. Table 4 shows that we chose V16 for the Pearson correlation method and took it out of the dataset because it had a strong correlation with V17 and was higher than the chosen threshold value of 0.95. Table 7, using the same feature selection technique, reveals a high correlation between the 6_month_avg_chbk_amt and Daily_chargeback_avg_amt features. In this case, remove only Daily_chargeback_avg_amt features from the dataset. Table 5 shows that no features were removed from the European Cardholders 2023 dataset using the Pearson correlation method, as none of the feature pairs exceeded the correlation threshold. Similarly, Table 6 indicates that the Australian Dataset retained all features during Pearson correlation analysis. Finally, Table 8 confirms that the German Dataset also did not require the removal of any features using this method, as all correlations remained below the threshold.
For the IG feature selection technique, the total number of features across the five datasets is 17, 16, 8, 8, and 20, respectively. Similarly, for the RFI feature selection, the total number of features is 14, 14, 14, 7, and 47, respectively. Finally, use the set union operation on the features to combine all the unique features from both methods. This will give the total number of features for these five datasets, which are 17, 16, 14, 8, and 49, respectively.
4.3 Ensemble learning
Ensemble learning is a method that combines multiple ML classifier models to achieve improved performance compared to the individual classifier model. The predictions from individual models are aggregated through a combination rule to produce a more accurate single prediction, rather than depending on a single model [56]. Ensemble models can outperform individual base learners even if some base learners are weak. The performance of the ensemble learning depends mainly on the accuracy and diversity of the base learners [57]. A crucial step in constructing ensemble classifiers is to combine the individual base learners. The ensemble learning method typically determines the combination mechanism employed. The most popular mechanism is the majority vote to combine ensemble base models, as shown in Fig 4.
Algorithm 3. Ensemble classification with majority voting (hard voting).
Input: Training Data , Test Data
Output: Predicted Class Labels
1:
2: for each model M in do
Train each model on the training data
3:
4: end for
5:
Initialize list to store predictions
6: for each model M in do
7:
Generate predictions on the test data
8: end for
9: for each instance i in do
10:
11:
Most Common Class
12: end for
13: Return
In a classification problem, the predictions for each class are aggregated, and the class with the majority vote is identified as the ensemble prediction. In regression tasks, the majority vote is obtained by calculating the average predictions from multiple base learners [58]. Let the decision of the t-th classifier be , where,
and
. Then, using majority voting, the class
is selected as the ensemble prediction such that
where T is the number of classifier models and C represents the number of classes.
Diverse classifiers tend to make uncorrelated errors, and by aggregating their predictions, the ensemble can mitigate individual weaknesses and reduce variance. In our case, combining tree-based models such as RF, ET, XGBC, AdaBoost, and CatBoost—each with different internal learning mechanisms—introduced sufficient diversity to enhance generalization. This aligns with ensemble theory, which states that a set of accurate and diverse models will yield a more robust and stable prediction than any single model alone. Therefore, ensemble voting not only combines predictions but also leverages the complementary strengths of heterogeneous classifiers.
5 Evaluation metrics
Model evaluation holds significant importance in predictive modeling tasks. In ensemble predictive modeling, the evaluation of relative performance and model diversity is essential. Evaluation metrics are derived from four classifications: true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) [59].
This paper presents research that is structured as a binary classification task in machine learning. As a result, the accuracy (AC) derived from the test data serves as the primary performance metric. Furthermore, for each model, the recall (RC), precision (PR), and F1-Score (F-Measure) are calculated [60]. To evaluate the classification quality of each model, we additionally plot the Area Under the Curve (AUC) and Precision Recall Curve (PR).
5.1 Accuracy
Accuracy is commonly employed to evaluate the performance of a model utilizing the confusion matrix [59].
5.2 Precision
Precision is defined as the ratio of true positives to the sum of true positives and false positives. In this context, precision quantifies the proportion of instances identified as positive by the classifier that are truly positive [59].
5.3 Recall
Recall is defined as the ratio of true positives to the sum of true positives and false negatives [59].
5.4 F1 Score
The F1-Score is a metric that incorporates both Precision and Recall to validate accuracy. It is the harmonic mean of Precision and Recall [60]. The relationship between precision and recall involves a trade-off: higher precision is typically associated with lower recall [59]. The harmonic mean of Recall and Precision is defined as follows:
5.5 Areas under the receiver operating characteristic (ROC) curve
ROC curve plots the False Positive Rate (FPR) on the X-axis and the True Positive Rate (TPR) on the Y-axis. FPR measures the fraction of negative examples that are misclassified as positive, while TPR measures the fraction of positive examples that are correctly labeled [61]. The Area Under the Curve (AUC) provides a summary of the ROC curve, with values that range from 0 to 1. An AUC of 0 indicates that the model’s predictions are entirely incorrect, whereas an AUC of 1 signifies that all predictions made by the model are accurate [62]. AUC measures the overall performance of a test; the better the AUC score, the better the overall performance [63].
5.6 Precision recall curve
The Precision-Recall (PR) curve plots Recall on the X-axis and Precision on the Y-axis. Recall is equivalent to the True Positive Rate (TPR), while Precision quantifies the proportion of instances identified as positive that are truly positive [61].
5.7 Matthews Correlation Coefficient (MCC)
Using a single indicator MCC provides the greatest description of true and false positives and negatives. It evaluates a two-class problem’s quality by considering both true and false positives as well as negatives. There is a balanced measurement when the class sizes are varied [64]. The equation of MCC is as follows
6 Result and discussion
We apply five ML model classifiers to five different datasets to validate our proposed methodology, including RF, XGBC, ET, AdaBoost, CatBoost, and the ensemble learning method described in Sect 4.2. The results are summarized in Tables 9, 10, 11, 12 and 13 and Figs 5–18. The evaluation of a model’s performance is based on its accuracy, precision, recall, f1 score, AUROC, PR score, and MCC. The best score for each model is in bold. As described in Sect 4.1.1, we applied SMOTE, SMOTE-ENN, and ADASYN techniques to address class imbalance in the dataset. Table 9 and Figs 5-10 show the performance of various models using these oversampling techniques. Among the five classifier models tested across these techniques, the Extra Trees (ET) classifier consistently outperformed the others. It achieved an accuracy of 99.97% with SMOTE and ADASYN, and 99.96% with SMOTE-ENN. The F1 scores for SMOTE and ADASYN with the ET model were also identical. Notably, the ADASYN-ET combination achieved the highest AUC (98.72%) and MCC (89.99%) scores. Therefore, for this dataset, our proposed model is ADASYN-ET.
For the European Cardholders 2023 dataset, ET also provides superior performance with accuracy; the f1 score, AUC, PR score, and MCC are 99.99%, 99.99%, 100%, 100%, and 99.98% respectively, as shown in Table 10 and Figs 11 and 12. In this dataset, all models perform well, with the exception of AdaBoost, which has a lower accuracy score of 97.43% compared to the other models. For the Australian dataset, AdaBoost provides superior performance with an accuracy score of 89.86%, 86.27% of the f1 score, and 78.22% MCC score. In this dataset, ET provides the highest AUC and PR scores of 91.55% and 86.52%, respectively. For the German dataset, the performance scores presented in Table 12 show that ensemble majority voting achieved the highest accuracy, MCC, and F1 score, reaching 84%, 50.59%, and 89.04%, respectively. For the Abstract dataset, the performance scores are presented in Table 13 and Figs 17 and 18. These results indicate that both AdaBoost and ensemble majority voting achieved the highest accuracy score of 98.05%, while the F1 score was 0.05% higher for ensemble learning. Figs 17 and 18 also show that the Extra Trees (ET) model achieved the highest AUC and PR scores, at 99.74% and 98.95%, respectively. The above discussion shows that our proposed hybrid feature selection technique performs effectively across all classifier models.
6.1 Statistical analysis
To assess the statistical significance of the performance metrics, we applied a one-way Analysis of Variance (ANOVA). This test determines whether there are statistically significant differences between the means of multiple groups. Null hypothesis significance testing was used to support the interpretation of results and to ensure that claims of improved performance are supported by statistical evidence. A p-value less than the predefined significance level (0.05) indicates a statistically significant result. In such cases, the null hypothesis is rejected in favor of the alternative hypothesis, suggesting that at least one group mean differs from the others [59]. Conversely, when the p-value is greater than 0.05, the null hypothesis cannot be rejected, implying that the observed differences between group means may not be statistically significant and could be due to random variation. In this study, a 95% confidence level of one-way ANOVA was applied to evaluate the performance differences based on F1-score and AUC. The results are presented in Table 14 for F1-score and AUC. Each table includes the degrees of freedom (Df), sum of squares (Sum Sq), mean sum of squares (Mean Sq), F-statistic (F value), and the p-value (Pr( F)).
In the one-way ANOVA results Table 14, the F1-score and AUC metrics were evaluated for various datasets. For most datasets, such as ECC-2013 (SMOTE), ECC-2013 (ADASYN), and ECC-2023 (SMOTE), the p-values are significantly smaller than 0.05 (ranging from 1.09e-06 to 9.86e-44), indicating that the null hypothesis is rejected. This suggests that the classifier performance metrics, specifically the F1-score and AUC, significantly differ across the different classification methods. For instance, in the ECC-2013 (SMOTE) dataset, the F1-score and AUC yielded p-values of 1.09e-06 and 7.65e-06, respectively, demonstrating strong statistical significance.
Conversely, in the Australia Dataset, the p-values for both F1-score and AUC are 0.5324 and 0.7125, respectively, which are greater than 0.05, indicating no statistically significant difference between the classifiers. This suggests that the observed differences between the classifier performance metrics in the Australia dataset may be due to random variation, rather than any inherent superiority of one classifier over another. Similarly, for the German Dataset, while the p-values for both F1-score and AUC are small, they are still higher than those found in other datasets (with p-values ranging from 0.0754 to 0.9893), which further supports the claim that performance differences are dataset-dependent. Overall, the results suggest that SMOTE-based methods and ADASYN consistently lead to statistically significant performance improvements in F1-score and AUC across multiple datasets, while performance differences in some datasets, such as the Australia dataset, were not significant.
In Table 15, we compared the existing models with our suggested method for all five datasets based on the accuracy, F1 score, and AUC. The F1 score gives equal weight to both precision and recall, which are measured by calculating the harmonic mean of both precision and recall. Since using only accuracy to determine a model’s correctness is not optimal, we have also utilized the F1 score and AUC.
For the European Cardholders 2013 dataset, our proposed method using (ADASYN) ET achieved an outstanding accuracy of 99.97%, an F1 score of 89.95%, and an AUC of 98.72%, marking the highest performance among the compared methods. This result highlights our model’s effectiveness in managing unbalanced class distributions. In contrast, previous techniques, including various ensemble and machine learning methods, failed to reach this level of accuracy and precision, indicating that our approach is more suited for older, complex data distributions.
Moving to the European Cardholders 2023 Dataset, we maintained the use of Extra Trees (ET) and achieved an impressive accuracy of 99.99%, further enhancing the F1 Score and AUC. This performance demonstrates the scalability and adaptability of our approach to evolving fraud detection patterns, as other models tested on this dataset did not match our accuracy, F1 scores, and AUC.
In the Australian Dataset, our implementation of AdaBoost resulted in 89.86% accuracy, an F1 Score of 86.27% and 90.17% AUC, outperforming alternative models such as ELM, XGBoost-TPE, and APSO-XGBoost. This performance underscores the efficacy of our method across diverse environments and dataset structures, while competing methods struggled to achieve high F1 scores, indicating a lack of precision necessary for effective fraud detection in this context.
For the German dataset, we employed an ensemble majority voting method, achieving 84% accuracy, 89.04% F1 score, and 78.30% AUC. This significantly outperformed models like APSO-XGBoost, LOF, and Gradient Boosted Decision Trees, underscoring the power of ensemble methods in yielding reliable results on complex, unbalanced datasets. Competing methods faced challenges in handling the intricacies and overlapping classes of the German dataset, further reinforcing the strength of our ensemble approach.
Lastly, in the Abstract Dataset, our ensemble voting method attained an accuracy of 98.05%, a 98.44% AUC, and an F1 Score of 94.44%, showcasing our method’s effectiveness even in generalized datasets with unique fraud detection challenges. Other competing methods, including CNN, Multilayer Perceptron (MLP), and RUSBoost did not achieve comparable results, suggesting that our ensemble approach provides a more reliable framework for fraud detection. This reinforces the effectiveness of our framework as a more reliable and balanced solution for fraud detection.
As such, across all five datasets, our proposed method consistently achieved the highest or near-highest accuracy, F1 scores, and AUC, proving its effectiveness in managing unbalanced class distributions and overlapping samples. The results indicate that our model not only adapts to various types of credit card fraud datasets but also provides a robust solution for diverse fraud detection scenarios.
6.2 Interpreting fraud detection models with SHAP
In the domain of credit card fraud detection, the interpretability of machine learning models is indispensable, fostering trust, transparency, and accountability among stakeholders. To elucidate the decision-making process of our predictive models, we employed SHAP (SHapley Additive exPlanations), a cutting-edge framework grounded in game theory [9, 43, 85]. SHAP calculates Shapley values to provide a consistent and theoretically sound measure of feature importance, quantifying the contribution of each feature to model predictions. This approach enables a deep understanding of how individual features influence the classification of transactions as fraudulent or legitimate. Through SHAP’s visualizations, such as summary plots, the distribution and magnitude of feature importance are revealed, highlighting the features that most significantly impact fraud detection. This interpretability equips financial institutions with actionable information, allowing them to understand the rationale behind the predictions of the model, improve decision-making processes, and reinforce the reliability of fraud detection systems.
Fig 19 showcases SHAP-based feature importance analysis across five diverse datasets: European Cardholders 2013, European Cardholders 2023, the German dataset, the Australian dataset, and the Abstract dataset. Prominent features, including transaction type, age, transaction amount, and time, emerge as the most influential variables, consistently demonstrating high SHAP values. These findings affirm the critical role of these features in guiding the model’s predictions, underscoring SHAP’s efficacy in bridging the gap between model complexity and interoperability. By leveraging SHAP, we provide a transparent and comprehensive understanding of model behavior, facilitating the refinement of feature selection processes to prioritize the most impactful variables. This interpretative framework not only advances the reliability of fraud detection systems but also sets a robust foundation for developing more transparent and trustworthy AI-driven solutions.
(a) Abstract(XGBC), (b) Australia(AdaBoost), (c) German(RF), (d) ECC-2013(RF), and (e) ECC-2023(RF).
7 Conclusion and future work
Detecting fraudulent credit card transactions continues to pose a significant challenge due to the severe class imbalance inherent in real-world datasets. This study introduced a novel hybrid feature selection framework combining Pearson Correlation, Information Gain (IG), and Random Forest Importance (RFI) to achieve optimal feature selection. The framework integrated state-of-the-art machine learning models, including Random Forest (RF), XGBoost (XGBC), Extra Trees (ET), CatBoost, AdaBoost, and an ensemble voting mechanism, resulting in enhanced detection accuracy. The proposed method outperformed existing baseline classifiers and approaches in recent literature, demonstrating superior performance across five diverse datasets. This robustness and adaptability underscore its potential for practical application in real-world credit card fraud detection systems.
Future research could expand this work by exploring advanced deep learning models, such as Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Transformer-based architectures, to further improve detection capabilities. Techniques such as autoencoders for anomaly detection and Generative Adversarial Networks (GANs) for addressing data imbalance could also enhance the framework. Additionally, incorporating Federated Learning could enable privacy-preserving collaborative training across institutions, fostering globally robust fraud detection systems. Finally, leveraging real-time data streams and large-scale, contemporary datasets will allow the models to remain scalable and adaptable to the ever-evolving nature of financial fraud.
References
- 1. Baabdullah T, Alzahrani A, Rawat DB, Liu C. Efficiency of federated learning and blockchain in preserving privacy and enhancing the performance of Credit Card Fraud Detection (CCFD) Systems. Future Internet. 2024;16(6):196.
- 2. Lebichot B, Paldino GM, Siblini W, He-Guelton L, Oblé F, Bontempi G. Incremental learning strategies for credit cards fraud detection. Int J Data Sci Anal. 2021;12(2):165–74.
- 3. Alarfaj FK, Malik I, Khan HU, Almusallam N, Ramzan M, Ahmed M. Credit card fraud detection using state-of-the-art machine learning and deep learning algorithms. IEEE Access. 2022;10:39700–15.
- 4. Islam MA, Uddin MA, Aryal S, Stea G. An ensemble learning approach for anomaly detection in credit card data with imbalanced and overlapped classes. J Inf Secur Appl. 2023;78:103618.
- 5. Dang TK, Tran TC, Tuan LM, Tiep MV. Machine learning based on resampling approaches and deep reinforcement learning for credit card fraud detection systems. Appl Sci. 2021;11(21):10004.
- 6. Marazqah Btoush EAL, Zhou X, Gururajan R, Chan KC, Genrich R, Sankaran P. A systematic review of literature on credit card cyber fraud detection using machine and deep learning. PeerJ Comput Sci. 2023;9:e1278. pmid:37346569
- 7. Ileberi E, Sun Y, Wang Z. A machine learning based credit card fraud detection using the GA algorithm for feature selection. J Big Data. 2022;9(1).
- 8.
Abakarim Y, Lahby M, Attioui A. An efficient real time model for credit card fraud detection based on deep learning. In: Proceedings of the 12th International Conference on Intelligent Systems: Theories and Applications. 2018. p. 1–7. https://doi.org/10.1145/3289402.3289530
- 9. Mienye ID, Jere N. Deep learning for credit card fraud detection: a review of algorithms, challenges, and solutions. IEEE Access. 2024.
- 10.
Thennakoon A, Bhagyani C, Premadasa S, Mihiranga S, Kuruwitaarachchi N. Real-time credit card fraud detection using machine learning. In: 2019 9th International Conference on Cloud Computing, Data Science & Engineering (Confluence). 2019. p. 488–93.
- 11. Wang P, Fan E, Wang P. Comparative analysis of image classification algorithms based on traditional machine learning and deep learning. Pattern Recognit Lett. 2021;141:61–7.
- 12. Malik EF, Khaw KW, Belaton B, Wong WP, Chew X. Credit card fraud detection using a new hybrid machine learning architecture. Mathematics. 2022;10(9):1480.
- 13.
Singh P, Singla K, Piyush P, Chugh B. Anomaly detection classifiers for detecting credit card fraudulent transactions. In: 2024 Fourth International Conference on Advances in Electrical, Computing, Communication and Sustainable Technologies (ICAECT). 2024. p. 1–6. https://doi.org/10.1109/icaect60202.2024.10469194
- 14. Afridah R, Ula M, Rosnita L. Performance analysis algorithm classification and regression trees and naive bayes based particle swarm optimization for credit card transaction fraud detection. Int J Eng Sci Inform Technol. 2024;4(3):47–54.
- 15. RB A, KR SK. Credit card fraud detection using artificial neural network. Glob Trans Proc. 2021;2(1):35–41.
- 16. Itoo F, Meenakshi , Singh S. Comparison and analysis of logistic regression, Naïve Bayes and KNN machine learning algorithms for credit card fraud detection. Int J Inf Tecnol. 2020;13(4):1503–11.
- 17. Olowookere TA, Adewale OS. A framework for detecting credit card fraud with cost-sensitive meta-learning ensemble approach. Sci Afr. 2020;8:e00464.
- 18. Xia Y, Liu C, Li Y, Liu N. A boosted decision tree approach using Bayesian hyper-parameter optimization for credit scoring. Exp Syst Appl. 2017;78:225–41.
- 19.
Khatri S, Arora A, Agrawal AP. Supervised machine learning algorithms for credit card fraud detection: a comparison. In: 2020 10th International Conference on Cloud Computing, Data Science & Engineering (Confluence). 2020. p. 680–3. https://doi.org/10.1109/confluence47617.2020.9057851
- 20. Li C, Ding N, Dong H, Zhai Y. Application of credit card fraud detection based on CS-SVM. IJMLC. 2021;11(1):34–9.
- 21. Ogwueleka FN. Data mining application in credit card fraud detection system. J Eng Sci Technol. 2011;6(3):311–22.
- 22.
Charleonnan A. Credit card fraud detection using RUS and MRN algorithms. In: 2016 Management and Innovation Technology International Conference (MITicon). 2016. MIT-73.
- 23. Carneiro N, Figueira G, Costa M. A data mining based system for credit-card fraud detection in e-tail. Decis Supp Syst. 2017;95:91–101.
- 24. Makki S, Assaghir Z, Taher Y, Haque R, Hacid M-S, Zeineddine H. An experimental study with imbalanced classification approaches for credit card fraud detection. IEEE Access. 2019;7:93010–22.
- 25. Xu D, Zhang X, Hu J, Chen J. A novel ensemble credit scoring model based on extreme learning machine and generalized fuzzy soft sets. Math Prob Eng. 2020;2020(1):7504764.
- 26. Ghaleb FA, Saeed F, Al-Sarem M, Qasem SN, Al-Hadhrami T. Ensemble synthesized minority oversampling-based generative adversarial networks and random forest algorithm for credit card fraud detection. IEEE Access. 2023;11:89694–710.
- 27.
Kaggle. Credit card fraud detection dataset. 2023. https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud
- 28.
Elgiriyewithana N. Credit Card Fraud Detection Dataset 2023; 2023. https://www.kaggle.com/dsv/6492730
- 29.
Hofmann H. Statlog (German Credit Data). UCI Machine Learning Repository. 1994.
- 30.
Quinlan R. Statlog (Australian Credit Approval). UCI Machine Learning Repository. 1987. https://archive.ics.uci.edu/ml/datasets/statlog+(australian+credit+approval)
- 31. Carta S, Ferreira A, Reforgiato Recupero D, Saia R. Credit scoring by leveraging an ensemble stochastic criterion in a transformed feature space. Prog Artif Intell. 2021;10(4):417–32.
- 32.
Joshi S. Abstract data set for credit card fraud detection. https://www.kaggle.com/datasets/shubhamjoshi2130of/abstract-data-set-for-credit-card-fraud-detection
- 33. Arora V, Leekha RS, Lee K, Kataria A. Facilitating user authorization from imbalanced data logs of credit cards using artificial intelligence. Mobile Inf Syst. 2020;2020:1–13.
- 34.
Iliou T, Anagnostopoulos C-N, Nerantzaki M, Anastassopoulos G. A novel machine learning data preprocessing method for enhancing classification algorithms performance. In: Proceedings of the 16th International Conference on Engineering Applications of Neural Networks (INNS). 2015. p. 1–5. https://doi.org/10.1145/2797143.2797155
- 35.
Witten IH, Frank E, Hall MA, Pal CJ, Data M. Practical machine learning tools and techniques. Data mining. Amsterdam, The Netherlands: Elsevier; 2005. p. 403–13.
- 36. Alasadi SA, Bhaya WS. Review of data preprocessing techniques in data mining. J Eng Appl Sci. 2017;12(16):4102–7.
- 37. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. JAIR. 2002;16:321–57.
- 38. Sun Y, Wong AKC, Kamel MS. Classification of imbalanced data: a review. Int J Patt Recogn Artif Intell. 2009;23(04):687–719.
- 39. Luengo J, Fernández A, García S, Herrera F. Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling. Soft Comput. 2010;15(10):1909–36.
- 40. Esenogho E, Mienye ID, Swart TG, Aruleba K, Obaido G. A neural network ensemble with feature engineering for improved credit card fraud detection. IEEE Access. 2022;10:16400–7.
- 41. Imani M, Beikmohammadi A, Arabnia HR. Comprehensive analysis of random forest and XGBoost performance with SMOTE, ADASYN, and GNUS under varying imbalance levels. Technologies. 2025;13(3):88.
- 42. Potdar K, S. T, D. C. A comparative study of categorical variable encoding techniques for neural network classifiers. IJCA. 2017;175(4):7–9.
- 43. Obaido G, Ogbuokiri B, Swart TG, Ayawei N, Kasongo SM, Aruleba K, et al. An interpretable machine learning approach for hepatitis B diagnosis. Appl Sci. 2022;12(21):11127.
- 44. Sami O, Elsheikh Y, Almasalha F. The role of data pre-processing techniques in improving machine learning accuracy for predicting coronary heart disease. IJACSA. 2021;12(6).
- 45. Sifat IK, Kibria MK. Optimizing hypertension prediction using ensemble learning approaches. PLoS One. 2024;19(12):e0315865. pmid:39715219
- 46. Kasongo SM. An advanced intrusion detection system for IIoT based on ga and tree based algorithms. IEEE Access. 2021;9:113199–212.
- 47. Hassanat A, Almohammadi K, Alkafaween E, Abunawas E, Hammouri A, Prasath VBS. Choosing mutation and crossover ratios for genetic algorithms—a review with a new dynamic approach. Information. 2019;10(12):390.
- 48. Mienye ID, Sun Y. Improved heart disease prediction using particle swarm optimization based stacked sparse autoencoder. Electronics. 2021;10(19):2347.
- 49.
Koehrsen W. Feature selector. 2019. https://github.com/WillKoehrsen/feature-selector
- 50.
Varmedja D, Karanovic M, Sladojevic S, Arsenovic M, Anderla A. Credit card fraud detection-machine learning methods. In: 2019 18th International Symposium INFOTEH-JAHORINA (INFOTEH). 2019. p. 1–5.
- 51.
Bhowmik P, Sohrawordi M, Ali UAME, Bhowmik PC. An empirical feature selection approach for phishing websites prediction with machine learning. In: Islam AKMM, Uddin J, Mansoor N, Rahman S, Al Masud SMR, editors. Bangabandhu and Digital Bangladesh. Cham: Springer; 2022. p. 173–88.
- 52. Feng G. Feature selection algorithm based on optimized genetic algorithm and the application in high-dimensional data processing. PLoS One. 2024;19(5):e0303088. pmid:38723061
- 53. Ebiaredoh-Mienye SA, Esenogho E, Swart TG. Integrating enhanced sparse autoencoder-based artificial neural network technique and softmax regression for medical diagnosis. Electronics. 2020;9(11):1963.
- 54. Ebiaredoh-Mienye SA, Esenogho E, Swart TG. Artificial neural network technique for improving prediction of credit card default: a stacked sparse autoencoder approach. IJECE. 2021;11(5):4392.
- 55. Ebiaredoh-Mienye SA, Swart TG, Esenogho E, Mienye ID. A machine learning method with filter-based feature selection for improved prediction of chronic kidney disease. Bioengineering (Basel). 2022;9(8):350. pmid:36004875
- 56. Mienye ID, Sun Y. A survey of ensemble learning: concepts, algorithms, applications, and prospects. IEEE Access. 2022;10:99129–49.
- 57.
Li N, Yu Y, Zhou ZH. Diversity regularized ensemble pruning. In: Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2012, Bristol, UK, 2012 .Proceedings, Part I. Springer; 2012. p. 330–45.
- 58.
Polikar R. Ensemble Learning. In: Zhang C, Ma Y, editors. Ensemble learning. New York, NY: Springer; 2012. p. 1–34. https://doi.org/10.1007/978-1-4419-93267_1
- 59. Alam TM, Shaukat K, Hameed IA, Luo S, Sarwar MU, Shabbir S, et al. An investigation of credit card default prediction in the imbalanced datasets. IEEE Access. 2020;8:201173–98.
- 60. Kasongo SM, Sun Y. A deep long short-term memory based classifier for wireless intrusion detection system. ICT Express. 2020;6(2):98–103.
- 61.
Davis J, Goadrich M. The relationship between precision-recall and ROC curves. In: Proceedings of the 23rd international conference on Machine learning - ICML ’06. 2006. p. 233–40. https://doi.org/10.1145/1143844.1143874
- 62. Mienye ID, Sun Y. A machine learning method with hybrid feature selection for improved credit card fraud detection. Appl Sci. 2023;13(12):7254.
- 63. Park SH, Goo JM, Jo C-H. Receiver operating characteristic (ROC) curve: practical review for radiologists. Korean J Radiol. 2004;5(1):11–8. pmid:15064554
- 64. Randhawa K, Loo CK, Seera M, Lim CP, Nandi AK. Credit card fraud detection using adaboost and majority voting. IEEE Access. 2018;6:14277–84.
- 65.
Sohony I, Pratap R, Nambiar U. Ensemble learning for credit card fraud detection. In: Proceedings of the ACM India Joint International Conference on Data Science and Management of Data. 2018. p. 289–94. https://doi.org/10.1145/3152494.3156815
- 66. Yılmaz AA. A machine learning-based framework using the particle swarm optimization algorithm for credit card fraud detection. Commun Faculty Sci Univ Ankara Ser A2-A3 Phys Sci Eng. 2024;66(1):82–94.
- 67.
Sailusha R, Gnaneswar V, Ramesh R, Rao GR. Credit card fraud detection using machine learning. In: 2020 4th International Conference on Intelligent Computing and Control Systems (ICICCS). 2020. p. 1264–70.
- 68. Kalid SN, Ng K-H, Tong G-K, Khor K-C. A multiple classifiers system for anomaly detection in credit card data with unbalanced and overlapped classes. IEEE Access. 2020;8:28210–21.
- 69. Randhawa K, Loo CK, Seera M, Lim CP, Nandi AK. Credit card fraud detection using AdaBoost and majority voting. IEEE Access. 2018;6:14277–84.
- 70.
Xenopoulos P. Introducing DeepBalance: random deep belief network ensembles to address class imbalance. In: 2017 IEEE International Conference on Big Data (Big Data). 2017. p. 3684–9.
- 71. Lee J, Jung D, Moon J, Rho S. Advanced R-GAN: generating anomaly data for improved detection in imbalanced datasets using regularized generative adversarial networks. Alexandria Eng J. 2025;111:491–510.
- 72. Wijaya S, Wesly W, Ginting K, Dharma A. Analysis of credit card fraud detection performance using random forest classifier & neural networks model. ETJ. 2024;9(02).
- 73.
Theodorakopoulos L, Theodoropoulou A, Zakka F, Halkiopoulos C. Credit card fraud detection with machine learning and big data analytics: a pyspark framework implementation. 2024.
- 74. Torvekar N, Game PS. Predictive analysis of credit score for credit card defaulters. Int J Recent Technol Eng. 2019;7(1):4.
- 75. Wang X, Hu M, Zhao Y, Djehiche B. Credit scoring based on the set-valued identification method. J Syst Sci Complex. 2020;33(5):1297–309.
- 76. Jiao W, Hao X, Qin C. The image classification method with CNN-XGBoost model based on adaptive particle swarm optimization. Information. 2021;12(4):156.
- 77. Dong L, Ye X, Yang G. Two-stage rule extraction method based on tree ensemble model for interpretable loan evaluation. Inf Sci. 2021;573:46–64.
- 78. Hsu F-J, Chen M-Y, Chen Y-C. The human-like intelligence with bio-inspired computing approach for credit ratings prediction. Neurocomputing. 2018;279:11–8.
- 79. Arora N, Kaur PD. A Bolasso based consistent feature selection enabled random forest classification algorithm: an application to credit risk assessment. Appl Soft Comput. 2020;86:105936.
- 80. Zhang W, Yang D, Zhang S, Ablanedo-Rosas JH, Wu X, Lou Y. A novel multi-stage ensemble model with enhanced outlier adaptation for credit scoring. Exp Syst Appl. 2021;165:113872.
- 81. Li G, Ma H-D, Liu R-Y, Shen M-D, Zhang K-X. A two-stage hybrid default discriminant model based on deep forest. Entropy (Basel). 2021;23(5):582. pmid:34066807
- 82. Jadhav S, He H, Jenkins K. Information gain directed genetic algorithm wrapper feature selection for credit rating. Appl Soft Comput. 2018;69:541–53.
- 83. Geetha N, Dheepa G. Transaction fraud detection using artificial bee colony (ABC) based feature selection and enhanced neural network (ENN) classifier. Int J Mech Eng. 2022;7(3).
- 84. Karthika J, Senthilselvi A. Smart credit card fraud detection system based on dilated convolutional neural network with sampling technique. Multimed Tools Appl. 2023;82(20):31691–708.
- 85. Shafiabady N, Hadjinicolaou N, Hettikankanamage N, MohammadiSavadkoohi E, Wu RMX, Vakilian J. eXplainable Artificial Intelligence (XAI) for improving organisational regility. PLoS One. 2024;19(4):e0301429. pmid:38656983