Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

A hybrid unsupervised machine learning model with spectral clustering and semi-supervised support vector machine for credit risk assessment

  • Tao Yu,

    Roles Conceptualization, Formal analysis, Methodology, Writing – original draft

    Affiliations School of Mathematics, Harbin Institute of Technology, Harbin, China, Department of Mathematics, Southern University of Science and Technology, Shenzhen, China

  • Wei Huang,

    Roles Supervision

    Affiliations College of Business, Southern University of Science and Technology, Shenzhen, China, National Center for Applied Mathematics Shenzhen, Southern University of Science and Technology, Shenzhen, China

  • Xin Tang ,

    Roles Software, Writing – review & editing

    tangx6@sustech.edu.cn

    Affiliations College of Business, Southern University of Science and Technology, Shenzhen, China, National Center for Applied Mathematics Shenzhen, Southern University of Science and Technology, Shenzhen, China

  • Duosi Zheng

    Roles Data curation, Resources

    Affiliations College of Business, Southern University of Science and Technology, Shenzhen, China, National Center for Applied Mathematics Shenzhen, Southern University of Science and Technology, Shenzhen, China

Abstract

In credit risk assessment, unsupervised classification techniques can be introduced to reduce human resource expenses and expedite decision-making. Despite the efficacy of unsupervised learning methods in handling unlabeled datasets, their performance remains limited owing to challenges such as imbalanced data, local optima, and parameter adjustment complexities. Thus, this paper introduces a novel hybrid unsupervised classification method, named the two-stage hybrid system with spectral clustering and semi-supervised support vector machine (TSC-SVM), which effectively addresses the unsupervised imbalance problem in credit risk assessment by targeting global optimal solutions. Furthermore, a multi-view combined unsupervised method is designed to thoroughly mine data and enhance the robustness of label predictions. This method mitigates discrepancies in prediction outcomes from three distinct perspectives. The effectiveness, efficiency, and robustness of the proposed TSC-SVM model are demonstrated through various real-world applications. The proposed algorithm is anticipated to expand the customer base for financial institutions while reducing economic losses.

1 Introduction

In the era of big data, developments in artificial intelligence such as machine learning offer several benefits to many different sectors [1]. Machine learning research has increasingly focused on credit classification as a primary method for assessing credit risk. To avoid potential default risk and pursue considerable profit, banks and lending institutions are encouraged to accurately screen out borrowers with strong repayment capabilities when approving loans. Financial institutions use credit classification methods to discern between “good” and “bad” loan applicants [2]. This differentiation is vital as good customers are more likely to repay their loans promptly, benefiting financial institutions [3]. Conversely, bad customers pose a risk of financial loss. Therefore, the development of efficient and accurate credit classification models is essential, as even marginal enhancements can significantly impact financial outcomes [4]. Recent advancements in machine learning and data mining have streamlined financial decision-making, reducing credit analysis costs and expediting loan approvals [5]. Various machine learning methodologies, such as support vector machines (SVM), decision trees (DT), and K-nearest neighbors (KNN), have proven effective in credit classification [69].

The traditional classification problem pertains to supervised binary classification. However, the process of labeling data points is time-consuming and labor-intensive, especially given the large amount of unlabeled data in real-life settings [10, 11]. Financial institutions and banks can significantly enhance their business scope and improve prediction accuracy by effectively leveraging these unlabeled credit data. The unsupervised algorithm can analyze unlabeled samples when a new institution receives a credit application from a lender, thus reducing the financial institution’s time and data acquisition costs. Unsupervised classification algorithms, also known as clustering methods, are designed to solve classification tasks in unlabeled datasets [10, 12]. Unsupervised learning have been investigated in credit card fraud detection and money laundering detection [13]. On the other hand, unsupervised classification algorithms can help improve supervised learning accuracy [1315]. The primary objective of unsupervised classification is to uncover patterns and relationships within the data. Various advanced unsupervised classification algorithms have been established, including K-means, self-organizing maps, spectral clustering, and SVM-based unsupervised methods [1619]. Despite their efficiency, the accuracy of these unsupervised classification algorithms is limited when applied to imbalanced datasets, where the number of samples in different classes differ considerably in the context of binary classification tasks [20].

Notably, dataset imbalance may cause unsupervised classification algorithms, such as K-means and KNN, to converge to local optimal, and undermine their robustness [21]. In credit risk assessment, the number of non-defaulting borrowers typically exceeds that of defaulting borrowers [5]. Consequently, credit risk datasets are typically imbalanced, with a clear dominance of the majority class over the minority class. For example, when financial institutions develop new businesses, there may be a lack of relevant labeled data to train classification models. In such cases, credit assessments on loan applications are performed using unsupervised classification algorithms. However, there may typically only be 20 defaulters among 100 loan applicants. Traditional unsupervised classification has been noted to be inefficient in identifying minority classes in such imbalanced datasets [22], potentially resulting in significant financial losses for institutions relying on these models. Because the imbalanced ratio of the datasets will negatively affect the prediction accuracy of traditional unsupervised algorithms for minority class samples. The existing research has not comprehensively addressed the imbalance problem in unsupervised classification, given the following two challenges. First, analyzing distribution differences resulting from imbalances is difficult without labeled data [23]. As a result, clustering algorithms create clusters of similar size. It is common for large and small groups to make observations at the same time. Second, compared with supervised algorithms, unsupervised algorithms are less prone to overfitting when applied to imbalanced datasets but still cannot efficiently recognize minority-class samples. In other words, some unsupervised algorithms can fall into local optimal with imbalanced datasets.

To mitigate the problem of imbalanced credit datasets in unsupervised classification, we introduce a two-stage hybrid system with spectral clustering and semi-supervised SVM (TSC-SVM). This algorithm is designed to address the difficulty in identifying minority-class samples in imbalanced datasets using unsupervised classification algorithms. The original idea is from the attention mechanism. It is expected that the first stage model will assist the second stage model in focusing on certain credit applicants. In the first stage, the dataset is partially labeled using three spectral clustering classifiers based on multiple views which is called the multi-view combined unsupervised method. The multi-view combined unsupervised method exploits complementary information among multiple views based on a probability transfer matrix. Spectral clustering is chosen owing to its ability to yield a computationally optimal solution, thereby enhancing the overall robustness of the model. This stage will help the algorithm estimate the size of different clusters to solve that clustering algorithms create clusters of similar size. The second stage involves a semi-supervised SVM for addressing imbalances in an unsupervised manner by applying varied constraints. Semi-supervised SVMs are a family of techniques for inferring from both labeled and unlabeled data. Unlike existing methods for imbalanced unsupervised classification, the proposed strategy involves segmenting a sample dataset into multiple perspectives using the probability transition matrix. This segmentation significantly enhances the prediction accuracy and robustness of the model. Moreover, by targeting a global optimal solution, the proposed model avoids the pitfall of settling into local optimal. The key contributions to this work can be summarized as follows:

  • We introduce the TSC-SVM approach capable of circumventing the challenges typically encountered in unsupervised classification algorithms, such as SVM-based unsupervised methods and spectral clustering, which often classify data into two balanced classes even when the original dataset is imbalanced [21]. The robustness and applicability of the proposed method are validated across various publicly available imbalanced credit datasets.
  • We establish a perspective decomposition strategy named the multi-view combined unsupervised method, based on a probability transfer matrix. Using perturbation theory, three different adjacency matrices are constructed from the probability transfer matrix: forward, backward and combined adjacency matrix. This integration enhances the accuracy and robustness of the proposed model through unsupervised classification based on three distinct perspectives.
  • We develop a semi-supervised SVM algorithm characterized by strong and weak constraints for labeled and unlabeled samples, respectively. This distinctive approach enables the creation of two hyperplanes that effectively constrain unlabeled samples while mitigating the influence of noise and outlier samples on the separation plane. Consequently, this algorithm can effectively detect and handle imbalanced samples.

The remaining paper is organized as follows: Section 2 provides an overview of unsupervised learning methods and the problem of data imbalance. Section 3 outlines the proposed strategies and their simple reformulation to establish the TSC-SVM approach. Section 4 presents the results of numerical experiments conducted on various imbalanced credit datasets to assess the performance of the proposed method. Section 5 presents the concluding remarks and highlights directions for future research.

2 Related work

This section provides a brief review of imbalanced classification methods, hybrid model and unsupervised learning methods.

2.1 Imbalanced credit classification

An imbalanced dataset is one in which one or more classes significantly outnumber others [24]. This imbalance is a prevalent challenge in credit risk assessment, where lenders frequently encounter datasets with a large number of repaid loans but few defaults [25]. Machine learning methods face challenges in learning from such imbalanced datasets, as classifiers are inclined to predict predominantly the majority class, reflecting real-world conditions [26]. Three approaches can be used to address the issue of imbalance, i.e., resampling, cost-sensitive learning, and ensemble learning methods [27].

Resampling methods adjust the number of instances in different classes within a dataset, incorporating undersampling and oversampling steps. Undersampling reduces the size of the majority class to balance the dataset. A representative undersampling method is one-sided selection, which identifies and eliminates duplicate or noisy samples from the majority class, thereby reducing its size [28]. In contrast, oversampling addresses class imbalances by augmenting the number of minority-class samples. A well-known example is the synthetic minority oversampling technique (SMOTE), which creates additional samples by interpolating between existing minority-class samples based on spatial similarity [29]. SMOTE has been noted to be more effective than other resampling algorithms in handling imbalanced credit classification problems [30]. Elbow Fuzzy Noise Filtering SMOTE which can effectively reduce the noise within the cluster and thus improve prediction accuracy [31]. The semi-supervised hybrid resampling (SSHR) method which runs semi-supervised clustering to capture the data distribution for both over-sampling and under-sampling [32]. However, a significant limitation of both undersampling and oversampling methods is their reliance on labeled samples, rendering them unsuitable for addressing imbalance in unsupervised learning contexts.

Cost-sensitive methods can be integrated into the learning framework to address problems related to imbalanced sample distributions and the costs associated with misclassification. For example, the cost-sensitive online gradient descent approach, based on a modified hinge loss function, can address the imbalance problem in credit online learning [33]. Additionally, a cost-sensitive SVM has been proposed for generating probabilistic outputs for credit scoring [34]. However, cost-sensitive functions rely on labeled samples and prior information, rendering them more suitable for solving supervised imbalanced credit classification problems, where such samples and knowledge are available.

Ensemble learning methods enhance the accuracy of individual classifiers by amalgamating the decisions of multiple distinct classifiers to produce a singular classification. For instance, multi-layer structured gradient boosted DTs with light gradient boosting machines have been effectively employed to resolve the imbalance problem in peer-to-peer (P2P) lending. However, these approaches require the adjustment of a cost-sensitive function based on labeled data. The cost-sensitive boosted tree method can identify potential default borrowers within labeled P2P datasets [35]. In particular, by leveraging the collective insights of multiple DTs, this method can enhance the predictive accuracy in scenarios characterized by imbalanced class distributions [36]. The Lasso-logistic regression ensemble(LLRE) learning algorithm, an ensemble model based on logistic regression, splits the training data set into several small balanced subsets [37]. BIRCH Clustering Borderline SMOTE (BCBSMOTE) sloved the problem of highly skewed credit card transaction dataset [38]. However, the logistic-BWE algorithm and BCBSMOTE also rely on sample labels for resampling.

Although scholars have extensively explored the issue of imbalanced datasets in credit evaluation, the focus has been on supervised learning contexts. Research on addressing unsupervised imbalanced credit datasets remains limited, despite the considerable attention the imbalanced problem has received in other domains, such as media data and image segmentation. This research gap highlights the need for additional research in the domain of unsupervised classification within credit evaluation, considering the unique challenges and methodologies associated with unsupervised learning entails compared with supervised approaches [21, 22].

2.2 Unsupervised learning

Unsupervised learning is distinguished by its ability to make decisions or identify patterns based on unlabeled data [39]. This aspect is particularly advantageous in the context of financial institutions, as it eliminates the need to wait for loan outcomes, thereby facilitating the efficient updating of models [40]. Additionally, the ease of obtaining unlabeled data presents an opportunity for financial institutions to expand their businesses. Because unsupervised learning does not rely on pre-labeled outcomes, it can effectively utilize the vast amounts of readily available, unlabeled financial data, offering a more agile and potentially insightful means of understanding and predicting financial behaviors and trends [10]. Although some studies have semi-supervised methods to solve the problem of lack of sample labels [41], they still cannot solve the problem of credit risk assessment of unlabeled samples.

Traditional unsupervised learning methods, commonly referred to as clustering techniques, can be categorized into prototype-based (using efficient algorithms such as K-means) [16, 42], hierarchical (aimed at discovering nested clusters) [43] density-based (designed to identify regions with varying densities) unsupervised learning [44], and neural networks [45]. Despite their utility, these traditional unsupervised learning algorithms often struggle with sample distributions and are prone to converging on local optima [46]. To enhance the robustness and accuracy of unsupervised learning, SVM-based models, specifically one-class SVMs (OC-SVMs), have been introduced. OC-SVMs cluster data based on a single type of label, offering a novel approach to unsupervised learning [47]. Subsequent developments, such as the weighted one-class SVM and fuzzy one-class SVM, aim to enhance the accuracy and efficiency of OC-SVMs [18]. However, OC-SVM-based methods rely initial labeled data points. To address this limitation, the unsupervised quadratic surface SVM (US-QSSVM) was developed, improving the performance of SVM-based unsupervised learning algorithms on credit risk datasets [10]. Nonetheless, these algorithms are not inherently suitable for handling imbalanced datasets in unsupervised learning scenarios, which represents a crucial concern for credit risk assessment.

In unsupervised imbalanced learning, the majority class often forms clusters, overshadowing the minority class, which may hold greater significance. Consequently, scholars have explored solutions for unsupervised learning in imbalanced datasets. For instance, the golden-rule QSSVM (golden-rule-QSSVM) approach was developed to adjust the separation plane in such scenarios [10]. However, its inability to solve for a global optimal solution limits its accuracy. Another approach is the partition constrained minimum cut (PCut), aimed at finding minimum cut partitions while maintaining minimum cluster sizes [21]. Although PCut can address certain aspects of the imbalanced problem, its computational efficiency is hampered by the complexity involved in searching for optimal parameters. Imbalanced clustering and optimal margin distribution for clustering approaches have been designed for high-dimensional imbalanced data [22, 46]. However, the complexity and specific conditions required for the effective implementation of these algorithms restrict their generalizability in diverse scenarios. Thus, it is necessary to identify robust and versatile unsupervised learning solutions capable of effectively managing imbalanced datasets. Although the isolation forest and copula-based outlier detection method can manage highly imbalanced fraud data [48], its effectiveness on general imbalanced datasets remains questionable.

3 Proposed method

In this section, we first introduce the multi-view combined unsupervised method, which leverages multiple perspectives to analyze data, enhancing the ability to uncover hidden patterns and relationships in unlabeled datasets. This approach is particularly effective in addressing the complexities and nuances of imbalanced datasets. Subsequently, we describe the semi-supervised SVM model, designed to bridge the gap between unsupervised and supervised learning by using both labeled and unlabeled data. This model is particularly adept at handling scenarios where labeled data is scarce or incomplete, rendering it valuable for managing imbalanced datasets in a semi-supervised context. These two techniques can be combined to address the challenges posed by imbalanced datasets in various domains.

3.1 Multi-view combined unsupervised method

Given the limited accuracy of unsupervised classification algorithms when applied to imbalanced datasets, it is necessary to address the challenge of local optima commonly encountered by traditional unsupervised algorithms, particularly in credit risk assessment. To this end, sufficient information must be mined from data samples. The main purpose of this stage is to identify samples with obvious credit risks. In this study, a bagging strategy is used to address the shortcomings of a single unsupervised classifier. Spectral clustering is adopted as the classifier because it can attain global optimal solutions. In general, spectral clustering algorithms determine the similarity between samples based on a distance-based adjacency matrix. Considering the difficulty of sampling unlabeled datasets, we train different classifiers by constructing different adjacency matrices.

In the first stage, we construct three adjacency matrices for training the ensemble model and labeling certain samples. Consider an unlabeled sample space including n points. The function d(xi, xj) represents the Euclidean distance between any two points xi and xj. Assuming that the transition probabilities between sample points are related to their Euclidean distances, we then construct a sample-based transition matrix P to capture the relationship and transition likelihoods between different points in the sample space, which means we have n states in the matrix P, providing a foundation for effectively identifying and labeling a portion of the samples based on their relative positions and distances. This approach is particularly valuable for dissecting and understanding the structure of imbalanced datasets, in which traditional methods might overlook subtler yet significant patterns owing to the dominance of the majority class. Thus, by breaking down the sample space into multiple perspectives and examining the relationships between points, a more nuanced and robust understanding of the dataset can be achieved, leading to improved unsupervised classification in credit risk assessment. In other words, identifying borrowers with obvious credit risk, whether they’re extremely low or extremely high default risks, is easy with any distant measurement method. Therefore, the multi-view method can efficiently select some samples with obvious credit risks, which will be beneficial for the model to estimate the imbalance ratio of samples.

In this model, each element pij of the transition matrix P is defined independent of the parameters of the radial basis function. This setting helps enhance the model robustness by eliminating the need for parameter selection. pij represents the probability that sample point i is transformed into sample point j. In other words, pij is an asymmetric measure of the correlation between sample points i and j, as pij is not equal to pji, unlike the traditional distance-based symmetry measure. Spectral clustering is an unsupervised learning method based on the relationship between sample points. The objective is to ensure that samples within the same cluster are as similar as possible, while those in different clusters are distinct [49].

To more effectively extract information from the samples and enhance the clustering efficiency, we decompose the probability transfer matrix P into two weighted adjacency matrices based on the directional differences in sample transitions. Specifically, the probability of sample xi transitioning to xj is not necessarily the same as that of xj transitioning to xi. Consequently, we split matrix P into a forward adjacency matrix F (representing the forward view) and backward adjacency matrix B (representing the backward view), each corresponding to the direction of sample transformation. By considering both forward and backward transition probabilities, we can better capture the asymmetric nature of the relationships within the dataset, leading to more accurate and effective clustering, particularly in imbalanced datasets commonly encountered in credit risk assessment.

Matrices F and B can be verified to be symmetric and positive semi-definite. However, these matrices are sensitive to the order in which the samples are input. To mitigate the influence of sample order on the results of unsupervised learning, we propose a strategy to sort the samples based on the distance d(xi, o) from their positions xi to a central point , typically defined as the centroid or mean of the sample space. This sorting in descending order ensures that the sample order does not unduly influence the learning process. Subsequently, we define two normalized graph Laplacians matrices, which are crucial to the spectral clustering process as they encapsulate the essential structure of the data by reflecting the relationships and connections between different samples. The normalization of these matrices ensures that the clustering process is not disproportionately influenced by variations in sample size or density, yielding a more balanced and accurate representation of the underlying data structure. This aspect is particularly beneficial in the context of imbalanced datasets, where traditional methods may be skewed by the dominance of the majority class.

The degree matrix D is a key component in defining the normalized graph Laplacians matrices. Specifically, D is a diagonal matrix where each element on the diagonal, d1, …, dn, represents the degree of a node in the graph. The node degree typically refers to the sum of the weights of the edges connected to that node. In spectral clustering, these weights are derived from the adjacency matrices F and B. To enhance the accuracy of the proposed algorithm, especially in handling imbalanced datasets, we introduce a third perspective through weighted adjacency matrices. In particular, we define a combination matrix C, which merges the information encapsulated in both F and B. C is designed to synthesize the distinct but complementary perspectives provided by the forward and backward adjacency matrices, offering a more comprehensive view of the relationships and structures within the dataset. By integrating this third perspective, a more complete understanding of the data can be achieved, which is crucial in unsupervised learning scenarios.

In constructing the combination matrix C, we use real non-negative numbers μi, where μ1 + μ2 = 1. This ensures that the combined matrix is a weighted average of the forward and backward perspectives, balancing the influence of each. The normalized graph Laplacians matrix for this combined view can then be defined as LC = μ1LF + μ2LB. To ensure coherence and avoid excessive discrepancies between clustering results under the three perspectives (forward, backward, and combined), we seek an intermediate perspective that balances the forward and reverse views. To this end, we introduce the canonical angle, which quantifies the difference in clustering ability between the eigenvectors of the forward and backward views [50]. This metric is particularly useful for understanding how different perspectives may influence the final clustering results and for adjusting weights μ1 and μ2 accordingly to achieve an optimal balance. By carefully calibrating these weights based on the canonical angle, the clustering performance of the proposed algorithm can be enhanced, especially in scenarios involving imbalanced datasets.

Definition 1. Let Vk and be subspaces spanned by the orthonormal eigenvectors vi, …, vi+k and , respectively. Moreover, let γ1 ≤ … ≤ γk represent the singular values of . Then, (1) where θi represents the canonical angles between Vk and .

If the largest canonical angle, defined in Eq (1), is small, the eigenvectors Vk and are closely aligned. This observation is significant in spectral clustering, where clustering results depend on the first k eigenvectors of the Laplacian matrix. Consequently, a critical task in enhancing the clustering process is to find the optimal pair of weights (μ1, μ2). These weights should be chosen such that the first k eigenvectors of the Laplacian matrix LC closely approximate the k eigenvectors and of the Laplacian matrices LF and LB, respectively. To accomplish this, we introduce a theorem related to the canonical angle to quantitatively assess the closeness of these eigenvectors [50] and determine μ1 and μ2 that result in the most effective clustering by leveraging the combined strengths of both forward and backward views. This selection ensures that the combined view represented by LC effectively merges the information encapsulated in both LF and LB, leading to a more accurate and robust clustering outcome, especially in the context of imbalanced datasets.

Theorem 3.1 Let , and represent the ith eigenvalues and eigenvectors of L and , respectively. Additionally, let Θ = diag(θ1, …, θk) be the diagonal matrix of canonical angles between the column space of Vk = (v1, …, vk) and . If there exists a gap η > 0 such that and , then (2) where sin Θ is defined entry-wise, and .

Following the literature, the gap η can be considered a constant approximating the ground truth Laplacian matrix L [51]. Therefore, we can minimize the upper bound of Eq (2) to obtain the closest eigenvectors to and . The optimization function is defined as (3)

Clearly, Eq (3) represents a quadratic programming problem, and thus, the optimal can be determined from the optimal solution of Eq (3). We perform spectral clustering on the three Laplacian matrices LF, LB, and LC, obtaining results , , and , respectively. Details of the spectral clustering algorithm are not presented herein for simplicity and can be found elsewhere [52]. The samples are labeled yi using a one-vote veto strategy to obtain the most robust results:

The multi-view combined unsupervised method presents two advantages for unsupervised imbalanced learning: First, all results are obtained using the global optimal solution. Second, the number of parameters to be selected is reduced. These aspects help enhance the robustness and generalizability of the proposed method. This stage will help the algorithm estimate the size of different clusters to solve that clustering algorithms create clusters of similar size. In other words, identify applicants with evident credit risk characteristics at this stage, regardless of whether they are good or bad. This will help the subsequent algorithm to determine the size of different clusters by separating the planes.

Algorithms 1: Multi-view combined unsupervised method

1: Input: data point: xi

2: Calculate the central point ← xi

3: Sort the data point by the distance(xi, o)

4: Calculate forwoad and backward normalized graph Laplacians matrix LF and LB

5: Calculate the adjusting weights μ1 and μ2 by solved quadratic programming problem Eq (3)

6: Calculate LCμ1LF + μ2LB

7: Obtain the label , , and by the spectral clustering with three Laplacian matrices LF, LB, and LC

8: Define the labled yi using a one-vote veto strategy with the label , , and

9: Output: labeled yi

3.2 Semi-supervised SVM

In stage 1, when , indicating consensus among several classifiers, we can effectively partition the sample set into two groups. The first group consists of labeled samples, denoted as , each with an associated label yi. The second group includes unlabeled samples, represented as . To facilitate this division, we define a matrix for the labeled samples, which consists of vector ai, where i = 1, …, n1. Similarly, the matrix for the unlabeled samples consists of vector qi, where i = 1, …, n2. This approach is inspired by the principles of SVM-based unsupervised methods. The core concept of the semi-supervised SVM model is to position all labeled points outside two hyperplanes defined by the equations Tb = 1 and Tb = −1. Simultaneously, the model aims to confine the unlabeled samples between these two hyperplanes as tightly as possible, ensuring clear and effective separation of classes. This configuration is illustrated in Fig 1, where red and blue dots represent unlabeled samples {qi} and labeled samples {ai}, respectively. The two hyperplanes Tb = 1 and Tb = −1 constrain the labeled sample points on the outside and unlabeled sample points on the inside.

thumbnail
Fig 1. Framework of the semi-supervised support vector machine.

https://doi.org/10.1371/journal.pone.0316557.g001

In the semi-supervised SVM framework, the objective is to maximize the margin, represented as 2/||ω||, while adhering to specific constraints. These constraints ensure that all unlabeled points lie between the two hyperplanes defined by T + b = 1 and T + b = −1, and that all labeled samples are situated outside these hyperplanes. To introduce flexibility and manage potential misclassification or overlap, slack variables and are incorporated into the model. These variables allow for a relaxation of the constraints, accommodating instances where it may not be feasible to strictly enforce the original conditions. Consequently, the semi-supervised model with these slack variables is formulated as follows: (4)

In this model, C1 and C2 are penalty coefficients, which determine the strength of constraints on labeled and unlabeled samples. Because labeled samples clearly indicate the credit risk, strong and weak constraints are imposed on labeled and unlabeled samples, respectively. In other words, C1C2. Thus, the formulation provided in Eq (4) constitutes a convex optimization problem. The convexity of this problem is advantageous, ensuring that any local optimum found is also a global optimum. To efficiently solve this optimization problem, a common approach is to derive the dual problem. The dual formulation of an optimization problem often simplifies the solution process, particularly in the context of SVMs. The Lagrangian function, presented in Eq (5), is formulated using Lagrange multipliers β1, β2, β3, β4, β5 ≥ 0. These multipliers facilitate the relaxation of the primal constraints and enable the transition from the primal to the dual problem. To identify the saddle point of the Lagrangian function, which corresponds to the optimal solution of the problem, the partial derivatives of the Lagrangian function with respect to the primal variables (ω, b, ξ, σ) are determined. Specifically, these derivatives are set to zero, and the resulting equations are solved, a standard procedure in convex optimization. (5)

Suppose the partial derivates of LSVM with respect to the primal variables (ω, b, ξ, σ) equal zero. Then, we can obtain the conditions for optimality corresponding to the following dual problem: (6)

For the sake of brevity and clarity, detailed mathematical derivations and the Karush–Kuhn–Tucker (KKT) conditions relevant to the semi-supervised SVM are presented in S1 Appendix. Overall, the optimal solution for the semi-supervised SVM can be obtained using quadratic optimization methods, as outlined in Eq (6). Quadratic optimization is a well-established method in optimization theory, particularly suited for SVM frameworks where the objective function and constraints can be expressed in quadratic forms. Once the optimal solution has been determined through this process, it can be used to predict the final classification results within the TSC-SVM. This system integrates the strengths of spectral clustering and semi-supervised learning to effectively handle the challenges posed by imbalanced datasets in unsupervised learning.

Fig 2 shows the TSC-SVM framework. Generally, the TSC-SVM framework selects samples that have obvious credit risks in the first stage and focuses on the remaining samples in the second stage. In the first stage, easy-to-identify samples are labeled using a bagging ensemble method, i.e., the multi-view combined unsupervised method. These labeled samples aid the semi-supervised SVM in determining the optimal separation plane. This stage will help the algorithm estimate the size of different clusters to solve that clustering algorithms create clusters of similar size. It is similar to an attention mechanism, which enables the second-stage model to focus on sample points that are closer to the separation plane. This ensemble learning method overcomes the limitations associated with a single classifier in identifying minority-class samples. The spectral clustering algorithm divides samples into two parts with the highest difference. This will enable samples with obvious credit risks, whether they are “too good” or “too bad” to be identified.

In the second stage, the results of the first stage are used to split the dataset {xi} into two subsets: {ai} (labeled dataset) and {qi} (unlabeled dataset). In semi-supervised SVMs, hyperplanes T + b = 1 and T + b = −1 aim to maximize the support vectors to separate labeled samples {ai}. In addition, hyperplanes constrain unlabeled samples {qi}, allowing the separation plane T + b = 0 to efficiently identify minority-class samples. It should be noted that in the second stage, more focus will be placed on unlabeled samples. Therefore, the semi-supervised support vector machine model will help us achieve this goal and improve model prediction accuracy. Furthermore, spectral clustering and semi-supervised SVM have lower parameter requirements, which is a key reason for their selection as the main classification methods.

Algorithms 2: Semi-supervised SVM

1: Input: labeled group: with label yi, unlabeled group: , penalty coefficients: C1 and C2

2: Calculate the ω and b ← convex programming problem xi Eq (4)

3: Obatain the label of the data point xi by the hyperplane T + b = 0

4: Output: the label of the data point xi

Overall, the TSC-SVM which is expected from the first stage model will assist the second stage model in focusing on certain credit applicants aims to solve the credit assessment in unsupervised imbalanced learning. In the next section, this paper will demonstrate the effectiveness of this method in numerical experiments.

4 Results

4.1 Experimental setup

To assess the effectiveness of the TSC-SVM model, various public benchmark datasets specific to credit risk were employed. These datasets provided a comprehensive platform for testing and validating the model performance in real-world scenarios. The performance of the TSC-SVM model was compared with those of state-of-the-art methods, including K-means, US-QSSVM, US-QSSVM-GOLD, PCut, spectral clustering algorithm and the unsupervised LDA (Un-LDA) [10, 16, 19, 21, 53]. Testing these methods on the same benchmark datasets enabled a comprehensive comparison across different dimensions, such as accuracy, robustness, and ability to handle imbalanced datasets, thereby clarifying the strengths and limitations of the TSC-SVM model in credit risk assessment compared with existing techniques.

To ensure a fair and thorough evaluation of the TSC-SVM model in real-world unsupervised imbalanced credit classification scenarios, it was tested on entire datasets using specific parameters. There is a lack of labeled samples for optimizing parameters in real unsupervised credit risk classification. Accordingly, the proposed method in this paper will be tested on a variety of datasets under fixed parameter conditions. The k-NN weighted adjacency matrix has been proven to be effective in imbalanced datasets [21]. Hence, considering the task to be a binary classification problem, we set k = n/2 in the experiments, where n is the number of samples in the dataset. In the first stage of the TSC-SVM model, different penalty parameters were assigned to labeled (C1 = 210) and unlabeled samples (C2 = 1) to account for their distinct roles in the model. This differentiation ensured that labeled samples adhered strictly to the two planes, while allowing some flexibility for unlabeled samples to cross the planes as needed. This approach helped stabilize the separation plane and yield more robust prediction results. To ensure that the experimental results of the TSC-SVM model were not influenced by parameter selection and recognizing that financial institutions may not always have the necessary data to fine-tune parameters, a systematic approach was adopted for parameter selection: For the US-QSSVM and US-QSSVM-GOLD models, the optimal hyper-parameter was determined using a grid method with in the range {16, 17, …, 30}, as suggested in the literature [10]. We randomly selected 50% of the original dataset for tuning parameters, repeating this process ten times to ensure statistical significance. The final parameters were chosen based on the highest average accuracy achieved. For the PCut method, the parameters suggested in a prior study were used: [21]: k0 = 30, , and δ = 0.05. Following the literature, the subspace dimension size m for Un-LDA is define as min{d, c − 1} where d is the number of dimensions and c is the number of classes. In all experiments, datasets were processed without labels for training and tested with actual labels for validation. Each reported result represents an average over 20 experiments, ensuring the reliability and validity of the findings. All the experiments are performed on a desktop computer with Intel i9 CPU, 64GB RAM, and Windows 10 operating system.

For validation, four widely recognized public credit benchmark datasets were chosen from the UCI public database and Kaggle [54, 55]. The datasets were selected to ensure that they adequately represented the considered algorithms across a range of conditions and applications. These datasets have been frequently used in various studies focusing on credit risk assessment [35, 56]:

  • Australian Dataset (from UCI Database): This balanced credit dataset is selected to test the robustness of the models in a balanced scenario, providing a contrast to the imbalanced datasets typically encountered in credit risk assessment.
  • Lending Club Datasets (from Kaggle): The 2018 Q4 (2018Q4) and 2018 Q3 (2018Q3) versions were chosen owing to their relevance to current credit risk evaluation scenarios. These datasets include loans categorized as non-default and late (31–120 days), according to the latest sample available on Kaggle. These datasets offer a real-world context for testing the model performance in handling various loan statuses. In particular, these datasets are selected as they include real imbalanced credit reports provided by financial institutions. Notably, Lending Club is the largest P2P platform in the United States and these datasets have been cited in various credit classification studies.
  • Credit Risk Dataset (from Kaggle): This dataset provides additional insights into credit risk assessment, complementing the Lending Club and Australian datasets.

Table 1 provides an overview of these datasets, including the number of samples, distribution of classes, and other relevant characteristics. This information is vital for understanding the context in which the algorithms were tested and interpreting the results of the experiments. The diversity and representativeness of these datasets ensure that the findings of this study are robust and applicable to a broad range of scenarios in the field of credit risk assessment. This article deletes the missing values in the datasets. The experiments use all samples in the dataset without the labels.

In evaluating the performance of the proposed model, particularly in the context of imbalanced data distributions commonly encountered in credit datasets, relying solely on the overall accuracy can be misleading. This is because the accuracy metric does not differentiate between the classification performance on the minority class (often more critical in credit risk scenarios) and majority class. For instance, in a dataset with a 90% imbalance ratio, classifying all samples as the majority class would still yield a 90% accuracy rate, despite failing to identify any of the minority-class samples. To more comprehensively evaluate the classifier ability to accurately identify both positive (minority) and negative (majority) classes, four key metrics were used. Table 2 presents a confusion matrix, providing a better understanding of these metrics. The confusion matrix specifies the true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN), which are the basic elements required to compute these metrics. These evaluation metrics were used to enable a balanced and nuanced assessment of the model performance, acknowledging the complexities inherent in imbalanced datasets. Such an approach is critical in demonstrating the efficacy of the proposed model in real-world credit risk assessment scenarios.

The Table 2 shows the definition of TP, FP, TN and FN which illustrates the confusion matrix used for evaluating these metrics, with the minority class designated as the positive class. In credit classification, FN indicates that loan defaulters are predicted to be non-defaulters, leading to substantial economic losses for financial institutions. FP indicates that the classification algorithm predicts non-defaulters as loan defaulters, resulting in the loss of potential customers. The economic loss caused by FN is typically greater than that caused by FP, which is why we emphasize the identification efficiency of the positive class (minority). Based on these definitions, the following three evaluation metrics can be better interpreted:

These evaluation metrics will help us fairly measure the experimental results with different models. The experimental results will be reported with these indicators.

4.2 Experiments results

Table 3 presents the evaluation metrics of the proposed TSC-SVM method as well as the comparative algorithms (K-means, US-QSSVM, US-QSSVM-GOLD, PCut, spectral clustering, and Un-LDA). In comparison against other algorithms across various credit datasets, TSC-SVM consistently performs better, highlighting its robustness and reliability for credit risk assessment in diverse scenarios, particularly those involving imbalanced datasets. This experiment also presents the Wilcoxon test results for the algorithm proposed in this paper and other algorithms that use the F1 indicators in order to support the conclusions of the experiment. ** and * represents the p-value is less than 0.01 and 0.1.

Specifically, the results in 2018 Q4, 2018 Q3 and credit risk datasets show that the TSC-SVM algorithm demonstrates enhanced capability in recognizing the minority class, which is crucial in credit risk assessment where identifying potential defaulters is more critical than identifying reliable borrowers. This improvement can potentially help financial institutions reduce losses due to credit defaults. Although the PCut algorithm exhibits higher accuracy in certain datasets (2018 Q4, 2018 Q3, and Credit Risk), it cannot effectively identify credit defaulters in imbalanced datasets. For the balanced dataset Australian, the TSC-SVM model also improves the identification of high-risk credit cases. Such a method is invaluable for financial institutions that require accurate and reliable models for risk assessment to manage and mitigate credit risk. For example, the increase in the recall indicator means that more loan defaulters are accurately identified, preventing financial institutions from suffering significant losses owing to loan defaults, as in the case of 2018 Q4 and 2018 Q3.

The experimental results of all algorithms on balanced datasets are better than those on imbalanced datasets including 2018 Q4, 2018 Q3 and credit risk datasets. This means that the imbalance problem has a negative impact on unsupervised algorithms. There are two main reasons. First, clustering algorithms create clusters of similar size. This means that traditional unsupervised algorithms ignore the imbalanced proportion of data. Certain algorithms, such as US-QSSVM and spectral clustering, inherently divide the dataset into two balanced categories based on their optimization functions. This approach fails to account for the inherent imbalance in the data, thereby neglecting the minority class, which is often of greater interest in credit risk assessment. Although the PCut algorithm is designed to address imbalanced problems in an unsupervised manner, it has its own set of challenges. The complex parameter search mechanism required for PCut leads to high computational complexity and reduced efficiency. Then, prior knowledge is required to set parameters such as k and δ, undermining the generalization capability and robustness of the algorithm. These issues limit the practical applicability of PCut in diverse real-world scenarios, particularly in financial contexts where data may not always conform to predefined assumptions. Second, in the absence of prior knowledge, traditional unsupervised learning algorithms are prone to converging on local optimal solutions. This issue becomes more pronounced in imbalanced datasets where the minority class is underrepresented, leading to biased or incomplete learning outcomes. These insights underscore the need for more sophisticated and tailored approaches, such as the TSC-SVM model, designed to address the challenges of unsupervised imbalanced learning in credit risk assessment. The three parameters that need to be determined for this model include k, C1, and C2. Penalty parameters C1 and C2 are used to strengthen labeled sample constraints, and thus, C1 must be sufficiently large. The parameters k, C1 and C2 are discussed in more detail in a later section. Such methods are crucial for financial institutions seeking reliable and efficient tools for risk evaluation in the increasingly complex and data-driven financial landscape.

Notably, the proposed algorithm demonstrates robust and efficient performance across both balanced and imbalanced datasets, a significant achievement considering the distinct challenges posed by each type of dataset. The effectiveness of our algorithm does not rely on prior knowledge for parameter adjustment, which is a common limitation in many traditional models. Its ability to efficiently classify all credit datasets, even with fixed parameters, highlights its adaptability and ease of use. The robustness and accuracy of the proposed algorithm across a diverse range of datasets can be attributed to two key factors: The TSC-SVM approach is designed to derive the global optimal solution at all steps of the process, effectively circumventing the issue of local optimal solutions. This global perspective ensures that the algorithm comprehensively considers the entire dataset, leading to more accurate and reliable outcomes. This method provides stable and accurately labeled samples, which are crucial for the effective performance of the semi-supervised SVM. Second, by incorporating multiple views, the algorithm can capture a broader and more nuanced understanding of the data, thereby enhancing the quality of the labeled samples and optimizing the separation plane of the semi-supervised SVM, further improving its classification capabilities. Overall, the favorable performance of the proposed algorithm across both balanced and imbalanced datasets, without the need for extensive parameter tuning, renders it an appealing choice for financial institutions seeking reliable and efficient tools for credit risk assessment, financial analytics, and risk management.

The TSC-SVM model is a hybrid model. Therefore it is important to discuss the computational complexity of TSC-SVM model. First, the multi-view combined unsupervised method consists of three spectral clustering problems and a convex optimization problem. Spectral clustering’s computational complexity increases with sample points, but its overall complexity can be controlled. The Eq (3) which is a quadratic programming problem is computationally tractable. Second, semi-supervised SVM is a linear programming problem which is low computational costly. Despite the increase in computational complexity with sample size such as credit risk dataset, the overall computational complexity is manageable.

The Table 4 shows the computational time consumption in the 2018 Q3, 2018 Q4, Credit risk and Australian datasets. In summary, the computational complexity of the TSC-SVM model is affordable for most credit risk assessment datasets.

4.3 Sensitivity analysis

To clarify the influence of the parameters on the model performance, additional experiments were conducted, focusing on the contributions of the k-NN weighted adjacency matrix, multi-view combined unsupervised method and penalty parameters. These experiments were designed to provide a fair evaluation of the model effectiveness across both balanced (Australian) and imbalanced (Lending Club 2018Q4) datasets. First, the influence of parameter k on the experimental outcomes was assessed, by varying k as a fraction of the total number of samples n in the dataset. The values of k were set as k = 0.25n, 0.5n, 0.75n, n. This range provided a comprehensive view of how the choice of k affects the model performance, especially in k-NN based methods, where the number of nearest neighbors (k) influences the model behavior. Fig 3 presents the results of these experiments, providing insights into the values of k that yield the optimal model performance in different types of datasets. These findings can guide users in selecting appropriate k values in practical applications. Such insights are particularly relevant for financial institutions and practitioners in credit risk assessment, aiding in fine-tuning the model to achieve the best possible balance between accuracy and robustness, tailored to the specific dataset characteristics.

thumbnail
Fig 3. k-NN weighted adjacency matrix.

Australian: it is the experiments results in Australian dataset with different parameter k. 2018 Q4: it is the experiments results in 2018 Q4 dataset with different parameter k.

https://doi.org/10.1371/journal.pone.0316557.g003

Additionally, the experimental results provide valuable insights into the effectiveness of the k-NN weighted adjacency matrix compared with the fully connected weighted adjacency matrix in both balanced and imbalanced datasets: The k-NN weighted adjacency matrix consistently outperforms the fully connected weighted adjacency matrix across both types of datasets. This indicates that the k-NN matrix predicts more reliable and consistent outcomes, making it a preferable choice in practical applications. The k-NN matrix not only offers better performance but also requires less computational power compared with the fully connected weighted adjacency matrix. This observation aligns with reported results, highlighting the efficiency of the k-NN approach in handling large datasets [21]. Choosing an extremely small k value may adversely affect the model accuracy. A smaller k value may not capture enough information from the neighborhood, resulting in less reliable predictions. Imbalanced datasets exhibit greater sensitivity to the choice of k compared with balanced datasets. Thus, k must be carefully selected to ensure optimal performance. For binary classification problems, particularly in imbalanced contexts, k should ideally be more than half the total number of samples n. This settings can help ensure that the algorithm captures sufficient context from the data to make accurate predictions. These findings underscore the importance of carefully selecting k in k-NN based methods, especially in credit risk assessment, where the choice of parameters can significantly impact the model effectiveness. By adhering to these insights, practitioners can enhance the accuracy and reliability of their predictive models in various financial applications.

The influence of different perspectives in the multi-view combined unsupervised method must be examined to clarify their contributions to the model robustness and accuracy. The proposed method incorporates three distinct perspectives: forward view, backward view, and combined view, to thoroughly mine data for information, thereby enhancing the robustness and accuracy of label prediction. To explore the effectiveness of this multi-view approach, two additional experiments were conducted. First, labels were predicted using both the forward and backward views without combining them. This setup, referred to as the double view, enabled the assessment of how each perspective independently influences label prediction. Second, the forward and backward views were combined using the parameters (μ1, μ2) = (0.5, 0.5). This approach, termed average combining, was aimed at balancing the contributions of both views. Other parameters in the model were maintained at constant values to isolate the effect of the combination strategy. Comparing the results of these two experiments with the standard multi-view approach could provide insights into the advantages or potential limitations of integrating multiple perspectives. Specifically, this comparison could help understand whether combining the views leads to a significant improvement in performance compared with their independent use.

Table 5 presents the results of the experiments conducted using the double view and average combining strategies. Although the average combining algorithm achieves slightly better experimental results on the balanced dataset compared with the proposed method, the overall robustness of each approach must be considered. Similarly, the double view algorithm yields marginally better results on the imbalanced dataset. However, the robustness of the proposed method remains a significant advantage. The higher robustness of the proposed algorithm compared with both the double view and average combining algorithms is a critical factor, especially in real-world applications where datasets can vary significantly. This robustness is attributable to the ability of the proposed method in balancing the perspectives. The combining method based on perturbation theory ensures minimal discrepancy between the prediction results obtained from the three perspectives (forward, backward, and combined), enabling the achievement of consistent and balanced performance across different datasets and scenarios. These findings underscore the importance of a balanced approach that integrates multiple perspectives without allowing any one perspective to dominate the prediction process. Such a strategy is particularly beneficial in complex domains such as credit risk assessment, where the accuracy and reliability of predictions are paramount.

The purpose of the penalty parameters C1 and C2 is used to strengthen labeled sample constraints. Therefore, the robustness of the TSC-SVM model under these penalty parameters can be verified given the parameter c2. Since C1 must be sufficiently large, consider the value of c1 is given as c1 = 5, 6, …, 15. The Fig 4 shows TSV-SVM is robust to penalty parameters. Since it is designed to enforce the constraints of the support vector on the labeled samples. This means that in the unsupervised credit classification problem, financial institutions avoid the suffering of optimizing the penalty parameter. The necessary requirement is C1C2. As a result, financial institutions are also able to reduce the cost of gathering labeled data and improve their profitability.

thumbnail
Fig 4. The sensitive test with penalty parameters.

Australian: it is the experiments results in Australian dataset with different penalty parameter c1. 2018 Q4: it is the experiments results in 2018 Q4 dataset with different penalty parameter c1.

https://doi.org/10.1371/journal.pone.0316557.g004

The TSC-SVM demonstrates strong robustness and accuracy when addressing unsupervised imbalanced credit risk assessment challenges. A key advantage of the TSC-SVM algorithm is its reduced sensitivity to parameters, which is valuable in practical applications where the optimal parameter settings may not be readily apparent or where data characteristics can vary significantly. The reduced need for extensive parameter tuning simplifies the model implementation, rendering it more user-friendly and accessible, especially for financial institutions with limited resources for such adjustments. By employing a robust and accurate model such as the TSC-SVM, financial institutions can expand their business scope. Specifically, the ability to more accurately assess credit risk in a variety of contexts and data environments can allow these institutions to explore new markets and customer segments with greater confidence. The enhanced performance of the TSC-SVM also translates into improved risk assessment capabilities. By providing more reliable and nuanced insights into potential credit risks, financial institutions can make better-informed decisions, ultimately leading to reduced losses owing to credit defaults and increased financial stability. Overall, the TSC-SVM offers a comprehensive solution for unsupervised imbalanced credit risk assessment, combining robustness, accuracy, and ease of use. These attributes make it an attractive option for financial institutions looking to enhance their analytical capabilities and risk management strategies.

5 Conclusions

This study introduces TSC-SVM, a novel approach designed for unsupervised imbalanced learning in credit risk assessment. The effectiveness and robustness of this algorithm are demonstrated through extensive testing on various credit datasets. The algorithm exhibits consistent performance across various credit datasets, indicating its robustness and adaptability to different data environments and scenarios. The key findings can be summarized as follows:

  • Competitive performance in imbalanced credit datasets: The TSC-SVM model shows strong competitiveness with other well-established unsupervised classification techniques, including K-means, US-QSSVM, US-QSSVM-GOLD, PCut, Un-LDA, and spectral clustering. The proposed method can effectively identify minority samples in imbalanced datasets, which is particularly valuable in credit risk assessment, where accurately identifying potential defaulters is crucial for risk management and mitigating economic losses associated with loan defaults;
  • Independence from prior knowledge for parameter adjustment: Unlike traditional models, the TSC-SVM algorithm does not rely on prior knowledge for parameter adjustment, making it easier for practitioners to implement and maintain without extensive tuning. Since the labeling of credit risk is time-consuming and labor-intensive. This can effectively help financial institutions classify the credit risk of unlabeled borrowers as quickly as possible;
  • Global optimal solution achievement: The TSC-SVM model consistently finds the global optimal solution for all steps involved, effectively avoiding the pitfalls of local optimal. This aspect is vital for ensuring the overall accuracy and reliability of the model. This avoids financial institutions from suffering economic losses due to the model falling into local optimality;
  • Enhanced robustness and accuracy through multi-view analysis: The multi-view combined unsupervised method, based on perturbation theory, significantly improves the model robustness and accuracy. By maintaining a balanced perspective among the forward, backward, and combined views, this method ensures that no single perspective disproportionately influences the overall prediction;
  • Effective handling of unlabeled imbalanced data: The semi-supervised SVM component of the model employs strong constraints for labeled samples and weaker constraints for unlabeled samples. This approach effectively addresses the classification challenges posed by unlabeled imbalanced data, a common issue in credit risk datasets.

In summary, the TSC-SVM model presents a robust and effective solution for unsupervised imbalanced learning in credit risk assessment, offering financial institutions a powerful tool for enhancing their risk evaluation processes and reducing potential economic losses. In the future, we consider extending TSC-SVM algorithm to other fields which are suffering from the imbalanced problem such as disease diagnosis and fraud detection. In these fields, misclassifying minority samples costs more than misclassifying other samples. For example, improving the prediction accuracy of a few types of diseases or illnesses can effectively improve the treatment efficiency to patients.

Supporting information

S1 Appendix. Proof.

The detailed mathematical derivations and the Karush–Kuhn–Tucker (KKT) conditions relevant to the semi-supervised SVM.

https://doi.org/10.1371/journal.pone.0316557.s001

(PDF)

References

  1. 1. Dhiman Pummy and Bonkra Anupam and Kaur Amandeep and Gulzar Yonis and Hamid Yasir and Mir Mohammad Shuaib, et al. Healthcare trust evolution with explainable artificial intelligence: bibliometric analysis. Information. 2023;14(10): 541.
  2. 2. Ma Zhengwei and Hou Wenjia and Zhang Dan. A credit risk assessment model of borrowers in P2P lending based on BP neural network. PloS one. 2021;16(8):e0255216. pmid:34343180
  3. 3. Xie Xiaofeng and Zhang Fengying and Liu Li and Yang Yang and Hu Xiuying. Assessment of associated credit risk in the supply chain based on trade credit risk contagion. PloS One. 2023;18(2):e0281616. pmid:36795729
  4. 4. Weng Cheng-Hsiung and Huang Cheng-Kui. A Hybrid Machine Learning Model for Credit Approval. Applied Artificial Intelligence. 2021;35(15):1439–1465.
  5. 5. Dastile Xolani and Celik Turgay and Potsane Moshe. Statistical and machine learning models in credit scoring: A systematic literature survey. Soft Computing. 2020;91: 106263.
  6. 6. Van Gestel Tony and Suykens Johan AK and Baesens Bart and Viaene Stijn and Vanthienen Jan, et al. Benchmarking least squares support vector machine classifiers. Machine Learning. 2004;54:5–32.
  7. 7. Davis Rober Hunter and Edelman DB and Gammerman AJ. Machine-learning algorithms for credit-card applications. IMA Journal of Management Mathematics. 1992;4(1):43–51.
  8. 8. Henley WE and others. Construction of a k-nearest-neighbour credit-scoring system. IMA Journal of Management Mathematics. 1997;8(4):305–321.
  9. 9. Breeden Joseph. A survey of machine learning in credit risk. Journal of Credit Risk. 2021;17(3).
  10. 10. Luo Jian and Yan Xin and Tian Ye. Unsupervised quadratic surface support vector machine with application to credit risk assessment. European Journal of Operational Research. 2020;280(3):1008–1017.
  11. 11. Bonkra Anupam and Pathak Sunil and Kaur Amandeep and Shah Mohd Asif. Exploring the trend of recognizing apple leaf disease detection through machine learning: a comprehensive analysis using bibliometric techniques. Artificial Intelligence Review. 2024;57(2): 21.
  12. 12. Zhang Lefei and Zhang Liangpei and Du Bo and You Jane and Tao Dacheng. Hyperspectral image unsupervised classification by robust manifold matrix factorization. Information Sciences. 2019;485:154–169.
  13. 13. Carcillo Fabrizio and Borgne Yann-Aël Le and Caelen Olivier and Kessaci Yacine and Frédéric Oblé and Bontempi Gianluca. Combining unsupervised and supervised learning in credit card fraud detection. Information Sciences. 2021;557:317–331.
  14. 14. Fuentes Cabrera José G and Pérez Vicente Hugo A and Maldonado Sebastián and Velasco Jonás. Combination of unsupervised discretization methods for credit risk. PloS one. 2023;19(11):e0289130. pmid:38011207
  15. 15. Shen Feng and Zhao Xingchao and Kou Gang. Three-stage reject inference learning framework for credit scoring using unsupervised transfer learning and three-way decision theory. Decision Support Systems. 2020;137:113366.
  16. 16. Kanungo Tapas and Mount David M and Netanyahu Nathan S and Piatko Christine D and Silverman Ruth and Wu Angela Y. An efficient k-means clustering algorithm: Analysis and implementation. IEEE transactions on pattern analysis and machine intelligence. 2002;24(7):881–892.
  17. 17. Kohonen Teuvo. Self-organized formation of topologically correct feature maps. Biological cybernetics. 1982;43(1):59–69.
  18. 18. Luo Jian and Tian Ye and Yan Xin. Clustering via fuzzy one-class quadratic surface support vector machine. Soft Computing. 2017;21:5859–5865.
  19. 19. Ng Andrew and Jordan Michael and Weiss Yair. On spectral clustering: Analysis and an algorithm. Advances in neural information processing systems. 2001;14.
  20. 20. Niu Kun and Zhang Zaimei and Liu Yan and Li Renfa. Resampling ensemble model based on data distribution for imbalanced credit risk evaluation in P2P lending. Information Sciences. 2020;536:120–134.
  21. 21. Aksoylar Cem and Qian Jing and Saligrama Venkatesh. Clustering and community detection with imbalanced clusters. IEEE Transactions on Signal and Information Processing over Networks. 2016;3(1):61–76.
  22. 22. Brodinová Šárka and Zaharieva Maia and Filzmoser Peter and Ortner Thomas and Breiteneder Christian. Clustering of imbalanced high-dimensional media data. Advances in Data Analysis and Classification. 2018;12:261–284.
  23. 23. Lu Yang and Cheung Yiu-Ming and Tang Yuan Yan. Self-adaptive multiprototype-based competitive learning approach: A k-means-type algorithm for imbalanced data clustering. IEEE transactions on cybernetics. 2019;51(3):1598–1612.
  24. 24. Chen Yujia and Calabrese Raffaella and Martin-Barragan Belen. Interpretable machine learning for imbalanced credit scoring datasets. European Journal of Operational Research. 2024;312(1):357–372.
  25. 25. Li Jian and Liu Haibin and Yang Zhijun and Han Lei. A credit risk model with small sample data based on G-XGBoost. Applied Artificial Intelligence. 2021;35(15):1550–1566.
  26. 26. Junior Leopoldo Melo and Nardini Franco Maria and Renso Chiara and Trani Roberto and Macedo Jose Antonio. A novel approach to define the local region of dynamic selection techniques in imbalanced credit scoring problems. Expert Systems with Applications. 2020;152:113351.
  27. 27. Sun Yu and Li Zhanli and Li Xuewen and Zhang Jing. Classifier selection and ensemble model for multi-class imbalance learning in education grants prediction. Applied Artificial Intelligence. 2021;35(4):290–303.
  28. 28. Kubat Miroslav and Matwin Stan and others. Addressing the curse of imbalanced training sets: one-sided selection. Icml. 1997;97(1):179.
  29. 29. Chawla Nitesh V and Bowyer Kevin W and Hall Lawrence O and Kegelmeyer W Philip. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research. 2002;16:321–357.
  30. 30. Marqués Ana Isabel and García Vicente and Sánchez José Salvador. On the suitability of resampling techniques for the class imbalance problem in credit scoring. Journal of the Operational Research Society. 2013;64(7):1060–1070.
  31. 31. Ahmad Hadeel and Kasasbeh Bassam and AL-Dabaybah Balqees and Rawashdeh Enas. EFN-SMOTE: An effective oversampling technique for credit card fraud detection by utilizing noise filtering and fuzzy c-means clustering. International Journal of Data and Network Science. 2023;7(3):1025–1032.
  32. 32. Jiang Zhen and Zhao Lingyun and Lu Yu and Zhan Yongzhao and Mao Qirong. A semi-supervised resampling method for class-imbalanced learning. Expert Systems with Applications. 2023;221: 119733.
  33. 33. Wang Jialei and Zhao Peilin and Hoi Steven CH. Cost-sensitive online classification. IEEE Transactions on Knowledge and Data Engineering. 2013;26(10):2425–2438.
  34. 34. Benítez-Peña Sandra and Blanquero Rafael and Carrizosa Emilio and Ramírez-Cobo Pepa. Cost-sensitive probabilistic predictions for support vector machines. European Journal of Operational Research. 2024;314(1):268–279.
  35. 35. Li Zhiyong and Zhang Junfeng and Yao Xiao and Kou Gang. How to identify early defaults in online lending: a cost-sensitive multi-layer learning framework. Knowledge-Based Systems. 2021;221:106963.
  36. 36. Xia Yufei and Liu Chuanzhe and Liu Nana. Cost-sensitive boosted tree for loan evaluation in peer-to-peer lending. Electronic Commerce Research and Applications. 2017;24:30–49.
  37. 37. Wang Hong and Xu Qingsong and Zhou Lifeng. Large imbalanced credit scoring using lasso-logistic regression ensemble. PloS one. 2015;10(2):e0117844. pmid:25706988
  38. 38. Alamri Maram and Ykhlef Mourad. Hybrid Undersampling and Oversampling for Handling Imbalanced Credit Card Data. IEEE Access. 2024;12:14050–14060.
  39. 39. Christakis Nicholas and Drikakis Dimitris. Reducing uncertainty and increasing confidence in unsupervised learning. Mathematics. 2023;11(14)3063.
  40. 40. Niu Xuetong and Wang Li and Yang Xulei. A comparison study of credit card fraud detection: Supervised versus unsupervised. ArXiv preprint. 2019.
  41. 41. Zhou Jingyue and Tian Ye and Luo Jian and Zhai Qianru. A kernel-free Laplacian quadratic surface optimal margin distribution machine with application to credit risk assessment. Applied Soft Computing. 2023;133: 109931.
  42. 42. Simões Eduardo C and de Carvalho Francisco de AT. Gaussian Kernel Fuzzy C-Means with Width Parameter Computation and Regularization. Pattern Recognition. 2023;143:109749.
  43. 43. Ieva Carlo and Gotlieb Arnaud and Kaci Souhila and Lazaar Nadjib. Discovering program topoi via hierarchical agglomerative clustering. IEEE Transactions on Reliability. 2018;67(3):758–770.
  44. 44. Campello, Ricardo JGB and Moulavi, Davoud and Sander, Jörg. Density-based clustering based on hierarchical density estimates. Pacific-Asia conference on knowledge discovery and data mining. Springer;2013;160–172.
  45. 45. Bonkra, Anupam and Dhiman, Pummy and Goyal, Shanky and Islam, Sardar MN and Rana, Arun Kumar and Sharma, Naman. An Innovative New Open Computer Vision Framework Via Artificial Intelligence with Python. International Conference on Data Science and Network Engineering. 2023;95–109.
  46. 46. Wen Guoqiu and Li Xianxian and Zhu Yonghua and Chen Linjun and Luo Qimin and Tan Malong. One-step spectral rotation clustering for imbalanced high-dimensional data. Information Processing & Management. 2021;58(1):102388.
  47. 47. Camastra Francesco and Verri Alessandro. A novel kernel method for clustering. IEEE transactions on pattern analysis and machine intelligence. 2005;27(5):801–805. pmid:15875800
  48. 48. Kennedy Robert KL and Salekshahrezaee Zahra and Villanustre Flavio and Khoshgoftaar Taghi M. Iterative cleaning and learning of big highly-imbalanced fraud data using unsupervised learning. Journal of Big Data. 2023;10(1):106.
  49. 49. Zhu Xiaofeng and Zhu Yonghua and Zheng Wei. Spectral rotation for deep one-step clustering. Pattern Recognition. 2020;105:107175.
  50. 50. Hunter, Blake and Strohmer, Thomas. Performance analysis of spectral clustering on compressed, incomplete and inaccurate measurements. ArXiv preprint. 2010.
  51. 51. Zong Linlin and Zhang Xianchao and Liu Xinyue and Yu Hong. Weighted multi-view spectral clustering based on spectral perturbation. Proceedings of the AAAI conference on artificial intelligence. 2018;32(1).
  52. 52. Von Luxburg Ulrike. A tutorial on spectral clustering. Statistics and computing. 2007;17:395–416.
  53. 53. Wang Fei and Wang Quan and Nie Feiping and Li Zhongheng and Yu Weizhong and Wang Rong. Unsupervised linear discriminant analysis for jointly clustering and subspace learning. IEEE Transactions on Knowledge and Data Engineering. 2019;33(3): 1276–1290.
  54. 54. Markelle Kelly, Rachel Longjohn, Kolby Nottingham; 2022. Database:The UCI Machine Learning Repository. Available from:https://archive.ics.uci.edu
  55. 55. Database:Kaggle. Available from:https://www.kaggle.com/
  56. 56. Xiao Jin and Wang Yadong and Chen Jing and Xie Ling and Huang Jing. Impact of resampling methods and classification models on the imbalanced credit scoring problems. Information Sciences. 2021;569:508–526.