Evaluating borrowers’ default risk with a spatial probit model reflecting the distance in their relational network

Jong Wook Lee; So Young Sohn

doi:10.1371/journal.pone.0261737

Abstract

Potential relationship among loan applicants can provide valuable information for evaluating default risk. However, most of the existing credit scoring models either ignore this relationship or consider a simple connection information. This study assesses the applicants’ relation in terms of their distance estimated based on their characteristics. This information is then utilized in a proposed spatial probit model to reflect the different degree of borrowers’ relation on the default prediction of loan applicant. We apply this method to peer-to-peer Lending Club Loan data. Empirical results show that the consideration of information on the spatial autocorrelation among loan applicants can provide high predictive power for defaults.

Citation: Lee JW, Sohn SY (2021) Evaluating borrowers’ default risk with a spatial probit model reflecting the distance in their relational network. PLoS ONE 16(12): e0261737. https://doi.org/10.1371/journal.pone.0261737

Editor: Elisa Ughetto, Politecnico di Torino, ITALY

Received: September 11, 2021; Accepted: December 8, 2021; Published: December 31, 2021

Copyright: © 2021 Lee, Sohn. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the paper and its Supporting information files.

Funding: This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (2020R1A2C2005026). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors declare no competing interests.

1. Introduction

Credit risk management is very important for service firms in the lending business. To predict the probability of default of loan applicant that is essential for credit risk management, machine learning models use two types of borrower information: standard “hard” information and nonstandard “soft” information [1]. The former directly reflects the loan applicants’ financial status or creditworthiness, while the latter includes those that do not have a direct relationship to the credit applicant’s financial status or creditworthiness such as age or residence. Existing studies have shown that not only hard information but also soft information, which is less relevant to their financial condition, is helpful in predicting default risk [1–5]. While both hard and soft information has been used in most credit scoring models, what is missing is the potential relation among loan applicants. Relationship among loan applicants that are at high risk of default can also provide valuable information for evaluating default risk [6–8].

In this study, we use a borrower relationship network based on the borrowers’ information provided for loan applications. This network is utilized as a spatial weight matrix for a spatial probit model that reflects different degrees of borrowers’ relation for the prediction of a loan default. Our proposed approach is applied to peer-to-peer (P2P) lending.

Online P2P lending allows individuals to lend money to other individuals through online platforms without the intervention of a financial institution. These online P2P lending platforms are gaining popularity due to their low operating costs compared with traditional lending programs [9]. However, online P2P lending faces a significant problem, such as information asymmetry between borrowers and lenders, that is, the reliability of a borrower’s credit is unknown to the lender [10]. Therefore, the use of relationship information among borrowers beyond those provided on the P2P platform is necessary. As it is difficult to discover realistic relationship information between borrowers in a P2P landing platform, this study defines the data-driven latent relationships between borrowers in terms of the similarity of their hard and soft information. We expect that the data-driven latent relationships information between borrowers can improve default risk prediction.

This paper is organized as follows. Section 2 reviews prior studies on default prediction in online P2P lending. Section 3 explains the methodologies employed, and Section 4 explores the Lending Club Loan (LCL) dataset used for this study. Finally, Section 5 presents the results, and Section 6 discusses the results, limitations, and suggestions for improvement.

2. Literature review

Models for default risk prediction in P2P lending services are divided into three categories: the probability of default (PD), exposure at default (EAD), and loss given default (LGD). Among them, PD models have been explored steadily [11]. The PD model predicts borrower’s default using classification models based on the statistical or machine learning approaches. Statistical methods have the advantage of being able to quantitatively show the effect of each factor on the borrowers’ default [12]. Emekter et al. [13] used a logistic regression model to predict the default probability of borrowers and found that Fair, Isaac and Company scores are a very important factor. However, statistical methods have the disadvantage of requiring strong assumptions in the observed data [14]. Meanwhile, machine learning methods have strong default prediction performance without requiring any statistical assumptions. These models include neural network [15], support vector machine [16, 17], and random forest [18]. However, these models have a fatal drawback, that is, individual factors do not directly show the effect on borrowers’ default.

It is also important to choose the optimal features used to predict default risk. Generally, hard information can reflect borrowers’ repayment ability [19], while soft information can reflect borrowers’ repayment willingness [20]. Hard information plays an important role in explaining default risk because it directly represents the borrowers’ financial status. However, online P2P lending platforms have difficulty collecting sufficient hard information. To overcome these limitations, the importance of soft information that is not related to the borrowers’ financial status is increasingly emphasized. Lin et al. [21] discovered that information on gender, age, educational level, and marital status play a significant role in predicting default. Recently, unstructured data, such as text and image information, as well as structured data, have been used as soft information. Dorfleitner et al. [22] used textual soft information containing a description of the loan purpose such as text length, spelling errors, and the presence of positive emotion-evoking keywords. Jiang et al. [23] used a topic model to extract representative features from descriptive text concerning loans.

However, few studies have used information on the relationship among individual borrowers in online P2P lending services. Calabrese et al. [24] defined bank networks by estimating interbank relationships as aggregate claims to predict bank contagion. Agosto et al. [6] defined business networks by estimating inter-company relationships as aggregate trade volumes to predict business default from P2P platforms that specialize in business lending. Unlike for banks and companies, obtaining quantitative indicators of relationships among individuals is difficult. In this study, we propose a network definition among individual borrowers and use this relationship information as independent information.

3. Methodology

3.1 Spatial probit model

Generally, the latent response model is the method used to fit the binary response variable Y as a regression model [25]. The model used in this study is a spatial probit model, which has a spatial autoregressive structure and can be used with a binary response variable. Taking the latent underlying quantity as being represented by a continuous variable , we consider the observation mechanism as (1) with i = 1, 2, ⋯, n where n is the number of observations. We implement the spatial structure with an autoregressive model specification, such that (2) where Y* is a continuous latent vector; X represents an n × k matrix of explanatory variables with related coefficient vector β; W is a spatial lag weights matrix with ρ as the associated coefficient; and ε is the error term.

This spatial probit model implies heteroskedastic errors e as follows: (3) where e = (I − ρW)⁻¹ε with variation: .

Calabrese and Elkink [26] reviewed various methods for estimating parameters ρ and β in Eq (3). Among them we performed parameter estimation using the generalized method of moments (GMM) proposed by Pinkse and Slade [27], which derive the GMM equations from the likelihood function. This method is extended by Klier and McMillen [28] to the logit model. It is more robust than the maximum likelihood estimation because it does not depend on the assumption that the error term follows a normal distribution [27].

A GMM estimator is defined as follows: (4) where θ = [ρ, β], u_i = y_i − p_i, ; σ_i is a diagonal element of covariance matrix [(I − ρW)′(I − ρW)]⁻¹; Z is a matrix of instruments; and M is a positive definite matrix that is generally initialized to an identity matrix. We define the instrument matrix Z = {X, WX, W²X, W³X}, as proposed by Kelijian and Prucha [29].

To estimate the parameter, θ, we use a two-step estimation procedure:

First, fix ρ = ρ₀, then estimate the β₀ with GMM and
Find the optimal value of through GMM as the initial value of θ₀ = [ρ₀, β₀] found in (1).

The estimated spatial lag is used to test the statistical significance of ρ by the Lagrange Multiplier (LM) test proposed by Anselin [30]. The LM statistic for spatial lag is defined as: (5) where with , e₀ = y − X(X′X)⁻¹X′y, and e_L = y − X(X′X)⁻¹X′Wy.

The spatial lag weights matrix between borrowers on the P2P platform, W, is defined in Section 3.2.

3.2 Borrowers`relation network

In this study, we construct a network with each borrower as a node and the distance between them as an edge to represent the relationship between the borrowers. The distance between them is defined as the degree of similarity in terms of their hard and soft information. Similarity between numeric information is easily defined by Euclidean distance, but defining similarity between categorical information is a challenge. We use a method proposed by Ahmad and Dey [31] to calculate the distance between borrowers with mixed numeric and categorical information.

Let us assume B_i and B_j are two borrowers with m hard and soft information attributes: X₁, …, X_m. The two borrowers may be represented as B_i = {X_i1, X_i2, …, X_im} and B_j = {X_j1, X_j2, …, X_jm} where the first m_r attributes are numeric, the next m_c attributes are categorical, and m_r + m_c = m. The distance between B_i and B_j, denoted by Dist(B_i, B_j) is computed as follows: (6) where s_t is the significance of the t-th numeric attribute, and δ(X_it, X_jt) is a distance function between the t-th categorical attributes in B_i and B_j. The distance between two distinct values, c₁ and c₂, of any categorical attribute X_t is given by: (7) where δ^tt` (c₁, c₂) = P_t(c` |c₁) + P_t(~c` |c₂) − 1, c` denotes a subset C of values of X_t` that maximizes the quantity P_t(c` |c₁) + P_t(~c` |c₂); ~c` denotes the complementary set of values occurring for attribute X_t`; and P_t(c` |c₁) denotes the conditional probability that an element having value c₁ for X_t` has a value belonging to c` for X_t`. To compute the significance of normalized numeric attributes, we discretize them to have L equal intervals: u[1], u[2], ⋯, u[l]. The significance of the t-th numeric attribute, s_t, is computed as: (8)

The relationship between two borrowers (B_i and B_j) is mapped so that the closer the distance is, the stronger the relationship. We use double-power distance weights, and the degree of relationship between B_i and B_j is evaluated as follows: (9) where d donates the maximum radius of influence (bandwidth). To use W_ij as a spatial weight matrix, row normalization is performed.

3.3 Evaluation metric

To measure the performance of the proposed spatial probit model, we used the following evaluation metrics: accuracy, precision, recall, F1 score, and area under the receiver operator characteristic (ROC) curve. These 4 indicators are the most used indicators for performance evaluation of binary classification tasks such as default prediction. The accuracy is the most intuitive performance indicator of a classification model and is defined as the ratio of correct to total predictions. The precision is the percentage of borrowers that actually defaulted out of those who were predicted to default. The recall is the percentage of borrowers predicted to default out of those actually defaulted. The F1 score is the harmonic mean of the precision and recall. Precision, recall, and f1 score are used as important indicators in a credit scoring task where borrowers with default is much less than borrowers with fully paid [32]. The ROC curve for a binary classification problem represents the true positive proportion as a function of the false positive proportion.

4. Data

We used LCL data from Lending Club, the largest online credit marketplace offering P2P lending worldwide. This data is open to public and provides 2.26 million loan records from June 2007 to December 2018. There are 36-month and 60-month long loans provided by LCL data. Therefore, there exist quite a few borrowers who belong to the “Current” category out of those who received the loan after 2013. Their default record is unknown. Because of these data problems, we only used loans issued in 2012. In the 2012 loan record, Fully Paid, Default, and Charged Off status existed, and in this study, Fully Paid was defined as a good result and the other two were defined as bad results.

In sum, our dataset consists of 51,314 issued loans, including 8,241 defaults. The LCL dataset describes 145 attributes of borrowers but like previous studies, selected only the important attributes with several references [18, 33, 34]. Brief descriptions of the seven numeric and five categorical attributes used in this study are presented in Table 1. Employment length and home ownership are soft information not directly representing borrowers’ financial status. We removed the missing values for the 12 variables and obtained 37,012 borrowers with fully paid loans and 7,080 borrowers with defaulted loans.

Download:

Table 1. Description of attributes used in this study.

https://doi.org/10.1371/journal.pone.0261737.t001

We performed preprocessing, taking into account the dispersion of each attribute. “Annual income,” “Loan amount,” and “Revolving balance” are log-transformed to reduce variance. Since 77% of all borrowers are classified as A, B, or C in the "Grade" attribute, classifications D to G are combined together as D or less. Since 78% of all borrowers are also concentrated under the categories debt consolidation and credit card in the "Loan purpose" attribute, we combined the remaining categories into the category other. The "Employment length" attribute is newly categorized as short, representing less than five years; middle, five to nine years; and long, 10 years or more. Thus, the categorical variables increased to nine, and their distribution is shown in Fig 1.

Download:

Fig 1. The distribution of categories for each categorical attribute.

https://doi.org/10.1371/journal.pone.0261737.g001

We performed the Welch`s T test on the difference between borrowers with fully paid loans and borrowers with defaulted loans for numeric attributes, as shown in Table 2. There were no statistically significant differences in the "Revolving balance" attribute under the significance level of 0.05. However, for attributes related to income, borrowers with fully paid loans are observed to be more stable than borrowers with defaulted loans.

Download:

Table 2. Result of the Welch`s T test for numeric attributes.

https://doi.org/10.1371/journal.pone.0261737.t002

We performed a chi-square test to check if being in default in a categorical attribute is independent of its categories. Table 3 shows for each category the number of borrowers with fully paid loans and those with defaulted loans, the ratio of borrowers with defaulted loans to borrowers with fully paid loans, and the chi-square statistic with the corresponding p-value. Depending on the “Grade” and the “Loan length,” the default-to-fully-paid ratio was quite different. The “Employment length” did not show a statistically significant difference under the p-value of 0.05.

Download:

Table 3. Result of the chi-squared test for categorical attributes.

https://doi.org/10.1371/journal.pone.0261737.t003

5. Experiment

In our dataset, borrowers with defaulted loans account for 16% of the total; thus, there is a class imbalance problem. This leads to a problem whereby the classification model is trained to be biased to predict a major class, and significantly reduces the performance of the prediction of a minor class [35]. To alleviate this problem, we utilized the under-sampling method [36]. We sampled 5,000 borrowers with fully paid loans and 5,000 borrowers with defaulted loans. We limited the range of some numeric attributes to control the dispersion of their min-max normalization. Values greater than 3 for "Inquiries in the last 6 months" and 26 for "Open accounts" were excluded from the sampling process. The spatial weight matrix, W, has been built from the sampled dataset, as described in section 3.2. Numeric variables were divided into three sections of equal length (L). The bandwidth (d) was set to 0.06059, which was the third quantile value of distances between borrowers.

To consider the allowable computation time for parameter estimation, we sampled 2,000 borrowers from the sample dataset, which was divided into 1,500 train datasets and 500 test datasets. Using the train dataset, the parameters: were estimated by GMM. To find the initial ρ₀, we observed a change in the “area under the curve” (AUC) for the test dataset by increasing the ρ₀ from 0 to 1 at intervals of 0.1. As shown in Fig 2, with an initial ρ₀ of 0.5, the test AUC was the highest, at 0.6855. This shows that borrowers are not independent in the borrowers’ relation network, and that there is sufficient spatial autocorrelation between borrowers with defaulted loans.

Download:

Fig 2. Test AUC variation with initial ρ₀.

https://doi.org/10.1371/journal.pone.0261737.g002

Table 4 compares the baseline model, logistic regression model without spatial component, with the model presented in this study. In the baseline model, ten attributes were statistically significant at the significance level of 0.1. The default probability of the borrower has a strong negative correlation with the “log(Annual income)” and “log(Revolving balance)” attributes. However, it has a positive correlation with the “Debt to income,” “Revolving utilization rate,” “Grade,” “Loan length,” and “Loan purpose.” In the spatial probit model proposed in this study, seven attributes were statistically significant at the significance level of 0.1. The “log(Annual income)” and “log(Revolving balance)” attributes were underestimated over the baseline model and were not statistically significant. Instead, “log(Loan amount)” and “Revolving utilization rate” have negative coefficients. In addition, the spatial autocorrelation component between borrowers with defaulted loans was 0.505, which was very significant under the significance level of 0.05. Compared to the baseline model, there was an increase in accuracy and AUC. In particular, the proposed model has remarkably increased recall and F1-score, which can be expected to have significant spatial autocorrelation between borrowers with defaulted loans. The additional consideration of spatial autocorrelation in the borrower relation network significantly improved the performance of logistic regression.

Download:

Table 4. Result of the estimation of the baseline and SAR models.

https://doi.org/10.1371/journal.pone.0261737.t004

We sampled the training and test dataset 500 times and observed changes in the test performance differences of the baseline and spatial probit models in the entire dataset. To observe the strength of autocorrelation between borrowers with defaulted loans, the initial ρ₀ was set to 0.2, 0.5, and 0.8. The results are shown in Table 5. The larger the initial rho, the higher the recall, which means the higher the predictability of the borrowers with defaulted loans. However, too large an initial value creates the risk of reduced accuracy and AUC. In our experiment, when the initial rho is 0.5, the AUC is slightly higher, and the F1-score is significantly higher than the baseline model. Therefore, a consideration of the appropriate level of spatial autocorrelation is expected to contribute significantly to the prediction of the default risk of a borrower.

Download:

Table 5. Result of the estimation of the SAR model with 500 repetitions.

https://doi.org/10.1371/journal.pone.0261737.t005

6. Conclusion

This study proposed a spatial probit model to improve default prediction by reflecting the relationship between borrowers, which is defined by the similarity of their characteristics.

We applied this method to 2012 LCL data. We found an evidence of a high level of spatial autocorrelation between borrowers with defaulted loans. Reflecting the spatial autocorrelation among loan applicants did not result in an overall improvement in the accuracy of the default prediction but instead, a significant improvement in the F1-score. An increase in the F1 score is a very significant contribution, since finding borrowers with high default risk is a more important issue than finding normal borrower. This study showed that the additional information of spatial autocorrelation between borrowers with high default risk can alleviate the class imbalance problem in the loan dataset and provide a high predictive power for high default risk borrowers.

However, this study has some limitations. Since the spatial weighting matrix increases enormously in proportion to the square of the number of observations, there are time and memory difficulties in using all the data. In addition, the calculation of the inverse of (I − ρW) in the parameter estimation process using GMM requires a large amount of computation. Because of these constraints on the spatial weighting matrix, we sampled a small number instead of the entire dataset. If the computing power is complemented and the constraints on the spatial weighting matrix are relaxed, then more robust default predictive modeling can be expected.

Supporting information

S1 File.

https://doi.org/10.1371/journal.pone.0261737.s001

(ZIP)

References

1. Angilella S., & Mazzù S. (2015). The financing of innovative SMEs: A multicriteria credit rating model. European Journal of Operational Research, 244(2), 540–554.
- View Article
- Google Scholar
2. Kim Y., & Sohn S. Y. (2007). Technology scoring model considering rejected applicants and effect of reject inference. Journal of the Operational Research Society, 58(10), 1341–1347.
- View Article
- Google Scholar
3. Jeon H., & Sohn S. Y. (2008). The risk management for technology credit guarantee fund. Journal of the Operational Research Society, 59(12), 1624–1632.
- View Article
- Google Scholar
4. Sohn S. Y., Doo M. K., & Ju Y. H. (2012). Pattern recognition for evaluator errors in a credit scoring model for technology-based SMEs. Journal of the Operational Research Society, 63(8), 1051–1064.
- View Article
- Google Scholar
5. Ju Y. H., & Sohn S. Y. (2015). Stress test for a technology credit guarantee fund based on survival analysis. Journal of the Operational Research Society, 66(3), 463–475.
- View Article
- Google Scholar
6. Agosto A., Giudici P., & Leach T. (2019). Spatial regression models to improve P2P credit risk management. Frontiers in artificial intelligence, 2, 6. pmid:33733095
- View Article
- PubMed/NCBI
- Google Scholar
7. Wei Y., Yildirim P., Van den Bulte C., & Dellarocas C. (2016). Credit scoring with social network data. Marketing Science, 35(2), 234–258.
- View Article
- Google Scholar
8. Óskarsdóttir M., Bravo C., Sarraute C., Vanthienen J., & Baesens B. (2019). The value of big data for credit scoring: Enhancing financial inclusion using mobile phone data and social network analytics. Applied Soft Computing, 74, 26–39.
- View Article
- Google Scholar
9. Zeng X., Liu L., Leung S., Du J., Wang X., & Li T. (2017). A decision support model for investment on P2P lending platform. PloS one, 12(9), e0184242. pmid:28877234
- View Article
- PubMed/NCBI
- Google Scholar
10. Serrano-Cinca C., Gutiérrez-Nieto B., & López-Palacios L. (2015). Determinants of default in P2P lending. PloS one, 10(10), e0139427. pmid:26425854
- View Article
- PubMed/NCBI
- Google Scholar
11. Lessmann S., Baesens B., Seow H. V., & Thomas L. C. (2015). Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research. European Journal of Operational Research, 247(1), 124–136.
- View Article
- Google Scholar
12. Crook J. N., Edelman D. B., & Thomas L. C. (2007). Recent developments in consumer credit risk assessment. European Journal of Operational Research, 183(3), 1447–1465.
- View Article
- Google Scholar
13. Emekter R., Tu Y., Jirasakuldech B., & Lu M. (2015). Evaluating credit risk and loan performance in online Peer-to-Peer (P2P) lending. Applied Economics, 47(1), 54–70.
- View Article
- Google Scholar
14. Kruppa J., Ziegler A., & König I. R. (2012). Risk estimation and risk prediction using machine-learning methods. Human Genetics, 131(10), 1639–1654. pmid:22752090
- View Article
- PubMed/NCBI
- Google Scholar
15. Ma Z., Hou W., & Zhang D. (2021). A credit risk assessment model of borrowers in P2P lending based on BP neural network. PloS one, 16(8), e0255216. pmid:34343180
- View Article
- PubMed/NCBI
- Google Scholar
16. Harris T. (2013). Quantitative credit risk assessment using support vector machines: Broad versus Narrow default definitions. Expert Systems with Applications, 40(11), 4404–4413.
- View Article
- Google Scholar
17. Yao X., Crook J., & Andreeva G. (2015). Support vector regression for loss given default modelling. European Journal of Operational Research, 240(2), 528–538.
- View Article
- Google Scholar
18. Malekipirbazari M., & Aksakalli V. (2015). Risk assessment in social lending via random forests. Expert Systems with Applications, 42(10), 4621–4631.
- View Article
- Google Scholar
19. Paul S. (2014). Creditworthiness of a borrower and the selection process in micro-finance: A case study from the urban slums of India. Margin: The Journal of Applied Economic Research, 8(1), 59–75.
- View Article
- Google Scholar
20. Abdou H. A., & Pointon J. (2011). Credit scoring, statistical techniques and evaluation criteria: a review of the literature. Intelligent Systems in Accounting, Finance and Management, 18(2–3), 59–88.
- View Article
- Google Scholar
21. Lin X., Li X., & Zheng Z. (2017). Evaluating borrower’s default risk in peer-to-peer lending: evidence from a lending platform in China. Applied Economics, 49(35), 3538–3545.
- View Article
- Google Scholar
22. Dorfleitner G., Priberny C., Schuster S., Stoiber J., Weber M., de Castro I., et al. (2016). Description-text related soft information in peer-to-peer lending–Evidence from two leading European platforms. Journal of Banking & Finance, 64, 169–187.
- View Article
- Google Scholar
23. Jiang C., Wang Z., Wang R., & Ding Y. (2018). Loan default prediction by combining soft information extracted from descriptive text in online peer-to-peer lending. Annals of Operations Research, 266(1–2), 511–529.
- View Article
- Google Scholar
24. Calabrese R., Elkink J. A., & Giudici P. S. (2017). Measuring bank contagion in Europe using binary spatial regression models. Journal of the Operational Research Society, 68(12), 1503–1511.
- View Article
- Google Scholar
25. Verbeek M. (2008). A guide to modern econometrics. John Wiley & Sons.
26. Calabrese R., & Elkink J. A. (2014). Estimators of binary spatial autoregressive models: A Monte Carlo study. Journal of Regional Science, 54(4), 664–687.
- View Article
- Google Scholar
27. Pinkse J., & Slade M. E. (1998). Contracting in space: An application of spatial statistics to discrete-choice models. Journal of Econometrics, 85(1), 125–154.
- View Article
- Google Scholar
28. Klier T., & McMillen D. P. (2008). Clustering of auto supplier plants in the United States: generalized method of moments spatial logit for large samples. Journal of Business & Economic Statistics, 26(4), 460–471.
- View Article
- Google Scholar
29. Kelejian H. H., & Prucha I. R. (1998). A generalized spatial two-stage least squares procedure for estimating a spatial autoregressive model with autoregressive disturbances. The Journal of Real Estate Finance and Economics, 17(1), 99–121.
- View Article
- Google Scholar
30. Anselin L. (1988). Lagrange multiplier test diagnostics for spatial dependence and spatial heterogeneity. Geographical Analysis, 20(1), 1–17.
- View Article
- Google Scholar
31. Ahmad A., & Dey L. (2007). A k-mean clustering algorithm for mixed numeric and categorical data. Data & Knowledge Engineering, 63(2), 503–527.
- View Article
- Google Scholar
32. Li W., Ding S., Chen Y., & Yang S. (2018). Heterogeneous ensemble for default prediction of peer-to-peer lending in China. Ieee Access, 6, 54396–54406.
- View Article
- Google Scholar
33. Li Z., Tian Y., Li K., Zhou F., & Yang W. (2017). Reject inference in credit scoring using semi-supervised support vector machines. Expert Systems with Applications, 74, 105–114.
- View Article
- Google Scholar
34. Szwabe A., & Misiorek P. (2018, September). Decision Trees as Interpretable Bank Credit Scoring Models. In International Conference: Beyond Databases, Architectures and Structures (pp. 207–219). Springer, Cham.
35. Longadge, R., & Dongre, S. (2013). Class imbalance problem in data mining review. arXiv preprint arXiv:1305.1707.
36. Kotsiantis S. B., & Pintelas P. E. (2003). Mixture of expert agents for handling imbalanced data sets. Annals of Mathematics, Computing & Teleinformatics, 1(1), 46–55.
- View Article
- Google Scholar

[ref1] 1. Angilella S., & Mazzù S. (2015). The financing of innovative SMEs: A multicriteria credit rating model. European Journal of Operational Research, 244(2), 540–554.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Kim Y., & Sohn S. Y. (2007). Technology scoring model considering rejected applicants and effect of reject inference. Journal of the Operational Research Society, 58(10), 1341–1347.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Jeon H., & Sohn S. Y. (2008). The risk management for technology credit guarantee fund. Journal of the Operational Research Society, 59(12), 1624–1632.
View Article
Google Scholar

[8] View Article

[9] Google Scholar

[ref4] 4. Sohn S. Y., Doo M. K., & Ju Y. H. (2012). Pattern recognition for evaluator errors in a credit scoring model for technology-based SMEs. Journal of the Operational Research Society, 63(8), 1051–1064.
View Article
Google Scholar

[11] View Article

[12] Google Scholar

[ref5] 5. Ju Y. H., & Sohn S. Y. (2015). Stress test for a technology credit guarantee fund based on survival analysis. Journal of the Operational Research Society, 66(3), 463–475.
View Article
Google Scholar

[14] View Article

[15] Google Scholar

[ref6] 6. Agosto A., Giudici P., & Leach T. (2019). Spatial regression models to improve P2P credit risk management. Frontiers in artificial intelligence, 2, 6. pmid:33733095
View Article
PubMed/NCBI
Google Scholar

[17] View Article

[18] PubMed/NCBI

[19] Google Scholar

[ref7] 7. Wei Y., Yildirim P., Van den Bulte C., & Dellarocas C. (2016). Credit scoring with social network data. Marketing Science, 35(2), 234–258.
View Article
Google Scholar

[21] View Article

[22] Google Scholar

[ref8] 8. Óskarsdóttir M., Bravo C., Sarraute C., Vanthienen J., & Baesens B. (2019). The value of big data for credit scoring: Enhancing financial inclusion using mobile phone data and social network analytics. Applied Soft Computing, 74, 26–39.
View Article
Google Scholar

[24] View Article

[25] Google Scholar

[ref9] 9. Zeng X., Liu L., Leung S., Du J., Wang X., & Li T. (2017). A decision support model for investment on P2P lending platform. PloS one, 12(9), e0184242. pmid:28877234
View Article
PubMed/NCBI
Google Scholar

[27] View Article

[28] PubMed/NCBI

[29] Google Scholar

[ref10] 10. Serrano-Cinca C., Gutiérrez-Nieto B., & López-Palacios L. (2015). Determinants of default in P2P lending. PloS one, 10(10), e0139427. pmid:26425854
View Article
PubMed/NCBI
Google Scholar

[31] View Article

[32] PubMed/NCBI

[33] Google Scholar

[ref11] 11. Lessmann S., Baesens B., Seow H. V., & Thomas L. C. (2015). Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research. European Journal of Operational Research, 247(1), 124–136.
View Article
Google Scholar

[35] View Article

[36] Google Scholar

[ref12] 12. Crook J. N., Edelman D. B., & Thomas L. C. (2007). Recent developments in consumer credit risk assessment. European Journal of Operational Research, 183(3), 1447–1465.
View Article
Google Scholar

[38] View Article

[39] Google Scholar

[ref13] 13. Emekter R., Tu Y., Jirasakuldech B., & Lu M. (2015). Evaluating credit risk and loan performance in online Peer-to-Peer (P2P) lending. Applied Economics, 47(1), 54–70.
View Article
Google Scholar

[41] View Article

[42] Google Scholar

[ref14] 14. Kruppa J., Ziegler A., & König I. R. (2012). Risk estimation and risk prediction using machine-learning methods. Human Genetics, 131(10), 1639–1654. pmid:22752090
View Article
PubMed/NCBI
Google Scholar

[44] View Article

[45] PubMed/NCBI

[46] Google Scholar

[ref15] 15. Ma Z., Hou W., & Zhang D. (2021). A credit risk assessment model of borrowers in P2P lending based on BP neural network. PloS one, 16(8), e0255216. pmid:34343180
View Article
PubMed/NCBI
Google Scholar

[48] View Article

[49] PubMed/NCBI

[50] Google Scholar

[ref16] 16. Harris T. (2013). Quantitative credit risk assessment using support vector machines: Broad versus Narrow default definitions. Expert Systems with Applications, 40(11), 4404–4413.
View Article
Google Scholar

[52] View Article

[53] Google Scholar

[ref17] 17. Yao X., Crook J., & Andreeva G. (2015). Support vector regression for loss given default modelling. European Journal of Operational Research, 240(2), 528–538.
View Article
Google Scholar

[55] View Article

[56] Google Scholar

[ref18] 18. Malekipirbazari M., & Aksakalli V. (2015). Risk assessment in social lending via random forests. Expert Systems with Applications, 42(10), 4621–4631.
View Article
Google Scholar

[58] View Article

[59] Google Scholar

[ref19] 19. Paul S. (2014). Creditworthiness of a borrower and the selection process in micro-finance: A case study from the urban slums of India. Margin: The Journal of Applied Economic Research, 8(1), 59–75.
View Article
Google Scholar

[61] View Article

[62] Google Scholar

[ref20] 20. Abdou H. A., & Pointon J. (2011). Credit scoring, statistical techniques and evaluation criteria: a review of the literature. Intelligent Systems in Accounting, Finance and Management, 18(2–3), 59–88.
View Article
Google Scholar

[64] View Article

[65] Google Scholar

[ref21] 21. Lin X., Li X., & Zheng Z. (2017). Evaluating borrower’s default risk in peer-to-peer lending: evidence from a lending platform in China. Applied Economics, 49(35), 3538–3545.
View Article
Google Scholar

[67] View Article

[68] Google Scholar

[ref22] 22. Dorfleitner G., Priberny C., Schuster S., Stoiber J., Weber M., de Castro I., et al. (2016). Description-text related soft information in peer-to-peer lending–Evidence from two leading European platforms. Journal of Banking & Finance, 64, 169–187.
View Article
Google Scholar

[70] View Article

[71] Google Scholar

[ref23] 23. Jiang C., Wang Z., Wang R., & Ding Y. (2018). Loan default prediction by combining soft information extracted from descriptive text in online peer-to-peer lending. Annals of Operations Research, 266(1–2), 511–529.
View Article
Google Scholar

[73] View Article

[74] Google Scholar

[ref24] 24. Calabrese R., Elkink J. A., & Giudici P. S. (2017). Measuring bank contagion in Europe using binary spatial regression models. Journal of the Operational Research Society, 68(12), 1503–1511.
View Article
Google Scholar

[76] View Article

[77] Google Scholar

[ref25] 25. Verbeek M. (2008). A guide to modern econometrics. John Wiley & Sons.

[ref26] 26. Calabrese R., & Elkink J. A. (2014). Estimators of binary spatial autoregressive models: A Monte Carlo study. Journal of Regional Science, 54(4), 664–687.
View Article
Google Scholar

[80] View Article

[81] Google Scholar

[ref27] 27. Pinkse J., & Slade M. E. (1998). Contracting in space: An application of spatial statistics to discrete-choice models. Journal of Econometrics, 85(1), 125–154.
View Article
Google Scholar

[83] View Article

[84] Google Scholar

[ref28] 28. Klier T., & McMillen D. P. (2008). Clustering of auto supplier plants in the United States: generalized method of moments spatial logit for large samples. Journal of Business & Economic Statistics, 26(4), 460–471.
View Article
Google Scholar

[86] View Article

[87] Google Scholar

[ref29] 29. Kelejian H. H., & Prucha I. R. (1998). A generalized spatial two-stage least squares procedure for estimating a spatial autoregressive model with autoregressive disturbances. The Journal of Real Estate Finance and Economics, 17(1), 99–121.
View Article
Google Scholar

[89] View Article

[90] Google Scholar

[ref30] 30. Anselin L. (1988). Lagrange multiplier test diagnostics for spatial dependence and spatial heterogeneity. Geographical Analysis, 20(1), 1–17.
View Article
Google Scholar

[92] View Article

[93] Google Scholar

[ref31] 31. Ahmad A., & Dey L. (2007). A k-mean clustering algorithm for mixed numeric and categorical data. Data & Knowledge Engineering, 63(2), 503–527.
View Article
Google Scholar

[95] View Article

[96] Google Scholar

[ref32] 32. Li W., Ding S., Chen Y., & Yang S. (2018). Heterogeneous ensemble for default prediction of peer-to-peer lending in China. Ieee Access, 6, 54396–54406.
View Article
Google Scholar

[98] View Article

[99] Google Scholar

[ref33] 33. Li Z., Tian Y., Li K., Zhou F., & Yang W. (2017). Reject inference in credit scoring using semi-supervised support vector machines. Expert Systems with Applications, 74, 105–114.
View Article
Google Scholar

[101] View Article

[102] Google Scholar

[ref34] 34. Szwabe A., & Misiorek P. (2018, September). Decision Trees as Interpretable Bank Credit Scoring Models. In International Conference: Beyond Databases, Architectures and Structures (pp. 207–219). Springer, Cham.

[ref35] 35. Longadge, R., & Dongre, S. (2013). Class imbalance problem in data mining review. arXiv preprint arXiv:1305.1707.

[ref36] 36. Kotsiantis S. B., & Pintelas P. E. (2003). Mixture of expert agents for handling imbalanced data sets. Annals of Mathematics, Computing & Teleinformatics, 1(1), 46–55.
View Article
Google Scholar

[106] View Article

[107] Google Scholar

Figures

Abstract

1. Introduction

2. Literature review

3. Methodology

3.1 Spatial probit model

3.2 Borrowers`relation network

3.3 Evaluation metric

4. Data

5. Experiment

6. Conclusion

Supporting information

S1 File.

References