A landslide susceptibility map based on spatial scale segmentation: A case study at Zigui-Badong in the Three Gorges Reservoir Area, China

China experiences frequent landslides, and therefore there is a need for landslide susceptibility maps (LSMs) to effectively analyze and predict regional landslides. However, the traditional methods of producing an LSM are unable to account for different spatial scales, resulting in spatial imbalances. In this study, Zigui-Badong in the Three Gorges Reservoir Area was used as a case study, and data was obtained from remote sensing images, digital elevation model, geological and topographic maps, and landslide surveys. A geographic weighted regression (GWR) was applied to segment the study area into different spatial scales, with three basic principles followed when the GWR model was applied for this propose. As a result, 58 environmental factors were extracted, and 18 factors were selected as LSM factors. Three of the most important factors (channel network basic level, elevation, and distance to river) were used as segmentation factors to segment the study area into 18 prediction regions. The particle swarm optimization (PSO) algorithm was used to optimize the parameters of a support vector machine (SVM) model for each prediction region. All of the prediction regions were merged to construct a GWR-PSO-SVM coupled model and finally, an LSM of the study area was produced. To verify the effectiveness of the proposed method, the outcomes of the GWR-PSO-SVM coupled model and the PSO-SVM coupled model were compared using three evaluation methods: specific category accuracy analysis, overall prediction accuracy analysis, and area under the curve analysis. The results for the GWR-PSO-SVM coupled model for these three evaluation methods were 85.75%, 87.86%, and 0.965, respectively, while the results for the traditional PSO-SVM coupled model were 68.35%, 84.44%, and 0.944, respectively. The method proposed in this study based on a spatial scale segmentation therefore acquired good results.


Introduction
Located on the eastern edge of the Asian continent, China, with active geological tectonic movements and a complex geological environment, is a country that experiences frequent A landslide susceptibility map (LSM), as a non-deterministic method of prediction, is currently the main method used for the prediction of regional landslides. Using the engineering geological analogy method, an LSM can be obtained through the use of a mathematical model to determine and assign the degree of importance of the LSM factors that cause landslides. Mondal and Mandal used a logistic regression (LR) model to evaluate landslide susceptibility in the Balason River Basin in the Indian Darjeeling region of the Himalayas. The result showed that the LR model can be used for landslide hazard research and decision making [1]. Wang et al. compared several methods for constructing an LSM, such as the frequency ratio (FR), LR, decision tree (DT), weights of evidence (WE), and artificial neural network (ANN) in a study in Mizunami, Japan, and found that LR had the best area under the curve (AUC) value [2]. Aditian, Kubota and Shinohara used three methods, FR, LR, and ANN, in a study of landslides triggered by heavy rains in the Ambon region of Indonesia: the study showed that the ANN had the best results among these three methods, and was the best method for interpreting the relationship between landslide and LSM factors [3]. Saro et al. used two methods, LR and ANN, for the construction of an LSM in Inje City, South Korea, with the results indicating that the accuracy of the ANN was higher than that of the LR model [4]. Hong et al. compared the effects of four support vector machine (SVM) models based on different kernel functions in the LSM by taking Luxi City, Jiangxi Province, as a study area. The results showed that the SVM models using these four different kernel functions achieved good results, with the model using a radial basis function (RBF) as the kernel function having the best effect, regardless of the success rate or prediction rate [5]. Pham et al. conducted an LSM study in Pauri Garhwal, India, and compared the SVM model with four Bayesian algorithms: the naive Bayes tree, Bayes network, naive Bayes, and decision table naive Bayes models. The analysis results showed that the SVM model had the best predictive performance [6]. Despite having achieved acceptable results in their application, such methods tend to ignore the spatial distribution of landslide hazards and extends them to the entire study area without considering the spatial applicability of the models. This affects the selection and assignment of important evaluation factors, and thus reduces the accuracy of the LSM.
To overcome the above problems, LSM methods that consider the spatial scale of landslides have emerged. About 20 years ago, Fell et al. published LSM guidelines. The authors believed that landslides of different scales should be evaluated at the corresponding spatial scale, and that the selection of the LSM factors should have a scale that is compatible with the spatial scale [7]. In the same year, Cascini affirmed the guidelines proposed by Fell et al. and focused on the applicability of the susceptibility and hazard zoning of landslides at different scales. In this study, according to the scales and applications of landslide zoning, landslides were divided into two categories: small & medium scales and large & detailed scales, and the results indicated that the guidelines were a "powerful tool for landslide and hazard zoning at different scales" [8]. Paudel, Oguchi and Hayakawa extracted the best scale of each LSM factor using the random forest model, and then constructed an LSM. Their experimental results for Niigata and Ehime prefectures in Japan showed that a multi-scale LSM model was superior to a traditional model [9]. Schlögel et al. extracted LSM factors using a digital elevation model (DEM) with different precisions (5, 10, and 25 m), and the experiments found that the LSM factors for the DEM with 10-m precision was the best data combination for acquiring the LSM [10]. These methods explore the relationship between the spatial scale of the LSM and the selected data accuracy, sampling accuracy, and applicable range, and promotes research on the spatial scale of LSMs. However, these methods weaken the concept of spatial scale in the scale or resolution of an LSM, and do not analyze the differences between the LSM at different spatial scales or consider the essential importance of such differences. They also ignore the definitiveness of spatial scale to the production of an LSM.
Some researchers have used the geographic weighted regression (GWR) model to overcome these problems. Zhang et al. employed the GWR model, and compared it with the traditional LR models when producing an LSM of the Three Gorges Reservoir Area. Among the six evaluation indicators considered in their study, the GWR model achieved the best outcome [11]. In the following year, Hong et al. plotted a zoning plan for an LSM in Xingguo County, Shanxi Province using the GWR model and compared it with the traditional LR and SVM models. The results indicated that the GWR model had the highest success rate and prediction accuracy [12]. In the same year, Matsche studied the western part of Oregon and determined that the precision of the GWR model was 6.2% higher than that of the LR model [13]. The use of only the GWR model as an ordinary LSM prediction model improved the LSM to some extent, and enabled spatial scale problems to be considered in an LSM study, but it failed to reveal the essence of the spatial imbalance of the LSM.
In this study, we quantitatively expressed the spatial scale concept of an LSM when studying the spatial scale problem, introduced the concept of spatial scale into the study of LSMs, and built a GWR-particle swarm optimization-SVM (GWR-PSO-SVM) coupled model, to determine the root cause of the impact of spatial scale on an LSM. The aim was to explain the spatial imbalance problem of an LSM, and improve its scientific applicability, accuracy, and reliability.
The remainder of this paper is organized as follows. Section 2 describes the study area and data used in this work. Section 3 reviews the algorithms and model used in this work. Section 4 presents the process used to establish the GWR-PSO-SVM coupled model. Section 5 reports the experimental results, including a comparison between the traditional PSO-SVM coupled models and our new model. Section 6 is a discussion of our model and the final section presents our concluding remarks.

Study area
In this work, the Zigui-Badong in the Three Gorges Reservoir Area was used as a study area (Fig 1). In terms of topography and geomorphology, the study area is located in the eastern part of the two natural geography units of the Three Gorges Reservoir Area. The area is a basin, and the topography along the river has the characteristics of being low in the middle and high on both banks. In terms of geology, the strata in the study area are fully developed, and only the Lower Devonian, the Upper Silurian and Carboniferous, most of the Cretaceous and a small amount of Tertiary strata are deficient (Fig 2) [14]. Geological disasters occur frequently in the study area, with landslides being the most prominent type of geological disaster. There have been 202 proven landslides in the study area, covering a total area of 23.4 km2, accounting for 6.03% of the entire study area [15].

Data Source
The following data were used in this study: ➢ 1: 10,000-scale landslide hazard map [15].  The spatial resolution of the remote sensing (RS) data and the GDEM data was 30 m, and the 1: 10,000-scale landslide hazard map, the 1: 50,000-scale topographic and geological maps could match these data in terms of spatial resolution. The seismic activity and atmospheric rainfall were point data, which had a temporal resolution but no spatial resolution.

Methods
The GWR model Fotheringham et al. first proposed GWR as a method to study the quantitative relationship between two or more variables with spatial distribution characteristics using the regression principle [16]. Local features are used as weights to change the multicollinearity in the global regression model [17,18]. The related functions are defined as follows: where (u i , v i ) are the spatial coordinates of the i-th sample; L and Q are the number of samples and regression coefficients, respectively; y i is the independent variable of the function at point i, x ik is the value of the k-th explanatory variable of point i; β k (u i ,v i ) is the local regression parameter of the k-th explanatory variable of point i; and β 0 (u i ,v i ) is the intercept parameter of point i. The least squares estimate for β i is as follows: The variance is: where W i is a diagonal matrix of n dimension, which is called the spatial weight matrix and is the core of the GWR model. The value on the diagonal is the geographic weight: The W i is chosen based on the choice of kernel function, and the selection of the spatial weight function has a large influence on the parameter estimation of the GWR model.

The SVM model
The SVM model was first proposed by Vapnik [19]. The model, established on the basis of the Vapnik-Chervonenkis dimension theory and structural risk minimization principle, has many unique advantages in solving small sample, nonlinear, and high-dimensional pattern recognition problems [20,21]. Its function is defined as follows: : where x i is a point on the hyperplane; y i is the classification marker, i = 1,2,� � �,R; R is the number of samples; w is a vector perpendicular to the hyperplane; b is a constant that is applied to prevent the hyperplane from passing the origin of the coordinate axis; and kwk is the 2-norm of w. When formula (5) introduces a non-negative slack variable ξ i , a penalty factor C must be introduced to represent the distance from a misclassified point to its correct position. Therefore, the formula (5) can be expressed as: The RBF can be selected as the kernel function of the SVM, and is used to map the vector of the low-dimensional space into the high-dimensional characteristic space for classification. The function is expressed as: where, γ is the kernel parameter of different radial basis functions.

The PSO algorithm
The performance of the SVM model relies heavily on two parameters, the penalty factor C and the kernel parameter γ. The most common method for selecting these two parameters is to use the PSO algorithm to find the optimal solution of the model. Eberhart and Kennedy first proposed the PSO as an intelligent optimization algorithm that mimics bird foraging [22][23][24]. Its function form is: ( where i =1,2,� � �,K; K is the number of particles; n is the current number of iterations; t is the inertia weight; p n i is individual optimal position of the i-th particle; p n g is the optimal position of all particles in the n-th iteration; V n i and x n i are the velocity and position of the i-th particle in the n-th iteration; V nþ1 i and x nþ1 i are the speed and position at which the i-th particle is updated in the (n+1)-th iteration, respectively; c 1 and c 2 are learning factors; and r 1 and r 2 are two random numbers between 0 and 1.

Evaluation models
Specific category accuracy analysis. The specific category accuracy analysis method is an improved quantitative analysis method [25]. In this study, the specific category accuracy method considers the number of slope units in the prediction regions. It can be expressed as: where, i = 1,2,� � �,n; n is the classification number of landslide-prone zonings; A i is the number of slope units occupied by landslides in i-th landslide susceptibility zoning classification; B i is the number of the slope units in i-th landslide susceptibility zoning classification; and P i is the specific category accuracy in the i-th landslide susceptibility zoning classification.
Overall prediction accuracy analysis. The overall prediction accuracy analysis is a commonly used evaluation method for the construction of an LSM. In this study, the original formula was rewritten because there were no landslides in some prediction regions. It was expressed as: where i = 1,2,� � �,n pr ; n pr is the number of prediction regions; a i is the number of slope units correctly predicted as landslides in the i-th prediction region; b i is the number of slope units correctly predicted as non-landslide areas in the i-th prediction region; and S i is the number of total slope units in the i-th prediction region. Receiver operating characteristic (ROC) curve analysis. Each point on the ROC curve reflects the susceptibility to the same signal stimulus, with the X-axis representing the negative positive rate specificity and the Y-axis representing the true positive rate sensitivity [26,27]. There are four possible cases for a binary classification problem, as shown in Table 1.
The AUC refers to the area under the ROC curve. It ranges between 0-1 and its value can be used to intuitively evaluate the quality of the classifier.

Coupled model for the LSM based on spatial scale segmentation
By taking the spatial autocorrelation of LSM factors as the breakthrough point, this study regarded the GWR coefficients of the LSM factors as the mathematical basis for the segmentation of the study area. Three basic principles were followed to ensure the rationality of segmentation. First, 58 environmental factors were extracted from the data sources, 18 factors were selected as LSM factors after factor screening, and three of the most important factors were used as segmentation factors to segment the study area into 18 prediction regions. Then, the SVM parameters were optimized by the PSO algorithm, and an LSM for each prediction region was obtained. Finally, all the prediction regions were integrated to establish the LSM model with spatial scale segmentation. A flowchart of the coupled model for the LSM based on a spatial scale analysis was established, as shown in Fig 3.

Selection of LSM calculation units
According to Guzzetti et al., all LSM calculation units can be summarized as either grid cells, geographic units, unique conditional units, slope units, or sub-basin units [28]. In this study,

PLOS ONE
the slope unit was selected as the LSM calculation unit. After calculation and modification, a total of 2,790 slope units were obtained in the study area, of which the smallest area was 11,823.9 m 2 and the largest was 819,444 m 2 .

Screening of LSM factors
In this work, through an analysis of historical landslide data and a summary of previous research in the study area, the LSM factors were divided into two categories: control factors and influencing factors. The control factors included geomorphological, geological, and hydrological factors. The influencing factors included surface cover index, geophysical, and meteorological factors. In this study, based on the geological and topographic maps, RS image data, field survey reports, and other data, a total of 58 LSM factors in two categories and eight sub-categories were extracted by RS and geographic information system. This is summarized in Table 2. Some of these 58 LSM factors were obtained by the DEM, and had a large correlation with each other. Therefore, not all of these 58 factors were involved in the modeling and calculation of LSM, but they needed to be further analyzed and screened. There were two main steps in the analysis.
Pearson product-moment correlation coefficient (PPMCC) analysis and principal component analysis (PCA). In this study, a PPMCC analysis was used to analyze the correlations among the LSM factors of five sub-categories (geomorphology, hydrology, vegetation index, wetness index, and building index), and the factors with a significant correlation were deleted [29].

PLOS ONE
In the three sub-categories other than geomorphology and hydrology, there were strong correlations among multiple factors. In the geomorphology sub-category, the profile curvature, topographic position index (TPI), TPI based landform classification, cross-sectional curvature, general curvature, longitudinal curvature, maximum curvature and minimum curvature factors had strong correlations, and formed a curvature factor combination. The same phenomenon occurred in the vegetation index, wetness index and the building index sub-categories, and formed the vegetation index factor combination, the wetness index factor combination, and the building index factor combination.
To retain the multi-factor effective information and remove the linear correlation of these factor combinations, the PCA method was used [30]. In this study, the first, second, and third principal component of the curvature factor combination (PCCFC-1, 2, 3), the first principal component of the vegetation index factor combination (PCVIFC-1), the first principal component of the wetness index factor combination (PCWIFC-1), and the first principal component of the building index factor combination (PCBIFC-1) were retained. After the PPMCC analysis and PCA, there were 32 factors remaining.
Factor importance screening based on the SVM model. In this study, the SVM was used as the prediction model for the LSM. The model can determine the importance of each factor according to the degree of contribution of the LSM. Based on this, this study removed the unimportant factors to improve the efficiency and accuracy of the LSM. After repeated experiments and comparisons, in combination with previous research results and based on the LSM factors that played a major role in most landslide studies, the importance threshold of the LSM factors was determined (0.005). Finally, 18 LSM factors were obtained. This is summarized in Table 3.

The GWR-based segmentation of the study area
Based on the calculation of the GWR coefficients of each LSM factor, the natural breakpoint method was used for classification in this study [32]. To segment the study area, theoretically, the classification results of all the LSM factors should be superimposed to reduce the spatial autocorrelation for each LSM factor. However, due to the excessive number of LSM factors, the superposition of all LSM factor classification results may generate too many small areas, and have a great impact on the subsequent steps. Moreover, too many segmentation areas may also make the spatial distance between the areas smaller, in turn increasing the spatial autocorrelation of the LSM factors. After repeated studies, three basic principles were identified that should be followed when the GWR model was used for spatial scale segmentation: ➢ Select the same appropriate number of classifications for all LSM factors; ➢ Select only the most important LSM factors as the segmentation factors to segment the study area; ➢ In light of the results of spatial scale segmentation, regions that are too small should be merged into adjacent regions, and the integrity of the landslide surface should be guaranteed.
Based on principles 1 and 2, we selected the three most important LSM factors in the SVM model (channel network basic level (CNBL), elevation, and distance to river) for use as the segmentation factors. Each segmentation factors was then divided into two categories by the natural breakpoint method. The final result of the spatial scale segmentation of the study area was superimposed and processed by principle 3, with a total of 18 small areas, which were called 18 prediction regions. The segmentation process is shown in Fig 4.

Establishment of an LSM model based on GWR
For each prediction region, all the landslide slope units and randomly selected non-landslide slope units constituted a training data set (at a 1: 1 ratio) to conduct training of the PSO-SVM coupled model. All the slope units in the region, as the verification sample data set, were input into the trained coupled model, and an LSM was obtained for each prediction region. The optimal solution of the PSO-SVM coupled model for each prediction region is shown in Table 4.

Experimental results of the GWR-PSO-SVM coupled model
The LSMs of all prediction regions were combined to obtain an LSM based on spatial scale segmentation, i.e., the LSM of the GWR-PSO-SVM coupled model. The landslide susceptibility index (LSI) is a form of LSM, which is a continuous value from 0 to 1. This is shown in Fig 5.

Establishment of a comparative experiment to test the PSO-SVM coupled model
To compare the precision and accuracy of the GWR-PSO-SVM coupled model proposed in this study, and especially to verify the correctness of the study area spatial scale segmentation using the GWR method, a comparative experiment was conducted. To verify the influence of spatial imbalance on the LSM, the PSO-SVM coupled model was used in the comparative experiment. The operational process of the PSO-SVM coupled model was basically the same as that of the GWR-PSO-SVM coupled model, with just the spatial scale segmentation using

PLOS ONE
GWR coefficients removed, and the selection of the LSM factors were consistent with those selected for the GWR-PSO-SVM coupled model. The PSO algorithm determined that the optimal solutions for C and γ in the SVM model were 4 and 1, respectively, and the LSM for the PSO-SVM coupled model was obtained. This is shown in Fig 6. To increase the readability of the LSM, the fixed threshold method was used in this study. Values of 0.1, 0.3, 0.7, and 0.9 were selected as the classification thresholds. The LSI was divided into five categories to obtain the landslide susceptibility zoning (LSZ): very low susceptibility areas, low susceptibility areas, medium susceptibility areas, high susceptibility areas, and very high susceptibility areas. The LSZs from the two experiments are shown in Fig 7 and Fig 8.

Evaluation model results and analysis
Specific category accuracy analysis. The specific category accuracy results of the two experiments were calculated using formula (9) and are shown in Table 5.

PLOS ONE
The results in Table 5 show that the GWR-PSO-SVM coupled model identified more slope units in the "Very High" LSZ category (85.75%) than the PSO-SVM coupled model (68.35%). The GWR-PSO-SVM coupled model was significantly superior to the PSO-SVM coupled model.
Overall prediction accuracy analysis. The overall prediction accuracy analysis results of the two experiments are shown in Table 6.
It can be clearly seen from Table 6 that the overall prediction accuracy of the PSO-SVM coupled model was 84.44%. In the GWR-PSO-SVM coupled model, the prediction accuracy of most prediction regions was greater than that of the PSO-SVM coupled model. The overall prediction accuracy of the GWR-PSO-SVM coupled model was 87.86%, which was more accurate than the PSO-SVM coupled model.
The ROC curve analysis. In this study, the ROC curve was constructed using the real data of each slope unit as the state variable, and the LSMs at different spatial scales as the test variable, as shown in  For a quantitative analysis, the AUC calculations for the two experiments are shown in Table 7.
As shown in Table 7, the AUC value of the GWR-PSO-SVM coupled model was 0.965, i.e., greater than the value of 0.944 for the PSO-SVM coupled model, indicating that in the ROC curve analysis, the result for the GWR-PSO-SVM coupled model was better than that for the PSO-SVM coupled model.

Discussion
Based on previous analyses and LSM characteristics, there were four main reasons for the differences among LSMs: (1) the spatial scale of LSMs; (2) the factors used in the construction of LSMs;(3) the calculation unit used to the construct the LSMs; and (4) the prediction model used to construct the LSMs. In most LSM studies, the factors, calculation units, and prediction

PLOS ONE
models have been considered to be the main reasons for the differences among LSMs. Although these are not the same points considered in this work, this is not a contradictory position. At the same spatial scale, the main reasons leading to the differences among LSMs were derived from the factors, calculation units, and prediction models. However, as research has intensified and with the introduction of spatial scale problems, the spatial scale, factors, calculation units, and prediction models have been identified as the root causes of the differences among LSMs.
In the actual experiments, the LSMs obtained from large areas were often different from and even the opposite of those obtained from a smaller area inside the large area when the same factors, calculation unit, and prediction model were used.
However, many of the LSM prediction models used previously were not originally based on geology or geography, but evolved from economics, statistics, and other disciplines. Therefore,

PLOS ONE
these prediction models have been subjected to repeated verifications over several years or even decades, and have been shown to have objectivity, applicability, and stability. Although there are five kinds of calculation unit, they are all fundamentally based on the grid unit. The grid unit is determined by the mathematical and physical properties of remote sensing satellite images, which also have objectivity. Considering this situation, this study focused on the LSM factors and spatial scale. A total of 13 experiments using the GWR-PSO-SVM coupled model were completed in this study, and in each experiment, each LSM factor had a different importance, as shown in Fig  10. For the convenience of comparison, the order of factors in the legend was arranged from high (0.241) to low (0.005) according to the importance score of LSM factors in the SVM model.
The following results can be observed from Fig 10: The important factors in the PSO-SVM coupled model did not have importance (i.e., the value was 0) in some prediction regions of the GWR-PSO-SVM coupled model. In prediction region 5, for instance, the most important LSM factor in the PSO-SVM coupled model (CNBL) had no importance. There were significant differences in the importance of LSM factors at different spatial scales.
After the study area was segmented, the figure shows that even the adjacent regions 2 and 3 had different importance rankings for the LSM factors, indicating the variable importance of the LSM factors in different prediction regions, and illustrating that the LSM factors had regional characteristics.

Conclusion
Using Zigui-Badong in the Three Gorges Reservoir Area as a case study, the GWR model was coupled with the PSO-SVM model to utilize the advantages of GWR in the processing of PLOS ONE spatial heterogeneity. According to the GWR coefficients of LSM factors, the study area was divided into several prediction regions to solve the problem of spatial imbalances in an LSM.
To verify the effectiveness of the proposed method, the outcomes of the GWR-PSO-SVM coupled model and the PSO-SVM coupled model were compared using three evaluation methods: specific category accuracy analysis, overall prediction accuracy analysis, and AUC analysis.
The results for the GWR-PSO-SVM coupled model for these three evaluation methods were 85.75%, 87.86%, and 0.965, respectively, while the results for the traditional PSO-SVM coupled model were 68.35%, 84.44%, and 0.944, respectively. Comparing the three evaluation methods, the results for the GWR-PSO-SVM coupled model were 17.4%, 3.42%, and 0.021 higher than those of the PSO-SVM coupled model, respectively, and the new model had obvious advantages over the former model. It was found that the importance of LSM factors in different areas were actually different. The method in which LSM factors were statistically calculated and assigned a weighted value by the prediction model for a complete study area was obviously questionable.
The spatial scale of the study area essentially affects the importance of LSM factors. Therefore, based on the LSM factors and GWR model, the spatial scale segmentation method of the study area developed in this study that was obtained by the selection of regional segmentation factors, calculation and classification of GWR coefficients, superposition of the classification results, and human-computer interaction modification was an effective method to solve this problem.