Fuzzy Nonlinear Proximal Support Vector Machine for Land Extraction Based on Remote Sensing Image

Currently, remote sensing technologies were widely employed in the dynamic monitoring of the land. This paper presented an algorithm named fuzzy nonlinear proximal support vector machine (FNPSVM) by basing on ETM+ remote sensing image. This algorithm is applied to extract various types of lands of the city Da’an in northern China. Two multi-category strategies, namely “one-against-one” and “one-against-rest” for this algorithm were described in detail and then compared. A fuzzy membership function was presented to reduce the effects of noises or outliers on the data samples. The approaches of feature extraction, feature selection, and several key parameter settings were also given. Numerous experiments were carried out to evaluate its performances including various accuracies (overall accuracies and kappa coefficient), stability, training speed, and classification speed. The FNPSVM classifier was compared to the other three classifiers including the maximum likelihood classifier (MLC), back propagation neural network (BPN), and the proximal support vector machine (PSVM) under different training conditions. The impacts of the selection of training samples, testing samples and features on the four classifiers were also evaluated in these experiments.


Introduction
Remote sensing (RS) plays a key role in the dynamic monitoring of lands [1][2][3]. Approaches of land extraction that are based on remote sensing image basically include manual visual interpretation and computerized auto-classification. Due to the large number of drawbacks in manual visual interpretation, numerous classification algorithms for computerized auto-classification have been developed; among the most popular are the maximum likelihood classifier, neural network classifiers and decision tree classifiers [4]. The maximum likelihood classifier is a popular classifier on the basis of the assumption that classes in the input data follow a Gaussian distribution. However, there will be errors in the results if the sample data size is not sufficient, where the input data set does not follow the Gaussian distribution and/or the classes have much overlap in their distribution, and therefore resulting in poor separability. The back propagation neural network model is widely applied because of its simplicity and its power to extract useful information from samples [5,6]. It is a hierarchical design consisting of fully interconnected layers or rows of processing units (with each unit comprising several individual processing elements, which will be explained below). Back propagation belongs to the class of mapping neural network architectures and therefore the information processing function that it carries out is the approximation of a bounded mapping [7]. Furthermore, the approach can effectively avoid some of the problems associated with MLC by simulating the processing patterns of the human brain, although it also has some disadvantages including a slow learning convergent velocity and being easily converging to local minimum [8]. Lastly, the basic idea of decision tree classifier is to break down a complex decisionmaking process into a collection of simpler decisions, thus providing a solution which is often easier to interpret. Support vector machine (SVM) is based on statistical learning theory, and aims to determine the location of decision boundaries that produce the optimal separation of classes [9]. This approach, a new classification technique in the field of remote sensing as compared to the above three methods, has quickly gained ground in the past ten years. The SVM classifier can achieve higher accuracies than both the ML (Maximum Likelihood) and ANN (Artificial Neural Network) classifiers [10] can, thus recently it has been applied to classify remote sensing images [11]. Although perfect performance and high classification accuracy can be achieved by basing on the SVM approach, there still are some shortcomings. One of such shortcomings is that the SVM mainly aims at the classification of a small number of training samples, and the cost of calculation increases rapidly with larger data size, especially so for remote sensing data. In order to resolve such issue of high calculation cost, Fung and Mangasarian [12]proposed proximal support vector machine (PSVM), which can also be interpreted as regularized least squares and considered in the much more general context of regularized networks, wherein classifies points are assigned to the closet of two parallel planes that are pushed apart as far as possible. In addition, the method is much more efficient than traditional SVM in terms of running speed because it merely requires the solution of a single system of linear equations. Accuracy and speed of classification are deemed significant in the classification that's based on remote sensing images. A variety of factors would affect the accuracy and speed of classification: training data size, selection of feature, algorithm parameter setting, just to name a few. Often, real data sets contain noises and the noisy samples might not be representative of a class, as if there is an uncertainty with regard to the class to which they belong. The noises tend to corrupt the data samples, and the optimal hyperplane obtained by the PSVM may be sensitive to noises or outliers in the training sets. As a result, a classifier might not be able to correctly classify some of the data samples having noisy data, so the fuzzy support vector machines [13,14] and fuzzy linear proximal support machines [15,16] were proposed to address the problem.
Normally however, real data set is not linearly separable. In this paper, we proposed the fuzzy nonlinear proximal support vector machine (FNPSVM) to extract different types of lands, and this technique is actually a fuzzy non-linear extension of the existing PSVM methods. In addition, we defined a fuzzy membership function that assigned a fuzzy membership to each data point, such that different data points could have different effects in the learning of the separating hyperplane. Additionally, for the purpose of improving algorithm performance, we presented the approaches of some key parameters of this algorithm, as well as the approaches of feature extraction and feature selection. And lastly, we compared our algorithm with the other three classifiers (MLC, BPN, and PSVM).
The paper is organized as follows: Section 2 discusses in detail the architectures of PSVM and FNPSVM.
Training algorithm of FNPSVM is shown in section 3. Experimental results of the algorithm and discussion are presented in section 4.
Section 5 contains the concluding remarks.

Architecture of PSVM
To deduce our FNPSVM algorithm, we briefly introduce the binary category proximal support vector machine first. Let the data set consisting of m points in the n-dimensional real space R n be represented by the m|n matrix, and let each point be represented by an n-dimensional row eigenvector A i (i~1,2, Á Á Á ,m): In the case of binary classification, each data point A i in the class of A+ or Ais specified by a given m|m diagonal matrix D, with +1 or -1 elements along its diagonal. The target is separating the m data points into A+ and A-, as depicted in Figure 1. For the problem, the proximal support vector machine with a linear kernel [12] is given by the following quadratic program with parameter c.0 (which controls the tradeoff between the margin and the error) and linear equality constraint: where e is an m-dimensional vector of ones, and y is an error vector. When the two classes are strictly linearly separable, y i~0 in (1) (which is not the case shown in Figure 1). As depicted in Figure 1, the variables (v,c) determine the orientation and location of the proximal planes: x'v{c~z1 around which the points of each class are clustered and which are pushed apart as far as possible by the term (v'vzc 2 ) in the objective function. Consequently, the plane: midway between and parallel to the proximal planes (2), is a separating plane that approximately separates A+ from A-as depicted in Figure 1. The distance 2 v c ! is called the ''margin'' (see Figure 1), and maximizing the margin enhances the generalization capability of a support vector machine [9,17]. The approximate separating plane (3) shown in Figure 1, acts as decision function as follows: x'v{c Architecture of FNPSVM In this paper, we will employ the following norms of a vector x'[R n [17]: The fuzzy nonlinear binary category proximal support vector machine. Generally, real data sets are corrupted with noises. And as a result, it's not always the case that one classifier obtained by training with noisy data would correctly classify some of the data samples. Since the optimal hyperplane only depends on a small part of the data points, it may become sensitive to noises or outliers in the training set [18,19]. We can associate each data point with a fuzzy membership that reflects their relative degrees as meaningful data, and accounts for the uncertainty in the class to which it belongs. Those noises or outliers are treated as less important and have lower fuzzy membership. This equips the classifier with the ability to train data that has noises or outliers. Such is done by setting lower fuzzy memberships to the data points that are considered to be noises or outliers with higher probability. A classifier that is able to use information regarding this fuzzy degree can improve its performance, and reduce the effects of noise or outliers. Thus we proposed the following the optimization problem in determining the classifier: where S denotes a diagonal matrix, i.e. S~diag(s 1 ,s 2 , Á Á Á ,s m ), whose diagonal elements correspond to the membership values of the data samples belonging to A+ or A-; and e is the vector of plus ones. And 0vsƒs i ƒ1(i~1,2, Á Á Á ,m). According to the objective function of (8), y can be replaced by v and c, so we then arrive at the following unconstrained minimization problem: To obtain fuzzy nonlinear proximal classifier, we modify formula (9) as in [12,20] first by substituting the variable v with its dual equivalent v~A'Du, and then by modifying the last term of the objective function to be the norm of the new dual variable u and c. Now we obtain the following problem: If we now replace the linear kernel AA' by a nonlinear kernel K (A, A)', we obtain: Let F(u,c)~c , and setting one-order derivative of F (u,c) with respect to u and c to , we arrive at the following formula: where both D and S are diagonal matrices, and so that D~D', S~S' and D 2~I . Further, we deal with the above formula (12), and obtain the equations with respect to u and c: And thus formula (13) can be expressed by the following formula: We can work out u and c by solving formula (14), and hence the binary category nonlinear classifier can be written as follows: The fuzzy nonlinear proximal support vector machine. There are roughly four types of support vector machines that handle multi-class problems [21]. Two strategies have been proposed to adapt the SVM to N-class problems [22], namely the ''one-against-one'' strategy and the ''one-against-rest'' strategy. The ''one-against-one'' strategy is to construct a machine for each pair of classes, resulting in N (N-1)/2 machines. When applied to a test pixel, each machine gives one vote to the winning class, and the pixel is labeled with the class having most votes. The ''one-against-rest'' strategy is to break the N-class case into N twoclass cases, in each of which a machine is trained to classify one class against all others [4]. In this paper, we employed the above mentioned strategies. ¤''One-against-one'' strategy: Here, k is the class number, while A r [R m r |n and A j [R m j |n represent the m r and m j points in class r and class j, respectively. Let m~m r zm j , and thus D is a m|m diagonal matrix as follows: From formula (14), the k|(k{1)=2 unique u and c can be obtained, and thus k|(k{1)=2 proximal surfaces are generated: A new given point x[R n is assigned the i th class T i (i~1, Á Á Á ,k) timed by k|(k{1)=2 proximal surfaces, and finally x is assigned i th class in terms of the following formula: supposing the dataset is to be classified into M classes. Therefore, M binary SVM classifiers may be created where each classifier is trained to distinguish one class from the remaining M-1 classes. For example, class one binary classifier is designed to discriminate between class one data vectors and the data vectors of the remaining classes. Other SVM classifiers are constructed in the same manner. During the testing or application phase, data vectors are classified by finding margin from the linear separating hyperplane. The final output is the class that corresponds to the SVM with the largest margin [23]. ¤''One-against-rest'' strategy: where k is the class number, A r [R m r |n represents the m r points in class r. Letting m~m 1 zm 2 z Á Á Á zm k , so that D is a m|m diagonal matrix as follows: From formula (14), the k unique u and c can be obtained, and thus k proximal surfaces are generated: A new given point x[R n is assigned class t, depending on which of the k nonlinear halfspaces generated by the k surfaces it lies deepest in, namely: In this method, SVM classifiers for all possible pairs of classes are created. Therefore, for M classes, there will be binary classifiers.
The output from each classifier in the form of a class label is obtained. The class label that occurs most is assigned to that point in the data vector. In case of a tie, a tie-breaking strategy may be adopted. A common tie-breaking strategy is to randomly select one of the class labels that are tied [23].

Fuzzy membership model
In order to improve classification performance and to reduce the corruption of data samples from noises, we defined a fuzzy membership function to a given class, where a membership is assigned to each data point. It is written as: where x denotes the distance between the data sample and the center of the class that it belongs to. In addition, t 1 and t 2 that tune the fuzzy membership of each data point in the training are two user-defined constants, and they determine the range in which the data sample absolutely does or does not belong to a given class. On the other hand, they also control the figure of the curve (see Figure 2). A reducing value of x would indicate that the distance between the data sample point and the center of the given class is smaller, and the probability of this sample belonging to this certain class is higher. When x is between 0 and t 1 , the data sample point belongs to the given class with absolute certainty; and when x is between t 2 and 1, the data sample point doesn't belong to the given class. When the value of x is known, the values of t 1 and t 2 would influence the values of fuzzy memberships, and thus would also influence the ultimate classification result.
The distance x is the key of each training sample's fuzzy membership, and it can be obtained as follows: where n is the number of training samples to a given class, and p is the number of feature selected, with VF ti representing the t th feature value of the i th sample. M t is the mean value of t th feature of n samples to a given class; DM t is the max value of the distances between all sample points and the center (M t ) of the t th feature to a given class; and x i denotes the average distance between the i th sample and the centers of all features.

Sample Selection
The choice in sample size and sampling design affect the performance and reliability of a classifier. Sufficient samples are necessary. A previous study indicated that this factor alone could be more important than the selection of classification algorithms in obtaining accurate classifications [24].
Sample selection includes two parts, namely sample data size and selection method. Increases in sample data size generally will lead to improved performances, though at the same time resulting in a higher calculation cost. The sample size must be sufficient enough to provide a representatively meaningful basis for training of a classifier and for accuracy assessment. The basic sampling designs, such as simple random sampling, can be appropriate if the sample size is large [25] enough. The adoption of a simple sampling design is also valuable in helping to meet the requirements of a broad range of users [26]. In this paper, we apply simple random sampling design to collect training samples and testing samples.

Kernel Function Strategy
The concept of the kernel is introduced to extend SVM's ability in dealing with nonlinear classification. It can transform non-linear boundaries in low-dimensional space into linear ones in highdimensional space by mapping feature vector into a highdimensional space, and thus the training data can be classified in the high-dimensional space without knowing the specific form of the mapping function. A kernel function is a generalization of the distance metric that measures the distance between two data points as the data points are mapped into a high dimensional space in which the data are more clearly separable [27,28].
Three kernel functions for nonlinear SVM, including the radial basis function (RBF), the polynomial, and the sigmoid are widely used. In this paper, we have adopted the Gaussian RBF kernel as the default kernel function model due to the fact that: (1) The RBF kernel can handle the case where the relation between class labels and attributes is nonlinear [29]; (2) The polynomial function spends a longer time in the training stage of SVM, and some previous studies [30][31][32] have reported that the RBF function would provide better performance compared to polynomial function. In addition, the polynomial kernel has more hyper parameters than RBF kernel does, and may approach infinity or zero while the degree is large [29]; (3) The sigmoid kernel behaves like the RBF under certain parameters; however, it is not valid under some parameters [9]; (4) When the size of sample data is quite large, convergent ability of RBF kernel is stronger than that of the other kernels above.
The Gaussian kernel function is expressed as: which is a row vector in R n , while B :j is the j th column of B; the kernel K(A,B) maps R m|n |R n|k into R m|k . In particular, if x and y are column vectors in R n , then K(x',y) is a real number, The parameter s of the RBF kernel is a user-defined positive constant regulating the width of the Gaussian kernel, which has an important impact on kernel performance. There is however little guidance in the literatures on the criteria of selecting kernelspecific parameters [33], hence we carried out lots of trials to acquire the optimal parameter s:

Parameter Selection Method
Regardless of using a simple or a more complex classifier, the learning parameters have to be chosen carefully in order to yield a good classification performance. The FNPSVM algorithm proposed in this paper requires four given parameters, specifically c, s, t 1 and t 2 . Vapnik [9] discovered that varying kernel functions would slightly affect classification results of SVM, while the parameters of the kernel functions and penalty constant c would have a strong effect on the performance of SVM.
One such parameter c.0 is an important quantity in determining a trade-off between the empirical error (number of wrongly classified inputs) and the complexity of the found solution.
Normally large values for c lead to fewer training errors (and a narrower margin), all at the cost of more training time; whereas small values generate a larger margin, with more errors and more training points situated inside the margin. Since the number of training errors cannot be interpreted as an estimate of the true risk, this knowledge does not really help in choosing a suitable value for the parameter. The parameter s of the Gaussian kernel affects the complexity of the decision boundary. Improper selection of these two parameters can cause over-fitting or under-fitting problems [29,34]. Nevertheless, there is little explicit guidance to solve the problem of choosing parameters for SVM. Recently, Hsu [35] suggested a method in determining parameters, namely gridsearch and cross validation. For multi-category however, the cross validation method is not feasible. In this paper, we advanced his method and proposed an approach named the multi-layer grid search and random-validation. The basic idea of random-validation is that we randomly divide the sample set into training set and test set of different size to each category. The test set is sequentially tested using the classifier trained on the training set, and the classification accuracy is derived. The above procedure is iteratively executed for n times during each cycle, and n accuracies are obtained. Finally, the random-validation accuracy is the mean of n accuracies.
We recommend the ''multi-layer grid search'' method on c and s using n random-validation, in order to accurately find the optimal parameters while lowering computational cost. We first acquire the boundary of the parameters c and s, and the 2dimentional grid of pairs of (c i ,s j ) is roughly constructed. Here, i~1,2, Á Á Á , m, and j~1,2, Á Á Á , n, thus m|n gird-plane and m|n pairs of (c i ,s j ) are obtained. The FNPSVM algorithm uses each pair of (c i ,s j ) to learn by basing on n random-validation, and obtains the classification accuracy. The corresponding (c i ,s j ) high of the best accuracy is the optimal pair. If the best accuracy does not satisfy the requirement of classification, a new 2-dimentional gridplane that's based on the center of the pair of (c i ,s j ) high should be constructed, and the learning by using new pairs of (c,s) in the new grid-plane is executed to acquire higher accuracy. The above procedure is performed iteratively to find the optimal parameters c and s.
Although the multi-layer grid search and random-validation seem simple, it is actually practical because of the fact that: (1) For each parameter, a finite number of possible values is prescribed, and then all possible combinations of (c,s) are considered to find one that yields the best result; (2) the computational time in finding good parameters through the approach isn't much more than that of advanced methods, since there are only two parameters (generally the complexity of grid search grows exponentially with the number of parameter); (3) The grid-search can be easily parallelized because each (c,s) is independent, unlike some other advanced methods that require iterative processes.

Experiments and Discussion
All experiments were run on 1800 MHz ADM Sempron (tm) processor 3000 + under Windows XP using Matlab 7.0 compiler. We have adopted the classification criterion of Chen [36]; salinealkalized lands are classified into heavy saline-alkalized land, moderate saline-alkalized land, and light saline-alkalized land.
Classification Experiments Using ETM + Image Experiment summary. We have selected Da'an, a city in northern China with a total area of 4,879 km 2 as our test area. Multi-spectral (Landsat-7 ETM + ) remote sensing data (30 m spatial resolution, UTM project) acquired on August 30 th , 2000 was used to classify the image data into nine land cover types (heavy saline-alkalized land, moderate saline-alkalized land, light saline-alkalized land, water area, cropland, grassland, rural residential area, urban residential area, and sand land).
According to the topographic maps of Da'an city (1:100,000 scale), we implemented precise geometric correction and resampling of the image. Geometric correction of image was accomplished through two-order polynomial while resampling was achieved through cubic convolution with the error of matching less than one pixel. We selected 270 samples (90 for training and 180 for testing) for each class using a random sampling procedure from the image, totally 810 training samples and 1,620 test samples for nine classes. For each sample set, the test set was independent of the training set.
To demonstrate the effectiveness of the proposed method, both ''one-against-one'' and ''one-against-rest'' strategies that are based on Gaussian RBF kernel in dealing with the n-class case were used, and the results (various accuracies, training speed, and classification speed) obtained using FNPSVM algorithm were compared with those derived from the four conventional classification methods including the maximum likelihood classifier (MLC), back propagation neural network (BPN), support vector machine (SVM), and proximal support vector machine (PSVM) under different training conditions (shown in Table 1).
Feature extraction and feature selection. (1) Feature extraction. Feature extraction has a strong impact on classification accuracy. In this paper, we extracted 14 features, including six bands of ETM + image, the first principle components of K-L transform and K-T transform, soil index, NDVI (normalized difference vegetation index), composition index, as well as H (hue), S (saturation), and I (intensity) color components of HSI color space. Some of the features can be obtained as follows: represent the first band, third band, forth band, and fifth band of ETM + image, respectively.
In the field of digital image processing, a number of color models were proposed, such as RGB, HSI, CIE, etc. But selecting the most optimal color space is still a problem in color image segmentation [20].
The RGB color model is suitable for color display, but less so for color analysis because of its high correlation among R, G, and B color components [38]. In color image processing and analysis, we know that: (1) H and S components are closely correlated to the color sense of the eyes; (2) Hue information and intensity information are distinctly differentiated in HSI model; (3) By HSI model, computer program can easily process color information after the color sense of the eye has been transformed into specific values, so we extracted H, S, and I color components of HSI color space as three features of classification. False color image composite of bands 5, 4, and 2 were performed, after which the image was exported into RGB image. And finally the RGB model was transformed into HSI model according to the following formulas [39]: (2) Feature selection.Normally, the size of a real dataset is so large that learning might not work, and the running time of a learning algorithm might be drastically increased before removing these unwanted features. Thus we must select some features that are neither irrelevant nor redundant to the target concept. Feature selection for classification is a well-researched problem, striving to improve the classifier's generalization ability, and to reduce the dimensionality and the computational complexity. It directly reduces the number of original features by selecting a subset of them that still retains sufficient information for classification [40]. Feature selection attempts to select the minimally sized subset of features according to the following criterion. The criterion can be [41]: 1) The classification accuracy does not significantly decrease; and 2) The resulting class distribution when given only the values for the selected features, is as close as possible to the original class distribution when given all features.
For this paper, in terms of the above criterion, the data types and the characteristics of remote sensing image, we adopted traditional DB Index rules which used the methods of betweenclass scatter and within-class scatter to select classification features. DB Index rules are as follows [42]: 1) where N i denotes the number of samples of i th class; and X i represents the center of the i th class. 2) where d ij is the distance between the centers of the two classes.
The smaller the value of DB k is, the better the performance of classification is. Based on the above rules and 270 sample points of each category, we obtained DB indices of fourteen features and their ranks (see Table 2).
Parameter setting. Due to the differing nature of the impacts that algorithm parameters have on different algorithms, it is impossible to account for such differences in evaluating the comparative performances of the algorithms [4]. To avoid this problem, the corresponding parameters of the best performance of each algorithm were chosen for the purpose of comparison.
(1) Parameter setting of PSVM and FNPSVM.The performance of classification algorithms is affected by the parameter settings of those algorithms. As described in section 3.4, we searched for the optimal parameters t 1 , t 2 , c, and s for FNPSVM classifier. In this procedure, we used two steps to find the best parameters. In the first step, we set the parameters t 1 = 0.1 and t 2 = 0.8, and searched for the kernel parameter s and penalty constant c as described in section 3.5. In the second step, we set the parameters s and c as found in the first step, and searched for the parameters t 1 and t 2 of the fuzzy membership mapping function. In the first step, we constructed the two-dimensional grid for the first layer. The values of c and s were prescribed from 2 214 to 2 14 , multiplied by 2 4 . The grid-search using 5-time random-validation was executed, and we found that the optimal parameter pair (c,s) was (2 10 , 2 210 ), having the highest overall classification accuracy (93.31%) and kappa  value (0.9248). Table 3 summarized the results of first-layer gridsearch. Subsequently we constructed the second-layer grid based on the center (2 10 , 2 210 ); and the values of c and s were chosen from 2 7 to 2 13 and from 2 27 to 2 213 , multiplied by 2 respectively; and the grid-search using 5-time random-validation was implemented. As was shown in Table 4, c = 2 13 and s = 2 213 gave the best overall classification accuracy (93.56%) and kappa coefficient (0.9275). As the accuracies could fundamentally satisfy our classification demand, we began the next step, where we set the parameters c = 2 13 and s = 2 213 , and searched for the parameters t 1 and t 2 . Unfortunately, we couldn't find that the changes of parameters t 1 (0.05,0.2) and t 2 (0.7,0.9) to be able to significantly improve the performance of the FNPSVM, hence we set t 1 = 0.1, t 2 = 0.8.
(2) Parameter setting of BP neural network.There are many parameters associated with BP neural network, including neuron number, transfer function, learning rate, iteration time and so on. It is not easy to know beforehand which values of these parameters are the best for a problem. Consequently in this paper, in order to yield the optimal classification performances, the settings of some key parameters of BP neural network were achieved by repeated trials and some experiences from previous studying.
A BP neural network with a hidden layer can approximate with arbitrary precision an arbitrary non-linear function that's defined on a compact set of R n [43,44].We employed three-layer BP neural network including input layer, hidden layer and output layer. The number of neurons in the hidden layer is one of the primary parameters of BPN algorithm; currently however there is no authoritative rule to determine it. Larger number of hidden units leads to a poor generalization and increases training time, but too few neurons would cause the networks to unfit the training set and to prevent the correct mapping of inputs and outputs. In this paper, the number of neurons in the hidden layer was determined by the empirical formula [44] to be 20, thus the network structure became n-20-9 (n denotes the number of features).
We chose log-sigmoid function as the transfer functions from input layer, while setting the limit on the neural network's iteration number to be 1,000 times for each desired output. Levenberg-Marquard optimum algorithm (trainlm function in Matlab software) was utilized as the training function because it could greatly increase the training speed of the network by utilizing a lot of memory. Gradient descent with momentum weight and bias learning function was employed to calculate a given neuron's weight change from the neuron's input and error, the weight, learning rate, and the momentum constant according to the gradient descent with momentum. The other parameters of the network are chosen as follows: learning rate g = 0.5, momentum factor a = 0.8, minimum gradient d = 10 220 , and minimum mean square error e = 10 26 . Figure 3 shows the classification maps using the MLC, BPN, PSVM, and FNPSVM, all based on the settings of above parameters of various classifiers.
Performance assessments. Normally, settings of the various parameters on different algorithms affect the classification results, so it is difficult to evaluate the comparative performances of the algorithms because of the changing parameters. To address this problem, the best performance of each algorithm on each training case was listed in the following tables. The criterion for evaluating the performances of classification algorithms includes accuracy, speed, stability and comprehensibility, among others [4]. In this paper, we chose one group of criteria, consisting of classification accuracy, speed and stability to assess the performances of different algorithms. Table 5 gave overall accuracies and kappa coefficients using various multi-class strategies and classifiers with ETM + data on different cases. Using different classifiers under different training conditions, Table 6 gave training speed and classification speed of the entire data set. Means and standard deviations of the overall classification accuracies basing on different training samples, testing samples and features, were manifested in Table 7. Figure 4 shows the boxplots of the overall classification accuracies, developed by randomly selecting training samples and testing samples from the 270 samples of each class for six times.
(1) Classification accuracy.In this paper, classification accuracy, one of the most important criterions in evaluating the performance of the classifier, was measured using overall accuracies and kappa coefficients computed by the confusion or error matrix. The most widely used way to represent the classification accuracy of remote sensing data should be in the form of an error matrix, applicable for a variety of site-specific accuracy assessments. Numerous researchers have recommended using error matrix in representing accuracy in the past, and it has now become one of the standard conventions to adopt such practice. The effectiveness of the error matrix in representing accuracy can be seen from the fact that accuracies of each category are fundamentally described along with both the errors of inclusion and errors of exclusion present in the classification [25,45]. In order to accommodate the effects of chance agreement, some researchers suggest using kappa coefficient and adopting it as a standard measure of classification accuracy [46]. Foody [47]also pointed out that since many of the remote sensing data sets are dominated by mixed pixels, the standard accuracy assessment measures such as the kappa Table 3. The overall accuracies (%) and kappa coefficients of the first layer grid-search using 5-time random-validation based on ETM + image. coefficient is often not suitable for accuracy assessment in remote sensing. Although its sensitivity to the density or frequency of the dynamic change in real world had some researchers arguing about its effect, the fact remains that the kappa coefficient has many intriguing features as an index of classification accuracy. More specifically, it offers some compensation for chance agreement, and a variance term could be calculated, enabling the statistical testing of the significance of the difference between two coefficients [25,48].
We also need to emphasize that the various measures of accuracy are to evaluate different components of accuracy and to make different assumptions on the data [49]. The fact is that the measurement and meaning of classification accuracy depend substantially on individual perspective and demands [49,50]. An accuracy assessment can be conducted for a variety of reasons, and many researchers have recommended that measures such as the kappa coefficient of agreement be adopted as a standard [25,46]. In terms of the above parameters selected from different algorithms, and basing on the 270 samples of each category obtained through simple random sampling design, we obtained overall classification accuracies and kappa coefficients using various multi-class strategies and classifiers on 12 training cases with the ETM + dataset consisting of 4,037,099 points (see Table 5).
Unfortunately, confronting such a large dataset, SVM failed on this problem because it required the more costly solution of a linear or quadratic program. Several patterns can be observed from Table 5 and Table 7, explained as follows: 1) As far as the multi-class classification strategies of PSVM and FNPSVM were concerned, the accuracies of ''one-against-one'' strategy in all training cases were about 1-2% higher than those of ''one-against-rest'' strategy. Also, through experiments, we found that compared to the classification speed of ''one-against-rest'' strategy, the classification speed of ''one-against-one'' strategy was at least two times higher, for both PSVM and FNPSVM (not listed in the following tables). So in this paper, we employed ''one-against-one'' multi-class classification strategy of PSVM and FNPSVM for comparison with the other two classifiers.
2) The level of classification accuracies achieved by PSVM and FNPSVM was significantly higher than that produced by either the MLC or BPN classifier. In addition, they yielded significantly better results than the MLC or BPN classifier did in all 12 training cases ( Table 5). The accuracy differences between the PSVM and FNPSVM were rather small, and quite the same as that between the MLC and BPN ( Table 5). The mean overall accuracies of the PSVM and FNPSVM were remarkably higher than those of MLC and BPN, however the differences between MLC and BPN or between PSVM and FNPSVM were only slight (Table 7). This is expected because the PSVM and FNPSVM are designed to locate an optimal separating hyperplane, while the other two algorithms may not be able to locate this separating hyperplane. Statistically, the optimal separating hyperplanes located by the PSVM and FNPSVM should be generalized to unseen samples with the least errors among all separating hyperplanes. Generally, as the number of available features increases, the overall accuracies and kappa coefficients of PSVM and FNPSVM grow gradually. Unexpectedly however, the increase in the number of available features didn't always lead to an improvement of the accuracies of MLC and BP. On the contrary, MLC and BP showed better comparative performances on training cases with ten features than they did on training cases with fourteen features, which might be explained by the presence of a large number of irrelevant features that would hurt the classification performances. This again demonstrates the importance of feature selection. In terms of Table 5, it could be seen that the accuracies and kappa coefficients of the four algorithms improved with the increase in training data size, though not significantly.
3) The overall accuracy differences between MLC and BPN on the data set used in this study were generally small, and those between PSVM and FNPSVM were also not obvious. However, many of them were statistically significant.
(2) Training speed and classification speed.Training speed and classification speed are two important criterions in evaluating the performances of classification algorithms. Shown in Table 6, the training speed and classification speed of the four classifiers were substantially different. Generally, the training time and classification time rise with an increase in available features. The training speed of BPN was significantly lower than those of the other three classifiers because of its complex network structure. As far as classification speed was concerned, in all training cases, those of the PSVM and FNPSVM were remarkably lower than those of the MLC and BPN. The classification of the MLC and BPN in all training cases took from less than an hour to only a few minutes, while the PSVM and FNPSVM took more than several hours and ten hours, respectively. This was due to the fact that PSVM and FNPSVM involved large matrix calculation and reverse matrix operation during the process of classification. In addition, it should be noted that we have spent much time in searching for the optimal key parameters including the kernel parameters s and the constant c in the training process, therefore yielding a better performance. Compared with PSVM, the training speed and classification speed of FNPSVM were more than twice its counterparts. The reason was that in terms of the comparison between the FNPSVM algorithm in section 2.2.1 and the PSVM algorithm [12], it was easy to find that the PSVM algorithm dealt with the product of an n-dimensional (n being the number of features) row vector and a matrix (m being the number of training samples), thus requiring a high calculation cost; while the FNPSVM algorithm avoided such problem.
To summarize, the training speeds and classification speeds of the above four algorithms are affected by many factors, including numbers of training samples and features, the size of training data set, as well as algorithm parameter settings. The training speed and classification speed of BPN depend on network structure, momentum rate, learning rate and converging criteria; while those of the PSVM and FNPSVM were affected by the number of features, kernel function, key parameter settings, as well as class separability.
(3) Algorithm stability.Various accuracies in Table 5 were obtained by randomly selecting training samples and testing samples only once at each sample size level. In order to evaluate  Table 5. Overall accuracies (%) and kappa coefficients using various multi-class strategies and classifiers on different cases based on ETM + image. the stabilities of the four classifiers and for the results to be statistically valid, we randomly selected training samples and testing samples for six times at three sample data size levels from the 270 samples of each class: 60 training samples and 210 testing samples, 90 training samples and 180 testing samples, as well as 120 training samples and 150 testing samples. Thus each classification algorithm was trained six times by various-sized training samples with four and ten features, respectively. Afterwards we calculated the means and standard deviations of the overall classification accuracies of each classifier (see Table 7). The standard deviation of the overall accuracy of an algorithm estimated in cross validation is a quantitative measure of its relative stability [4]. Both Table 7 and Figure 4 revealed that the stabilities of the algorithms differed greatly and were affected by the training data size, testing data size, and the number of features. Generally, the overall classification accuracies of the algorithms became more robust when trained by using large-sized pixels than using small-sized pixels, especially when ten features were used ( Figures 5 (b), (d), and (f)). Unexpectedly however, MLC showed higher reliability and lower mean overall accuracy when trained with only four out of a total 14 features (Table 7). This is probably due to the fact that MLC algorithm itself is sensitive to some relevant features, while some features that are partially or completely irrelevant to the classification target only increase the uncertainty of the classification results. On the other hand, according to Hughes effect [51], the effect of increasing dimensionality is thought to lower the reliability of the estimates of statistical parameters required for the computation of probabilities [10]. The FNPSVM showed more stable overall accuracies than the other three classifiers did when trained with ten features at different training sample size levels; however, when trained with four out of the total 14 features, the stability of FNPSVM was significantly lower than that of MLC, although clearly higher than those of the other two algorithms (Figure 4 (a), (c), and (e)). The likely cause of the stability of FNPSVM being lower than that of MLC on data with four features is that the applicability of the FNPSVM to non-linear decision boundaries depends on whether the decision boundaries can be transformed into linear ones by mapping the input data into a high-dimensional space. When the data contain very few features, the FNPSVM can't successfully transform non-linear decision boundaries in the original feature space into linear ones in a high-dimensional feature space, while the MLC algorithm is useful when there is a fair amount of randomness under which the data are generated. The theoretical statistical distribution allows the use of the MLC approach that is optimal in the sense that, using too many irrelevant features probably affects its stability; so that even when the data contain very few features, it has better comparative reliability performance over the FNPSVM.
Compared to PSVM, FNPSVM generated better reliability in all of the 6 training cases (Table 7), owing to automatically associating each data point with a fuzzy membership that can reflect their relative degrees as meaningful data, and FNPSVM becoming more applicable in reducing the effects of noises or outliers in the process of training. Of the four algorithms, the BPN gave overall accuracies in a wider range than the other three algorithms (Figure 4) did for all cases, and showed the worst reliability (Table 7) because of its complex network structure and lots of optional parameters that affect the classification performance.

Classification Experiments Using SPOT Image
Experiment summary. This study took the SPOT remote sensing image captured on September 12 th , 2004 (scene number : Table 6. Training time and classification time of whole data set (4,037,099 pixels) using various classifiers on different cases based on ETM + image unit:second. 64002TH200409121049401018) as the data source, covering the western area of Da'an city in China and including near-infrared, red and green band. We cut 1,734*1,969 sized image from the SPOT image as test data set. The experiment area mainly contained several land types, namely heavy saline-alkalized land, moderate saline-alkalized land, light saline-alkalized land, water area and farmland. And then 120 samples (60 training samples and 60 test samples) were selected from each land type to train classification algorithm and to evaluate the accuracy of classification.
To evaluate the performance of FNPSVM algorithm on extracted saline-alkalized land using high spatial resolution (SPOT Table 7. Means and standard deviations (s) of overall classification accuracies based on various samples and features using ETM + image.  Fuzzy Nonlinear Proximal Support Vector Machine with 20m resolution), we adopted the ''one-against-one'' strategy based on the Gaussian RBF kernel function, and compared with MLC, BPN, and PSVM methods in terms of classification accuracy and classification speed. Feature extraction. The paper extracted 8 features from the SPOT image data, including near-infrared band, red band, green band, the 1 st component of K-L transform, NDVI, and the H, S and I components of HSI color space. NDVI is expressed by the following formula: where B 2 and B 3 are red band and near-infrared band of SPOT, respectively. RGB image is acquired by basing on the false-color composite using the third, the second, and the first band of SPOT image. The RGB model is then transformed to HSI model by the following formulas: It's not necessary to choose feature because of the limited features, so we extracted saline-alkalized land information based on the above eight features using various algorithms.
Key parameter setting. We used the method in section 3.4 to obtain the parameters of PSVM and FNPSVM classifier. The accuracy of FNPSVM classifier didn't change significantly when t 1 changed in the 0.05-0.2 range and t 2 changed in the 0.7-0.9 range, so the paper still set t 1~0 :1 and t 2~0 :8. Afterwards, the optimal parameters of c and s were searched, and the results were shown in Table 8 and 9. The overall classification accuracy increased to a maximum of 97.26% at c~2 11 and s~2 {13 .
The parameter setting of BPN was as same as the BPN parameters of section 4.1.3.2 except for the neural network structure. We chose log-sigmoid function as the transfer functions from input layer to output layer, and set the limit on the neural network's iteration number to 1,000 times for each desired output. Levenberg-Marquard optimum algorithm is used as the training function. The other parameters of the network were set as follows: learning rate g = 0.3, momentum factor a = 0.8, minimum gradient d = 10-20, and minimum mean square error e = 10-6. In this study, the number of neurons in the hidden layer is finally acquired through repeated experiments, eventually arriving at 12. Therefore the structure of neural network is 8-12-5.
Performance assessments.
(1) Vision effect.Original SPOT image in study area is obtained by compositing bands 3, 2 and 1 (see Figure 5). Figure 6 shows the experiment classification result using MLC, BPN, PSVM and FNPSVM algorithms which were based on 8 feature vectors and 60 training samples per class. When comparing the classification result map of each algorithm with the original SPOT image, on the macro level the differences of various classification result maps are not clear; but in detail we can find that the spatial patterns of land cover classification from the PSVM and FNPSVM method are significantly better than the other two methods, while the spatial patterns of classification maps are quite similar between PSVM and FNPSVM. As seen in Figure 6, when using MLC and BPN classifier, not only were the patches fragmented, but the saline-alkalized land and water area were also mistakenly mixed; and a large number of saline-alkalized lands were wrongly classified as water area. But BPN, PSVM and FNPSVM classifiers all overcome the drawbacks of MLC classifier.
(2) Classification accuracy.We obtained the confusion matrix according to 60 test samples, and then calculated the overall accuracy and kappa coefficient of various classifiers (see Table 10).
Seen from Table 10, the overall accuracy and kappa coefficient of all classifiers were higher, except that the classification accuracy of MLC classifier was less than 90%. The accuracies of the other classifiers were all higher than 95%. The overall accuracy of FNPSVM was the highest (97.33%), meaning that the performance of FNPSVM was better than the others, which was mainly in accordance to the strict mathematical theory.
(3) Classification speed.Classification speed is one of the important indicators to evaluate the performance of the classifier. Table 10 gave the classification time based on the 8-dimensional feature vector. As can be seen from the table, the MLC classifier was the fastest, mainly because of the simplicity of the algorithm and the low amount of computation. This was followed by BPN classification by only a few minutes. The classification speed of PSVM and FNPSVM dropped substantially from that, with several hours lagging; the speed of FNPSVM was twice the such of PSVM, for the same reason that explains ETM image classification speed, of which the paper will not discuss at this time.
(4) Algorithm stability.Based on the above experiment data, feature extraction and key parameter settings, we obtained the overall classification accuracies of various classifiers (see Table 10), and calculated the means and standard deviations (see Table 11).
Seen from Table 11, the stability of each algorithm under the same training condition is quite different. The stability of BPN classifier is the lowest because of its complex network structure and the many optional parameters affecting the classification performance. The FNPSVM showed more stable overall accuracies than the other three classifiers did, as by automatically associating each data point with a fuzzy membership in the process of training, FNPSVM could effectively reduce the effects of noises.

Conclusions
Considered as a kind of regularized least squares SVM, PSVM requires the solution of a single set of linear equations, and thus can be considerably faster than conventional SVM. Jayadeva [16] extended the PSVM and proposed fuzzy linear proximal support vector machine. In order to increase nonlinear separability of real data set, we presented in this paper the fuzzy nonlinear proximal support vector machine (FNPSVM), and described the strategy for setting fuzzy membership in FNPSVM, therefore making FNPSVM more feasible in the application of reducing the effects of noises or outliers. Numerous experiments were performed to evaluate the comparative performances of this algorithm and three other popular classifiers, including the MLC, BPN and PSVM in saline-alkalized land classification. In addition, impacts of the key parameters of FNPSVM algorithm on its performance as well as the impacts of the selection of training data and features on all four classifiers were also evaluated.
The results of our experiments supported the use of ''oneagainst-one'' strategy for multi-class classification problems, and indicated that of the four algorithms evaluated, both the PSVM and FNPSVM achieved considerably higher levels in overall accuracies and kappa coefficients than either the MLC or the BPN did, especially so in high-dimensional feature space; and comparatively, the accuracies and kappa coefficients of the MLC were lowest in all 12 training cases. The results should be attributed to the abilities of PSVM and FNPSVM in locating the optimal separating hyperplanes, as shown in Figure 1. Statistically, the optimal separating hyperplanes found by the PSVM and FNPSVM classifiers should be generalized as unlabeled samples with errors smaller than any other separating hyperplanes that might be located by other classifiers. In terms of the performances of PSVM and FNPSVM classifiers, the absolute differences of their overall accuracies were quite small. Many of the differences were however, statistically significant.
The stabilities of PSVM and FNPSVM are closely correlated to the features used in the classification. The PSVM and FNPSVM algorithms gave higher stability than either the MLC or BPN did when being trained with 10 features by different sizes of pixels. When reduced to only 4 features however, the MLC manifested comparatively better reliability. As far as the PSVM and FNPSVM classifiers were concerned, the stability of FNPSVM was significantly higher than that of the PSVM, because the application of fuzzy set approach reduced the effects of noises or outliers. With regard to classification speed, based on larger dataset consisting of 4,037,099 points, the MLC and BPN were much faster than the PSVM and FNPSVM, and the computational cost of the PSVM was more than twice the cost of the FNPSVM. This indicated that the algorithm in this paper had predominant advantage in running speed compared to PSVM, and noted that both the PSVM and FNPSVM were affected by training data size, key parameter settings, class separability and so on.
When we adopted the multi-layer grid search and randomvalidation method in searching for the optimal parameters t 1 and t 2 , it was found that with t 1 valued smaller than 0.2 and t 2 valued larger than 0.7, accuracy variation of FNPSVM classifier was rather small, and the classification accuracy was relatively high. Within this range, the sensitivity wasn't high. With t 1 valued between 0.2 and 0.5, or t 2 valued between 0.5 and 0.7, accuracy variation of FNPSVM classifier was relatively high, while the classification accuracy significantly reduced, showing obvious cases of misclassification. Within this range, the sensitivity was relatively high. This is mainly because t 1 and t 2 would determine the range of which this sample point absolutely does or doesn't belong to a given class. When t 1 has too high a value or t 2 too low a value, it would automatically lead to a higher probability in misclassification, at which point the classification result is relatively sensitive to the values of t 1 and t 2 . This is also in consistency with our experience.
Both the selection and the number of training samples and testing samples affect the performances of all four classifiers. It is impractical and even impossible to determine the minimum number of samples needed for the sufficient training of an algorithm according to the results from these experiments. Fuzzy classification technique can reduce the corruption that the noises have on the data samples, and consequently, classification robustness is improved. To a greater extent, feature extraction and feature selection exert a strong impact on the substantial increases in accuracy, as some irrelevant features probably lead to a declining classification performance for some algorithms. As SVM is becoming a popular learning machine for object classification, the principal contribution of this paper is the presentation of a fast, simple and efficient classification algorithm in the research field of SVM, which is important as it significantly reduces the running time of classification and improves the stability performance.
Currently, due to China's increasing population and the far from perfect land management system, the conflict in more people having less land is becoming increasingly apparent. Drastically reduced farmland and land degradation have drawn the public's attention. The FNPSVM method proposed in this paper can rapidly and accurately extract information in land use and dynamic changes. As a foundation in establishing general planning for land utilization, protecting basic farmland, and ensuring sustainable development of the land, the method is also capable of providing decision supports for the authorities in land utilization and management. The algorithm has been successfully used in lands extraction with remote sensing image, and with the development in constructing the digital city, the extraction of land use information in cities becomes increasingly important.  Table 9. The overall accuracies (%) and kappa coefficients of the second layer grid-search using 6-time random-validation based on SPOT image. This algorithm can be used to classify patches of cities, and it is devoted to the construction of a digital city.