Support vector machine with quantile hyper-spheres for pattern classification

This paper formulates a support vector machine with quantile hyper-spheres (QHSVM) for pattern classification. The idea of QHSVM is to build two quantile hyper-spheres with the same center for positive or negative training samples. Every quantile hyper-sphere is constructed by using pinball loss instead of hinge loss, which makes the new classification model be insensitive to noise, especially the feature noise around the decision boundary. Moreover, the robustness and generalization of QHSVM are strengthened through maximizing the margin between two quantile hyper-spheres, maximizing the inner-class clustering of samples and optimizing the independent quadratic programming for a target class. Besides that, this paper proposes a novel local center-based density estimation method. Based on it, ρ-QHSVM with surrounding and clustering samples is given. Under the premise of high accuracy, the execution speed of ρ-QHSVM can be adjusted. The experimental results in artificial, benchmark and strip steel surface defects datasets show that the QHSVM model has distinct advantages in accuracy and the ρ-QHSVM model is fit for large-scale datasets.


Introduction
Support vector machine (SVM) [1] proposed by Vapnik and his cooperators has become an excellent tool for machine learning. SVM is a comprehensive technology by integrating the margin maximization principle, kernel skill and dual method. It has perfect statistical theory, which makes SVM be widely applied in many fields [2][3][4]. In spite of that, great efforts are needed to improve SVM. So, SVMs with different attributes have been proposed, such as least squares SVM (LS-SVM) [5], proximal SVM (PSVM) [6], v-SVM [7], fuzzy SVM (FSVM) [8] and pinball loss SVM (Pin-SVM) [9].
In 2007, Jayadeva et al. proposed a twin support vector machine (TWSVM) [10] for pattern classification. TWSVM is derived from generalized eigenvalue proximal SVM (GEPSVM) [11]. GEPSVM and the other multi-surface classifiers [12][13] are used to solve the XOR problems and reduce the computing time of SVM. Similarly, the TWSVM classifier determines two nonparallel separating hyper-planes by solving two quadratic programming problems (QPPs) PLOS  with smaller size. TWSVM has advantages in classification speed and generalization, which makes TWSVM become a new popular tool for machine learning. Based on TWSVM, some extended TWSVMs have been proposed, such as least squares TWSVM (LS-TSVM) [14], twin bounded SVM (TBSVM) [15], twin parametric-margin SVM (TPMSVM) [16], Laplacian TWSVM (LTWSVM) [17] and weighted TWSVM with local information (WLTSVM) [18]. Support vector data description (SVDD) [19] inspired by support vector classifier is a oneclass learning tool. SVDD implements the minimum volume description by building a hypersphere for target samples. When negative samples can be used, [19] provided a new SVDD with negative examples (SVDD_neg). SVDD_neg merges negative samples into training dataset to improve the description of hyper-sphere with the minimum volume. Different versions of classifiers have been extended from SVDD because the inner-class of samples can be gathered to the greatest extent. These classifiers include maximal-margin spherical-structured multi-class SVM (MSM-SVM) [20], twin support vector hyper-sphere (TSVH) [21], twinhypersphere support vector machine (THSVM) [22], maximum margin and minimum volume hyper-spheres machine with pinball loss (Pin-M 3 HM) [23] and least squares twin support vector hyper-sphere (LS-TSVH) [24].
A main challenge for all versions of SVM is to avoid the adverse impact of noise. As mentioned in [9], classification problems may have label noise and feature noise. So, anti-noise versions of SVM have been proposed. [13] proposed L1-norm twin projection support vector machine. In [13], L1-norm is shown to be robust to noise and outliers in data. [25] overcame noise impact on LS-SVM with weight varying. [26] adopted a robust optimization method in SVM to deal with uncertain noise. [27] built a total margin SVM with separating hyperplane which is insensitive to noise. [8] built a fuzzy SVM by applying a fuzzy member into each input sample. Fuzzy SVM can restrain the adverse effect brought by noise. These versions of SVMs have achieved some success in avoiding the adverse impact of noise, but they are not good at dealing with the feature noise around the decision boundary. In 2014, Huang et al. [9] designed a novel Pin-SVM by introducing pinball loss. Pin-SVM uses pinball loss to replace hinge loss, which makes Pin-SVM not only maintain the good property of SVM, but also be less sensitive to noise, especially the feature noise around the decision boundary. As such, the pinball loss has been successively introduced into different versions of SVM in [23], [28] and [29].
In this paper, a novel support vector machine with quantile hyper-spheres (QHSVM) for pattern classification is proposed. It inherits the excellent genes of SVDD_neg, TWSVM and Pin-SVM. QHSVM has the following attributes and advantages. a. QHSVM adopts pinball losses instead of hinge losses. The hinge losses with maximizing the shortest distance between two classes of samples are sensitive to noise. The pinball losses adopt quantile distance to replace the shortest distance. The quantile distance depending on many samples reduces the sensitivity to noise, especially the feature noise around the decision boundary. So, QHSVM improves the anti-noise ability of hyper-spheres by using the pinball losses.
b. QHSVM searches for two quantile hyper-spheres with the same center for positive or negative samples. On the premise of using pinball losses, the volume of one quantile hypersphere is required to be as small as possible, while that of the other one is required to be as big as possible. Moreover, QHSVM requires the target samples to be close to the same center of two hyper-spheres as much as possible. These attributes ensure that the margin maximization principle and the inner-class clustering maximization of samples are implemented.
c. QHSVM has a QPP for positive or negative samples. The QPP makes one class as a target class and makes the other class as a negative class. QHSVM explores the potential information of target samples to the greatest extent. And the negative samples are used to improve the description of hyper-sphere. These attributes improve the generalization of QHSVM.
d. In order to meet the classification requirement of high efficiency, a new local center-based density estimation method is proposed. And QHSVM with surrounding and clustering samples (ρ-QHSVM) is given. The local center-based density estimation method can appropriately split training samples into surrounding samples and clustering samples. The hyper-spheres of ρ-QHSVM will be described by sparse surrounding samples, while the center of hyper-spheres will be clustered by clustering samples.
In [23], Pin-M 3 HM also has the genes of THSVM and Pin-SVM. It seems that our QHSVM is similar to Pin-M 3 HM. In fact, our QHSVM is different from Pin-M 3 HM in the above attributes (b), (c) and (d). Furthermore, our QHSVM formulates two QPPs with the same structures, but Pin-M 3 HM has two QPPs with different structures. This paper is organized as follows. Section 2 reviews related work. Section 3 proposes the model of QHSVM and the local center-based density estimation method. Section 4 solves the new QHSVM and ρ-QHSVM. Section 5 deals with experimental results and Section 6 contains concluding remarks.

Support vector machines with hinge loss and pinball loss
For binary classification, the hinge loss is widely used. The hinge loss proposed in [1] brings popular standard SVM classifier. Suppose a training dataset T r = {(X 1 ,y 1 ),(X 2 ,y 2 ),� � �,(X m ,y m )}, where X i 2 < d�1 and y i 2{1,−1}. Standard SVM searches for an optimal separating hyperplane w T φ(x)+b = 0 by convex optimization, where w 2 < d�1 , b 2 < and φ(�) is a nonlinear feature mapping function. Its corresponding optimization problem can be described as follows: where c is a trade-off parameter. The hinge loss (L h ) i is given by (1), the final QPP of SVM can be obtained: QPP (3) of SVM searches for two support hyper-planes w T φ(x)+b = ±1 by maximizing the shortest distance between two classes of samples. The support hyper-planes belong to boundary hyper-planes. So, SVM is sensitive to noise. In 2014, Huang et al. [9] proposed a Pin-SVM classifier by introducing the pinball loss into standard SVM. Pin-SVM has the good property of standard SVM and is insensitive to noise, especially the feature noise around the decision boundary. The pinball loss in [9] is just like the following: ( where τ is an adjusting parameter. Replacing (L h ) i in (1) with (L τ ) i , the QPP of Pin-SVM can be obtained: Pin-SVM is insensitive to noise because the pinball loss is correlated with quantiles [30][31]. The pinball loss in (5) changes the idea of (3) into maximizing the quantile distance. Specially, when τ!0, Pin-SVM reduces to SVM. And the decision functions of QPPs (3) and (5) can be determined by using Lagrangian function, Karush-Kuhn-Tucker (KKT) condition and kernel function. Their formulas can be found in [1] and [9].

Twin support vector machine
TWSVM determines two nonparallel hyperplanes by optimization two QPPs, which is different from standard SVM. Each QPP of TWSVM is very much in line with standard SVM. Its size is smaller than single QPP of SVM. So, TWSVM is comparable with SVM in classification accuracy and has higher efficiency. Moreover, TWSVM is excellent at dealing with the dataset with cross planes.

Support vector data description with negative examples
SVDD is an efficient method to solve a one-class data description problem. It builds a hypersphere to cover one class of target samples by the description of the minimum volume. The hyper-sphere embodies the inner-class clustering maximization of samples. Based on SVDD, SVDD_neg adds negative samples. When negative samples can be used, they can improve the hyper-sphere description of target samples. QPP of SVDD_neg can be given by where R and C are the radius and center of the hyper-sphere respectively. QPP (9) requires the target samples be inside of the hyper-sphere and the negative samples be outside of the hypersphere. On one hand, this requirement ensures that the hyper-sphere describes a closed boundary around the target samples well. On the other hand, it can be used to distinguish the target samples and negative samples. Inspired by SVDD_neg, some classifiers with hypersphere have been proposed in [20][21][22][23][24].

Pinball losses for quantile hyper-spheres
The idea of QHSVM is similar to SVDD_neg in building a hyper-sphere. However, it needs to build two hyper-spheres with the same center for target samples, which is different from SVDD_neg. We consider a support vector machine with boundary hyper-spheres (BHSVM). BHSVM has two boundary hyper-spheres with the same center for target samples, which are shown in Fig 1(A). For binary classification, X þ i is firstly considered as a target sample. So, X À j is considered as a negative sample. These two boundary hyper-spheres must satisfy the following inequality constraints: where R + is the radius of the boundary hyper-sphere covering the target samples andR þ is the radius of the other boundary hyper-sphere. C + is the center of the two hyper-spheres. And the negative samples are outside of the hyper-sphere with the radiusR þ . x þ i andx þ j are the corresponding slack variables. Moreover, BHSVM requires min R + and maxR þ . So, BHSVM satisfying (10) and (11) maximizes the shortest distance between two classes of samples. Hinge losses are adopted in (10) and (11), which can be given as The hinge losses (12) are shown in Fig 1(C). It is known that the hinge losses are sensitive to noise [9]. In order to reduce the adverse effect brought by noise, QHSVM is generated by introducing the pinball losses to BHSVM. At this term, QHSVM inherits the ideas of Pin-SVM. The pinball losses for quantile hyper-spheres can be expressed as follows: The pinball losses (14) are shown in Fig 1(D). If the hinge losses in (10) and (11) are replaced by (14), then two inequality constraints with pinball losses can be obtained: Under the constraints of (15) and (16), the hyper-spheres of QHSVM are insensitive to noise because they are quantile hyper-spheres. The quantile hyper-spheres are shown in Fig 1  (B). Maximizing the quantile distance instead of maximizing the shortest distance is implemented. Compared with (10), (15) requires that some samples must be distributed outside of the hyper-sphere, which can be controlled with parameter τ. That is to say, maximizing the quantile distance of QHSVM depends on a number of samples. So, QHSVM is insensitive to noise, especially the feature noise around the decision boundary. When τ!0, (15) becomes (10). For (16), a similar conclusion can be drawn.
For binary classification, the other case is that X À j is a target sample and X þ i is a negative sample. Similarly, the corresponding pinball losses can be obtained as 8 > > > > < > > > > : And the inequality constraints with pinball losses can be expressed as: where C − is the center of two quantile hyper-spheres. R − andR À are the radii of the two quantile hyper-spheres respectively. x À j andx À i are the corresponding slack variables.

Primal formulation and analysis
For binary classification, consider two datasets X þ ¼ fX þ i ji ¼ 1; 2; � � � ; m þ g and X À ¼ fX À j jj ¼ 1; 2; � � � ; m À g. Next, we formulate two QPPs with the inequality constraints (15), (16), (19) and (20): whereX þ and � X þ represent two datasets in class +1.X À and � X À represent two datasets in class -1. The numbers of samples in four datasets arem þ ,m À , � m þ and � m À respectively. For QHSVM, these datasets are specified as X + or X − . So, for QHSVM, QPPs (21) and (22) need to satisfy the following condition: For QPP (21) with the condition (23), X þ i is the target sample, and X À j is the negative sample. QPP (21) searches for two quantile hyper-spheres: O + andÔ þ . Their radii are R + andR þ . And the two hyper-spheres have the same center C + . The first term of objective function in QPP (21) minimizes (R + ) 2 , which tends to keep the volume of O + as small as possible. The second term maximizes ðR þ Þ 2 , which is to force the volume ofÔ þ as big as possible. On the other hand, minimizing (R + ) 2 and maximizing ðR þ Þ 2 mean to keep the margin between O + andÔ þ as big as possible, which embodies the margin maximization principle. The first and the second constraint conditions in QPP (21) make O + be a quantile hyper-sphere controlled by τ instead of boundary hyper-sphere because some target samples fall outside of O + . The third and the fourth constraint conditions in QPP (21) also makeÔ þ be a quantile hyper-sphere controlled by τ because some negative samples fall inside ofÔ þ . These constraints make the maximum margin depend on many samples instead of few samples, which ensures QPP (21) is insensitive to noise, especially the feature noise around the decision boundary. The third and the fourth terms of objective function in QPP (21) are to minimize the sum of slack variables caused by some samples not satisfying the constraint conditions. The fifth term and constraint condition require the target samples to be distributed in the center of O + as much as possible.
In other words, the center of O + is close to the cluster of target samples. This means our QHSVM exploits the prior structural information of target samples. Our QHSVM should be not sensitive to the structure of the data distribution. So, the term ensures that the inner-class clustering of samples is maximized. The last constraint condition ensures the radius of O + is not smaller than that ofÔ þ . c þ 2 , c þ 3 and v + are trade-off parameters. For QPP (22) with the condition (23), X À j is the target sample, and X þ i is the negative sample. QPP (22) is similar to QPP (21) in attribute and conclusion. So, it is not necessary to analyze again.
Similar to TWSVM, QHSVM builds two support hyper-spheres for binary classification. For QPP (21) with the condition (23), O + with parameters C + and R + is referred to as the support hyper-sphere of the target sample X þ i . The negative sample X À j is only used to improve the description of O + . O + is described by using the margin maximization principle and inner-class clustering maximization of samples. It is insensitive to noise. X À j is only used to implement the margin maximization principle. Similarly, for QPP (22) with the condition (23), O − with parameters C − and R − is reckoned as the support hyper-sphere of X À j . X þ i is only used to improve the description of O − . All mentioned above is helpful to improve the generalization of QHSVM.
For binary QHSVM, the following decision function can be obtained.

QHSVM with surrounding and clustering samples
For QPPs (21) and (22) with the condition (23), all training samples are used for optimization with inequality constraints, which means QHSVM is fit for classification without high efficiency requirement. For a highly efficient classification problem, we provide a QHSVM with surrounding and clustering samples, which is called ρ-QHSVM. The surrounding samples refer to samples that are distributed near the boundary of the quantile hyper-spheres. In the case of X + , its surrounding samples are distributed near the boundary of O + . The clustering samples refer to samples that are distributed near the center of the quantile hyper-spheres. The quantile hyperspheres of ρ-QHSVM can be obtained by using sparse surrounding samples rather than all samples. So, these training samples should be divided into surrounding samples and clustering samples. In order to achieve it, a novel local center-based density estimation method is proposed. Local center-based density estimation is originated from kernel density estimation in [32]. Kernel density estimation yields Gaussian weight by calculating the distance between a sample and its K-nearest neighbors. This kernel density weight can efficiently characterize the local geometry of samples manifold, but it can't capture surrounding samples in the training dataset. So, the local center-based density estimation method is designed.
Consider a training dataset X = {X i |i = 1,2,� � �,m}. Firstly, the kernel function C(X i ,X l ) = φ (X i )�φ(X l ) is introduced. Then, the steps for a local center-based density estimation method are given in nonlinear feature mapping space: Step 1: Calculate the square distance between each sample X i and the others.
Step2: Search for K-nearest neighbors in nonlinear feature mapping space for each sample X i .
Step 3: Calculate the mean of square distances for the training dataset.
Step 4: Calculate the kernel density weight for each sample X i .
Step 5: Determine the center of K-nearest neighbors for sample X i .
Step 6: The local center-based density of X i is estimated as follows.
It can be seen from the above steps that the local center-based density of X i is estimated with the distance between the sample and its K-nearest neighbors, where K is given by user. Moreover, the local center-based density is a Gaussian kernel density. When q i = 1, r i ¼ r w i . A bigger r w i indicates that X i is closer to its K-nearest neighbors. So, r w i can be used to check if X i is a clustering sample or an isolated sample. However, r w i can't be used to identify surrounding samples. The training dataset can be divided into clustering samples and surrounding samples from center to outside. The surrounding samples distributed near the boundary of the quantile hyper-spheres deviate the center of K-nearest neighbors. Fig 2 shows that the surrounding sample x s is far from the center of K-nearest neighbors, while the clustering sample x c is close to that center of K-nearest neighbors. This is their distinctive characteristics. So, q i is used to represent the deviation degree. When q i 6 ¼1, each ρ i must be compensated with q i . ρ i is called as local center-based density. The smaller ρ i is, the closer X i is to boundary. The bigger ρ i is, the closer X i is to clustering region. On the other hand, Gaussian kernel parameter δ 2 is set as � d 2 . � d 2 is the mean of square distances for the training dataset, which makes (30) fit for different training datasets with different clustering degrees.
whererĩ þ and � r � i þ are the local center-based densities ofXĩ þ and � X � i þ respectively. The principle of division isrĩ þ < � r � i þ . So,Xĩ þ is a surrounding sample with small local center-based density. And � X � i þ is a clustering sample with big local center-based density.  Support vector machine with quantile hyper-spheres according to ratio ε.
whererĩ þ and � r � i þ are the local center-based densities ofXĩ þ and � X � i þ respectively. The principle of division isrj À < � r � j À . Based onX þ , � X þ ,X À and � X À , the two QPPs of ρ-QHSVM are expressed as (21) and (22). Comparing with QHSVM, X þ i and X À j are changed asX þ i andX À j respectively.X þ andX À are sparse datasets because the number of samples is reduced greatly. And their sparseness is controlled by ε. This means that the number of samples with inequality constraints is greatly reduced. So, the optimization speed of ρ-QHSVM is improved. Moreover, it can be seen from QPP (21) that the optimization accuracy is controlled by boundary samples. So, for QPP (21), the surrounding samples sets fX þ i ji ¼ 1; 2; � � � ;m þ g and fX À j jj ¼ 1; 2; � � � ;m À g ensure the optimization accuracy because they include boundary samples. On the other hand, comparing with QHSVM, X þ i is changed as � X þ l , which shows that the center of the support hyper-sphere is closer to the samples with higher clustering degree. And � X þ is a sparse dataset, because the number of samples is also reduced. So, the clustering samples improve optimization speed and accuracy with equality constraints. For QPP (22), similar attributes and conclusions can be obtained. In summary, ρ-QHSVM is fit for high efficiency classification.

Solution to ρ-QHSVM
Comparing with ρ-QHSVM, QHSVM has the additional condition (23). So, QHSVM can be considered as a special case of ρ-QHSVM. So, the solution of ρ-QHSVM is only given in this section. And the solution of QHSVM can be obtained from ρ-QHSVM.

Experiments and results analysis
In order to test the performance of the proposed classification model, QHSVM, ρ-QHSVM, SVM, Pin-SVM, TWSVM and THSVM are compared by using artificial and benchmark datasets with noise. Moreover, ρ-QHSVM is used to classify strip steel surface defects datasets obtained from a steel plant in China. It must be noted that THSVM is an extended binary classifier based on SVDD_neg. In this experiment, the nonlinear classifiers adopt kernel function C(X i ,X l ) = exp(−kX i −X l k 2 /2δ 2 ). And the linear classifiers adopt C(X i ,X l ) =X i �X l . All classifiers are solved and executed with MATLAB 7.11 on Windows 7 running on a PC with Intel Core CPU (3.2 GHz) and 4 GB RAM.
Moreover, for a fair comparison, all classifiers use the same quadprog solver in MATLAB. For QHSVM, some parameters need to be determined. In order to reduce the computation complexity, assume that c þ 2 ¼ c À 2 , c þ 3 ¼ c À 3 and v + = v − for QHSVM and ρ-QHSVM. This brevity method has also been used in [16], [22], [23], [29] and [32]. For TWSVM and THSVM, c 1 = c 2 and v 1 = v 2 are set. All parameters c's, v's and δ's are chosen from the set {2 l |l = −9,−8,� � �,10}. K is used to control the number of nearest neighbors. For the nearest neighbors' algorithm, K is generally determined by grid search. In [18] and [32], K has been discussed. According to [18] and [32], K is set as 8. The parameter τ is chosen from {0.1,0.2,0.5,1}. There are some common parameter selection methods: exhaustive search, 5-fold cross validation, grid search and optimization search. In the experiments, in order to completely cut interactions between training and testing phases, the following selection methods are adapted. Firstly, we randomly split m all samples into m training training samples and m testing testing samples, where m all = m training +m testing . And the split step is repeated n training times. Thus, n training training/testing datasets are obtained. Then, the parameter values are determined by 5-fold cross validation and grid search for the ith training dataset, where i = 1,2,� � �,n training . The final classifier is set through the determined parameter values and is used to evaluate the accuracy for the i-th testing dataset. It can be seen that the step is repeated n training times. Finally, we can obtain n training testing accuracies. And the average accuracy and the standard deviation of all accuracies are calculated. The average accuracy and the standard deviation are used to evaluate the performance of the classifiers in UCI datasets and strip steel surface defects datasets. In artificial datasets, the average accuracy is used to represent the performance of the classifiers. To make statistical analysis sound, n training is set as 50 and m training = 5m testing .

Artificial datasets
To illustrate the ability of QHSVM graphically, the 2-D artificial datasets with Gaussian distribution are adopted. Suppose the samples X þ i (i = 1,2,� � �,m + ) satisfy Gaussian distribution N (μ 1 ,∑ 1 ). And the mean μ 1 is [−0.38,−0.38] T and covariance matrix ∑ 1 is diag (0.1,0.1). Suppose the samples X À j (j = 1,2,� � �,m − ) also satisfy Gaussian distribution N(μ 2 ,∑ 2 ) with μ 2 = [0.38,0.38] T and ∑ 2 = diag(0.03,0.03). Moreover, some samples in artificial datasets are introduced with noise around the decision boundary by using an adjustable parameter θ, which are called noise samples. θ is the ratio of the number of noise samples to the number of training samples. These noise samples affect the labels around the boundary. The labels of these noise samples are selected from {+1,−1} with equal probability. And the positions of these samples satisfy Gaussian distribution with the following parameters μ n = [0,0] T and S n = diag (0.03,0.03).
Firstly, the dataset D 1 with m + = 100 and m − = 100 is built according to the above Gaussian distribution. Then, the dataset D n 1 is obtained by introducing noise samples with θ = 10% into D 1 .  4  (A-1) that the decision boundary of SVM is obtained based on two parallel support hyper-planes. These two support hyper-planes belong to boundary hyper-planes. Compared with Fig  4(A-1), the boundary hyper-planes of SVM in Fig 4(A-2) change in position. The result proves that SVM is adversely affected by noise samples. The support hyper-planes of Pin-SVM are quantile hyper-planes. Many samples are added between two quantile hyper-planes, which dilutes the adverse impact of noise samples. So, the decision boundary has not changed much for Pin-SVM on D 1 and D n 1 . Different from SVM, TWSVM uses two nonparallel support hyper-planes to describe two classes of samples. This attribute makes TWSVM be in favor of the description of training dataset, especially the dataset with cross planes. However, each support hyper-plane of TWSVM needs to be supported by a parallel boundary hyper-plane. So, noise samples also affect the nonparallel support hyper-planes of TWSVM. THSVM builds two support hyper-spheres. Each hyper-sphere covers one class of samples and keeps away from the other class of samples. THSVM maximizes the margin between the two classes and the inner-class clustering of samples. So, the decision boundary of THSVM becomes more reasonable. It can be seen from Fig 4(D-1) that the decision boundary of THSVM curves to the clustering samples. However, its two support hyper-spheres belonging to boundary are affected by noise samples near the boundary. If τ = 0, the quantile hyper-spheres reduce to the boundary hyper-spheres for QHSVM. So, QHSVM (τ = 0) includes two boundary hyperspheres with the same center for every class of samples. It can be seen that it has similar attributes with THSVM. So, it is clear that the decision boundary of QHSVM (τ = 0) will be changed by noise samples. QHSVM (τ = 0.5) builds two quantile hyper-spheres with the same center for every class of samples. Compared with the boundary hyper-spheres, some samples are added inside or outside of the quantile hyper-spheres. These samples reduce the adverse impact caused by noise samples around the decision boundary. So, the training results of QHSVM (τ = 0.5) for D 1 and D n 1 are not changed obviously, just like the support hyper-spheres and the decision boundary. Moreover, the decision boundary of QHSVM (τ = 0.5) for D 1 and D n 1 are both reasonable, which are curved to clustering samples. All these results prove that QHSVM has better performance because it integrates the excellent attributes of Pin-SVM, TWSVM and THSVM. Then, the dataset D 2 with m + = 200 and m − = 200 is built. According to the prescribed rules, it is divided into the training dataset and testing dataset. And noise samples with θ = 0%, 5%, 10%, 20% are introduced into the training dataset respectively. At last, the testing accuracies for different classifiers with linear kernel are shown in Table 1. For θ = 0%, compared with SVM and TWSVM, THSVM and QHSVM have better classification accuracies, which shows that the nonparallel hyper-planes (hyper-spheres) and inner-class clustering of samples strengthen the performance of classifiers. For θ = 0%, the testing accuracy of Pin-SVM is lower than that of SVM. One possible reason is that there are some isolated samples in D 2 , which can be seen from Fig 4(B-1). The only error point in Fig 4(B-1) deviates from the dataset with "+" in black. Quantile hyper-plane is sensitive to isolated samples as well as noise samples. For θ6 ¼0%, QHSVM provides the best testing accuracy compared with the other classifiers. All these results show that QHSVM performs the best in accuracy for datasets with noise samples, which is due to pinball losses, two nonparallel support hyper-spheres and inner-class clustering of samples. For θ6 ¼0%, the testing accuracy of Pin-SVM is higher than that of SVM, TWSVM and THSVM, which shows that the pinball loss can improve classifier's performance for datasets with noise samples. The testing accuracy of Pin-SVM is lower than that of QHSVM. The reason is that it does not have the attributes of inner-class clustering of samples and nonparallel support hyper-planes. Testing accuracies corresponding to different classifiers with nonlinear kernel are shown in Table 2. For all conditions, QHSVM has the best testing accuracy. Compared with Table 1, testing accuracies corresponding to all classifiers in Table 2 are improved, which shows that the classifiers with nonlinear kernel improves the classification results.
Finally, in order to test the performance of ρ-QHSVM, the datasets D 3 (m + = m − = 100), D 4 (m + = m − = 400), D 5 (m + = m − = 700) and D 6 (m + = m − = 1000) are built. And noise samples with θ = 0% are introduced into these datasets. Nonlinear classifiers of SVM, Pin-SVM, TWSVM, THSVM and QHSVM are tested on accuracy and speed. Testing results are shown in Table 3. The conclusions on Table 3 are nearly the same with that on Table 2, which shows that QHSVM has excellent and stable performance for different-scale datasets. THSVM and TWSVM are faster than SVM, Pin-SVM and QHSVM. The reason is that these two classifiers solve two smaller QPPs instead of one large QPP used for SVM and Pin-SVM. The efficiency of QHSVM is the lowest because it solves two large QPPs to obtain better classification accuracy. So, QHSVM is not fit for high efficiency requirement. In order to solve the above problem, ρ-QHSVM with adjustable execution speed is proposed. It uses parameter ε to adjust the execution speed. The accuracy and execution time of ρ-QHSVM with different ε for differentscale datasets are shown in Table 3. The classification accuracy of ρ-QHSVM reduces as ε becomes small. The sparseness of surrounding samples and clustering samples is controlled by ε. This is caused by the fact that reducing ε means reducing the number of surrounding samples. Fewer surrounding samples inevitably reduce the classification accuracy for datasets with noise samples. For small-scale datasets, if ε is big, ρ-QHSVM is close to QHSVM, and exceeds the other classifiers in accuracy. Take the dataset D 3 as an example, when ε = 0.7, ρ-QHSVM is   close to QHSVM in accuracy. For large-scale datasets, when ε is small, ρ-QHSVM is close to QHSVM in accuracy. And it also exceeds the other classifiers in accuracy. For the dataset D 6 , the classification accuracy of ρ-QHSVM exceeds that of Pin-SVM when ε = 0.3, and is close to that of QHSVM when ε = 0.4. It can be seen from Table 3 that the smaller ε is, the higher the efficiency of ρ-QHSVM is. When ε�0.4, ρ-QHSVM is the fastest classifier, which shows that the efficiency of ρ-QHSVM can be adjusted by ε. The results of Table 3 show that the improvement of execution time brought by ρ-QHSVM is limited for small-scale datasets under the premise of high accuracy. However, for small-scale datasets, this difference is insignificant because the execution time of classifiers is small. For large-scale datasets, the execution time of ρ-QHSVM is reduced greatly under the premise of high accuracy. For example, ρ-QHSVM has high efficiency and testing accuracy for the dataset D 6 when ε = 0.3. So, ρ-QHSVM is fit for large-scale datasets with high efficiency requirement.

UCI datasets with noise samples
In order to further test the performance of QHSVM, all classifiers are run on fifteen public benchmark datasets downloaded from the UCI Machine Learning Repository [33]. Ten smallscale or middle-scale datasets are used for testing accuracy, including Heart, Ionosphere, Breast, Thyroid, Australian, WPBC, Pima, German, Sonar and ILPD. And five large-scale datasets are used for testing accuracy and speed, including Wifi, Splice, Wilt, Musk and Spambase. The details of these original benchmark datasets are listed in Table 4. In order to highlight the anti-noise ability of QHSVM, the benchmark datasets with noise samples are tested. Each benchmark dataset is corrupted by zero-mean Gaussian noise. For each feature, the ratio of the variance of noise to that of the feature denoted as θ is set to be 0%, 5% and 10%. And all original and corrupted benchmark datasets are normalized before training. Table 5 shows the testing accuracies of SVM, Pin-SVM, TWSVM, THSVM and QHSVM with nonlinear kernels on the ten benchmark datasets. It can be seen that QHSVM achieves the best testing accuracy for majority of datasets. For the original benchmarked datasets with θ = 0%, QHSVM and THSVM yield the best testing accuracy on 5 and 2 of 10 datasets respectively. And SVM, Pin-SVM and TWSVM yield the best testing accuracy on 1, 1 and 1 of 10 datasets respectively. This result shows that QHSVM and THSVM with nonparallel hyperspheres and inner-class clustering of samples strengthen the performance of classifiers. It should be pointed out that QHSVM has obvious advantage for corrupted benchmark datasets. For the corrupted benchmark datasets with θ = 5% and θ = 10%, QHSVM and Pin-SVM yield the best testing accuracies on 13 and 5 of 20 datasets respectively. And SVM, TWSVM and THSVM yield the best testing accuracy on 2, 1 and 1 of 20 datasets respectively. This result shows that QHSVM and Pin-SVM are better than classifiers with hinge loss for the corrupted datasets. Moreover, for the original and corrupted benchmark datasets, QHSVM is superior to THSVM and Pin-SVM, because it has merits of pinball losses, nonparallel hyper-spheres and inner-class clustering of samples. This conclusion is the same as experimental results on the artificial datasets. Table 6 shows the classification accuracies and execution time of SVM, Pin-SVM, TWSVM, THSVM and ρ-QHSVM with nonlinear kernels on the five large-scale datasets. In the above section, it has been found that ρ-QHSVM improves the execution efficiency for the large-scale artificial datasets under the premise of high accuracy. This part of experiment also proves the same conclusion. According to the experimental results on the artificial datasets, parameter ε of ρ-QHSVM is set as 0.3. Compared with the other classifiers, the execution time of ρ-QHSVM is the shortest. The reason is that ρ-QHSVM solves smaller QPPs with inequality constraints. These smaller QPPs are produced on sparse surrounding samples. On the other hand, ρ-QHSVM achieves the best testing accuracy for the majority of datasets. For the original benchmark datasets with θ = 0%, ρ-QHSVM yields the best testing accuracy on 3 of 5 datasets, while for the corrupted benchmark datasets with θ = 5% and θ = 10%, ρ-QHSVM yields the best accuracy on 6 of 10 datasets. The reason is that the local center-based density estimation method ensures the reasonable division about surrounding samples and clustering samples. In general, ρ-QHSVM has higher efficiency and accuracy for large-scale datasets compared with the other classifiers.

PASCAL VOC dataset
The PASCAL VOC dataset [34] is a public benchmark dataset and is often used in challenge competitions for supervised machine learning. The dataset is composed of color images of twenty visual object classes in realistic scenes. In the experiment, the ten classes of them are chosen, such as person, cat, cow, dog, horse, sheep, bicycle, bus, car and motorbike. These color images are converted to the intensity images, then are resized to s times the size of the original color images so that they have the specified 4096 pixels, where s is a positive real number. So, each image is represented as a sample vector with 4096 elements. We choose 800 and 1600 vectors as training samples respectively and the others are testing samples. In order to highlight the anti-noise ability, the PASCAL VOC dataset with noise are built. The PASCAL VOC dataset is corrupted by zero-mean Gaussian noise. For each feature, the ratio of the variance of noise to that of the feature denoted as θ is set to be 5%. For brevity, we build ten nonlinear binary classifiers with one-against-rest method. Then, the mean of ten accuracies is presented in Fig 5. It can be seen from Fig 5(A) that the performance of our QHSVM is superior to that of SVM, Pin-SAVM, TWSVM and THSVM in challenging the PASCAL VOC dataset with noise. The result highlights that the new attribute of pinball losses improves the anti-noise ability of our QHSVM. Furthermore, the attribute of nonparallel hyper-spheres strengthens the generalization performance of the classifier. Fig 5(B) shows that the average accuracy of QHSVM is not lower than that of other classifiers. This indicates that the QHSVM also has reliable performance in challenging the original PASCAL VOC dataset. The robustness of QHSVM is strengthened by maximizing the margin between two hyper-spheres with the same center and maximizing the inner-class clustering of samples. Moreover, two nonparallel quantile hyperspheres improve the generalization of QHSVM. In addition, the performance of all classifiers is improved with the increase of training samples. In the case of more training samples, the performance of all classifiers in corrupted dataset is close to that in original dataset. Notably, the accuracies of our classifier in the two datasets are close. This also shows that our QHSVM has better robustness than other classifiers.

Strip steel surface defects datasets
Strip steel surface defects datasets are obtained from Northeastern University (NEU) surface database [35]. In the experiment, four typical defects datasets in NEU surface database are Support vector machine with quantile hyper-spheres investigated: patches (S1 Dataset), inclusion (S2 Dataset), scratches (S3 Dataset) and scale (S4 Dataset). Their typical images are shown in Fig 6. These defect images are extracted as defect samples, and each defect sample includes sixteen attributes. This means that each defect sample is a 16-dimensional vector. Their related attributes have been described in our previous work [36]. It can be seen that the strip steel surface defects classification belongs to multi-class classification. There are many multi-class classification methods based on binary classifier, such as one-against-one, one-against-rest, decision directed acyclic graph and binary tree [37]. And the binary tree model is most widely used. Multi-class classifiers for SVM, Pin-SVM, TWSVM, THSVM and ρ-QHSVM can be obtained on the binary tree. According to the binary tree model, three QPPs are needed to solve for SVM and Pin-SVM, while six QPPs are needed to solve for TWSVM, THSVM and ρ-QHSVM.
Moreover, to obtain more samples, the strip steel surface defects datasets are supplemented by rotation, distortion, translation, and scaling. In the end, the strip steel surface defects datasets include 8000 samples and each type of defects includes 2000 samples. All parameters of classifiers are obtained with the same method mentioned above. And the parameter ε of ρ-QHSVM is set to 0.3. The accuracies and execution time of all classifiers for all types of defects are shown in Table 7 and Fig 7 respectively. It can be seen from Table 7 that the accuracy of ρ-QHSVM is always the best for all types of defects. The accuracy of Pin-SVM is better than that of SVM, TWSVM and THSVM. The reason is that the strip steel surface defects datasets are corrupted by noise usually. It is well known that there is noise on the production line of strip steel. So, the pinball losses in ρ-QHSVM and Pin-SVM work for the strip steel surface defects datasets with noise samples. Moreover, the other excellent attributes improve the performance of ρ-QHSVM further. Besides, the efficiency of ρ-QHSVM is high. TWSVM, THSVM and ρ-QHSVM is better than SVM and Pin-SVM in execution time, which is shown in Fig 7. Though SVM and Pin-SVM only need to solve three QPPs for four types of datasets, these QPPs are all large. TWSVM, THSVM and ρ-QHSVM need to solve smaller QPPs, which improves the execution time. ρ-QHSVM has the fastest speed, which is benefited from the local center-based density estimation method. The method improves the classification efficiency under the premise of high accuracy. In summary, ρ-QHSVM is very fit for the strip steel surface defects classification.

Conclusions
A novel QHSVM classifier is proposed for pattern recognition in this paper. QHSVM has remarkable attributes: pinball losses, two nonparallel quantile hyper-spheres and inner-class clustering of samples. The quantile hyper-spheres ensure that QHSVM is insensitive to noise, especially the feature noise around the decision boundary. The robustness of QHSVM algorithm is strengthened by maximizing the margin between two hyper-spheres with the same center and maximizing the inner-class clustering of samples. Moreover, compared with standard SVM model, two nonparallel quantile hyper-spheres improve the generalization of QHSVM.
On the other hand, in order to satisfy the requirement of high efficiency for large-scale datasets classification, a new version of QHSVM with adjustable execution speed is proposed, which is called ρ-QHSVM. Under the premise of high accuracy, ρ-QHSVM reduces the execution time.
That benefits from the local center-based density estimation which reasonably divides training samples into surrounding samples and clustering samples. The proposed QHSVM and ρ-QHSVM are compared with SVM, Pin-SVM, TWSVM and THSVM through numerical experiments on artificial, benchmark and strip steel surface defects datasets with noise. The results show that QHSVM performs the best in accuracy for datasets with noise samples, which is due to pinball losses, two nonparallel support hyper-spheres and inner-class clustering of samples. The execution time of ρ-QHSVM is reduced greatly under the premise of high accuracy for large-scale datasets, especially strip steel surface defects datasets. ρ-QHSVM has the fastest speed, which is benefited from the local center-based density estimation method. In the future, it is necessary to find the optimal parameters for QHSVM with some effective methods. And how to apply QHSVM to unbalanced datasets will be investigated.
Supporting information S1 Dataset. Patches dataset. The first typical strip steel surface defects dataset.