A Multi-Label Learning Based Kernel Automatic Recommendation Method for Support Vector Machine

Choosing an appropriate kernel is very important and critical when classifying a new problem with Support Vector Machine. So far, more attention has been paid on constructing new kernels and choosing suitable parameter values for a specific kernel function, but less on kernel selection. Furthermore, most of current kernel selection methods focus on seeking a best kernel with the highest classification accuracy via cross-validation, they are time consuming and ignore the differences among the number of support vectors and the CPU time of SVM with different kernels. Considering the tradeoff between classification success ratio and CPU time, there may be multiple kernel functions performing equally well on the same classification problem. Aiming to automatically select those appropriate kernel functions for a given data set, we propose a multi-label learning based kernel recommendation method built on the data characteristics. For each data set, the meta-knowledge data base is first created by extracting the feature vector of data characteristics and identifying the corresponding applicable kernel set. Then the kernel recommendation model is constructed on the generated meta-knowledge data base with the multi-label classification method. Finally, the appropriate kernel functions are recommended to a new data set by the recommendation model according to the characteristics of the new data set. Extensive experiments over 132 UCI benchmark data sets, with five different types of data set characteristics, eleven typical kernels (Linear, Polynomial, Radial Basis Function, Sigmoidal function, Laplace, Multiquadric, Rational Quadratic, Spherical, Spline, Wave and Circular), and five multi-label classification methods demonstrate that, compared with the existing kernel selection methods and the most widely used RBF kernel function, SVM with the kernel function recommended by our proposed method achieved the highest classification performance.

SVM classifies the data objects via identifying the optimal separating hyperplanes among classes. Determining a class boundary in the form of a separating hyperplane is adequate for simpler cases where the classes are nearly or completely linearly separable. However, in practice, classes are usually complexity, high dimensionality or not linearly separable, so a kernel function is employed to project original data into a higher dimensional space at first, and then a linear separating hyperplane with the maximal margin between two classes are constructed [1].
According to the Reproducing Kernel Hilbert Spaces (RKHS) [19,20], the kernel function, which is represented as a legitimate inner product K(u, v) = (ϕ(u) Á ϕ(v)), can be any positive definite function that satisfies the Mercer conditions [2]. While most commonly used kernel functions are linear, polynomial, radial basis function and sigmoid function, there are many other complicated kernel functions derived by aggregating multiple base kernel functions. In essence, the generalization capacity of SVM depends on the choice of kernel function and the setting of misclassification tolerance parameter C, in which the C is directly related to certain kernel [21,22]. Hence, a careful choice of the kernel function is primary for SVM in order to produce an appropriate classification boundary.
For the purpose of determining a optimal kernel function, most researchers devote to tweaking the associative parameters for a specified kernel function via trial-and-error, less on selecting an appropriate kernel function. Generally, the existing methods can be divided into four categories (1) cross-validation [23][24][25] is most commonly used to find an optimal kernel for a new data; (2) multiple kernel learning (MKL) [26,27] attempts to construct a generalized kernel function so as to solve all classification problems through combing different types of standard kernel functions; (3) genetic programming [28,29] uses Gene Expression Programming algorithms to evolve the kernel function of SVM; and (4) automatic kernel selection method with C5.0 [30,31] aims to recommend a special kernel function for different classification problems based on the statistical data characteristics and distribution information.
Apart from cross-validation, MKL and genetic programming methods require numerous iterations for converging towards a reasonable solution [32], all existing methods are trying to seek a single optimal kernel function in terms of classification accuracy. However, for some kernels, although the difference of classification accuracy is minor, the differences of the corresponding number of stored support vectors or the time complexity [33] can be significant. This means the selected kernel may not be the best one. For instance, suppose the classification accuracy of SVM with kernel function A is slightly higher than that with kernel function B, and the former costs much more time than the latter. Usually, kernel function A is selected as the best one. However, this is not the best option and kernel function B should be more appropriate for practical applications. Therefore, it is fallacious to evaluate the performance of SVM with one kernel function just in terms of classification accuracy, and further selection should be proceeded to pick out those applicable kernels in the light of overall performance. Noting that, due to the balance between classification accuracy and CPU time, SVM might perform equally well with different kernels on a same classification problem. In this case, kernel selection can be viewed as a multi-label learning problem so that an applicable kernel set for a new problem with satisfied classification performance can be recommended.
In the multi-label learning, multi-label classification (MLC) [34][35][36] has been widely applied in semantic annotation [37,38], tag recommendation [39], rule mining [40], and information retrieval [41,42]. Recently, multi-label classification has been studied and adopted for recommending an applicable set of classification algorithms by Wang, et.al [43]. Inspired by their research work, we proposed a new multi-label meta-learning based kernel recommendation method this paper presents, in which data sets are described by the corresponding characteristics and their corresponding applicable kernel sets are identified in terms of the adjusted ratio of ratios (ARR) [44] via cross-validation and the relationship between them is discovered by multi-label classification algorithms and further used to recommend applicable kernels for new problems. Extensive experiments over 132 UCI benchmark data sets, with five types of data set characteristics, eleven commonly used kernels (Linear, Polynomial, Radial Basis Function, Sigmoidal function, Laplace, Multiquadric, Rational Quadratic, Spherical, Spline, Wave and Circular), and five multi-label classification methods demonstrate that, compared with the existing kernel selection methods and the most widely used RBF kernel function, SVM with the kernel function recommended by our proposed method achieved a higher classification performance.
The remainder of the paper is set up as follows. The related work is briefly reviewed in Section Previous Work. The proposed method is concretely introduced in Section Multi-label Learning Based Kernel Recommendation Method. The experimental process and the result analysis are provided in Section Experimental Study. Finally, the conclusion of our work is drawn in Section Conclusion.

Previous Work
In the past decades, the issue of kernel selection for SVM has attracted much attention and lots of methods have been proposed. Most research work concentrated on the parameter optimization for a pre-specified kernel function [45][46][47][48] via cross-validation, exhaustive grid search or evolutionary algorithms, etc, whereas less on kernel selection. Generally, the state-of-the-art kernel selection methods can be categorized into four classes: cross-validation, multiple kernel learning, evolutionary methods, and meta-learning based methods.
Cross-validation is the most frequently used method for model selection, the problem is that the computational cost is too much to be used in practice since the learning problem must be iterated n times. For SVM, the optimal kernel is usually achieved after minimizing the nfold cross-validation error (i.e. the leave-one-out classification error) [23][24][25].
Multiple kernel learning methods (MKL) [32,33,49] are the linear or nonlinear combination of different kernels instead of a single kernel function. Since different kernels correspond to different notions of similarity and they may be using inputs of different representations possibly from different sources or modalities, combining kernels is one possible way to combine multiple information sources. This type of methods aim to yield a general kernel function for solving any problem. However, in fact, it is difficult to determine which kernels should be combined and it converges very slowly for a big data set.
Evolutionary kernel selection methods use the n-fold cross-validation accuracy as the fitness criterion. Howley et al. [50] and Sullivan et al. [29] attempted to find the near-optimal kernels for SVM using genetic programming system, in which the kernel functions are represented as trees, input variables or numerical constants are represented as the leaves and their values are passed to nodes. It performed some numerical or program operations before passing on the result further towards the root of the tree, where the classification error and "tiebreaker" are taken as the fitness function. Kanchan et al. [28] employed the Gene Expression Programming to train a SVM with the most suitable kernel function, where the cross-validation accuracy is calculated for measuring the fitness of a kernel. These methods show wide applicability, but the combined computational overhead of genetic programming and SVM remains a major unresolved issue.
Meta-learning based kernel selection method [30] applied decision tree to generate the associate rules between the most appropriate kernel and data set characteristics for support vector machine. In this approach, three types of measures (classical, distance and distribution-based statistical information) are collected for characterizing each data set, and the classification accuracy is used to evaluate the performance of SVM with the selected kernel function. Similarly, Wang et al. [31] proposed to assign a suitable kernel function for a given data set after discerning its approximate distribution with PCA [51].
The first three types of methods would result in a large number of evaluations and unacceptable CPU runtime, it is unpractical for solving the real classification tasks. In addition, the existing automatic kernel selection methods pay more attention on how to create or optimize a promising kernel for a given classification problem, less on kernel selection from amounts of available kernel functions. For dealing with issue that multiple kernels might perform equally well on a given data set, contrast to the multiple kernel learning methods, we view kernel selection as a multi-label classification problem and propose a multi-label learning based kernel recommendation method to identify all the applicable kernels for different classification problems.

Multi-label Learning Based Kernel Recommendation Method
In this section, we concretely introduce the fundamental of our multi-label learning based kernel recommendation method. We first give an overview of the proposed method in subsection General View of the Method, and then describe each component of our recommendation method in subsection Meta-knowledge Database Generation and subsection Model Construction and Recommendation, respectively.

General View of the Method
As we know, for different data sets, the performance of SVM with a specific kernel function can be different. This means that, just as there is a dual relation between data set characteristics and the performance of a classification algorithm [52,53], there also exists a relationship between data set characteristics and the performance of a kernel function when SVM is used as a classification algorithm. Thus, before recommending a suitable kernel function for a classification problem, this relationship must be modeled. Furthermore, in order to build this relationship, the characteristics of a data set and the corresponding performance of the appropriate kernel function(s) should be obtained. Therefore, our proposed method consists of three parts: meta-knowledge database generation, recommendation model construction, and kernel recommendation. Fig 1 shows the details.
1. Meta-knowledge database generation. This preparation stage creates a meta-knowledge database based on the historical data sets and all the possible kernel functions. Specifically, for each historical data set, the characteristics measures are extracted as the meta-features, while the corresponding applicable kernel set are identified as the meta-targets through constructing and evaluating SVM with each candidate kernel. After that, the meta-knowledge database is created by merging the meta-features and the applicable kernel set for each of the historical data sets.

2.
Recommendation model construction. At this stage, based on the meta-knowledge database, multi-label classification algorithms are applied to the meta-knowledge data consisting of meta-features and meta-targets, and the recommendation model is built.

3.
Kernel recommendation for the new data set. When recommending kernels for a new problem, its characteristics are extracted and passed to the recommendation model, the output of the model is the suitable kernel functions for the new problem.

Meta-knowledge Database Generation
Meta-knowledge database captures the knowledge about which kernels perform well on what data sets when SVM is used as the classification algorithm. In this section, we first introduce all the measures for characterizing a data set, and then we explain how to determine the applicable kernel set for each data set. Meta-features. The meta-features consist of measures extracted from data sets for uniformly depicting data set characteristics. Pavel et al. [54] first proposed to generate a set of rules for characterizing the applicability of classification algorithms using meta-level learning (ML). Afterwards, Shawkat and Kate [53] presented a rule-based classification selection method, which built on the data characteristics, to induce which types of algorithms are appropriate for solving which types of classification problems, and then they explored a meta-learning approach to automatic kernel selection for support vector machines [30]. Overall, the above mentioned work characterized data sets by the simple, statistical and information theory based measures. These measures are not only conveniently and efficiently calculated, but also related to the performance of machine learning algorithms [55,56].
Up to now, data measures are not limited to statistical or information theoretic descriptions. Recently, some other well-established measures have been developed in different perspectives as well, such as problem complexity measures, Landmarking measures, model-based measures and structural measures. Table 1 describes each type of data characteristic measures in detail.

• Problem complexity measures
Ho and Basu [57] explored a number of measures to characterize the difficulty of a classification problem, focusing on the geometrical complexity of the class boundary. These problem complexity measures can highlight the manner in which classes are separated or interleaved, a factor that is most critical for classification accuracy. • Landmarking measures Bensusan et al. [58][59][60] introduced the meta-learning by landmarking various learning algorithms. The main idea of landmarking is using some simple and efficient learning algorithm themselves to determine the location of a specific learning problem, which is represented by the disagreement pattern between a set of standard classifiers. The disagreement patterns not only point towards different types of classification problems, but also indicate the novelty and the usefulness of a classifier with respect to a set of classification problems and classifiers. In our study, we build a small set of standard classification algorithms (Naive Bayes, IB1 and C4.5) on each data set and then record their classification performance as the Landmarking measures.
• Model based measures Peng et al. [61] presented new measures to capture the characteristics from the structural shape and size of the decision tree induced from a given data set. We employed C5.0 tree algorithm to construct the standard decision tree on each data set, and then obtained a total of 15 measures for describing the properties of each decision tree. • Structural and statistical information based measures Song et al. [62] presented the structural and statistical information based data set characteristics, and used for constructing a recommendation method for classification algorithms. Firstly, the given ordinary data set is transformed into the corresponding binary data set; then a one-item set V I and a two-item set V II are extracted, where V I captures the distribution of the values of a given attribute while V II reflects the correlation between two features; finally, to achieve unified representation and comparison of different data sets, both V I and V II are sorted in ascending order and then a unified features set is generated by computing the statistical summary of both items, including their minimum, seven octiles and the maximum. Compared with the traditional data set characteristic measures, this type of measures is confirmed to be superior to the others.
Applicable Kernels Identification. Applicable kernels are kernels among which there is no significant differences in classification performance of SVM. For the objective of modeling the relationship between data set characteristics and applicable kernel functions, it is essential to identify an applicable kernel set for each historical data set as the target concept.
The identification of applicable kernels can be briefly described below: for each historical data set, SVM with each candidate kernel is adopted to classify each historical data set, and then the applicable kernel set of each data set is identified via evaluating and comparing the classification performances of SVM with all candidate kernels.
When evaluating the performance of SVM with two different kernel functions on a given data set, there may not be large difference in terms of classification accuracy, but there may be significant difference between two different kernels in the number of stored support vectors or training time complexity [33]. For example, if SVM with kernel A is slightly better than that with kernel B in success ratio but SVM with kernel A takes much more CPU time than that with B, then it is more possible that kernel B is chosen as the kernel function. Thus, taking the success ratio of classification and CPU time into consideration when evaluating the performance of SVM with one special kernel function is rational. Moreover, the issue of class imbalance distribution often exists in real-world applications, in this case, classification accuracy is not a powerful metric used for reflecting the performance of SVM with different kernel functions. Alternatively, the area under the curve (AUC) [63] is theoretically and empirically validated to be more suitable for evaluating the performance of a classification algorithm under this situation. Thus, the adjusted ratio of ratios (ARR) [44,64] is modified and adopted as multi-criteria measure for evaluating the classification performance of SVM with different kernels, which aggregates information concerning the success ratio AUC and CPU time of SVM with each candidate kernel and even realizes the compromise of both.
The ARR of SVM with the kernel function k p on the data set d, ARR d k p is defined as: Where AUC d k p and T d k p represent AUC and CPU time of kernel k p on data set d, respectively, k q denotes each of other kernels, and β represents the relative importance of AUC and CPU time, it is in the range of [0, 1] and often defined by users.
For the purpose of obtaining a stable classification performance of SVM with different kernel functions and making full use of the historical data set, a 10×10-fold cross-validation is employed. This means the 10-fold cross-validation is repeated 10 times for the SVM with a given kernel function on each data set with different random seeds, and the effects from the order of inputs are reduced. The detailed procedure for evaluating the performance of kernels is shown in Table 2.
Once the performance array, in which each column represents the ARR performance of SVM with one special kernel function and each row denotes each fold cross-validation with M kernel functions, of SVM with all candidate kernel functions is obtained, for a given data set, a multiple comparison procedure (MCP) [65] is used to identify the applicable kernel set that consists of the top-scoring candidate kernels without significant performance differences among them. Since the value distribution of performance is unknown, the simple, yet safe and robust non-parametric tests are used to statistically comparing classifiers. Specially, the Friedman test [66,67] with Holm's procedure [68] is employed to compare multiple kernel functions on each given data set. The Friedman test is first employed to check whether there is significant difference between all candidate kernels at the significant level α = 0.05. The null hypothesis is H 0 : k 1 = k 2 = . . . = k m , which states that all kernels are equivalent, if the test result p < 0.05, then the null-hypothesis is rejected and the post-hoc test Holm-Procedure is proceeded to find out those kernels outperforming others, or there is no significant difference among all kernels. Table 3 provides the details of the applicable kernels identification for a given data set.

Model Construction and Recommendation
In this section, we elaborate the process of recommendation model construction. Firstly, we introduce the state-of-the-art multi-label classification methods used for modeling the relationship between data set characteristics and the performance of different kernels in subsection Multi-label Classification. Secondly, we present the multi-label feature selection methods that used to exclude useless features affecting the construction of recommendation model in subsection Multi-label Feature Selection. Finally, we provide the construction method of the multilabel learning based kernel recommendation model in subsection Multi-label Kernel Recommendation Model Construction, and give the measures used to evaluate the performance of our recommendation method in subsection Multi-label Evaluation Metrics.
Multi-label Classification. Traditional single-label classification is concerned with learning from a set of examples associated with a single class label from a set of disjoint labels, and the applicable kernels for a classification problem usually are not only one. Moreover, multiple-label classifications [34,36,69] learn the problems where each example is associated with more than one class labels. Therefore, the kernel recommendation is a multi-label classification problem. Two main categories of multi-label classification methods can be used to build kernel recommendation model, they are problem transformation and algorithm adaption.

• Problem transformation
The purpose of problem transformation is to convert a multi-label learning problem into a traditional single-label classification problem by the methods listed below: (1) subjectively or randomly select one of the multiple labels for each multi-label instance and discard the rest; (2) simply discard every multi-label instance from the multi-label data set and only retain those instances with single label; (3) Label powerset (LP) considers each different set of labels in the multi-label data set as a single label classification task; (4) Include labels classifier  label ranking (CLR) introduces a calibration label representing the boundary between relevant and irrelevant labels, and effectively produces an ensemble combining the models learned by the conventional binary relevance ranking approach and the pairwise comparison approach.
• Algorithm adaptation Unlike the problem transformation, the objective of algorithm adaptation is to modify the existing single-label classification algorithms, and then adapt them to solve the multi-label classification problems. The prevalent algorithms contain C5.0 [70], BoosTexter [71], the multilabel k-nearest neighbor ML-KNN [72], the multi-label kernel method RANK-SVM [73], the multi-class multi-label neural networks BP-MLL [74], and MMP [75]. The first three methods are the most commonly used in multi-label learning. Clare and King [70] adopted C4.5 to handle the multi-label biological problems through modifying the definition of entropy, its output is a decision tree or equivalently a set of symbolic rules allowing to be interpreted and compared with existing biological knowledge. However, it just learns the rules for biological interest rather than predict all examples. Schapire and Singer [71] proposed a Boosting-based system for text categorization (BoosTexter) on the basis of the popular ensemble learning method ADABOOST [76]. In the multilabel training phase, BoosTexter maintains a set of weights over training examples and labels. As boosting progresses, training examples and their corresponding labels that are hard to predict correctly get incrementally higher weights, while examples and labels that are easy to classify get lower weights. ML-KNN [72] is a multi-label lazy learning approach, which is derived from the traditional k-nearest neighbor (kNN) algorithm. Concretely, for each unseen instance, its k nearest neighbors are first identified in the training set. After that, based on the number of neighboring instances belonging to each possible class, the label set for the unseen instance is determined by maximum a posteriori (MAP) principle. Experimental results confirmed that ML-KNN slightly outperforms BoosTexter, and is far superior to ADABOOST.MH and RANK-SVM. Thus, ML-KNN has been applied to solve the real-world multi-label learning problems.
Aiming to construct our multi-label learning based kernel recommendation method as well as possible, we select several representative and effective multi-label classification methods in the level of problem transformation and algorithm adaption to build the recommendation model, respectively.
Multi-label Feature Selection. Feature selection can provide more suitable features for building classification models. Considering single-label feature selection methods dedicate to filter out useless and redundancy features for single-label based learning while our kernel recommendation is a multi-label learning based method, before constructing the recommendation model, multi-label feature-selection techniques are applied to select those critical meta-features for model construction. The multi-label feature-selection techniques can be classified into external and internal strategies.
Internal strategy [77] aims to utilize the multi-label statistical relationship such as document-label information and label-label relationships within the design of feature selection algorithms. On the contrary, external strategy [78] transforms multi-label training data into singlelabel data before feature selection, so the traditional single-label feature selection algorithms can be applied. In this section, we mainly introduce the external strategy since some previous single-label feature selection methods are applicable for multi-label problems with high efficiency and effectiveness.
Yang and Pedersen [79] made a comparative study on five popular feature selection methods and confirmed that Information Gain (IG) and CHI-SQUARE (CHI) are the most effective and comparable. Here, besides IG and CHI, we also apply Relief [80] to provide suitable features for building kernel recommendation model. Relief is a practical feature selection method that evaluates the worth of a feature by repeatedly sampling an instance and considering the values of the given feature for the nearest instance belong to the same and the different classes. IG [79] evaluates the worth of a feature by computing the information gain related to the class. If the information gain of a feature is less than the predetermined threshold, then the feature will be removed from the feature space. CHI [79] evaluates the worth of a feature by computing the value of the chi-squared statistic with respect to the class. The chi-squared statistic is a normalized value and it is comparable across features of the same class.
Multi-label Kernel Recommendation Model Construction. In order to thoroughly explore our proposed kernel recommendation method and adequately make use of the available data, five multi-label classification algorithms and three multi-label feature selection methods are employed to build kernel recommendation model via the jackknife cross-validation technique. The detail is shown in Table 4.
Multi-label Evaluation Metrics. For comprehensively evaluating the performance of our kernel recommendation method, three evaluation metrics Hit Rate, Precision, and ARR are selected.
Let D be a multi-label meta-knowledge base consisting of jDj multi-label examples hx i , Y i i (i = 1, . . ., jDj), Y i allKernels be a kernel identified by Table 3, R be our multi-label kernel recommendation method, and Y 0 i ¼ Rðx i Þ be the set of labels predicted by R for example x i , these three metrics are defined as follows.
• Hit Count (HC) and Hit Rate (HR) Song et al. [62] defined the metrics HC and HR to evaluate the individual and overall performance of a recommendation model R over an example x i and all examples D, respectively.
For an example x i , if the intersection of Y i and Y 0 i is not empty, then the hit count HC is 1, indicating that the recommendation hits the target successfully; otherwise, HC is 0, indicating that the recommendation misses. • Precision Precision [81] is used to evaluate the effectiveness of our proposed multi-label kernel recommendation method. This measure calculates the fraction of labels correctly recommended by the multi-label kernel recommendation method, it is defined as follows: • Adjusted ratio of ratios ARR defined in Formula Eq 1 is utilized to evaluate the classification performance of SVM with a kernel function recommended by the proposed method.

Experimental Study
In this section, we experimentally evaluate the proposed multi-label kernel recommendation method over the benchmark data sets.

Benchmark Data Set
To evaluate our proposed multi-label kernel recommendation method, we collected 132 benchmark data sets from the publicly available repositories UCI, DASL, PROMISE, Agricultural, Agnostic-vs-Prior, and Examples. These data sets cover different fields of life, biology, physical, engineering, and software effect prediction. The brief statistical information of these data sets is given in Table 5.

Experimental Setup
1. Aiming to facilitate the classification of the data sets with Support Vector Machine, we import the LIBSVM tool package [82] into WEKA, in which the C-Support Vector Classification is specially designed for dealing with the classification problems. Furthermore, eleven different types of classical kernel functions [30,83] are chosen as the candidates, including linear, polynomial, radial basis function, sigmoidal function, Laplace, Multiquadric,   Rational Quadratic, Spherical, Spline, Wave and Circular. Linear kernel function. If the number of features is large, then it is needless to map the data to a higher dimensional space. That is, using the linear kernel is good enough. The formulation is shown as following: Polynomial kernel function. When the number of features is small, one often maps data to higher dimensional spaces. At this time, using nonlinear kernels is a better choice. Polynomial is a kind of nonlinear kernel expressed as following: Where parameters γ, coef, d need to be initialized, and it is time consuming when the value of degree d is large or the training set size is large. Radial basis function kernel function (RBF). RBF is implemented by using convolutions of the type.
Sigmoidal kernel function. The SVM with the Sigmoidal Kernel function is equivalent to the Multi-Layer Perceptron classifier in performance [84].
Rational Quadratic Kernel. The Rational Quadratic Kernel is less computationally intensive than the RBF kernel and can be used as an alternative when using the RBF becomes too expensive.
Multiquadric Kernel. The Multiquadric Kernel is also an example of an non-positive definite kernel and can be used in the same situations as the Rational Quadratic kernel.
Laplace Kernel. The Laplace Kernel is less sensitive for changes in the sigma parameter.
Circular Kernel. The Circular Kernel comes from a statistics perspective. It is an example of an isotropic stationary kernel and is positive definite.
Spherical Kernel. The Spherical Kernel is positive definite.
Wave Kernel. The Wave Kernel is a symmetric positive semi-definite.
Spline Kernel. The Spline Kernel is given as a piece-wise cubic polynomial, as derived in the works [85].
2. Considering the importance of CPU runtime when evaluating the performance of SVMs with different kernels, we set parameter β in Eq 1 to be three values 1%, 10% and 15%, respectively. This allows us to examine the usability of our proposed kernel method under different situations.
3. To produce the meta-knowledge data base, (1) all data characteristics listed in subsection Meta-features are collected as independent variables; (2) the corresponding applicable kernel set is identified for each data set as targets; (3) Friedman test and the post-hoc Holm's Procedure with the significance α = 0.05 are used to guarantee the high confident level.
4. The multi-label kernel recommendation model is built with the help of the java library MULAN, which is specially designed for Multi-Label Learning. Existing multi-label learning methods adopted in our experiments are BR, LP, CLR and ILC in the problem transformation level and ML-KNN (K = 5) in the algorithm adaptation level. Five standard classification algorithms are employed, including IB1, Naive Bayes, J48, Ripper and Random Forest. Although the measures listed in Table 1 are used for characterizing data sets in different perspectives, not all are critical for building multi-label kernel recommendation models. Therefore, we preprocess each data set with feature selection methods Relief, IG and CHI.

5.
When evaluating the performance of SVM with different kernels, 10×10-fold cross-validation is applied to guarantee the stability of results and to reduce the effect caused by the order of instances. Meanwhile, the jackknife strategy is employed to recommend kernels for each data set and realize an unbiased estimation for the proposed kernel recommendation method. That is, each data set has an opportunity to be recommended an applicable kernel set and the others are viewed as historical data sets.

Experimental Results and Analysis
In this section, we compare the performance of our proposed multi-label recommendation method with the single-label recommendation method [62] for kernel selection, the metalearning based kernel selection method (AliKSM) [30] and the simple multiple kernel learning (MKL) [32] with Polynomial and RBF as the basic kernel function (MKL-Poly) and (MKL-RBF) on the 132 data sets in terms of hit rate (HR), Precision and ARR in Section Recommendation Performance Comparison, respectively. We also analyze the impact of different multi-label classification methods and feature selection methods on our proposed recommendation method in Section Sensitivity Analysis.
Recommendation Performance Comparison. The performance of both single-label and multi-label based kernel recommendation models depends on the employed classification algorithms; and many classification algorithms are used for model construction in this paper. In this section, we just present the comparison results of the proposed multi-label kernel recommendation model with the best classification algorithm (Random Forest) and the existing kernel recommendation models in terms of the recommendation hit rate HR, Precision and the classification performance ARR.
Moreover, we also compare the classification performance ARR of SVM with the recommended kernel by our proposed multi-label recommendation method to that with (1) the most widely used radial basis function kernel function (RBF) [86], which is the default kernel in LIBSVM [82], and (2) the kernel created by the multiple kernel learning methods MKL-Poly and MKL-RBF, respectively. Fig 2 shows the recommendation hit rates of the proposed multi-label kernel recommendation method, the single-label recommendation method, the meta-learning based kernel selection method AliKSM and the multiple kernel learning methods under β = 1%, 10% and 15%, respectively. From this figure we observe that:

Comparison on hit rate (HR)
a. Whatever the data characteristics and the values of β are, the hit rate HR of our proposed multi-label kernel recommendation method is significantly greater than those with the other two recommendation methods. Specially, for three values of β, the HRs of our multi-label recommendations on the structure measures reach up to 91.6%, 88.55% and 90.84%, respectively, which are almost the twice as high as that of other two kernel selection methods. This indicates that our proposed multi-label kernel recommendation method can effectively predict the applicable kernels for the given data sets.
b. The reason why the hit rate HR of our proposed multi-label kernel recommendation method is much better than those of the single-label kernel recommendation methods lies that: When constructing the single-label kernel recommendation model with historical data sets, the kernel with the highest classification performance identified by cross validation is selected as the target concept. However, there might exist more than one appropriate kernels for a given data set with no significant differences in the classification performance of SVM. This means single-label kernel recommendation methods miss other applicable kernels, and only if the recommended kernel is the selected one, it hits. This finally results in a lower hit rate. Fig 3 shows the recommendation precision of our proposed multi-label kernel recommendation method, the single-label kernel recommendation method and the meta-learning based kernel selection method AliKSM with β = 1%, 10% and 15%, respectively. From this figure we observe that: a. The precision of the proposed multi-label kernel recommendation method approaches to 70%, the maximum Precisions of both the single-label kernel recommendation method and the automatic kernel selection method AliKSM are smaller than the minimum Precision of the multi-label kernel recommendation method. It means that our proposed method is more powerful for selecting the applicable kernels for the given data sets.

Comparison on precision
b. When β varies from 1% through 10% to 15%, the Precisions of our multi-label kernel recommendation method are all greater than those of the other two recommendation methods by 51.85%, 44.55% and 26.58% at least, respectively. It means whatever the value of β is, our proposed method is more effective for kernel selection.
3. Comparison on classification performance (ARR) Fig 4 shows the ARR of SVM with the kernel functions recommended by different kernel selection methods in terms of ARR when β = 1%, 10% and 15%, respectively. From this figure we observe that: a. Whatever the value of β is, the classification performance ARR of SVM with the kernel recommended by the proposed multi-label recommendation method on the structure measures outperforms those by the other methods.
b. When β = 1%, the ARRs of SVM with kernel recommended by our proposed method on each kind of meta-features significantly outperform those by other kernel selection methods, except that by the multiple kernel learning method MKL-Poly. However, when building the multi-label kernel recommendation model on the structure measures, the classification performance ARR of SVM with the kernel recommended by the proposed multi-label recommendation method is still greater than that with the kernel derived from MKL-Poly by 3.59%. c. When β = 10%, the ARRs of SVM with the kernel recommended by the multi-label recommendation method on most kinds of meta-features are greater than that by the single-label recommendation method by 5.19%-30.46% and that by AliKSM by 10.54%-17.42%, respectively. Compared to the multiple kernel learning method MKL-Poly, the ARR of SVM with the kernel recommended by the multi-label recommendation method on structure measure is improved by 1.05%. Compared to the multiple kernel learning method MKL-RBF and the default RBF kernel, the ARRs of SVM with the kernel recommended by the multi-label recommendation method are improved by 5.29%, 6.74% on the Landmarking measures and 14.51%, 16.08% on structure measures, respectively.
d. When β = 15%, the multi-label kernel recommendation models built on the model-based and structure measures are superior to the single-label recommendation method and the meta-learning based kernel selection method AliKSM. The improvements of ARR reach up to 10.43% for the single-label recommendation method and 23.40% for AliKSM, respectively. Compared to the multiple kernel learning methods MKL-Poly, MKL-RBF and the default RBF kernel function, the ARR of SVM with the kernel recommended by the multi-label recommendation based on the structure measures is increased by 6.79%, 19.73% and 21.33%, respectively. To summarize, with the kernel recommended by our proposed multi-label recommendation method on the structure measures, SVM will obtain the optimal classification performance. In Fig 5, a scatter plot is employed to provide an intuitive image on the performance of our proposed kernel recommendation method, the single-label kernel recommendation method, the meta-learning based kernel selection method AliKSM, the simple multiple kernel learning methods MKL-Poly and MKL-RBF for β = 1%, 10% and 15%, respectively, where X-axis and Y-axis stand for the classification performance ARRs of SVM with the real best kernel and the recommended kernel. The points on the diagonal y = x mean that the recommendations are optimal. The more the points deviated from the diagonal and the further their distances away from the diagonal, the worse the recommendation performance. From the Fig 5, we observe that: a. Compared with 59.54%, 57.25% and 54.20% of the recommendations for the single-label recommendation method, 68.70%, 69.47% and 60.31% of the recommendations for AliKSM, more than 95% of the recommendations for the multiple learning methods, only 26.72%, 22.90% and 27.48% of the recommendations of our proposed multi-label recommendation method deviated from the diagonal in terms of ARR when β = 1%, 10% and 15%, respectively. This indicates that the classification performance of SVM with the kernel recommended by the proposed multi-label recommendation method outperforms those with other recommendations and the multi-label recommendation is more likely to recommend the optimal kernels for the given data set.
b. Whatever the value of β is, the deviation degree of our proposed method is much smaller than those of the other methods. This means the error of our proposed multi-label kernel recommendation method is much less than those of other methods and the classification performance ARRs of SVM with the kernels recommended by our method are most close to the real best ones.

Significant test results
In order to explore whether our multi-label kernel recommendation method is significantly superior to the existing recommendation methods in terms of HR, Precision and ARR when β = 1%, 10% and 15%, Wilcoxon signed ranks tests [87,88] were conducted at the significance level of 0.05 in terms of hit rate HR, Precision and ARR. The alternative hypotheses are that our proposed multi-label kernel recommendation method is better than other methods. Table 6 shows the statistical test results of the proposed multi-label kernel recommendation method vs. the single-label kernel recommendation method, the meta-learning based kernel selection method AliKSM, the multiple kernel learning methods MKL-Poly and MKL-RBF on each kind of meta-features. From Table 6 we observe that: a. Whatever the value of β is, our proposed multi-label kernel recommendation method is significantly superior to the single-label recommendation method and the meta-learning based kernel selection method AliKSM on all meta-features in terms of HR and Precision.
b. When β = 1% and 10%, the ARR of SVM with the kernel recommended by our multilabel recommendation method obviously outperforms those by the single-label kernel recommendation method and AliKSM on most kinds of meta-features. When β = 15%, the ARRs of SVM with the kernel recommended by the proposed multi-label recommendation method on the model-based and structure measures are significantly greater than those by the single-label recommendation method and AliKSM.
c. For each value of β, the ARR of SVM with the kernel recommended by our multi-label recommendation method on the structure measures significantly outperforms those with the kernel created by the multiple kernel learning methods MKL-Poly and MKL-RBF.
In summary, our multi-label kernel recommendation method significantly outperforms the single-label kernel recommendation method and AliKSM on most kinds of meta-features in terms of hit rate HR and Precision. Specially on the structure measures, the classification performance of SVM with the kernel recommended by our multi-label recommendation method is significantly superior to those by all the other methods.
In a word, our proposed multi-label kernel recommendation method on the structure measures significantly outperforms other available kernel selection methods. It is capable of recommending the applicable kernels for a new classification problem.
Sensitivity Analysis. When building a kernel recommendation model, many multi-label classification methods and feature selection methods can be used. However, different combinations of the multi-label classification methods and feature selection methods will lead to different recommendation models, and further give various recommendations. Therefore, it is necessary to explore which combination is better when constructing the multi-label kernel recommendation model.
In this subsection, we analyze the effect of five multi-label classification methods (BR, LP, CLR, ILC and ML-KNN (k = 5)) and three representative feature selection methods (Relief, CHI and IG) on the recommendations for the 132 data sets in terms of the average HR, Precision and ARR for three different β values. Figs 6 and 7 illustrates the sensitivity analysis results of multi-label classification methods and feature selection methods, respectively.

Effect of multi-label classification methods
From Fig 6, we observe that: a. For all the three β values, the average recommendation hit rate HR, Precision and the classification performance ARR vary with different multi-label classification methods. It indicates that different multi-label classification methods do affect the performance of the multi-label kernel recommendation method.
b. When β = 1% and 10%, the multi-label recommendation methods with CLR obtain the best performance. When β = 15%, the multi-label recommendation methods with BR and ILC equally achieve the optimal HR, Precision and ARR. From Fig 7, we can observe that:

Effect of feature selection methods
a. For each of the three β values, the performance of the multi-label kernel recommendation method vary with different feature selection methods. It means that different feature selection methods indeed affect the performance of the multi-label kernel recommendation method. b. When β = 1%, employing Relief and IG for feature selection, the recommendations with the multi-label recommendation method outperforms those with CHI and without feature selection (NON). When β = 10%, there is no significant difference between the multi-label methods with feature selection and without feature selection (NON) in terms of HR and ARR. It is worth noting that the recommendation precision with Relief is greater than those with NON, CHI and IG by 2.92%, 10.33% and 10.33%, respectively. When β = 15%, the performance of the multi-label kernel recommendation method with Relief outperforms those with NON, CHI and IG by 7.21%, 56.58% and 56.58% in terms of HR, 9.67%, 30.67% and 30.67% in terms of Precision, and 7.27%, 52.62%, 52.62% in terms of ARR, respectively.
Overall, we learn about that the combination of the multi-label classification method CLR and the feature selection method Relief is benefit for optimizing the multi-label kernel recommendation method.

Conclusion
Aiming to identify the applicable kernels for SVM for a new classification problem, in this paper, we have presented a multi-label learning based kernel recommendation method. In our method, all available data characteristics are first extracted from each data set as the meta-features and the really applicable kernels are identified via cross-validation in terms of the relative performance metric integrating the classification success ratio with the CPU time. Then, the relationship is built on the meta-knowledge data base consisted of the meta-features and the multi-label target with the multi-label classification method. After that, the applicable kernels are recommended for a new classification problem according to its data characteristics.
For the purpose of confirming the effectiveness of our proposed recommendation method, we conducted the experiments on 132 public data sets and compared the proposed multi-label kernel recommendation method with other kernel selection methods in terms of hit rate and precision on all types of data characteristics. Moreover, we also evaluated the classification performance of SVM with the kernel functions recommended by different selection methods and that with the most widely used RBF kernel function. The experimental results demonstrate that our proposed multi-label kernel recommendation method outperforms the other kernel selection methods, and it can comprehensively and precisely identify the applicable kernels for a new classification problem. We also carried out the sensitivity analysis for the multi-label kernel recommendation method so as to observe under which situation the proposed method performs better, in which we respectively analyzed the effect from different multi-label classification methods and feature selection methods on our recommendation results. After that, we draw the conclusions that (1) the multi-label classification method CLR is a better choice for constructing the multi-label kernel recommendation method when β = 1% and 10%, while BR and ILC is better when β = 15%, and (2) the feature selection method Relief is more effective for improving the performance of our kernel recommendation method.
For the further work, we plan to study the parameter optimization or the combination of multiple applicable kernels for SVM based on multi-label learning.