Analyzing Kernel Matrices for the Identification of Differentially Expressed Genes

One of the most important applications of microarray data is the class prediction of biological samples. For this purpose, statistical tests have often been applied to identify the differentially expressed genes (DEGs), followed by the employment of the state-of-the-art learning machines including the Support Vector Machines (SVM) in particular. The SVM is a typical sample-based classifier whose performance comes down to how discriminant samples are. However, DEGs identified by statistical tests are not guaranteed to result in a training dataset composed of discriminant samples. To tackle this problem, a novel gene ranking method namely the Kernel Matrix Gene Selection (KMGS) is proposed. The rationale of the method, which roots in the fundamental ideas of the SVM algorithm, is described. The notion of ''the separability of a sample'' which is estimated by performing -like statistics on each column of the kernel matrix, is first introduced. The separability of a classification problem is then measured, from which the significance of a specific gene is deduced. Also described is a method of Kernel Matrix Sequential Forward Selection (KMSFS) which shares the KMGS method's essential ideas but proceeds in a greedy manner. On three public microarray datasets, our proposed algorithms achieved noticeably competitive performance in terms of the B.632+ error rate.


Introduction
Microarray data has been applied to the class prediction of different samples, from which the disease diagnosis and prognosis can benefit. A microarray dataset usually contains thousand of genes and a relatively much smaller number of samples (usually v100). For the purpose of predicting the type of biological samples, a majority of this genes are irrelevant and redundant. This fact has prompted the development of a variety of approaches which detect differentially expressed genes (DEGs) to accomplish an accurate classification of the samples.
The t-test has been one of the most widely-used parametric statistical methods for the identification of DEGs between populations of two classes. Variants of the t-test, which adopt different technologies to obtain a more stable estimate of the within-class variance for each gene, have been proposed [1][2][3]. The regularized t-test, for example, adjusted the gene-wise variance estimate by using a Bayesian probabilistic model [2]. For multiple testings, the p-value is calculated and adjusted to address the problem that the false positive rate is likely to accumulate over thousands of genes. Approaches in this categories range from those bounding the ''Family-Wise Error Rate'' (FWER) which is the overall chance of one or more false positives [4][5][6] and strategies controlling the ''False Discovery Rate'' (FDR) which is the expected percentage of false positives among the genes deemed as differentially expressed [1,7]. Because the null distribution is unknown, these methods often shuffle the class labels of the samples to estimate the p-value. The ANOVA F -test extends the t-test to multiple classes and a number of F -like statistics have been proposed which used different shrinkage estimators of the gene-wise variance [8,9].
Another family of statistical methods proposed to factor in the dependency information between genes. Representative examples include the gene pair selection method [10] and correlation-based methods the rationale behind which is that a good feature subset is highly correlated with the class and uncorrelated with each other [11,12]. Also included are the approaches derived from Markov blanket filtering [13][14][15]. Minimum redundancy maximum relevance [16] and uncorrelated shrunken centroid [17] are also well-established gene selection methods in this category.
When cast in the framework of pattern recognition, gene selection is a typical feature selection problem. Feature selection techniques in pattern recognition can be generalized into three types: filter, wrapper and embedded methods [18][19][20]. For filter methods, the feature selection is performed independently of a classification algorithm, which cover a majority of the aforementioned statistical tests. Wrapper methods, by contrast, use a classifier to evaluate a feature subset. The problem of choosing n out of d features involves altogether n d feature subsets. An exhaustive evaluation of these subsets is computationally infeasible, particularly for microarray data of a large d. A number of heuristic search techniques are thus proposed, and among them are the Sequential Forward Selection (SFS), the Sequential Backward Elimination (SBE), the Sequential Forward Floating Selection (SFFS) and the Sequential Backward Floating Elimination (SBFE). The SFS has been used to search for feature subsets which are evaluated by the leave-one-out cross validation accuracy of Least-Squares SVM [21,22]. Genetic Algorithms (GAs) are another family of search strategies that have attracted considerable research attention [23][24][25].
Embedded methods, on the other hand, use the intrinsic property of a specific classifier to evaluate feature subsets. For example, the SVM Recursive Feature Elimination (SVM-RFE) [26] regards that the normal vector of the linear SVM carries the significance information of the genes. Representative examples also include random forest induced approaches [27,28]. An extensive review of major feature selection techniques has been carried out [29]. No general consensus has yet been reached on which one is the best, despite the diversity and abundance of gene selection algorithms.
Empirically, wrappers and embedded methods have been observed to be more accurate than filters [30]. However, they require repetitive training of a specific classifier in order to guide the search in the space of feature subsets and are consequently very time consuming. Filters are, generally speaking, faster in the absence of interactions between feature subsets and a classifier. Thus filters, statistical tests in particular, have enjoyed considerable popularity in the field of gene selection for microarray data [4,9,31,32]. In fact, wrappers normally incorporate statistical tests as a preprocessing step to prune a majority of genes so that the number of feature subsets to be visited is reduced along the search pathway [21,22,26].
Meanwhile, although the choice of the classifier also presents a wide diversity, SVMs have been widely recognized for its generalization abilities [33] and remained as a predominant option [34][35][36].
In summary, a widely-accepted scheme for the analysis of microarray data has been ''identification of DEGs by statistical tests followed by sample classification using SVMs''. The justification is that the prediction accuracy of various classifiers including SVMs, depends on how discriminant the features are. However, SVMs belong to the family of sample-based classifiers whose generalization performance comes down to, more precisely, how discriminant the samples are. DEGs identified by statistical tests cannot guaranteed to establish a set of discriminant samples for SVMs. Consequently, it cannot be promised the highest degree of accuracy for sample classification. This problem necessitates the development of gene selection algorithms that are more consistent with the fundamental ideas of SVMs. It is naturally desired that, the proposed methods can bypass the computationally-expensive training procedure of SVMs, which is required by the SVM-RFE algorithm [26] and wrapper methods based on Least-Squares SVMs [21,22].

Support Vector Machines
Given a binary classification problem with the training data set of: where d is the number of features and each y i (i~1, . . . ,') is the class label for the training sample x i . As depicted in Fig. 1, the SVM algorithm seeks the separating hyperplane H 0 which possesses optimal generalization abilities.
The hyperplane H 0 takes the form of w T xzb~0 where w is the normal vector and the constant b the bias term. The classifier vw,bw is constructed so that samples from the positive class lie above the hyperplane H z : w T xzb~z1 while samples from the negative class lie beneath the hyperplane The condition of optimality requires that the vector w be a linear combination of the training samples: Each constant a i (i~1, . . . ,') is the Lagrangian multiplier introduced for sample x i . The feasible value range for the a i 's is ½0,C where C is the regularization parameter and tunes the tradeoff between generalization abilities and the empirical risk.
For nonlinear problems where the training data are not separable in the input space, a function, denoted as w( : ), is applied, mapping the data to a feature space of higher dimensions where they become separable. Consequently, the normal vector of the resultant classifier becomes: Equation (2) which represents the solution in the linear case, can also be viewed as a special case of Equation (3) On a test sample z, the SVM classifier outputs a decision value of: According to the sign of the f (z), the sample z obtains a class label of either z1 or {1.
Equation (4) suggests that the SVM algorithm requires the knowledge of the dot product between w(x i ), rather than that of w(x i ) itself. Thus the SVM employs the ''kernel trick'' which allows the the dot product between w(x i ) to be computed without the explicit knowledge of the function w( : ). Mining the Information Hidden in the SVM Solution As mentioned previously, each training sample is eventually assigned a Lagrangian multiplier a i (i~1, . . . ,'), subject to 0ƒa i ƒC. The establishment of the SVM classifier is, in actual fact, a process of optimizing the values of these ' Lagrangian multipliers. In the SVM solution which is formulated by Equation (4), a i 's can be divided into three groups which respectively satisfy a i~0 , 0va i vC and a i~C .
Using Figure 1, we now focus on the linear SVM classifier and review the connection between the value of a i and the geometric location of its associated training sample x i . It is worth attention that the connection arises, mathematically, from the optimality conditions of SVMs [37,38]. We then reveal the hidden information that can be mined out of this connection.
Depending on its class label y i , x i lies geometrically either in the space above H z : w T xzb~z1 for y i~z 1 or in the space beneath H { : w x T zb~{1 for y i~{ 1. Consider a sample x i whose y i~z 1. Since it locates in the subspace above H z , we expect x i bearing noticeable similarities to class ''+'' than to class ''{''. The similarity of x i and class ''+'' can be measured by evaluating the the similarity between x i and each representative sample from class ''+''. The training set of the SVM is, or has been supposed to be, composed of representative samples from each class.
We use the inner product to measure the similarity level between vectors. Denoting the number of the positive training samples as n z , the inner products between x i and each each positive training sample x j form a population of n z measurements, denoted as fx T i x j , j~1, . . . ,n z g: The mean of these measurements, denoted as u z i , is regarded to be indicative of the similarity of x i and class ''+'': Likewise, denoting the number of the negative training samples as n { , the similarity of x i and class ''{'' can be measured as: where the set fx j ,j~1, . . . ,n { g consists of all the negative training samples.
As a result, we can express, mathematically, the expectation that a positive sample bears more resemblance to class ''+'' than to class ''{'' as: And a negative training sample x i whose y i~{ 1 and a i~0 is expected to satisfy: Equation (7) and Equation (8) can be combined into: 2 i with 0va i vC x i with 0va i vC lies exactly on either H z for y i~z 1 or H { for y i~{ 1. This group of training samples are normally referred to as ''boundary samples''.
The class resemblance of a boundary sample to its supposed class is not as striking as those samples with a i v0. Nevertheless, they are still the samples whose class labels can be correctly restored by the SVM solution and thus expected to satisfy Equation (9).
3 i with a i~C x i whose y i~z 1 and a i~C can be located at one of the following three locations: A training sample from group (a), is a boundary sample but its class label can be correctly restored by the SVM solution. As with positive samples whose 0va i vC, Equation (9) is expected to hold for samples from case (a).
For a training sample from group (b), the SVM classifier would not have been able to correctly restore its class label if it weren't for the introduction of the slack variables. Our interpretation is that, the class resemblance of this sample to its supposed class is so ambiguous that the SVM has difficulties in acknowledging its actual class membership.
For a training sample from group (c), the SVM classifier is simply unable to correctly restore its class label. It is very likely that the class resemblance of this sample to its supposed class in fact contradicts its given class label.
In mathematical terms, we reckon that a positive training sample of either group (b) or group (c) satisfies: The hidden information for x i whose y i~z 1 and a i~C can be inferred in a similar manner. And the formulation that describes a sample x i with a i~C can be generalized as: In summary, the resultant value of a i for the training samples x i suggests how discriminant x i is between two opposing classes. But the values of a i 's can only be obtained after the completion of the training procedure which is of a formidable time complexity of O(' 3 ).
Luckily, our analysis above implies that that the function of is, promisingly, indicative of the discriminant level of x i . In other words, the vector of fx T i x j ,j~1, . . . ,'g is highly informative about the complexity of classifying x i by the linear SVM classifier. It is easy to infer that, for nonlinear problems, this information can be obtained from the vector of fw(x i ) T w(x j ),j~1, . . . ,'g.  x .
x . x

Estimating the Separability of a Problem
In the SVM algorithm, the vector of fw(x i ) T w(x j ),j~1, . . . ,'g constitutes the i-th column of the input kernel matrix. The ' measurements in the i-th column can be separated into two populations, according to the class label of sample i, and respectively denoted as K zi and K {i . Performing the following test to the two populations yields a score s i which measures the separability of x i : where u z i (u { i ) and s z i (s { i ) are the mean and the standard deviation of K zi (K {i ).
We justify the introduction of standard deviations in the denominator by considering two positive training samples in the feature space. The first sample is assumed to have come from a region of denser population than the second one. We reckon that, compared with the second sample, the first sample is more typical a representative of class ''+'' and is believed to be more similar to class ''+''. The positive sample from a denser population is expected to exhibit a lower deviation of the elements fw(x i ) T w(x j ),(j~1, . . . ,n z )g: Thus, the standard deviation is formulated into Equation (12), demonstrating our confidence in a higher separability of a sample from a denser population.
The values of s i 's can be split into three types, large positive ones, small positive ones and negative ones. A large positive value of s i implies that, the training sample x i is likely to be discriminant, statistically bearing more resemblance to the supposed class than to the other one. A small positive s i suggests that, x i might bear almost the same level of resemblance to both classes and thus, hard to classify. For a negative value on s i , the class that x i is computed as more similar to, is different from the actual one, which poses difficulties for the SVM classifier.
Meanwhile, the similarity between each sample and itself is supposed to be 1. However, It is not the case for all kernel functions to satisfy K ii~1 . Consequently a proper preprocessing procedure might be required prior to the application of Equation (12), depending on the kernel in use. For linear kernels, we divide each element of the i-th column of the kernel matrix by x T i x i . For Gaussian RBF kernels [29] which take the form of it already holds that K ii~1 and the preprocessing step is avoided. However, the value of the parameter l is required to be optimized. Since the separability of each sample has an impact on the the class separability of a problem, we propose to use the sum of each sample's separability score as an estimate of the separability of the problem.
A word about the formulation of Equation (12). In statistics, it is the norm of practice to add a small constant to the sum of variances, in order to guard against zero in the denominator. But for our algorithms, the designation of K ii~1 prevents the occurrence of zero in the denominator of Equation (12). We explain how it is achieved for linear kernels and Gaussian RBF kernels: (1) In the case of linear kernels, take a positive sample x i for example. Since K ii~1 , in order to have s z i~0 for Equation (12), it demands that K ij~x for any j whose x j is a positive sample. This requires that x T i (x i {x j )~0. This set of conditions can only satisfied either when x i~0 or x i~xj which suggests that the training set only include one positive sample. We reckon that either case is unlikely for well-posed classification problems.
(2) In the case of Gaussian RBF kernels, in order to have s z i~0 given a positive sample, it has to be met that K ij~e xp ({lEx i {x j E 2 )~1 for any j whose x j is a positive sample. This in fact implies that either x i~xj which is hardly true with real-life microarray datasets, or the parameter l has been assigned a value of zero, which can be easily avoided.

Kernel Matrix Induced Gene Selection Algorithms
Since each gene subset introduces a classification problem represented by the set of training samples, the gene subset thus corresponds to an estimate of the separability of the problem. Consequently, DEGs can be identified as those resulting in ''easier problems'' of high separability. This is the essential idea of our kernel matrix induced gene selection methods, which has been illustrated in Figure 2. This methodology is shared by the two gene selection algorithms we proposed below. The first algorithm, namely the Kernel Matrix Gene Selection (KMGS), ranks each gene individually, while the second one, namely the Kernel Matrix Sequential Forward Selection (KMSFS), identifies DEGs iteratively.
Kernel Matrix Gene Selection. Given a microarray dataset of ' samples with d genes, the n-th (1ƒnƒd) gene of the ' samples forms a vector. The vector, in fact, establishes a training set for the following classification problem: where x in is the value of n-the gene for the i-th sample and the y i is its given class label. Given the training set, the separability of each sample, denoted as s i (n), can be assessed using Equation (12). The class separability of the problem constructed from the n-th gene can thus be computed: while the reason behind using (15) is that the class separability of a problem is reflected by the sum of the separability of each sample.
Hence the function f (n) maps a gene to the separability level, in the contexts of sample-based classifiers including the SVM.
The d genes are ranked according to their respective f (n) value where n[½1,d. Genes achieving a large f (n) obtain higher rankings.
Kernel Matrix Sequential Forward Selection. An alternative to the KMGS which proceeds in a greedy manner is also developed, which is namely the Kernel Matrix Sequential Forward Selection (KMSFS) algorithm. The algorithm starts with an empty set of selected DEGs. At each iteration, the algorithm identifies a single DEG which is then appended to the set. We now describe how the KMSFS algorithm proceeds between two consecutive iterations.
Given a microarray dataset of ' samples with d genes, at the nth iteration, n genes has been collected into the set of DEGs. This in fact stands for a classification problem with the training set composed of ' samples, each of which is of n dimensions: Each gene from the remaining (d{n) genes is, in turn, appended to these n genes and forms a different classification problem with a training set of ' samples, each of which is of (nz1) dimensions. This results in, altogether, (d{n) data matrices of size '|(nz1) which are actually the training sets for (d{n) classification problems. The complexity of each problem can be estimated and interpreted as the significance of the associated (nz1)-th gene. The (nz1)-th DEG is eventually identified to be the one which produces the problem featuring the highest separability.
The pseudo codes for the KMGS and KMSFS algorithms are given respectively in Table 1 and Table 2.

Merits of Proposed Algorithms
The proposed methods have noticeable merits: 1. Filter methods identify discriminant features, making them suitable for feature-based classifiers whose normal vector is the linear combination of features. However, Equation (3) demonstrates that the SVM classifier is the linear combination of training samples in the feature space. Thus the performance of the SVM comes down, more to discriminant levels of samples than those of features. Since discriminant features selected by filter methods are not guaranteed to generate a training set composed of discriminant samples, the resultant classifier cannot be ensured to be optimally accurate either. In contrast, our algorithms aim at unveiling the information regarding discriminant levels of samples using the kernel function. Our algorithms are developed upon the fundamental ideas of SVMs and thus more likely to produce a classifier of a higher degree of accuracy.

2.
A majority of wrapper and embedded methods are based on the assumption that most microarray datasets pose linear problems. However, we reckon that, the problem presented by a set of DEGs can hardly be a linear one when the the set size is as small as only one or two.
But the generalization to nonlinear cases have been challenging for various wrappers and embedded methods. For example, the SVM-RFE [27] keeps unchanged the Lagrangian multipliers a i 's from the previous iteration and then selects the gene which makes the least change to the dual objective function. The strategy of fixing a i 's is likely to compromises the significance evaluation of each gene, as well as the generalization abilities of the resultant SVM classifier.
Advantageously, our algorithms can be directly applied to nonlinear cases by opting for Gaussian RBF kernels. The Gaussian RBF K ij , according to Mercer's conditions, is an inner product of x i and x j in the feature space: where W is the function mapping a sample from the input space to the feature space. Thus, for the nonlinear case, the Gaussian kernel matrix is still composed of similarity measurements between training samples.
3. The output of Equation (15) which measures the significance of genes is a real-valued number rather than an integer. This avoids the ties problem [40] which often occurs to count based wrapper methods including the one using the leave-one-out cross validation error as the selection criterion [21].

Datasets and Data Preprocessing
Prostate dataset. The dataset contains, in total, 136 samples of two types which respectively have 77 and 59 cases. Each sample includes expression values of 12600 genes.
m is a user-defined integer which indicates the number of selected genes. FOR n~1, . . . ,d -A training data set is formed by the n-th feature of the original training set: -Construct the kernel matrix K (n) . In the case of the linear kernel function, apply the aforementioned preprocessing procedure to ensure that: K (n) ii~1 for each column.
-At the j-th column entry, the measurement K (n) ij is assigned into either the population K (n) zj or K (n) {j depending on the class label y i . Perform the following statistical test on the two populations: m is a user-defined integer which indicates the number of selected genes.
c is the index set of candidate features. Initially c i~i where i[f1, . . . ,dg.
-D is a '|' matrix which is D 0~0 at the start.
ij is: -Add the c n -th feature to the set of previously (k{1) selected features whose indices are fc 1 , . . . ,c k{1 g.
A training data set is formed by the k features of the original training set: 1g -Construct the kernel matrix K (n) . In the case of the linear kernel function, apply the aforementioned preprocessing procedure to ensure that K (n) ii~1 for each column.
-Calculate the score f (n) w.r.t K (n) ij using (12) and (15). Leukaemia dataset. The dataset was collected from 72 patients. 47 of them were diagnosed with acute acute lymphoblastic leukemia (ALL) and 25 with acute myeloid leukemia (AML). Expression values of 7129 genes were measured.
Both the prostate dataset and the colon dataset were normalized using the following procedure. A microarray dataset with ' samples and d genes was arranged as a matrix of ' rows and d columns. Each row of the matrix was standardized so that the mean and the standard deviation for the row vector are respectively zero and unity. Next, each column of the resultant matrix was standardized to have zero mean and unity standard deviation. No further processing was conducted. All the simulations and comparisons have been performed on the standardized data.
For the leukemia dataset, we applied the pre-processing procedure proposed by Dudoit et al. [41] which consisted of (i) thresholding (floor of 100 and ceiling of 16000), (ii) filtering (exclusion of genes with max/min~5 and max-min~500 across the samples), (iii) base 10 logarithmic transformation, leaving us with 3571 genes. Next, we applied Fisher's ratio and selected the 1000 top DEGs. For each individual gene, Fisher's ratio assigns it a score using the function f~(m z {m { ) 2 =(s 2 z zs 2 { ) where m z (m z ) and s z (s { ) are respectively the mean and the standard deviation across samples from the positive(negative) class. The preprocessing strategy which was also employed by [21] makes possible a fairer comparison between our experiment results and those reported in [21]. All the simulations and comparisons regarding the leukemia dataset have been performed on the preprocessed and pre-selected data.

Error Rate Estimation Techniques
Various gene selection algorithms are evaluated and compared by the error rate of SVMs. The simplest technique for error estimation is the holdout method which splits the dataset into a training set and a test set. The gene selection algorithm is performed on the training set and sample classification on the test set. However, the holdout method has been highly discouraged for microarray datasets which usually contain a small number of samples. In contrast to researchers who applied gene selection to the entire training set and employed k-fold cross validation to assess the selected DEGs, Ambroise and McLachlan [42] emphasized to exclude samples used for validation from the gene selection procedure and labelled techniques that follow their recommendation as ''external'' ones. They suggested that external 10-fold cross validation and external B.632+ bootstrap could produce unbiased estimate [42,43]. Due to the problem of high variance with cross validation techniques when applied to microarray datasets [44], we used the external B.632+ estimator for the comparison of gene selection algorithms.
The B.632+ estimator involves resampling, with replacement, of the original dataset. From a dataset of ' samples denoted as X~fx 1 , . . . ,x ' g, a single sample is randomly drawn and then put back at each time. This process is repeated ' times, leading to a new set which is denoted as X Ã~( x Ã 1 , . . . ,x Ã ' ). The resampled set X Ã includes, with probability, duplicates of a sample from the original set X. The number of duplicates for a sample included in X Ã ranges from 0 to '. The set of X Ã is used for both gene selection and training a SVM. The SVM classifier is then tested on the set of (X{X Ã ). A good error estimator requires the generation of B resampled sets which are denoted as X Ã1 ,X Ã2 , . . . ,X ÃB where it was recommended that B §200. We set B~200 for all our experiments. Meanwhile, for each sample x i [X(i~1, . . . ,'), its overall number of occurrences in the B resampled sets is ensured to be B, which further reduces the variance.
The flow chart for evaluating a gene selection algorithm using the B.632+ technique has been given by Figure 3.

Gene Selection Algorithms
For the methods of KMSFS and KMGS, both the linear kernel and the Gaussian RBF kernel were tested. This resulted in altogether four algorithms which are referred to as Gaussian KMSFS, Gaussian KMGS, linear KMSFS and linear KMGS. They were compared with two wrapper methods: the leave-oneout calculation sequential forward selection (LOOSFS) which improved the least-squares bound measure [21] by easing the ties problem, and the gradient-based leave-one-out gene selection (GLGS) method [22] which was claimed to outperform the SVM-RFE algorithm [26]. Comparisons were also made to a number of filter methods, including the aforementioned Fisher's ratio [45], Cho's [46] and two other methods of Yang's [47]. We described the ideas of these gene selection algorithms below.

Leave-One-Out Calculation Sequential Forward Selection
(LOOSFS). The Leave-One-Out Cross-Validation(LOOCV) error has been generally used for measuring the generalization abilities of SVMs and Least-Squares SVMs (LS-SVMs). The LOOSFS method thus identifies as DEGs those genes which result in a LS-SVM classifier with the minimal Leave-One-Out Cross-Validation(LOOCV) error rate. The beauty of the algorithm consists in the efficient and exact computation of the LOOCV error. To address the ''ties problem'' in which multiple gene subsets achieve the same LOOCV error rate, a further selection criterion is imposed which favors the gene subset with minimal empirical risk.

Gradient-Based
Leave-One-Out Gene Selection (GLGS). The starting point of the GLGS method is also the employment of the exact formulation of the LOOCV error for LS-SVMs. The method then utilizes the gradient descent algorithm to seek a diagonal matrix which eventually minimizes the LOOCV error. Genes are ranked according to the absolute values of the diagonal elements of the diagonal matrix. With Cho's method and Yang's methods, genes are individually ranked. We use the following notations for their descriptions. Each micorarray dataset with ' samples and d genes is treated as a matrix, denoted as fa ij g, where a ij indicates the expression value of gene j for sample i. Given a n-class problem, the average expression value of each class, in terms of gene j, can be computed and denoted as the set of ( a a 1j , a a 1j , . . . , a a nj ). Denote the standard deviation for the set asâ a j . A matrix, denoted as fx ij g is also introduced, where x ij~ffi ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi (a ij { a a ij ) 2 q . Cho's Method. The score, denoted as s(j), that gene j obtains eventually is: where w i is the reciprocal of the number of samples that share the same class label as sample i and W~P ' i~1 w i . A small value of s(j) indicates that samples of the j-th gene are clustered the centroid of each class.
Yang's Methods. The between-class variation with respect to gene j, denoted as scatter(j), is formulated as: In order to estimate within-class variations in terms of gene j, a function d(j) is first introduced: where f (fx kj g) is a function of fx kj g which are composed of the elements from the j-th column of the matrix fx ij g and associated with the k-th class. Denote x x kj as the mean of of fx kj g. f (fx kj g) can be either the squared x x kj or the mean of fx 2 kj g, which results in two forms of d(j) which are referred to as d 1 (j) and d 2 (j) respectively.
Two metrics for measuring within-class variations, denoted as compact 1 (j) and compact 2 (j) which are derived respectively upon d 1 (j) and d 2 (j), are proposed: where m(j) is the mean of the set of f x x kj g where k~1, . . . ,n. Eventually, two score functions which decide the ranking of gene j, are given: s p (j) = compact p (j)=scatter(j), p = 1,2 We refer to these two score functions respectively as Yang's method 1 and Yang's method 2.
Cho's method and Yang's two methods identify as DEGs those genes whose associated s(j) are smaller. For all the gene selection algorithms, it has been emphasized to exclude the test subset each time from the gene selection procedure in order to obtain an unbiased evaluation [42,43]. The gene selection algorithms terminated when a specific number of DEGs have been identified and we set this number to be 100.

Parameter Tuning
We employed grid search and Friedman rank sum tests combined with Holm correction to tune the parameters for different algorithms.
Grid Search. Among the total 10 gene selection algorithm, both Gaussian KMSFS and Gaussian KMGS require the parameter l in Equation (13) to be optimized. For the LOOSFS algorithm, the regularization parameter, denoted here as l for consistency, has to be tuned. l was varied sequentially from 10 {5 to 10 5 in multiples of 10, which made up a total of 11 different values. With respect to sample classification, the linear SVM was used throughout. Its regularization parameter, denoted as C, ranged from 10 {3 to 10 3 in multiples of 10, which gives 7 different values. Thus, 77 value pairs for (l,C) were tested for Gaussian KMSFS, Gaussian KMGS and LOOSFS algorithms, while 7 Friedman Rank Sum Test with Holm Correction. The Friedman rank sum test is a non-parametric alternative to ANOVA with repeated measures. The test statistic for the Friedman test is a Chi-square with n{1 degrees of freedom, where n is the number of repeated measures.
We take the algorithm of Gaussian KMSFS as an example to explain how to apply Friedman test for the discovery of optimal values on (l,C). As mentioned previously, 100 DEGs were selected, from each a new classification problem arose. We thus obtained altogether 100 B.632+ error rates for each setting on (l,C). As we tried 77 settings for the parameter pair, 77 groups of classification accuracies were obtained.
Friedman rank sum test was used to detect statistical differences among these 77 groups. The test was based on 100 sets of ranks, with each set corresponding to an individual classification problem. The performances of different parameter settings analyzed are ranked separately for each problem. If we rejected the null-hypothesis stating that all the 77 settings led to equal performance in mean ranking, we employed the Holm post-hoc analysis to identify which setting was significantly better than the rest.
All the gene selection methods were coded in Matlab. The linear SVM was implemented using LIBSVM [48] and the Friedman test with Holm correction was coded in R. The specifications of the computer running the experiments were: Intel core i5-2320 quad-core processor 3.0 GHz, Memory 4 GBytes and the operating system of Windows 7.
GLGS and linear KMSFS shared the optimal setting of C~0:01. For linear KMGS and all the filter methods which are respectively Fisher's ratio, Cho's method and the two methods of Yang's, the optimal parameter settings were uniformly C~0:1.
Comparisons against Wrappers. With a minimal error rate of 0:1240 and a mean error rate of 0:1932, the performance of GLGS was much worse than that of the other 9 methods. Thus its simulation results were not graphically presented. Figure 4 illustrates the the B.632+ error rates as a function of the number of DEGs, for the algorithms of Gaussian KMGS, Gaussian KMSFS, LOOSFS, linear KMSFS and linear KMGS. It can been seen that, when the number of DEGs fell between 10 and around 60, linear KMSFS which is represented by the green solid line dotted with upper triangles, remained the best. As the number of DEGs further increased, Gaussian KMGS outperformed the rest and achieved the lowest B.632+ error rate. As shown by Figure 4, the classical LOOSFS was outperformed by our algorithms including the Gaussian KMGS, linear KMSFS and linear KMGS. When the number of DEGs ranged between 10 and 20, Gaussian KMSFS also performed better than LOOSFS.
Comparisons against Filters. Figure 5 compares linear KMSFS, Gaussian KMGS against the filter methods of Fisher's ratio, Cho's method as well as the two methods of Yang's.
The error rate of linear KMSFS remained noticeably lower than the filter methods when the number of DEGs fell between 10 and 60. When the number of DEGs grew larger, Gaussian KMGS showed better than performance than the four filter methods.
The performance of Gaussian KMGS and linear KMSFS remained competitive to those of the filter methods, respectively between the value ranges of ½0,60 and ½60,100 for the number of DEGs.
These comparisons lead to the conclusion that linear KMSFS and Gaussian KMGS are the two best methods for the prostate dataset.
For the other 3 wrapper methods which are respectively linear KMGS, linear KMSFS, GLGS and the 4 filter methods which are respectively Fisher's ratio, Cho's method and the two methods of Yang's, their optimal parameter settings were found to be C~0:01.
Comparisons against Wrappers. With a minimal error rate of 0:1577 and a mean error rate of 0:2100, GLGS performed much worse than the other nine methods. Thus again its simulation results were not graphically presented. Figure 6 illustrates the the B.632+ error rates of Gaussian KMGS, Gaussian KMSFS, linear KMSFS, linear KMGS and LOOSFS. It can been seen that Gaussian KMSFS demonstrated the best performance while the LOOSFS the worst performance. Gaussian KMGS, linear KMSFS and linear KMGS also performed slightly better than LOOSFS, particularly when the number of DEGs ranged between 15 and 45.
It is interesting to note that Gaussian KMGS, with only 10 DEGs, reached the lowest B.632+ error rate which was approximately 0.10. Also the lowest B.632+ error rate of LOOSFS, which was 0.11 was lower than that reported in [21] which was around 0.15 on the colon data. We reckon it could be due to the employment of different data preprocessing strategies.
Comparisons against Filters. Figure 7 compares linear KMSFS and Gaussian KMGS against the filter methods of Fisher's ratio, Cho's method as well as the two methods of Yang's.
Gaussian KMSFS remained better than the 4 filter methods whose performances were comparable between each other. Meanwhile, the error rates of Gaussian KMGS were also lower than those of the filter methods, particularly for a smaller number of selected DEGs.    In conclusion, Gaussian KMSFS and Gaussian KMGS have proved to be the best methods for the colon dataset.
For GLGS, the optimal setting was found to be C~0:01. For linear KMGS, linear KMSFS and all the filter methods which are respectively Fisher's ratio, Cho's method and the two methods of Yang's, their optimal parameter settings were uniformly C~0:1.
Comparisons against Filters.  consistent with that reported in [21]. The performance of Gaussian KMSFS remained competitive to that of LOOSFS. Meanwhile, the lowest B.632+ error rate was achieved by the Gaussian KMSFS with around 50 selected DEGs.
However, Gaussian KMGS, linear KMGS and linear KMSFS failed to perform as well as LOOSFS. We reckon it might be attributable to the preprocessing procedure which resulted in the removal of over 86% of the original 7029 genes, although this viewpoint has to be confirmed with more experiments.
Comparisons against Filters. Figure 9 further compares the performance of Gaussian KMSFS against the filter methods of Fisher's ratio, Cho's method as well as the two methods of Yang's. It was demonstrated that, the error rates of Gaussian KMSFS remained noticeably low than those of the 4 filter methods throughout.
We regarded the LOOSFS and the Gaussian KMSFS as the two best gene selection algorithms for the leukemia dataset.

Heatmaps of Differentially Expressed Genes
Due to the employment of B.632+ error estimation technique, each gene selection algorithm was applied to the 200 sets of bootstrap samples as well as the original training set. For each of these 201 sample sets, we selected a sequence of 100 DEGs. This resulted in altogether 201 sets each of which contained 100 DEGs.
We calculated the frequency with which each of the 2000 genes was selected into the 201 sets of DEGs and drew the heatmaps of 50 DEGs that were selected most frequently. For the algorithms of Gaussian KMGS, Gaussian KMSFS and LOOSFS, the outcome of gene selection procedures is influenced by the value setting on the parameter l and we used the optimal values reported in the previous section.
Heatmaps for the ten gene selection methods, were shown by Figure 10, Figure 11 and Figure 12. In each heatmap, each column corresponds to a sample and each row is the normalized expression values of a selected DEG across the 62 samples. A grid of each heatmap is colored according to the color key at the top of Figure 10 which maps a normalized expression value to a specific color between blue and red. The class of a sample at each column is indicated by the color bar at the top of each heatmap where blue indicates the cancerous case and red the normal case. Along the downward direction, the 50 DEGs are displayed in descending order of their frequency of occurrence in the 201 sets of selected genes.
It can be seen from Figure 11 that, the filter methods tend to favor ''discriminant features'' whose color forms an obvious contrast between the cancerous population and the normal population at each row. In Figure 10(b) which represents Gaussian KMSFS method, the color of genes at each row is in a pattern of ''occasional dotting of red versus a majority of blue''. The color contrast at each row of Figure 10(b) is less noticeable than Figure 11. Nevertheless, Gaussian KMSFS demonstrated the best prediction accuracies among all the methods, as reported in the previous section.
The second best gene selection algorithm for the colon data is Gaussian KMGS whose selected DEGs have been presented by Figure 10(a). Interestingly, between the two opposing classes, Figure 10(a) exhibited a sharper color contrast than the one exhibited by Figure 10(b).
The above facts suggest that, although filter methods selected genes whose values, in general, differ significantly between opposing classes, our kernel induced algorithms seemed not to hold it as the selection criterion. Instead, our methods endeavored to select genes that could establish a set of ''discriminant samples'' for SVMs. This possibly accounts for their superiority in terms of B.632+ error rates on the colon dataset. Using the prostate dataset and the colon dataset, we employed the Friedman rank sum test with Holm correction to study the sensitivity of sample classification to value settings on C and l respectively.
Sensitivity of Sample Classification to l. We kept C fixed at a specific value and ran Friedman rank sum tests with Holm correction for various choices of l. The results were given by Table 3 each row of which reports the score for different values on l with C fixed at a specific value. The best choice is the one which obtained the lowest score and has been highlighted in bold for each row. Since the optimal value setting for C for Gaussian KMGS was found to be 0.1, we have labelled the associated row with an asterisk. It can be seen that, fixing C at 0.1, l~1 is significantly better than all the rest, except for l~0:1, at confidence levels of both 0:95 and 0:99. This suggests sample classification is insensitive to choices of l between 0.1 and 1 for C~0:1. For C~1, l~1 is significantly better than all the rest at confidence levels of 0.95 and 0:99. Thus sample classification is insensitive to choices of l between 0.1 and 1 for C~0:1. For the other rows for which C was fixed at 10, 10 2 and 10 3 respectively, sample classification remained insensitive to choices of l between 0.1 and 1.
We can conclude that B.632+ error rates are insensitive to choices of l between 0.1 and 1 in terms of Gaussian KMGS.
At the first row for Gaussian KMSFS, we can see that 10 {3 is significantly better than 10 {1 and other larger settings on l. At the next row, l~0:1 is significantly better than the rest, exclusive of l~10 {5 , at confidence levels of 0.95. l~10 {4 was also excluded at the confidence level of 0.99. The row of C~0:1 has been labelled with an asterisk, indicating that 0.1 is the optimal choice for C for Gaussian KMSFS. And we can see that l~0:1 is significantly better than the other settings at confidence levels of both 0.95 and 0.99. This is also the case with the row corresponding to C~1.
For the remaining three rows whose C was fixed at respectively 10, 10 2 and 10 3 , uniformly, l~10 {1 is significantly better than the rest at the confidence level of 0.95.
Thus, for Gaussian KMSFS, B.632+ error rates are sensitive to choice of l, when C was fixed at 0.1 and 1.
Colon Dataset. Similar analysis can be performed for the colon dataset, whose scores for various l produced by the Friedman tests have been reported at the bottom half of Table 3.
Regarding Gaussian KMGS, we can see that B.632+ error rates are insensitive to choices of l between 10 and 100 when C goes from 1 to 10 3 in multiples of 10. For C~10 {2 which is its optimal setting, B.632+ rates are insensitive to choices of l between 0.1 and 1.
In terms of Gaussian KMSFS, we can see that sample classification is sensitive to l at the rows of C~10 {2 and C~10 {1 . For C[f1,10 2 ,100 3 g, it shows that B.632+ error rates are insensitive to l at l~10 {5 and l~10 {2 . Larger values for l caused severe performance degradation, as suggested by the scores at the bottom right in Table 3.
Sensitivity of Sample Classification to C. We then kept l fixed at a specific value and ran Friedman rank sum tests with Holm correction for varied C's. The results were given by Table 4 each row of which reports scores for various choices of C at a specific value setting on l. The best choice at each row is the one with the lowest score and has been highlighted in bold.
Prostate Dataset. For Gaussian KMGS, we can see that B.632+ error rates are sensitive to choices of C. For different l's, C~0:1 remained the setting that linear SVMs achieved the best performance.
For Gaussian KMSFS, when l grows from 10 {5 to 10 {1 , B.632+ error rates remaine sensitive to l and the best performance was always achieved at l~10 {2 . As C continues to grow larger, B.632+ error rates appear to be insensitive between 0.1 and 1, which is particularly true with the row corresponding to l~1.
C is the sole parameter for linear KMGS and linear KMSFS. We can also see from Table 4 that, for both algorithms, B.632+ error rates are sensitive to C.
Colon Dataset. Table 4 indicates that, for both Gaussian KMGS and Gaussian KMSFS, the classification performance is insensitive to choices of C between 0.01 and 0.1, for values of l greater than 1. Nevertheless, for smaller values on l with both algorithms, B.632+ error rates are sensitive to C.
In term of both linear KMGS and linear KMSFS, B.632+ error rates remain sensitive to C, as with the results on the prostate dataset.

Conclusions
Statistical tests select genes whose expression values differ significantly between the two opposing classes, i.e., the discriminant genes. Samples-based learning machines including SVMs favor genes which results in a set of discriminant samples. The discriminant genes can not be guaranteed to result in a set of discriminant samples. We have shown that the genes leading to discriminant training samples can be detected by applying statistical tests to the kernel matrix.
In addition to the competitive performance demonstrated on the three public microarray datasets, the proposed kernel matrix induced gene selection algorithms offer extra advantages: 1. Generality. Our methods are considered applicable to any kernel classifiers, not just SVMs. 2. Flexibility. For the implementation of our methods, users can opt for any mercer kernel which can be linear, Gaussian RBF, sigmoid, or polynomials. However, depending on the specific kernel, properly-designed preprocessing steps may be required. For examples, Gaussian RBF kernels require the tuning of the width parameter. For linear kernels, strategies are required to ensure the diagonal elements of the kernel matrix, each of which suggests the similarity between a sample and itself, to be uniformly one.
It is also worth attention that microarray datasets have usually been assumed to present linear problems. However, it is unlikely to be true in the case that the number of DEGs is as few as one.
Interestingly, linear problems can be solved by SVMs with nonlinear kernels, while nonlinear problems are hardly solvable with SVMs using linear kernels. A potential solution to the nonlinear problem posed by a small number of DEGs could be the application of nonlinear SVM classifiers. But our method suggested a successful alternative which is the use of the nonlinear Gaussian RBF kernel for the identification of DEGs. We reckon that this effort of instilling ''nonlinearity'' into the identification of DEGs has contributed to the better empirical performance of our methods.

Supporting Information
Additional File S1 The three microarray datasets (PROS-TATE, COLON, and LEU) in MATLAB format which were used in this work. (ZIP)