Analyzing Kernel Matrices for the Identification of Differentially Expressed Genes

Xiao-Lei Xia; Huanlai Xing; Xueqin Liu

doi:10.1371/journal.pone.0081683

Abstract

One of the most important applications of microarray data is the class prediction of biological samples. For this purpose, statistical tests have often been applied to identify the differentially expressed genes (DEGs), followed by the employment of the state-of-the-art learning machines including the Support Vector Machines (SVM) in particular. The SVM is a typical sample-based classifier whose performance comes down to how discriminant samples are. However, DEGs identified by statistical tests are not guaranteed to result in a training dataset composed of discriminant samples. To tackle this problem, a novel gene ranking method namely the Kernel Matrix Gene Selection (KMGS) is proposed. The rationale of the method, which roots in the fundamental ideas of the SVM algorithm, is described. The notion of ''the separability of a sample'' which is estimated by performing -like statistics on each column of the kernel matrix, is first introduced. The separability of a classification problem is then measured, from which the significance of a specific gene is deduced. Also described is a method of Kernel Matrix Sequential Forward Selection (KMSFS) which shares the KMGS method's essential ideas but proceeds in a greedy manner. On three public microarray datasets, our proposed algorithms achieved noticeably competitive performance in terms of the B.632+ error rate.

Citation: Xia X-L, Xing H, Liu X (2013) Analyzing Kernel Matrices for the Identification of Differentially Expressed Genes. PLoS ONE 8(12): e81683. https://doi.org/10.1371/journal.pone.0081683

Editor: Ken Mills, Queen's University Belfast, United Kingdom

Received: July 10, 2013; Accepted: October 15, 2013; Published: December 9, 2013

Copyright: © 2013 Xia et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: This work was supported by grants from Natural Science Foundation, Zhejiang Province, P.R. China (Project No. LQ13F030011) and National Science Foundation of P.R. China (Project No. 61133010). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Microarray data has been applied to the class prediction of different samples, from which the disease diagnosis and prognosis can benefit. A microarray dataset usually contains thousand of genes and a relatively much smaller number of samples (usually ). For the purpose of predicting the type of biological samples, a majority of this genes are irrelevant and redundant. This fact has prompted the development of a variety of approaches which detect differentially expressed genes (DEGs) to accomplish an accurate classification of the samples.

The -test has been one of the most widely-used parametric statistical methods for the identification of DEGs between populations of two classes. Variants of the -test, which adopt different technologies to obtain a more stable estimate of the within-class variance for each gene, have been proposed [1]–[3]. The regularized -test, for example, adjusted the gene-wise variance estimate by using a Bayesian probabilistic model [2]. For multiple testings, the -value is calculated and adjusted to address the problem that the false positive rate is likely to accumulate over thousands of genes. Approaches in this categories range from those bounding the ''Family-Wise Error Rate'' (FWER) which is the overall chance of one or more false positives [4]–[6] and strategies controlling the ''False Discovery Rate'' (FDR) which is the expected percentage of false positives among the genes deemed as differentially expressed [1], [7]. Because the null distribution is unknown, these methods often shuffle the class labels of the samples to estimate the -value. The ANOVA -test extends the -test to multiple classes and a number of -like statistics have been proposed which used different shrinkage estimators of the gene-wise variance [8], [9].

Another family of statistical methods proposed to factor in the dependency information between genes. Representative examples include the gene pair selection method [10] and correlation-based methods the rationale behind which is that a good feature subset is highly correlated with the class and uncorrelated with each other [11], [12]. Also included are the approaches derived from Markov blanket filtering [13]–[15]. Minimum redundancy maximum relevance [16] and uncorrelated shrunken centroid [17] are also well-established gene selection methods in this category.

When cast in the framework of pattern recognition, gene selection is a typical feature selection problem. Feature selection techniques in pattern recognition can be generalized into three types: filter, wrapper and embedded methods [18]–[20]. For filter methods, the feature selection is performed independently of a classification algorithm, which cover a majority of the aforementioned statistical tests. Wrapper methods, by contrast, use a classifier to evaluate a feature subset. The problem of choosing out of features involves altogether feature subsets. An exhaustive evaluation of these subsets is computationally infeasible, particularly for microarray data of a large . A number of heuristic search techniques are thus proposed, and among them are the Sequential Forward Selection (SFS), the Sequential Backward Elimination (SBE), the Sequential Forward Floating Selection (SFFS) and the Sequential Backward Floating Elimination (SBFE). The SFS has been used to search for feature subsets which are evaluated by the leave-one-out cross validation accuracy of Least-Squares SVM [21], [22]. Genetic Algorithms (GAs) are another family of search strategies that have attracted considerable research attention [23]–[25].

Embedded methods, on the other hand, use the intrinsic property of a specific classifier to evaluate feature subsets. For example, the SVM Recursive Feature Elimination (SVM-RFE) [26] regards that the normal vector of the linear SVM carries the significance information of the genes. Representative examples also include random forest induced approaches [27], [28]. An extensive review of major feature selection techniques has been carried out [29]. No general consensus has yet been reached on which one is the best, despite the diversity and abundance of gene selection algorithms.

Empirically, wrappers and embedded methods have been observed to be more accurate than filters [30]. However, they require repetitive training of a specific classifier in order to guide the search in the space of feature subsets and are consequently very time consuming. Filters are, generally speaking, faster in the absence of interactions between feature subsets and a classifier. Thus filters, statistical tests in particular, have enjoyed considerable popularity in the field of gene selection for microarray data [4], [9], [31], [32]. In fact, wrappers normally incorporate statistical tests as a preprocessing step to prune a majority of genes so that the number of feature subsets to be visited is reduced along the search pathway [21], [22], [26].

Meanwhile, although the choice of the classifier also presents a wide diversity, SVMs have been widely recognized for its generalization abilities [33] and remained as a predominant option [34]–[36].

In summary, a widely-accepted scheme for the analysis of microarray data has been ''identification of DEGs by statistical tests followed by sample classification using SVMs''. The justification is that the prediction accuracy of various classifiers including SVMs, depends on how discriminant the features are. However, SVMs belong to the family of sample-based classifiers whose generalization performance comes down to, more precisely, how discriminant the samples are. DEGs identified by statistical tests cannot guaranteed to establish a set of discriminant samples for SVMs. Consequently, it cannot be promised the highest degree of accuracy for sample classification. This problem necessitates the development of gene selection algorithms that are more consistent with the fundamental ideas of SVMs. It is naturally desired that, the proposed methods can bypass the computationally-expensive training procedure of SVMs, which is required by the SVM-RFE algorithm [26] and wrapper methods based on Least-Squares SVMs [21], [22].

Materials and Methods

Support Vector Machines

Given a binary classification problem with the training data set of: (1)where is the number of features and each is the class label for the training sample .

As depicted in Fig. 1, the SVM algorithm seeks the separating hyperplane which possesses optimal generalization abilities. The hyperplane takes the form of where is the normal vector and the constant the bias term. The classifier is constructed so that samples from the positive class lie above the hyperplane while samples from the negative class lie beneath the hyperplane .

Download:

Figure 1. The linear SVM trained on samples from two classes.

Samples locating on the hyperplanes of and are referred to as ''boundary samples''.

https://doi.org/10.1371/journal.pone.0081683.g001

The condition of optimality requires that the vector be a linear combination of the training samples: (2)

Each constant is the Lagrangian multiplier introduced for sample . The feasible value range for the 's is where is the regularization parameter and tunes the tradeoff between generalization abilities and the empirical risk.

For nonlinear problems where the training data are not separable in the input space, a function, denoted as , is applied, mapping the data to a feature space of higher dimensions where they become separable. Consequently, the normal vector of the resultant classifier becomes: (3)

Equation (2) which represents the solution in the linear case, can also be viewed as a special case of Equation (3) where .

On a test sample , the SVM classifier outputs a decision value of: (4)

According to the sign of the , the sample obtains a class label of either or .

Equation (4) suggests that the SVM algorithm requires the knowledge of the dot product between , rather than that of itself. Thus the SVM employs the ''kernel trick'' which allows the the dot product between to be computed without the explicit knowledge of the function .

Mining the Information Hidden in the SVM Solution

As mentioned previously, each training sample is eventually assigned a Lagrangian multiplier , subject to . The establishment of the SVM classifier is, in actual fact, a process of optimizing the values of these Lagrangian multipliers. In the SVM solution which is formulated by Equation (4), 's can be divided into three groups which respectively satisfy , and .

Using Figure 1, we now focus on the linear SVM classifier and review the connection between the value of and the geometric location of its associated training sample . It is worth attention that the connection arises, mathematically, from the optimality conditions of SVMs [37], [38]. We then reveal the hidden information that can be mined out of this connection.

1. with

Depending on its class label , lies geometrically either in the space above for or in the space beneath for .

Consider a sample whose . Since it locates in the subspace above , we expect bearing noticeable similarities to class ''+'' than to class ''''. The similarity of and class ''+'' can be measured by evaluating the the similarity between and each representative sample from class ''+''. The training set of the SVM is, or has been supposed to be, composed of representative samples from each class.

We use the inner product to measure the similarity level between vectors. Denoting the number of the positive training samples as , the inner products between and each each positive training sample form a population of measurements, denoted as The mean of these measurements, denoted as , is regarded to be indicative of the similarity of and class ''+'': (5)

Likewise, denoting the number of the negative training samples as , the similarity of and class '''' can be measured as: (6)where the set consists of all the negative training samples.

As a result, we can express, mathematically, the expectation that a positive sample bears more resemblance to class ''+'' than to class '''' as: (7)

And a negative training sample whose and is expected to satisfy: (8)

Equation (7) and Equation (8) can be combined into: (9)

2. with

with lies exactly on either for or for . This group of training samples are normally referred to as ''boundary samples''.

The class resemblance of a boundary sample to its supposed class is not as striking as those samples with . Nevertheless, they are still the samples whose class labels can be correctly restored by the SVM solution and thus expected to satisfy Equation (9).

3. with

whose and can be located at one of the following three locations:

(a) exactly on the hyperplane of ;
(b) in the region between the hyperplanes of and but closer to ;
(c) in the region between the hyperplanes of and but closer to .

A training sample from group (a), is a boundary sample but its class label can be correctly restored by the SVM solution. As with positive samples whose , Equation (9) is expected to hold for samples from case (a).

For a training sample from group (b), the SVM classifier would not have been able to correctly restore its class label if it weren't for the introduction of the slack variables. Our interpretation is that, the class resemblance of this sample to its supposed class is so ambiguous that the SVM has difficulties in acknowledging its actual class membership.

For a training sample from group (c), the SVM classifier is simply unable to correctly restore its class label. It is very likely that the class resemblance of this sample to its supposed class in fact contradicts its given class label.

In mathematical terms, we reckon that a positive training sample of either group (b) or group (c) satisfies: (10)

The hidden information for whose and can be inferred in a similar manner. And the formulation that describes a sample with can be generalized as: (11)

In summary, the resultant value of for the training samples suggests how discriminant is between two opposing classes. But the values of 's can only be obtained after the completion of the training procedure which is of a formidable time complexity of .

Luckily, our analysis above implies that that the function of is, promisingly, indicative of the discriminant level of . In other words, the vector of is highly informative about the complexity of classifying by the linear SVM classifier. It is easy to infer that, for nonlinear problems, this information can be obtained from the vector of .

Estimating the Separability of a Problem

In the SVM algorithm, the vector of constitutes the -th column of the input kernel matrix. The measurements in the -th column can be separated into two populations, according to the class label of sample , and respectively denoted as and . Performing the following test to the two populations yields a score which measures the separability of : (12)where () and () are the mean and the standard deviation of ().

We justify the introduction of standard deviations in the denominator by considering two positive training samples in the feature space. The first sample is assumed to have come from a region of denser population than the second one. We reckon that, compared with the second sample, the first sample is more typical a representative of class ''+'' and is believed to be more similar to class ''+''. The positive sample from a denser population is expected to exhibit a lower deviation of the elements Thus, the standard deviation is formulated into Equation (12), demonstrating our confidence in a higher separability of a sample from a denser population.

The values of 's can be split into three types, large positive ones, small positive ones and negative ones. A large positive value of implies that, the training sample is likely to be discriminant, statistically bearing more resemblance to the supposed class than to the other one. A small positive suggests that, might bear almost the same level of resemblance to both classes and thus, hard to classify. For a negative value on , the class that is computed as more similar to, is different from the actual one, which poses difficulties for the SVM classifier.

Meanwhile, the similarity between each sample and itself is supposed to be . However, It is not the case for all kernel functions to satisfy . Consequently a proper preprocessing procedure might be required prior to the application of Equation (12), depending on the kernel in use. For linear kernels, we divide each element of the -th column of the kernel matrix by . For Gaussian RBF kernels [29] which take the form of (13)it already holds that and the preprocessing step is avoided. However, the value of the parameter is required to be optimized.

Since the separability of each sample has an impact on the the class separability of a problem, we propose to use the sum of each sample's separability score as an estimate of the separability of the problem.

A word about the formulation of Equation (12). In statistics, it is the norm of practice to add a small constant to the sum of variances, in order to guard against zero in the denominator. But for our algorithms, the designation of prevents the occurrence of zero in the denominator of Equation (12). We explain how it is achieved for linear kernels and Gaussian RBF kernels:

In the case of linear kernels, take a positive sample for example. Since , in order to have for Equation (12), it demands that for any whose is a positive sample. This requires that . This set of conditions can only satisfied either when or which suggests that the training set only include one positive sample. We reckon that either case is unlikely for well-posed classification problems.
In the case of Gaussian RBF kernels, in order to have given a positive sample, it has to be met that for any whose is a positive sample. This in fact implies that either which is hardly true with real-life microarray datasets, or the parameter has been assigned a value of zero, which can be easily avoided.

Kernel Matrix Induced Gene Selection Algorithms

Since each gene subset introduces a classification problem represented by the set of training samples, the gene subset thus corresponds to an estimate of the separability of the problem. Consequently, DEGs can be identified as those resulting in ''easier problems'' of high separability. This is the essential idea of our kernel matrix induced gene selection methods, which has been illustrated in Figure 2. This methodology is shared by the two gene selection algorithms we proposed below. The first algorithm, namely the Kernel Matrix Gene Selection (KMGS), ranks each gene individually, while the second one, namely the Kernel Matrix Sequential Forward Selection (KMSFS), identifies DEGs iteratively.

Download:

Figure 2. The essential idea of kernel matrix induced gene selection algorithms.

https://doi.org/10.1371/journal.pone.0081683.g002

Kernel Matrix Gene Selection.

Given a microarray dataset of samples with genes, the -th gene of the samples forms a vector. The vector, in fact, establishes a training set for the following classification problem: (14)where is the value of -the gene for the -th sample and the is its given class label. Given the training set, the separability of each sample, denoted as , can be assessed using Equation (12). The class separability of the problem constructed from the -th gene can thus be computed:(15)while the reason behind using (15) is that the class separability of a problem is reflected by the sum of the separability of each sample.

Hence the function maps a gene to the separability level, in the contexts of sample-based classifiers including the SVM.

The genes are ranked according to their respective value where . Genes achieving a large obtain higher rankings.

Kernel Matrix Sequential Forward Selection.

An alternative to the KMGS which proceeds in a greedy manner is also developed, which is namely the Kernel Matrix Sequential Forward Selection (KMSFS) algorithm. The algorithm starts with an empty set of selected DEGs. At each iteration, the algorithm identifies a single DEG which is then appended to the set. We now describe how the KMSFS algorithm proceeds between two consecutive iterations.

Given a microarray dataset of samples with genes, at the -th iteration, genes has been collected into the set of DEGs. This in fact stands for a classification problem with the training set composed of samples, each of which is of dimensions: (16)

Each gene from the remaining genes is, in turn, appended to these genes and forms a different classification problem with a training set of samples, each of which is of dimensions. This results in, altogether, data matrices of size which are actually the training sets for classification problems. The complexity of each problem can be estimated and interpreted as the significance of the associated -th gene. The -th DEG is eventually identified to be the one which produces the problem featuring the highest separability.

The pseudo codes for the KMGS and KMSFS algorithms are given respectively in Table 1 and Table 2.

Download:

Table 1. The Algorithm of Kernel Matrix Gene Selection.

https://doi.org/10.1371/journal.pone.0081683.t001

Download:

Table 2. The Algorithm of Kernel Matrix Sequential Forward Selection.

https://doi.org/10.1371/journal.pone.0081683.t002

Merits of Proposed Algorithms

The proposed methods have noticeable merits:

Filter methods identify discriminant features, making them suitable for feature-based classifiers whose normal vector is the linear combination of features. However, Equation (3) demonstrates that the SVM classifier is the linear combination of training samples in the feature space. Thus the performance of the SVM comes down, more to discriminant levels of samples than those of features. Since discriminant features selected by filter methods are not guaranteed to generate a training set composed of discriminant samples, the resultant classifier cannot be ensured to be optimally accurate either. In contrast, our algorithms aim at unveiling the information regarding discriminant levels of samples using the kernel function. Our algorithms are developed upon the fundamental ideas of SVMs and thus more likely to produce a classifier of a higher degree of accuracy.
A majority of wrapper and embedded methods are based on the assumption that most microarray datasets pose linear problems. However, we reckon that, the problem presented by a set of DEGs can hardly be a linear one when the the set size is as small as only one or two.

But the generalization to nonlinear cases have been challenging for various wrappers and embedded methods. For example, the SVM-RFE [27] keeps unchanged the Lagrangian multipliers 's from the previous iteration and then selects the gene which makes the least change to the dual objective function. The strategy of fixing 's is likely to compromises the significance evaluation of each gene, as well as the generalization abilities of the resultant SVM classifier.

Advantageously, our algorithms can be directly applied to nonlinear cases by opting for Gaussian RBF kernels. The Gaussian RBF , according to Mercer's conditions, is an inner product of and in the feature space: (17)where is the function mapping a sample from the input space to the feature space. Thus, for the nonlinear case, the Gaussian kernel matrix is still composed of similarity measurements between training samples.

3. The output of Equation (15) which measures the significance of genes is a real-valued number rather than an integer. This avoids the ties problem [40] which often occurs to count based wrapper methods including the one using the leave-one-out cross validation error as the selection criterion[21].

Datasets and Data Preprocessing

Prostate dataset.

The dataset contains, in total, 136 samples of two types which respectively have 77 and 59 cases. Each sample includes expression values of 12600 genes.

Colon dataset.

The dataset contains the expression values of 2000 genes from 62 tissues, of which 22 are normal and 40 are cancerous.

Leukaemia dataset.

The dataset was collected from 72 patients. 47 of them were diagnosed with acute acute lymphoblastic leukemia (ALL) and 25 with acute myeloid leukemia (AML). Expression values of 7129 genes were measured.

Both the prostate dataset and the colon dataset were normalized using the following procedure. A microarray dataset with samples and genes was arranged as a matrix of rows and columns. Each row of the matrix was standardized so that the mean and the standard deviation for the row vector are respectively zero and unity. Next, each column of the resultant matrix was standardized to have zero mean and unity standard deviation. No further processing was conducted. All the simulations and comparisons have been performed on the standardized data.

For the leukemia dataset, we applied the pre-processing procedure proposed by Dudoit et al. [41] which consisted of (i) thresholding (floor of 100 and ceiling of 16000), (ii) filtering (exclusion of genes with max/min and max-min across the samples), (iii) base 10 logarithmic transformation, leaving us with 3571 genes. Next, we applied Fisher's ratio and selected the 1000 top DEGs. For each individual gene, Fisher's ratio assigns it a score using the function where and are respectively the mean and the standard deviation across samples from the positive(negative) class. The preprocessing strategy which was also employed by [21] makes possible a fairer comparison between our experiment results and those reported in [21]. All the simulations and comparisons regarding the leukemia dataset have been performed on the preprocessed and pre-selected data.

Error Rate Estimation Techniques

Various gene selection algorithms are evaluated and compared by the error rate of SVMs. The simplest technique for error estimation is the holdout method which splits the dataset into a training set and a test set. The gene selection algorithm is performed on the training set and sample classification on the test set. However, the holdout method has been highly discouraged for microarray datasets which usually contain a small number of samples. In contrast to researchers who applied gene selection to the entire training set and employed -fold cross validation to assess the selected DEGs, Ambroise and McLachlan [42] emphasized to exclude samples used for validation from the gene selection procedure and labelled techniques that follow their recommendation as ''external'' ones. They suggested that external 10-fold cross validation and external B.632+ bootstrap could produce unbiased estimate [42], [43]. Due to the problem of high variance with cross validation techniques when applied to microarray datasets [44], we used the external B.632+ estimator for the comparison of gene selection algorithms.

The B.632+ estimator involves resampling, with replacement, of the original dataset. From a dataset of samples denoted as , a single sample is randomly drawn and then put back at each time. This process is repeated times, leading to a new set which is denoted as . The resampled set includes, with probability, duplicates of a sample from the original set . The number of duplicates for a sample included in ranges from 0 to . The set of is used for both gene selection and training a SVM. The SVM classifier is then tested on the set of . A good error estimator requires the generation of resampled sets which are denoted as where it was recommended that . We set for all our experiments. Meanwhile, for each sample , its overall number of occurrences in the resampled sets is ensured to be , which further reduces the variance.

The flow chart for evaluating a gene selection algorithm using the B.632+ technique has been given by Figure 3.

Download:

Figure 3. The flow chart for evaluating a gene selection algorithm using the B.632+ technique.

https://doi.org/10.1371/journal.pone.0081683.g003

Gene Selection Algorithms

For the methods of KMSFS and KMGS, both the linear kernel and the Gaussian RBF kernel were tested. This resulted in altogether four algorithms which are referred to as Gaussian KMSFS, Gaussian KMGS, linear KMSFS and linear KMGS. They were compared with two wrapper methods: the leave-one-out calculation sequential forward selection (LOOSFS) which improved the least-squares bound measure [21] by easing the ties problem, and the gradient-based leave-one-out gene selection (GLGS) method [22] which was claimed to outperform the SVM-RFE algorithm [26]. Comparisons were also made to a number of filter methods, including the aforementioned Fisher's ratio [45], Cho's [46] and two other methods of Yang's [47]. We described the ideas of these gene selection algorithms below.

Leave-One-Out Calculation Sequential Forward Selection (LOOSFS).

The Leave-One-Out Cross-Validation(LOOCV) error has been generally used for measuring the generalization abilities of SVMs and Least-Squares SVMs (LS-SVMs). The LOOSFS method thus identifies as DEGs those genes which result in a LS-SVM classifier with the minimal Leave-One-Out Cross-Validation(LOOCV) error rate. The beauty of the algorithm consists in the efficient and exact computation of the LOOCV error. To address the ''ties problem'' in which multiple gene subsets achieve the same LOOCV error rate, a further selection criterion is imposed which favors the gene subset with minimal empirical risk.

Gradient-Based Leave-One-Out Gene Selection (GLGS).

The starting point of the GLGS method is also the employment of the exact formulation of the LOOCV error for LS-SVMs. The method then utilizes the gradient descent algorithm to seek a diagonal matrix which eventually minimizes the LOOCV error. Genes are ranked according to the absolute values of the diagonal elements of the diagonal matrix.

With Cho's method and Yang's methods, genes are individually ranked. We use the following notations for their descriptions. Each micorarray dataset with samples and genes is treated as a matrix, denoted as , where indicates the expression value of gene for sample . Given a -class problem, the average expression value of each class, in terms of gene , can be computed and denoted as the set of . Denote the standard deviation for the set as . A matrix, denoted as is also introduced, where .

Cho's Method.

The score, denoted as , that gene obtains eventually is: (18)where

(19)(20)

is the reciprocal of the number of samples that share the same class label as sample and . A small value of indicates that samples of the -th gene are clustered the centroid of each class.

Yang's Methods.

The between-class variation with respect to gene , denoted as scatter, is formulated as: (21)

In order to estimate within-class variations in terms of gene , a function is first introduced: (22)where is a function of which are composed of the elements from the -th column of the matrix and associated with the -th class. Denote as the mean of of . can be either the squared or the mean of , which results in two forms of which are referred to as and respectively.

Two metrics for measuring within-class variations, denoted as compact and compact which are derived respectively upon and , are proposed: (23)where is the mean of the set of where .

Eventually, two score functions which decide the ranking of gene , are given:

= , p = 1,2

We refer to these two score functions respectively as Yang's method 1 and Yang's method 2.

Cho's method and Yang's two methods identify as DEGs those genes whose associated are smaller. For all the gene selection algorithms, it has been emphasized to exclude the test subset each time from the gene selection procedure in order to obtain an unbiased evaluation [42], [43]. The gene selection algorithms terminated when a specific number of DEGs have been identified and we set this number to be 100.

Parameter Tuning

We employed grid search and Friedman rank sum tests combined with Holm correction to tune the parameters for different algorithms.

Grid Search.

Among the total 10 gene selection algorithm, both Gaussian KMSFS and Gaussian KMGS require the parameter in Equation (13) to be optimized. For the LOOSFS algorithm, the regularization parameter, denoted here as for consistency, has to be tuned. was varied sequentially from to in multiples of 10, which made up a total of 11 different values. With respect to sample classification, the linear SVM was used throughout. Its regularization parameter, denoted as , ranged from to in multiples of 10, which gives 7 different values. Thus, 77 value pairs for were tested for Gaussian KMSFS, Gaussian KMGS and LOOSFS algorithms, while 7 different values of were evaluated for the rest of the 10 gene selection algorithms.

Friedman Rank Sum Test with Holm Correction.

The Friedman rank sum test is a non-parametric alternative to ANOVA with repeated measures. The test statistic for the Friedman test is a Chi-square with degrees of freedom, where is the number of repeated measures.

We take the algorithm of Gaussian KMSFS as an example to explain how to apply Friedman test for the discovery of optimal values on . As mentioned previously, 100 DEGs were selected, from each a new classification problem arose. We thus obtained altogether 100 B.632+ error rates for each setting on . As we tried 77 settings for the parameter pair, 77 groups of classification accuracies were obtained.

Friedman rank sum test was used to detect statistical differences among these 77 groups. The test was based on 100 sets of ranks, with each set corresponding to an individual classification problem. The performances of different parameter settings analyzed are ranked separately for each problem. If we rejected the null-hypothesis stating that all the 77 settings led to equal performance in mean ranking, we employed the Holm post-hoc analysis to identify which setting was significantly better than the rest.

All the gene selection methods were coded in Matlab. The linear SVM was implemented using LIBSVM [48] and the Friedman test with Holm correction was coded in R. The specifications of the computer running the experiments were: Intel core i5-2320 quad-core processor 3.0 GHz, Memory 4 GBytes and the operating system of Windows 7.

Results

Results on the Prostate Dataset

Optimal Parameter Settings.

Using Friedman tests with Holm correction, optimal settings on for Gaussian KMGS, Gaussian KMSFS and LOOSFS were found to be , and respectively.

GLGS and linear KMSFS shared the optimal setting of . For linear KMGS and all the filter methods which are respectively Fisher's ratio, Cho's method and the two methods of Yang's, the optimal parameter settings were uniformly .

Comparisons against Wrappers.

With a minimal error rate of and a mean error rate of , the performance of GLGS was much worse than that of the other 9 methods. Thus its simulation results were not graphically presented.

Figure 4 illustrates the the B.632+ error rates as a function of the number of DEGs, for the algorithms of Gaussian KMGS, Gaussian KMSFS, LOOSFS, linear KMSFS and linear KMGS. It can been seen that, when the number of DEGs fell between 10 and around 60, linear KMSFS which is represented by the green solid line dotted with upper triangles, remained the best. As the number of DEGs further increased, Gaussian KMGS outperformed the rest and achieved the lowest B.632+ error rate.

Download:

Figure 4. The B.632+ error shown as a function of the number of DEGs for the prostate dataset.

The curves depict the performance of the following wrapper methods with their respective optimal parameter settings: Gaussian KMGS with and ; Gaussian KMSFS with and ; LOOSFS with and ; linear KMSFS with ; linear KMGS with . The performance of linear KMSFS was the best when the number of DEGS was between 10 and 60, while Gaussian KMGS outperformed the rest when the number of DEGs increases further to 100. The lowest B.632+ rate was achieved by Gaussian KMGS.

https://doi.org/10.1371/journal.pone.0081683.g004

As shown by Figure 4, the classical LOOSFS was outperformed by our algorithms including the Gaussian KMGS, linear KMSFS and linear KMGS. When the number of DEGs ranged between 10 and 20, Gaussian KMSFS also performed better than LOOSFS.

Comparisons against Filters.

Figure 5 compares linear KMSFS, Gaussian KMGS against the filter methods of Fisher's ratio, Cho's method as well as the two methods of Yang's.

Download:

Figure 5. The B.632+ error shown as a function of the number of DEGs for the prostate dataset.

The curves are obtained from the following algorithms with their respective optimal parameter settings: Fisher's ratio with ; Yang's methods both of which with ; Cho's method with ; linear KMSFS with ; Gaussian KMGS with and . Linear KMSFS and the Gaussian KMGS performed better than the 4 filter methods.

https://doi.org/10.1371/journal.pone.0081683.g005

The error rate of linear KMSFS remained noticeably lower than the filter methods when the number of DEGs fell between 10 and 60. When the number of DEGs grew larger, Gaussian KMGS showed better than performance than the four filter methods.

The performance of Gaussian KMGS and linear KMSFS remained competitive to those of the filter methods, respectively between the value ranges of and for the number of DEGs.

These comparisons lead to the conclusion that linear KMSFS and Gaussian KMGS are the two best methods for the prostate dataset.