Nearest shrunken centroids via alternative genewise shrinkages

Nearest shrunken centroids (NSC) is a popular classification method for microarray data. NSC calculates centroids for each class and “shrinks” the centroids toward 0 using soft thresholding. Future observations are then assigned to the class with the minimum distance between the observation and the (shrunken) centroid. Under certain conditions the soft shrinkage used by NSC is equivalent to a LASSO penalty. However, this penalty can produce biased estimates when the true coefficients are large. In addition, NSC ignores the fact that multiple measures of the same gene are likely to be related to one another. We consider several alternative genewise shrinkage methods to address the aforementioned shortcomings of NSC. Three alternative penalties were considered: the smoothly clipped absolute deviation (SCAD), the adaptive LASSO (ADA), and the minimax concave penalty (MCP). We also showed that NSC can be performed in a genewise manner. Classification methods were derived for each alternative shrinkage method or alternative genewise penalty, and the performance of each new classification method was compared with that of conventional NSC on several simulated and real microarray data sets. Moreover, we applied the geometric mean approach for the alternative penalty functions. In general the alternative (genewise) penalties required fewer genes than NSC. The geometric mean of the class-specific prediction accuracies was improved, as well as the overall predictive accuracy in some cases. These results indicate that these alternative penalties should be considered when using NSC.


Introduction
Nearest shrunken centroids (NSC) is one of the most frequently used classification methods for high-dimensional data such as microarray data [1,2]. NSC shrinks the average expression (i.e., centroid) of each gene within each class toward the overall centroid via soft thresholding. Genes whose expression levels do not significantly differ between the classes will have their centroids reduced to the overall centroids, effectively removing them from the classification procedure. The amount of shrinkage is determined by cross validation. Then class prediction a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 is performed using the shrunken centroids, which allows one to identify important genes and predict the class of unlabeled observations. Wang and Zhu [3] showed that NSC is the solution to the regression problem that estimates the class centroids subject to an L 1 penalty (i.e., LASSO) of Tibshirani [4]. They observed that the LASSO penalty applies the same penalties to all centroids, but the centroids for the same gene should be treated as one group. To overcome this problem, they proposed two NSC methods using different penalties: adaptive L 1 -norm penalized NSC (ALP-NSC) and adaptive hierarchically penalized NSC (AHP-NSC). They showed that the two NSC methods have better performance than the original NSC in terms of misclassification error rate and the number of variables with nonzero centroids. However, ALP-NSC requires an exhaustive search to find an index set satisfying certain condition. If no such indices exist, quadratic programming must be employed to estimate the parameters. AHP-NSC requires an iterative procedure to estimate the parameters, and this increases the computational burden as the number of genes increases.
While Wang and Zhu [3] sought to improve NSC by considering the correlation between the centroids for the same gene, Guo et al. [5] improved NSC by regularizing the covariance matrix of genes in addition to shrinking the class centroids. In fact, Guo et al. [5] modified the classical linear discriminant score, not the diagonal linear discriminant score, and thus the method of Guo et al. [5] is a generalized version of NSC. Pang et al. [6] proposed an improved diagonal linear discriminant analysis (LDA) through shrinkage and regularization of the variances, but their method dose not perform variable selection. Several authors proposed new types of sparse LDA and provided the related optimality conditions and asymptotic properties. Shao et al. [7] applied the thresholding methodology, which was developed for function estimation, to the estimation of the means and variances, and Mai et al. [8] used the least squares formulation of LDA.
Another way to improve NSC is to modify the way to select an optimal threshold as in Blagus and Lusa [9]. They improved NSC in class-imbalanced data by selecting the optimal threshold as the value that maximizes the geometric mean of the class-specific prediction accuracies. Their numerical studies showed that the modified NSC improved the prediction accuracy of the minority class and area under the curve (AUC), and even the average prediction accuracy of entire classes for some real data.
In this article, we proposed the methods that improve NSC through alternative shrinkage of the class centroids. Like Wang and Zhu [3], we used an additional parameter, which controls the amount of penalization given to the parameters for our methods. These alternative shrinkages were derived from three existing alternative penalized regression methods, namely the smoothly clipped absolute deviation (SCAD) [10], the adaptive LASSO (ADA) [11], and the minimax concave penalty (MCP) [12], which are known to outperform LASSO regression in some situations. They enjoy the oracle property, which means that the efficiency of these estimators is not reduced when the subset of variables with nonzero coefficients is unknown. As noted earlier, under an orthonormal design (such as the case of NSC), the LASSO solution can be obtained via soft thresholding. Similarly, these three regression methods also have simple solutions in the NSC setting, so the computation is easy and fast. While the LASSO solution yields biased estimates for large coefficients, these methods produce unbiased estimates. Several researchers have considered the use of the alternative shrinkage methods in place of soft shrinkage [2,13,14]. In this article, we will evaluate the performances of these alternative shrinkage methods by comparing them with conventional soft shrinkage systematically through simulation and real data studies.
Blockwise additive penalties, which were discussed in Antoniadis and Fan [15], were shown to give alternative genewise shrinkage estimators of the class centroids in the NSC setting, where the block is the gene. Similar to the methods of Wang and Zhu [3], these estimators use the fact that the centroids from the same gene should be treated as a group, but they are less computationally intensive than those of Wang and Zhu [3] because an iterative procedure is not involved. The approach of Blagus and Lusa [9] was also applied for our alternative (genewise) penalties to further improve NSC, especially for class-imbalanced data.
In the Methods section, we described the penalized least squares framework for general shrinkage methods using the model of Wang and Zhu [3], which includes the special case of NSC. We examined the performance of NSC with alternative penalty functions (ALT-NSC), which include the SCAD, the adaptive LASSO and the MCP. We also described how ALT-NSC can be used for genewise inference (GEN-NSC). In the Simulation section, we conducted simulation studies and showed that ALT-NSC and GEN-NSC have substantially better performance than NSC in terms of predictive accuracy and feature selection in data sets with multiple classes. In the Real Data Study section, we applied the proposed variants of NSC to several real microarray data sets. A discussion and concluding remarks are provided in the last two sections.

Penalized least squares for the nearest shrunken centroids
Adapting the framework of Wang and Zhu [3], let x ij be the gene expression for the jth gene of the ith sample (j = 1, . . ., p; i = 1, . . ., n). There are K classes and each sample i belongs to one of K classes, that is i 2 C k , where C k is the set of sample indexes belonging to class k 2 {1, . . ., K}, and n k is the number of samples for class k. The average expressions for the jth gene in the kth class and over the entire data set are " where s j is the pooled within-class standard deviation for the jth gene: and m k ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi 1=n k À 1=n p . Alternatively, s j + s 0 can be used instead of s j to prevent the genes with low expression levels from having large m 0 kj values by chance due to very small s j values, where s 0 is a small constant. The statistic m 0 kj is equivalent to d kj , in Tibshirani et al. [1,2], which is a t-statistic for the jth gene comparing class k to the average of the other classes.
Let y ij ¼ ðx ij À " x j Þ=ðm k s j Þ and consider the following linear model: where z ik = 1 if sample i belongs to the class k, and 0 otherwise, μ kj is a parameter to be estimated and ε ij is an independent error term that has variance 1=m 2 k if sample i belongs to class k. For a fixed gene index j, μ kj is a deviation from the overall mean, so we have the constraint that P K k¼1 m kj ¼ 0. The class index to which sample i belongs is denoted by k(i) 2 {1, . . ., K}. By multiplying 1= ffiffiffiffiffiffiffi ffi n kðiÞ p to both sides in Eq (1), we have where y Ã ij ¼ y ij = ffiffiffiffiffiffiffi ffi n kðiÞ p and ε Ã ij ¼ ε ij = ffiffiffiffiffiffiffi ffi n kðiÞ p . In vector notation, Eq (2) can be written as  samples belonging to class k(i). This implies that W T l W l ¼ 1 ðl ¼ 1; :::; KpÞ. In addition, we can see that each row of W has only one non-zero value and the rest of the elements are zero. This is because each y ij takes only one μ kj , and this implies W T l W h ¼ 0 for (l 6 ¼ h). Thus, W is orthonormal. Note that μ 0 = W T y Ã is the least squares estimator for μ, and letŷ Ã ¼ Wμ 0 . Since W is orthonormal, a form of the penalized least squares is given by [10]: where p λ (Á) = λp(Á) is a penalty function and k A k 2 ¼ P n i¼ a 2 i when A = (a 1 , . . ., a n ) T . The problem of minimizing Eq (3) with respect to μ kj is equivalent to minimizing it componentwise. By ignoring 1 2 k y Ã Àŷ Ã k 2 , which is irrelevant to the parameters, this allows us to consider the following penalized least squares problem: Eq (4) shows that the minimization problem Eq (3) has been converted to a univariate minimization problem. Since, the univariate solutions for regression coefficients are presented in the papers describing these penalized regression methods, we can use these solutions to obtain m kj . NSC uses the LASSO penalty function p λ (|μ kj |) = λ|μ kj | [4], and the resulting estimator for μ kj is given bym where "sgn" is a sign function and z + is the positive part of z. The LASSO solution is equivalent to the soft shrinkage estimate [16]. The resulting estimators for μ kj under alternative penalties are presented in the next subsection.
To predict the class of a new sample x Ã ¼ ðx Ã 1 ; :::; x Ã p Þ T , we define the discriminant score for class k as x j þm kj m k s j is a shrunken mean and π k = n k /n is a prior probability estimate for class k. The shrunken mean and the discriminant score depend on the shrinkage method used, hence the choice of shrinkage method affects class prediction and gene selection. Finally, the classification rule is given by Alternative shrinkage methods (ALT-NSC) Here, we described several shrinkage methods that are possible alternatives to soft shrinkage. The first order derivative of the SCAD penalty function [10] is defined as for some a > 2. This penalty function gives smaller penalties on larger coefficients. The resulting estimator for μ kj iŝ If a is close to 2, then SCAD behaves like a hard shrinkage estimate when estimating μ kj . The adaptive LASSO penalty function [11], which is the LASSO penalty function with a data-dependent weight, is given by where a > 0. The resulting solution iŝ The adaptive LASSO solution is equivalent to soft shrinkage when a = 0 and is similar to the nonnegative garotte when a = 1 [17] (although the nonnegative garotte requires additional sign restrictions).
The MCP penalty function [12] is defined as p l;a ðjm kj jÞ ¼ where a > 1. The resulting solution is given bŷ The MCP solution is equivalent to firm shrinkage, which offers advantages over soft and hard shrinkage [18]. The MCP solution approaches hard shrinkage as a ! 1 and soft shrinkage as a ! 1.
As mentioned previously, these shrinkage methods are known to have oracle properties under some mild conditions (for details, see [10], [11] and [12]). The LASSO solution is inconsistent because it produces estimates biased toward zero. This bias in the LASSO can also cause its variable selection to be inconsistent [12]. The basic reason that the alternative shrinkages can produce better estimates is because they have different rules for estimating the coefficients μ kj , which depend on the size of |μ kj |. When the sizes of the coefficients are large, these procedures leave them almost unpenalized (or completely unpenalized). Thus, they overcome the tendency of soft shrinkage to produce biased estimates.
While soft shrinkage has one tuning parameter λ, the alternative shrinkage methods have two tuning parameters, namely a and λ. The tuning parameter a controls the size of the penalties for large coefficients. The tuning parameters are determined by cross validation (CV). In our subsequent analysis, six values of the tuning parameter a were examined for each ALT-NSC and genewise shrinkage method: (0.5, 1, 1.5, 2, 2.5, 3) for the adaptive LASSO penalty, (2.01, 2.2, 2.5, 2.8, 3.2, 3.7) for the SCAD penalty and (1.01, 1.3, 1.7, 2, 2.5, 3) for the MCP penalty. For each method, thirty values of λ were considered. For the case when there are ties among the CV prediction accuracies or g-means, we chose the parameters resulting in a smaller number of genes.

Genewise shrinkage methods (GEN-NSC)
Here we extend the shrinkage methods discussed in the previous subsection to genewise inference. Let μ j = (μ 1j , . . ., μ Kj ) T denote a K × 1 mean vector for the jth gene. Further let μ 0 j ¼ ðm 0 1j ; :::; m 0 Kj Þ T denote the corresponding mean estimator vector. The objective function to be minimized for the genewise penalized least squares estimator is Note that instead of penalizing μ kj , we penalize the vector μ j . Using the fact that and the orthonormality of W, Eq (5) can be written as The solution to Eq (6) is genewise separable, and thus one may solve it by minimizing Using the result of Antoniadis et al. [15], the solution to Eq (7) is given bŷ where rðk μ 0 j kÞ is the solution to min Since Eq (8) depends on the penalty function p λ (Á), we can derive genewise shrinkage methods under diverse penalty functions. Note that the problem of solving Eq (9) is equivalent to that of Eq (4), and thus, the computational complexity of the genewise shrinkages is the same as that of the alternative shrinkages.
When the LASSO penalty is employed, If the SCAD penalty is used, where a > 2. For the adaptive LASSO penalty, the resulting solution iŝ For the MCP penalty, where a > 1.
The thresholding rules of the genewise shrinkage methods are determined by k μ 0 j k instead of an individual m 0 kj . By pulling information from the neighboring mean estimators belonging to the same gene, the genewise shrinkage may allow the accuracy of the thresholding mean estimators to be improved. Furthermore, Eq (5) has a nice Bayesian interpretation [15]: the genewise penalized least squares method models the mean coefficients belonging to the same gene by using proper prior distributions.

Geometric mean methods (GM)
Adpating the idea of Blagus and Lusa [9], we considered the optimal values of tuning parameters (a, λ) of the (genewise) althernative shrinkage methods to be those that maximize the geometric mean of the class-specific prediction accuracies.
Throughout the remainder of this manuscript, we will refer to the genewise version of each method by adding "G" to the beginning of its abbreviated name. Moreover, when the tuning parameters are determined by the geometric mean, we will add "GM-" to the beginning of the name. For example, "GADA" refers to the genewise version of adaptive lasso, and "GM-GADA" referes to "GADA" whose tunnig parameters are determined by the geometric mean.

Simulations
In this section, we conducted simulation studies to compare ALT-NSC, GEN-NSC, and the GM versions of ALT-NSC and GEN-NSC to conventional NSC. We examined the overall prediction accuracy (PA), geometric mean (g-mean), area under the curve (AUC, only for a twoclass classification scenario), sensitivity (SEN) and positive predictive value (PPV). SEN is the number of detected important genes divided by total number of important genes. PPV is the number of detected important genes divided by total number of genes the method selects. As in Dudoit et al. [19], we presented the median and upper quartiles of the evaluation measures.
In a two class classification scenario, we generated two classes from multivariate normal distributions with sample sizes, n 1 = nπ 1 and n − n 1 : MVN(μ 1 , S) and MVN(μ 2 , S), each had a dimension of p = 2500. μ 1 was equal to 0 for all genes and μ 2 was 0.5 for 100 genes and 0 for the rest of genes. The differentially expressed (DE) 100 genes were randomly selected. As in Guo et al. [5] and Pang et al. [6], S was a block diagonal matrix with each diagonal block S ρ having an auto-regressive structure and alternating in sign. The block size was 50 × 50 and there were 50 blocks, which gave a total of 2500 genes: ρ took values of 0.5 and 0.9, indicating sparse and dense correlation blocks, and π 1 took values of 0.5 and 0.8, corresponding to class-balance and -imbalance. The three class classification scenario is very similar to the previous one. We generated three classes from multivariate normal distributions with the fixed proportions (π 1 , π 2 , π 3 ) = (0.4, 0.2, 0.4): MVN(μ 1 , S), MVN(μ 2 , S) and MVN(μ 3 , S), each of which had the same dimension as the previous scenario. Ninety differentially expressed genes were randomly selected and those DE genes had mean vectors of (γ, 0,−γ). We used the same S as in the first simulation. We let γ take the values of 0.5 and 0.1 to study how the effect size of DE genes is related to the performances of the classifiers.
Given a, the tuning parameter λ was chosen to minimize the m-fold CV misclassification error rate on training data set, and we let m = 5. We generated training data sets with sample size 100 and test data sets with 10 times the sample size of the training data. Then test error rates were computed using the tuning parameters selected by CV. Gene selection was performed in the same way as in Tibshirani et al. [1,2], where the genes with at least one nonzero difference were selected (the jth gene is selected if there exists at least one k such thatm kj 6 ¼ 0).
Simulation results for the two-class scenario have been presented in Tables 1, 2, 3 and 4. All of the proposed methods performed very similarly to NSC in terms of PA, g-mean and AUC except when the diagonal block matrix was dense and class was imbalanced; ALT-NSC improved the g-mean slightly, the GM versions of ALT-NSC and GEN-NSC also improved the Table 1. Two groups with sparse block diagonal structure (ρ = 0.5) and class-balance (π 1 = 0.5). "PA", "g-mean" and "AUC" are overall accuracy, geometric mean and AUC of class prediction, calculated from the test data set. "SEN" and "PPV" are sensitivity and positive predictive value of gene selection obtained from the training data set. "Median" and "Upper" are median and upper quartiles of 100

Method
repetitions. The scale of all the numbers is a percentage.
doi:10.1371/journal.pone.0171068.t001 Table 2. Two groups with dense block diagonal structure (ρ = 0.9) and class-balance (π 1 = 0.5). g-mean, and GM-GSCAD had the highest g-mean in this setting. The classifiers had poorer prediction perfromance based on PA, g-mean and AUC when the block diagonal matrix was dense and classes were imbalanced. Gene selection accuracy (SEN and PPV) also decreased when class was imbalanced.

Method
Simulation results for the three-class scenario have been presented in Tables 5, 6, 7 and 8. Unlike the two-class scenario, class sizes were always imbalanced. However, we considered different values of γ, which was the effect size of DE genes, and observed that our proposed methods performed significantly better than NSC. When the effect size of DE genes was moderate Table 3. Two groups with sparse block diagonal structure (ρ = 0.5) and class-imbalance (π 1 = 0.8). (γ = 0.5), only the ALT-NSC had better PA and g-mean, but GEN-NSC and GM methods showed no improvement. Under the very small effect size, all the classifiers performed very similarly in terms of PA, but their performance varied with respect to g-mean. MCP had the highest g-mean, and the other penalty functions gave zero as the median quartile of g-mean.

Method
Secondly, GM significantly improved g-mean for all the penalty functions, and the amount of the improvement was greater when the genewise penalties were used for the sparse block matrix. Finally, gene selection was also improved by GM: SEN increased and PPV stayed at  Table 6. Three groups with class-imbalance (π 1 = 0.4, π 2 = 0.2, π 3 = 0.4), dense block diagonal structure (ρ = 0.9) and moderate mean difference (γ = 0.5). almost the same value, compared to the corresponding methods based on the cross-validation prediction accuracy criterion.

Method
We randomly split each data set into a training set and a test set with 33% of the data allocated to the test set. This process was iterated 100 times. We chose optimal tuning parameters (a, λ) as the values that give the maximum of 5-fold CV prediction accuracy or g-mean under the training data set. We compared prediction accuracy, g-mean, AUC (only for the Gravier data set) and the number of selected genes. The Gravier data set is slightly imbalanced; the proportions of "no events" and "early metastasis" are 0.66 and 0.34. There was no improvement of PA, g-mean and AUC by the proposed methods, but using the alternative penalty functions reduced N-sig. MCP had the smallest Nsig (Table 10). The Pomeroy data set is balanced. ALT-NSC improved g-mean, but not PA, and reduced N-sig, with the exception of SCAD. GM did not improve either PA or g-mean. MCP performed the best with higher PA and g-mean and smaller N-sig compared to NSC. GMCP performed very similarly to MCP with slightly inferior prediction performance but much smaller N-sig (Table 11). The Yeoh data set is imbalanced. Like the Golub data set [24], the prediction was easy for this data set despite the large number of classes. PA and g-mean were not improved, but N-sig was reduced by the proposed methods. Both ALT-NSC and GEN-NSC reduced N-sig, with GMCP having the smallest N-sig. GMCP selected 418 genes, while NSC selected 1456 genes at the median quartile (Table 12). The Ramaswamy data set has a large number of classes, and, as a result, all the classifeirs had low g-mean values. All the MCP methods and GM-SCAD had positive g-mean values, but the other methods had zero as the 90% quantile of g-mean (Table 13).

Discussion
In this article, we proposed several variations of NSC that use alternative genewise shrinkages. We derived these methods using three penalized regression models that enjoy oracle properties and have closed-form solutions under an orthonormal design. We also further modified these variants of NSC by adapting genewise penalty functions that use the correlations between the parameters belonging to the same gene, and the geometric mean approach for class-imbalanced data. We showed that these methods have better performance than conventional NSC in terms of prediction accuracy, g-mean and gene selection through simuation and real data studies. We conducted simulation studies to evaluate the proposed methods. We used a block diagonal covariacne matrix with the block being an auto-regresive structure with a paramter ρ.
When ρ is small, the block matrix becomes sparse, and thus it behaves like an identity matrix. Ohterwise, when ρ is large, the block matrix becomes dense, and thus it behaves like a block exchageable matrix. We varied ρ, the degree of class imbalance and the effect size of DE genes. The proposed methods had better peformance in terms of prediction accuracy and gene selection compared to NSC when the block matrix was dense and class was imbalanced. When the effect size was moderate, ALT-NSC methods performed well and among those MCP performed the best. When the effect size is small, GM method performed well with the highest gmean.
We applied the proposed methods to four real microarray data sets. The proposed methods improved the g-mean, but not the overall prediction accuracy, in the data sets we considered. When the number of classes was two (Gravier data set) or prediction was easy (Yeoh data set), only gene selection was improved by the alternative penalty functions. In the data set with the moderate number of classes (Pomeroy data set), g-mean was improved by the alternative penalty functions. When the data set had very large number of classes (Ramaswamy data set), using the genewise penalty functions reduced the performance.
In many applications, it is desirable to develop classifiers that use the smallest possible number of genes. For example, one may wish to use an RT-PCR assay to discriminate between different types of tumors or to determine the prognosis of a patient with a given tumor type. Such an assay will be prohibitively expensive if the expression levels of more than a handful of genes are needed. Thus, a classification method that produces comparable accuracy to another method using fewer genes would be considered superior in these situations. Hence, the fact that our proposed methods consistently use fewer genes than conventional NSC represents a significant advantage of our methods even if prediction accuracy is not always improved. MCP would be very useful in real applications becacuse they have shown to select the most reliable parsimonious gene set with competitive predictive accuracy.
Both simulation and real data studies showed that our proposed methods produced greater improvement compared to conventional NSC in the data sets with three or four classes, but not in data sets with very large numbers of classes. When the number of classes is large, the sample size per class is usually small, and this affects the efficiency of shrunken mean estimators. By the virtue of the oracle property, ALT-NSC can produce more efficient estaimtes of the shrunken means, which yields better performance on both prediction and gene selection. Genewise shrinkages also improve the NSC classifier by combining the related genes in the same class, producing more accurate estimates when the size of the class is small (which commonly occurs when the number of classes is large). Clearly, the genewise penalty (GEN-NSC) shrinks a mean estimator toward zero faster than the non-genewise penalty (ALT-NSC), as shown in the simulations and the real data study. Appropriately fast shrinkage will be able to remove noisy genes effectively. However, one observes that when the number of classes is large, such as the Ramaswamy data set, the amount of shrinkage produced by the genewise penalty is so large that NSC loses some prediction accuracy. Thus, applying GEN-NSC to data sets with too many classes may not be recommended when the objective is to maximize predictive accuracy (rather than minimize the number of selected genes).
The performance of the methods can be affected by heterogeneity of gene expression, and this heterogeneity happens when the variances of genes differ by groups. This was pointed out by Tibshirani et al. [2] and was observed in the compariative study of Lee et al. [25]. Tibshirani et al. [2] suggested an ad-hoc method to account for heterogeneity. However, that method is only applicable in the case where class centroids are not separated. The method of Pang et al. [6] may overcome this problem because it combines the linear and quadratic discriminant scores, where the latter assumes unequal variances by classes. Since the method does not perform varaible selection, applying the mean shrinkage to their method would be a future research to handle this heterogeneit problem.
Supporting information S1 Rscript. R source code. This file contains the R functions that implement ALT-NSC and GEN-NSC. (ZIP)