Feature Selection and Cancer Classification via Sparse Logistic Regression with the Hybrid L1/2 +2 Regularization

Cancer classification and feature (gene) selection plays an important role in knowledge discovery in genomic data. Although logistic regression is one of the most popular classification methods, it does not induce feature selection. In this paper, we presented a new hybrid L1/2 +2 regularization (HLR) function, a linear combination of L1/2 and L2 penalties, to select the relevant gene in the logistic regression. The HLR approach inherits some fascinating characteristics from L1/2 (sparsity) and L2 (grouping effect where highly correlated variables are in or out a model together) penalties. We also proposed a novel univariate HLR thresholding approach to update the estimated coefficients and developed the coordinate descent algorithm for the HLR penalized logistic regression model. The empirical results and simulations indicate that the proposed method is highly competitive amongst several state-of-the-art methods.


Introduction
With advances in high-throughput molecular techniques, the researchers can study the expression of tens of thousands of genes simultaneously.Cancer classification based on gene expression levels is one of the central problems in genome research.Logistic regression is a popular classification method and has an explicit statistical interpretation which can obtain probabilities of classification regarding the cancer phenotype.However, in most gene expression studies, the number of genes typically far exceeds the number of the sample size.This situation is called high-dimensional and low sample size problem, and the normal logistic regression method cannot be directly used to estimate the regression parameters.
To deal with the problem of high dimensionality, one of the popular techniques is the regularization method.A well-known regularization method is the L 1 penalty [1], which is the least absolute shrinkage and selection operator (Lasso).It is performing continuous shrinkage and gene selection at the same time.Other L 1 norm type regularization methods typically include the smoothly-clipped-absolute-deviation (SCAD) penalty [2], which is symmetric, nonconcave, and has singularities at the origin to produce sparse solutions.The adaptive Lasso [3] penalizes the different coefficients with the dynamic weights in the L 1 penalty.However, the L 1 type regularization may yield inconsistent feature selections in some situations [3] and often introduces extra bias in the estimation of the parameters in the logistic regression [4].Xu et al. [5] proposed the L 1/2 penalty, a method that can be taken as a representative of L q (0 <q < 1) penalties in both sparsity and computational efficiency, and has demonstrated many attractive properties, such as unbiasedness, and oracle properties [5][6][7].However, similar to most of the regularization methods, the L 1/2 penalty ignores the correlation between features, and consequently unable to analyze data with dependent structures.If there is a group of variables among which the pair-wise correlations are very high, then the L 1/2 method tends to select only one variable to represents the corresponding group.In gene expression study, genes are often highly correlated if they share the same biological pathway [8].Some efforts had been made to deal with the problem of highly correlated variables.Zhou and Hastie proposed Elastic net penalty [9] which is a linear combination of L 1 and L 2 (the ridge technique) penalties, and such method emphasizes a grouping effect, where strongly correlated genes tend to be in or out of the model together.Becker et al. [10] proposed the Elastic SCAD (SCAD − L 2 ), a combination of SCAD and L 2 penalties.By introducing the L 2 penalty term, Elastic SCAD also works for the groups of predictors.
In this article, we proposed the HLR (Hybrid L 1/2 + 2 Regularization) approach to fit the logistic regression models for gene selection, where the regularization is a linear combination of the L 1/2 and L 2 penalties.The L 1/2 penalty achieves feature selection.In theory, a strictly convex penalty function provides a sufficient condition for the grouping effect of variables and the L 2 penalty guarantees strict convexity [11].Therefore, the L 2 penalty induces the grouping effect simultaneously in the HLR approach.Experimental results on artificial and real gene expression data in this paper demonstrate that our proposed method is very promising.
The rest of the article is organized as follows.In Section 2, we first defined the HLR approach and presented an efficient algorithm for solving the logistic regression model with the HLR penalty.In Section 3, we evaluated the performance of our proposed approach on the simulated data and five public gene expression datasets.We presented a conclusion of the paper in Section 4.

Regularization
Suppose that dataset D has n samples D = {(X 1 , y 1 ), (X 2 , y 2 ),. ..,(X n , y n )}, where X i = (x i1 , x i2 , . .., x ip ) is i th sample with p dimensional and y i is the corresponding dependent variable.
For any non-negative λ, the normal regularization form is: where P(β) represents the regularization term.There are many regularization methods proposed in recent years.One of the popular methods is the L 1 regularization (Lasso), where PðbÞ ¼ P p j¼1 jb j j 1 .The others L 1 type regularizations include SCAD, the adaptive Lasso, Elastic net, Stage wise Lasso [12], Dantzig selector [13] and Elastic SCAD.However, in genomic research, the result of the L 1 type regularization may not sparse enough for interpretation.Actually, a typical microarray or RNA-seq data set has many thousands of predictors (genes), and researchers often desire to select fewer but informative genes.Beside this, the L 1 regularization is asymptotically biased [14,15].Although the L 0 regularization, where PðbÞ ¼ P p j¼1 jb j j 0 , yields the sparsest solutions, it has to deal with NP-hard combinatory optimization problem.
To gain a more concise solution and improve the predictive accuracy of the classification model, we need to think beyond the L 1 and L 0 regularizations to the L q (0<q<1) regularization.The L 1/2 regularization can be taken as a representative of the L q (0<q<1) penalties and has permitted an analytically expressive thresholding representation [5].With the thresholding representation, solving the L 1/2 regularization is much easier than solving the L 0 regularization.Moreover, the L 1/2 penalty is unbiasedness and has oracle properties [5][6][7].These characteristics are making the L 1/2 penalty became an efficient tool for high dimensional problems [16,17].However, due to the insensitivity of the highly correlated data, the L 1/2 penalty tends to select only one variable to represent the correlated group.This drawback may deteriorate the performance of the L 1/2 method.
2.2 Hybrid L 1/2 +2 Regularization (HLR) For any fixed non-negative λ 1 and λ 2 , we define the hybrid L 1/2 +2 regularization (HLR) criterion: where β = (β 1 , . .., β p ) are the coefficients to be estimated and The HLR estimator b is the minimizer of Eq (2): , then solving b in Eq (3) is equivalent to the optimization problem: We call the function α|β| 1/2 + (1 − α)|β| 2 as the HLR, which is a combination of the L 1/2 and L 2 penalties.When α = 0, the HLR penalty becomes ridge regularization.When α = 1, the HLR becomes L 1/2 regularization.The L 2 penalty is enjoying the grouping effect and the L 1/2 penalty induces sparse solutions.This combination of the both penalties makes the HLR approach not only capable of dealing with the correlation data, but also able to generate a succinct result.
Fig 1 shows four regularization methods: Lasso, L 1/2 , Elastic net and HLR penalties with an orthogonal design matrix in the regression model.The estimators of Lasso and Elastic net are biased, whereas the L 1/2 penalty is asymptotically unbiased.Similar to the L 1/2 method, the HLR approach also performs better than Lasso and Elastic net in the property of unbiasedness.
Fig 2 describes the contour plots on two-dimensional for the penalty functions of Lasso, Elastic net, L 1/2 and HLR approaches.It is suggest that the L 1/2 penalty is non-convex, whereas the HLR is convex for the given α.The following theorem will show how the HLR strengthens the L 1/2 regularization.Theorem 1.Given dataset (y, X) and (λ 1 , λ 2 ), then the HLR estimates b are given by The L 1/2 regularization can be rewritten as The proof of Theorem 1 can be found in S1 File.Therorem1 shows the HLR approach is a stabilized version of the L 1/2 regularization.Note that P ¼ X T X is a sample version of the correlation matrix S and where δ = λ 2 /(1 + λ 2 ) shrinks P that towards the identity matrix.The classification accuracy can often be enhanced by replacing P by a more shrunken estimate in linear discriminate analysis [18,19].In other word, the HLR improves the L 1/2 technique by regularizing P in Eq (6).

The sparse logistic regression with the HLR method
Suppose that dataset D has n samples D = {(X 1 , y 1 ), (X 2 , y 2 ), . .., (X n , y n )}, where X i = (x i1 , x i2 , . .., x ip ) is i th sample with p genes and y i is the corresponding dependent variable that consist of a binary value with 0 or 1. Define a classifier f(x) = e x / (1 + e x ) and the logistic regression is defined as: Where β = (β 1 , . .., β p ) are the coefficients to be estimated.With a simple algebra, the regression model can be presented as: In this paper, we apply the HLR approach to the logistic regression model.For any fixed nonnegative λ and α, the sparse logistic regression model based on the HLR approach is defined as:

Solving algorithm for the sparse logistic regression with the HLR approach
The coordinate descent algorithm [20] is an efficient method for solving regularization models because its computational time increases linearly with the dimension of the problems.Its standard procedure can be showed as follows: for every β j (j = 1,2,. ..,p), to partially optimize the target function with respect to coefficient with the remaining elements of β fixed at their most recently updated values, iteratively cycling through all coefficients until meet converged.The specific form of renewing coefficients is associated with the thresholding operator of the penalty.Suppose that dataset D has n samples D = {(X 1 , y 1 ), (X 2 , y 2 ), . .., (X n , y n )}, where X i = (x i1 , x i2 , . .., x ip ) is i th sample with p dimensional and y i is the corresponding dependent variable.The variables are standardized: [20] and Liang et al. [16], in this paper, we present the original coordinate-wise update form for the HLR approach: where o j ¼ P n i¼1 x ij ðy i À ỹi ðjÞ Þ, and ỹi ðjÞ ¼ P k6 ¼j x ik b k as the partial residual for fitting β j .Half ðz; rÞ is the L 1/2 thresholding operator Half ðo j ; lÞ ¼ where The Eq (9) can be linearized by one-term Taylor series expansion: Lðl; a; bÞ % 1 2n where Z i ¼ X i b þ y i Àf ðX i bÞ f ðX i bÞð1Àf ðX i bÞ is the estimated response, W i ¼ f ðX i bÞð1 À f ðX i bÞ is the weight for the estimated response.f ðX i bÞ ¼ expðX i bÞ=ð1 þ expðX i bÞÞ is the evaluated value under the current parameters.Thus, we can redefine the partial residual for fitting current b as The procedure of the coordinate descent algorithm for the HLR penalized logistic model is described as follows.

Analyzes of simulated data
The goal of this section is to evaluate the performance of the logistic regression with the HLR approach in the simulation study.Four approaches are compared with our proposed method: logistic regression with the Lasso regularization, L 1/2 regularization, SCAD − L 2 and Elastic net regularization respectively.We simulate data from the true model where X * N(0, 1), is the independent random error and σ is the parameter that controls the signal to noise.Four scenarios are presented here.In every example, the dimension of predictors is 1000.The notation./. was represented the number of observations in the training and test sets respectively, e.g.100/100.Here are the details of the four scenarios.

!
, we defined three grouped variables . . .; 30; In this example, there were three groups of the correlated features and some single independent features.An ideal sparse regression method would select only the 200 true features and set the coefficients of the 800 noise features to zero.
In our experiment, we set the correlation coefficient ρ of features are 0.3, 0.6, 0.9 respectively.The Lasso and Elastic net were conducted by Glmnet (a Matlab package, version 2014-04-28, download at http://web.stanford.edu/~hastie/glmnet_matlab/).The optimal regularization parameters or tuning parameters (balance the tradeoff between data fit and model complexity) of the Lasso, L 1/2 , SCAD − L 2 , Elastic net and the HLR approaches were tuned by the 10-fold cross-validation (CV) approach in the training set.Note that, the Elastic net and HLR methods were tuned by the 10-CV approach on the two-dimensional parameter surfaces.The SCAD − L 2 were tuned by the 10-CV approach on the three-dimensional parameter surfaces.Then, the different classifiers were built by these sparse logistic regressions with the estimated tuning parameters.Finally, the obtained classifiers were applied to the test set for classification and prediction.
We repeated the simulations 500 times for each penalty method and computed the mean classification accuracy on the test sets.To evaluate the quality of the selected features for the regularization approaches, the sensitivity and specificity of the feature selection performance [21] were defined as the follows: As showed in Table 1, for all scenarios, our proposed HLR procedure generally gave higher or comparable classification accuracy than the Lasso, SCAD − L 2 , Elastic net and L 1/2 methods.Also, the HLR approach results in much higher sensitivity for identifying true features compared to the other four algorithms.For example, in the scenario 1 with ρ = 0.9, our proposed method gained the impressive performance (accuracy 99.87% with perfect sensitivity and specificity).The specificity of the HLR approach is somewhat decreased, but not greatly as compared to the achieved in sensitivity.

Analyzes of real data
To further evaluate the effectiveness of our proposed method, in this section, we used several publicly available datasets: Prostate, DLBCL and Lung cancer.The prostate and DLBCL datasets were both downloaded from http://ico2s.org/datasets/microarray.html, and the lung cancer dataset can be downloaded at http://www.ncbi.nlm.nih.gov/geo with access number [GSE40419].
More information on these datasets is given in Table 2.
Prostate.This dataset was originally proposed by Singh et al. [22]; it is contains the expression profiles of 12,600 genes for 50 normal tissues and 52 prostate tumor tissues.
Lymphoma.This dataset (Shipp et al. [23]) contains 77 microarray gene expression profiles of the two most prevalent adult lymphoid malignancies: 58 samples of diffuse large B-cell lymphomas (DLBCL) and 19 follicular lymphomas (FL).The original data contains 7,129 gene expression values.
Lung cancer.As RNA-sequencing (RNA-seq) technique widely used, therefore, it is important to test the proposed method whether it has the ability to handle the RNA-seq data.
To verify it, one dataset that used the next-generation sequencing was involved in our analysis.This dataset [24] contains 164 samples with 87 lung adenocarcinomas and 77 adjacent normal tissues.
We evaluate the performance of the HLR penalized logistic regression models using the random partition.This means that we divide the datasets at random such that approximate 75% of the datasets becomes the training samples and the other 25% as the test samples.The optimal tuning parameters were found by using the 10-fold cross-validation in the training set.Then, the classification model was built by the sparse logistic regression with the estimated tuning parameters.Finally, application of the classifier to the test set provides the prediction characteristics such as classification accuracy, AUC under the receiver operating characteristic (ROC) analysis.The above procedures were repeated 500 times with different random dataset partitions.The mean number of the selected genes, the training and the testing classification accuracies, were summarized in Table 3 and the averaged AUC performances were showed in Fig 3.
As showed in Table 3, for prostate dataset, the classifier with the HLR approach gives the average 10-fold CV accuracy of 97.61% and the average test accuracy of 93.68% with about 12.6 genes selected.The classifiers with Lasso, L 1/2 , SCAD − L 2 and Elastic net methods give the average 10-fold CV accuracy of 96.22%, 96.13%, 95.99%, 96.28% and the average test accuracy of 92.4%, 92.18%, 91.33%, 91.35% with 13.7, 8.2, 22 and 15.2 genes selected respectively.For lymphoma datasets, it can be seen that the HLR method also achieves the best classification performances with the highest accuracy rates in the training and test sets.For lung cancer, our method gained the best training accuracy.The testing performance of Elastic net was slightly better than our method.However, the HLR method achieved its success using only about 15.6 predictors (genes), compared to 28.9 genes for the Elastic net method.Although the Lasso or L 1/2 methods gained the sparsest solutions, the classification performance of these two approaches were worse than the HLR method.This is an important consideration for screening and diagnostic applications, where the goal is often to develop an accurate test using as few features as possible in order to control cost.As showed in Fig 3, our proposed method achieved the best classification performances in these three real datasets amongst all the competitors.For example, the AUC from ROC analysis of the HLR method for datasets prostate, lymphoma and lung cancer datasets were estimated to be 0.9353, 0.9347 and 0.9932 respectively.AUC results of the Lasso method for the three datasets were calculated to be 0.9327, 0.9253 and 0.9813 respectively, which were worse than the proposed HLR method.We summarized the top 10 ranked (most frequently) genes selected by the five regularization methods for the lung cancer gene expression dataset in Table 4, the information of top 10 ranked genes for the other datasets could be found in S2 File.Note that in Table 1, the proposed HLR method has the impressive performances to select the true features in the simulation data.It is implied that the genes selected by the HLR method in these three cancer datasets are valuable to the researchers who want to find out the key factors that associated with the cancer development.For example, in Table 4, the biomarkers selected by our HLR method include advanced glycosylation end product receptor (AGER), which is a member of the immunoglobulin superfamily predominantly expressed in the lung.AGER plays a role in epithelial organization, and decreased express of AGER in lung tumors may conduce to loss of epithelial tissue structure, potentially leading to malignant transformation [25].The unique function of AGER in lung, making it could be used as an additional diagnostic tool for lung cancer [26], and even a target [27].GATA2 (GATA binding protein 2) are expressed principally in hematopoietic lineages, and have essential roles in the development of multiple hematopoietic cells, including erythrocytes and megakaryocytes.It is crucial for the proliferation and maintenance of hematopoietic stem cells and multi-potential progenitors [28].Kumar et al. [29] showed a strong relationship between GATA2 and RAS-pathway mutant lung tumor cells.We used the SVM approach to build the classifiers based on the first two, first five and first ten genes selected by the different regularization approaches from the lung cancer dataset (Table 4), and were trained on the lung cancer dataset (Table 2) respectively.These classifiers then were applied to the two independent lung cancer datasets, GSE19804 and GSE32863, respectively.doi:10.1371/journal.pone.0149675.t005 To further verify the biomarkers selected by our method, we had collected two independent cancer datasets for validation.The GSE19804 [30] contains 120 samples with 60 lung adenocarcinomas and 60 adjacent normal tissues.The GSE32863 [31] contains 116 samples include 58 lung adenocarcinomas and 58 healthy controls.These two datasets are available from the GEO series accession number [GSE19804] and [GSE32863].We used the support vector machine (SVM) approach to build the classifiers based on the first two, first five and first ten genes selected by the different regularization approaches from the lung cancer dataset (Table 4), and were trained on the lung cancer dataset (Table 2) respectively.These classifiers then were applied to the two independent lung cancer datasets, GSE19804 and GSE32863, respectively.
It is known that the obtained prediction models may be only applicable to samples from the same platform, cell type, environmental conditions and experimental procedure.However, interestingly, as demonstrated in Table 5, we can see that all the classification accuracies predicted by the classifiers with the selected genes by the HLR approach, are higher than 90%.Especially the classification accuracy on the GSE32863 dataset is 97.41% with the classifier based on the first ten genes.Such performances are better than the genes selected by other methods.For example, the accuracy of the classifier with the first two genes selected by Elastic net, for GSE19804, was estimated to be 86.67% that was worse than the classifier with the genes selected by our method, 90.83%.The performance of the classifier with the first five genes selected by SCAD − L 2 , for GSE32863, was calculated to be 92.24% that was worse than the classifier with the genes selected by our HLR method, 96.55%.The results indicate that the sparse logistic regression with the HLR approach can select powerful discriminatory genes.
In addition to comparing with the Lasso, L 1/2 , SCAD − L 2 and Elastic net techniques, we also make a comparison with the results of other methods for datasets prostate and lymphoma published in the literature.Note that we only considered methods using the CV approach for evaluation, since approaches based on a mere training/test set partition are now widely known as unreliable [32].Table 6 displays the best classification accuracy of other methods.In Table 6, classification accuracy achieved by the HLR approach is greater than other methods.Meanwhile, the number of selected genes is smaller than other methods except on the Lymphoma dataset.

Negative ðTNÞ≔j b: Ã bj 0 ;
False Positive ðFPÞ≔j b: Ã bj 0 False Negative ðFNÞ≔jb: Ã bj 0 ; True Positive ðTPÞ≔jb: Ã bj 0 Sensitivity≔ TP TP þ FN ; Specificity≔ TN TN þ FP : where the .Ã is the element-wise product, and |.| 0 calculates the number of non-zero elements in a vector, b and b are the logical "not" operators on the vectors β and b.

Table 1 .
Mean results of the simulation.In bold-the best performance amongst all the methods.Mean results are based on 500 repeats.The sensitivity and specificity are both dedicated to measures the quality of the selected features, the accuracy evaluates the classification performance of the different regularization approaches on the test sets. doi:10.1371/journal.pone.0149675.t001

Table 3 .
Mean results of empirical datasets.In bold-the best performance.Fig 3.The performance of the AUC from ROC analyzes of each method on prostate, lymphoma and lung cancer datasets. doi:10.1371/journal.pone.0149675.g003

Table 4 .
The most frequently selected 10 genes found by the five sparse logistic regression methods from the lung cancer dataset.

Table 5 .
The validation results of the classifiers based on the top rank selected genes from lung cancer dataset.In bold-the best performance.

Table 6 .
The result of the literature.In bold-the best performance.