Figures
Abstract
High dimensional data are commonly encountered in various scientific fields and pose great challenges to modern statistical analysis. To address this issue different penalized regression procedures have been introduced in the litrature, but these methods cannot cope with the problem of outliers and leverage points in the heavy tailed high dimensional data. For this purppose, a new Robust Adaptive Lasso (RAL) method is proposed which is based on pearson residuals weighting scheme. The weight function determines the compatibility of each observations and downweight it if they are inconsistent with the assumed model. It is observed that RAL estimator can correctly select the covariates with non-zero coefficients and can estimate parameters, simultaneously, not only in the presence of influential observations, but also in the presence of high multicolliearity. We also discuss the model selection oracle property and the asymptotic normality of the RAL. Simulations findings and real data examples also demonstrate the better performance of the proposed penalized regression approach.
Citation: Wahid A, Khan DM, Hussain I (2017) Robust Adaptive Lasso method for parameter’s estimation and variable selection in high-dimensional sparse models. PLoS ONE 12(8): e0183518. https://doi.org/10.1371/journal.pone.0183518
Editor: Chenping Hou, National University of Defense Technology, CHINA
Received: April 26, 2017; Accepted: August 4, 2017; Published: August 28, 2017
Copyright: © 2017 Wahid et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the paper and its Supporting Information files.
Funding: The authors received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
1 Introduction
Variable selection plays a vital role in modern statistical modeling and machine learning, especially for models in which a large number of predictors and minimum observations, known as high dimensionality. Introducing many predictors in the regression models will result in reducing bias in model, but we wish to select a parsimonious set of important covariates for the efficient prediction. Therefore, variable selection is, then, essential to identify important variables, and produces more interpretable models with better prediction power.
The penalization methods are very useful in this field, and there are a large body of literature on this problems recently. The least absolute shrinkage and selector operator (LASSO), which was introduced by [1], is one of the key steps in coefficient estimation and predictor selection simultaneously, and was further studied by [2] and [3]. [4] proposed bridge regression that depends on Lq penalty. [5] proposed correlation based penalty to encourage a grouping effect, and also does well when there is high correlation among explanatory variables. It does not do as well when the correlation is perfect [6]. The smoothly clipped absolute deviation (SCAD) penalty, proposed by [7], enjoys the oracle property, but this criterion is non-convex, which is a drawback since it makes the computation much more difficult. [8] proposed the adaptive LASSO which is convex and reduces the possible bias. Therefore, it consistently selects the model. He also proved the oracle property of proposed method. Moreover, several other penalized methods had been proposed, such as [9, 10].
Recently, many sparse learning and classification algorithms have been proposed. [11] proposed weighted sparse representation based classification (WSRC) method which is direct extention of SRC, integrates the locality to sparse coding. WSRC provides good results in lower dimensional subspaces. [12], developed a new classification algorithm called Representative Vector Machines (RVM’s). The comprehensive experimental evaluation demonstrate the effectiveness of proposed method over other classifiers. [13] also presents a comprehensive survey of various aspects of structured sparsity-inducing feature selection (SSFC) methods on feature selection. [14] and [15] have proposed a new subspase learning algorithm called discriminant sparse neihborhood preserving embedding(DSNPE) and a robust feature extraction method for face recognition based on least squares regression, respectively.
Robust penalized techniques, such as the least absolute deviation and quantile regression, have been used for predictor selection in the case of fixed dimensionality, for instance, [16, 17, 18, 19]. [17] studied the L1-penalized LAD regression. The penalized composite likelihood method was proposed in [20] for robust estimation in high-dimensions with focus on the efficiency of the method. [21, 22] studied the L1-penalized quantile regression in high-dimensional sparse models.
In this paper, we develop a robust penalized method for estimating regression coefficients and predictor selection. We consider a weighted likelihood estimating equation with L1 penalty of [8] which employs a re-weighting of the components of the likelihood score function. The proposed approach is useful when the model is in doubt or when outliers are present in the data. This work is based on a recent proposal by [23]. Their work requires to attaching weights to score function values of each observation in such a way that the weights are close to one when the residuals describe the match between the empirical and the true model distribution functions. On the other hand, for large residuals there is a mismatch, and the corresponding likelihood function may require downweighting in order to obtain a robust solution.
However, the [24] and [25] articles remain the pioneers in this particular area of research. According to [24], this approach of downweighting points that are large residuals outliers and as well as both residuals outliers and high leverage points.
In the present work, we evaluate the performance of the proposed technique with some existing penalized methods including LASSO, elastic net, adaptive lasso and correlation based adaptive lasso. In order to cover the effects of outliers in various situations, we have taken the percentages of outliers as 0%, 10%, 20% and 30% from different distributions. The performances are evaluated in respect to their median prediction error, variable selection and bootstrap estimators.
The rest of the paper is organized as follows. Section 2, provides the background by briefly reviewing the residual function and weight function. We introduce the proposed robust penalized regression estimator in section 3. In section 4, we described the selection of the tuning parameters. Section 5 presents the results of the simulation studies. In section 6, we illustrate the performance of the proposed robust method on real data examples. Finally, section 7 offers some concluding remarks.
2 Background
Let y1, y2, …, yn be an i.i.d random sample from a distribution G, having the density g, is modeled by the parametric functions Ψθ = {Fθ: θ ∈ Θ ⊂ Rd; d ≥ 1}. The maximum likelihood estimator(MLE) of θ is obtained by maximizing the likelihood . To obtain the MLE of θ, we solve the equation (1) In the present paper, we consider the solution to the weighted likelihood equation given by (2) The weights wθ(yi) will be constrained to lies between 0 and 1. This approach is motivated by the aim of generating estimators that are simultaneously robust and asymptotically fully efficient [23]. This technique borders on the idea of minimum disparity estimation proposed by [26] and [27].
[24] and [25] are considered the same approach that provide a quantification of the magnitude and sign of Pearson residuals, and are generally linked to a residual adjustment function employed in minimum disparity estimation. We consider the [23] weighting scheme which is downweighting of discrepant observations, where the strength of the downweighting increases steadily with the degree of discrepancy, or extreme outliers obtain weights close to 1.
In the following subsections, we briefly describe the residual function and weight function.
2.1 The residual function
Let I(.) denote an indicator function. We define Fn and Sn as , , these represent the empirical distribution function and survival function of the data. Let Fθ(Y) = P(Y ≤ y) and Sθ(Y) = P(Y ≥ y) be corresponding theoretical quantities. Now, the residual function τn(yi) proposed by [23] is given as: Where q to choose is a suitable fraction q ≤ 0.5. This tuning parameter will determine the proportion of observations on either tail that will be subjected to possible downweighting. According to the above equation, it is clear that the consideration of the distribution function in the left tail and the survival function in the right tail helps to highlight the mismatch of the data and the model in the respective tails. Therefore, treat it as a case that requires downweighting.
2.2 The weight function
Now, we have our residual function, the next objective is to construct suitable weight function. The weight function should have the following properties:
(a) 0 ≤ w(τn(y)) ≤ 1, w(0) = 1 (b) w(−1) is small, preferably close to 1.
[23] define residual adjustment function for weight as follows: Where α is a positive constant. There may be different forms for the downweighting structure represented by the function H(.). The role of this function has to be extensively studied in [24] and [25]. In this paper, we have used the above defined function.
Now, the weight function according to [23], can be defined as: (3)
3 Proposed Robust Adaptive Lasso: Penalizing the negative weighted log-likelihood
Consider the linear regression model (4) Where yn×1 is the response vector, Xn×p be the predictor matrix, β = (β1, …, βp)T is the co-efficient vector, and ε = (ε1, …, εn)T is a vector of i.i.d random variables.
We consider the following regularization problem (5) where w(τ) is the weight function in Eq (3), li(.) is the conditional log-likelihood, and pλ(.) is a non-negative penalty function on [0, ∞] with a regularization parameter λn ≥ 0.
The use of weight function in Eq (3) is overcome by the difficulty of heavy tails of the error distribution. It is non-negative, bounded above 1 and twice differentiable with respect to τ. The loss function for many models is convex, if the conditional distribution like li(.), is from a class of the exponential family ([28]). The penalty function we will use is where is an initial estimator. In the present paper we get this from ridge regression estimator when p ≫ n, and in usual case (n > p) we used robust tukey bisquare M-estimator as initial estimator.
3.1 Theoretical properties
Consider the constant weight vector w, the penalized weighted log-likelihood function based on n samples is, (6) Where Wn(β) represent the weighted likelihood function. We write the true parameter coefficient vector as , where β10 consists of all ‘s’ non-zero components and β20 consists of the remaining ‘s-1’ zero components. We write the corresponding maximizer of Eq (6) as .
The Fisher Information matrix is (7) Where W = w.wT is (1 × n) positive definite matrix. Let I1(β10) = I11(β10, 0), where I11(β10, 0) is the leading (s × s) submatrix of I(β0) with β20 = 0. We also assume that if , then the adaptive Lasso type estimator satisfies . It shows that is root n-consistent if λn ↦ 0 ([7]). Next we show that, when λn is chosen properly the proposed Robust estimator has the oracle property under the same regularity conditions of [7].
Theorem: Assume that and nλn ↦ ∞. Then under the some regularity conditions, with probability tending to 1, the root n-consistent adaptive Lasso Robust estimator must satisfy the following properties.
(i) Sparsity; .
(ii) Asymptotic normality; in distribution as n ↦ ∞.
The proofs of theorem follows the same steps of [7], under same regulaty conditions.
4 Selection of the tuning parameters
We now discuss an important issue of the tuning parameters selection regarding the construction of weight functions and in the different regularization methods. For the weight functions, increasing the value of the parameter “α” leads to greater downweighting. To make balance between the degree of robustness and efficiency, we need an extensive numerical studies. In the present simulation study we keep it in the range of (0.05, 0.005). This appears to be a difficult problem, but solving this issue remains among our plan for future work.
On the other hand, in the different regularization methods described in this paper, we need to choose an optimal tuning parameters . Intuitively, an optimals of these tuning parameters be chosen through a variety of tools such as cross validation (CV), generalized CV and Bayesian information criterion (BIC) approach.
There are well-established methods for choosing such parameters [29]. [17] and [30] used BIC-type criterion, and [3] using 10-fold CV technique and proved that the resulting estimator with such type of tuning parameters satisfies the good prediction accuracy and model selection. Following this idea, we apply 10-fold CV procedure for selection of tuning parameters.
5 Simulation studies
In this section, we introduced some numerical examples to illustrate the performance of the Robust Adaptive Lasso method described in section (3) with other methods. We set two scenarios in every example in terms of pairwise correlation between predictors, i.e, r = 0.5 and 0.85. Besides this, four levels of contamination were considered in error distribution (δ = 0%, 10%, 20% and 30%) from two types of contamination, that is, scale and location contamination.
The following four performance measures were calculated:
(a) Prediction Error- computed on the test data set.
(b) Bootstrap Standard Error-by using the bootstrapp with B = 500 resamplings on the 1000 mean squared errors (in percentage).
(c) The average number of “0” estimated coefficients correctly, denoted by “C”.
(d) The average number of “0” estimated coefficients incorrectly, denoted by “I”.
We simulated 1000 data sets for each example from the linear regression model The predictors, Xi, …, Xp were generated from the multivariate normal distribution, N(0, Σ) with Σ = (ρjk)p×p and ρjk = r|j−k|. For the distribution of the noise, εi, we considered N(0, 1) with different levels of contamination from symmetric and asymmetric distribution.
Scenario-I: We let β = (3, 1.5, 0, 0, 2, 0, 0, 0). We consider two cases for sample size, i.e, n = 50, and 100. Next, we took a scenario where an error distribution to be (1 − δ)N(0, 1) + δN(−10, 1) which is contaminated by N(−10, 1) with different levels of percentages, δ.
Scenario-II: This scenario is the same as first, except that the error distribution to be (1 − δ)N(0, 1) + δN(0, 25).
Scenario-III: In this case we consider the sample size n = 120 and covariates dimension p = 400. We set the coefficient vector β in which first 15 components are non-zero and the remaining are zero. Given is, β = {(3, …, 3)5, (1.5, …, 1.5)5, (2, …, 2)5, (0, …, 0)385}.
Scenario-IV: This scenario is same as scenario-I, except that the error distribution is exponential(1) contaminated by an exponential(1/5).
Scenario-V: In this case, we demonstrate the p ≫ n situation, when errors have non-normal distributions. We use n = 120 observations and p = 400 predictors. We sample ϵi from exponential distribution, i.e, (1 − δ)exp(1) + δexp(1/5).
We fixed the true regression coefficients vector as, β = {(3, …, 3)5, (1.5, …, 1.5)5, (2, …, 2)5, (0, …, 0)385}
Table 1 summarized the simulation results in scenario-I, for the different levels of contamination, different sample sizes, and low and high pairwise correlation between predictors. To measure the quality of the proposed technique, median test error, bootstrap standard error and predictor selection performance were computed under each condition.
Effects of sample size
We observe that, across sample sizes, the median test error and bootstrap standard error decreased with the increasing sample size from 50 to 100 in both cases of low and high correlation among predictors. This pattern of decreasing holds for all regularization methods, but the decrease in the proposed Robust adaptive Lasso is more than other competitors. On the other hand, in terms of variable selection with the increase of sample size, the ratio of incorrectly “0” selection of coefficients, denoted by “I”, decreases significantly. Table 1 also shows that for, n = 100, the correct selection, denoted by “C”, is improved as compared for n = 50, but the proposed Robust adaptive Lasso methods, defeated all other methods, in both cases, i.e, in “C” and “I” and performs just like oracle estimator (i.e, C = 5 and I = 0).
Effects of level of contamination
Under ideal condition (unit normal error distribution, δ = 0%), the results in Table 1 indicate that the “CBPR” and proposed method provides good results in terms of both prediction and variable selection, particularly in r = 0.85, the predictor selection performance is the nearest to oracle estimator, but in, n = 100, case, the test error of the proposed method is lower than all other methods.
For the 10% data contamination condition, the proposed Robust adaptive Lasso shows strong performance in terms of both prediction accuracy and variable selection, but in only one cell, i.e, for n = 50 and δ = 10%, the variable selection performance is not better than CBPR.
From Table 1, it is clear that the proposed method shows superior performance under 20% and 30% contamination conditions for both cases of sample sizes and correlations among predictors, respectively. In this extreme location contamination, the proposed Robust method has performed just like an oracle estimator and also has minimum bootstrap standard errors with excellent predictor selection performance. Hence, the overall performance of proposed the Robust adaptive Lasso method is better than other with the increase of location contamination.
Effects of multicollinearity
In given scenario-I, we also consider two cases for correlation between predictors, i.e, r = 0.5 and 0.85. In both situations, the Robust adaptive Lasso method outperforms all other methods, except CBPR which has a little better result in case of high correlation and clean data (δ = 0%), particularly, in variable selection. But, the levels of contamination increases in both low and high collinearity. The proposed method findings become better and better. From Table 1, we can also see that the [8] gives very poor results in high correlation between predictors, especially, in variable selection.
Simulation results for scenario-II are summarized in Table 2, in which we consider scale level of contamination scheme where N(0, 1) model is contaminated by N(0, 25). The details are discussed as below.
Effects of sample size
From the findings given in Table 2, it is evident that the prediction errors and bootstrap standard errors of all regularization approaches blows up in the presence of scale contamination. The results shows that when sample size doubled from 50 to 100, the test errors and as well as bootstrap standard errors significantly decreases, for fixed contamination and correlation among predictors. Among these five methods, the reduction in the proposed Robust method is maximum. We can also observe that, the variable selection performance is also improved when the sample size increases, particularly for the proposed method, in terms of both aspects “C” and “I”. Generally speaking, the usual positive effect of sample size is on prediction accuracy and standard error.
Effects of level of contamination
It is clear form Table 2, that the proposed Robust regularization technique provides good results in terms of model error in all cases of contamination. The variable selection performance of Lasso and [8] are extremely poor when the contamination rate increases from% to 30%. The findings also indicate that the proposed method remains almost consistent in variable selection under low and high contamination, as compared to other competitors.
Effects of multicollinearity
From Table 2, it can be shown that in both cases of correlation, i.e, r = 0.5 and r = 0.85, the proposed Robust method is the best one overall. But it is very interesting to see that the Robust RAL method in terms of test errors and bootstrap standard errors is relatively better when the predictors are highly correlated, but in predictor selection in high correlation and extreme contamination aspect (i.e, 20% and 30%) is worse than low correlation condition among predictors. For example, in case of fixed (n, p) = (50, 8) and δ = 30%, the proposed Robust method estimated 4.639 predictors on the average, in “C” and in aspect of “I” is 0, when r = 0.5. But on the other hand, in r = 0.85, the situation is 4.037 in aspect of “C” and 0.745 in “I”, respectively.
In scenario-III, the predictors dimension is larger than sample observations, i.e, p = 400 and n = 120, but the dimension of the true model is fixed to be 15. The detailed results are depicted in Table 3. To compare the performance of the Robust Adaptive Lasso estimator, we did not report the results of the [8], because in case of, p ≫ n, it is nontrivial to calculate reliable initial estimates (i.e, OLS estimators) for weights used in it. From Table 3, the results confirm robustness of the proposed method with increasing the level of contamination, and hence, the prediction error and bootstrap standard errors decrease slowly when δ increases towards 30%.
In case of variable selection, from Table 3, it can be seen that the predictor selection results of the other methods have a little more advantage than the proposed method in the aspect of “C”, especially for low level of contamination and moderate correlation, i.e, r = 0.5, but in extreme contamination condition, i.e, (δ = 30%) the performance of Robust proposed method becomes better than others. On the other hand, in high correlation set up among predictors, the proposed technique gives satisfactory results in both aspects of “C” and “I”, except for clean data, i.e, (δ = 0%).
In scenario-IV, we consider the non-normal error distribution. The Table 4, presents the prediction error, bootstrap standard error and predictor selection performance when the error has heavy tailed an exponential(1) distribution, contaminated by an exponential (1/5) distribution. From Table 4, it may be seen that the contamination proportion tends to the level of 30%, the prediction error of the proposed penalized regression method is stable and does not increases as compared to others. In terms of predictor selection, the proposed robust method remains better for higher level of contamination too, for fixed sample size and correlation ammong predictors. Additionally, for the small sample size Table 4 reported that the test errors for the different procedures are grater than in case of large sample size (i.e, n = 100), for all fixed cases of conatmination and correlation.
Our simulation results for scenario-V are depicted in Table 5. For the exponential error distribution, the robust adaptive Lasso exhibits good prediction performance in both low and high correlations among predictors when contamination tends to 30%. In terms of variable selection, we observe that the performance of the proposed method tends to dominate the other penalized regression procedures when the level of contamination increases towards 30%. These findigs suggest that the proposed robust procedure, by utilizing weights proposed by [23], is effective when the tails got heavier.
6 Real data application
6.1 Prostate cancer data
The data set for this subsection comes from a study by [31] and analyzed by [29] for estimation and variable selection. The data set consists of 97 observations and 9 variables. We used the response variable which is “lcavol” and the rest are explanatory variables. The proposed Robust penalized method was applied along with four other penalized approaches (i.e, given in Table 6). The first 30 observations were used as training data set and the rest as testing data to evaluate the prediction ability.
The dashed entries correspond to predictors that are estimated “0”.
In Fig 1, the QQ-plot and boxplot shows that there are three distinct outliers in the response variable. A normal model would provide a nice fit to response variable if the outliers are deleted. We set our optimum tuning parameters values as q = 30, that is to determine the proportion of observations on either tail that will be subjected to downweighting, and “α = 2.202” for proposed approach. Beside this, the weights associated with these three identified outliers are, 0.0380, 0.0092 and 0.0359, respectively.
The columns of Table 6 present the different penalized approaches with the proposed robust method, their associated coefficient estimates and test errors. From the Table 6, we can see that the proposed method produces more sparse solution and selects two covariates, i.e, “lcp” and “lpsa”, corresponding to the smallest test error, i.e, 0.807.
6.2 Microarray data-riboflavin production by bacillus subtilis
Here we analyze the high-dimensional real data set (S2 File) about riboflavin production by bacillus subtilis which was analyzed by [32]. Here the continuous response variable Y which measures the logarithm of the production rate of riboflavin, p = 4088, is the number of covariates corresponding to the logarithms of expression levels of genes and n = 71 individuals of genetically homogeneous sample.
Our main objective here is to test whether our method can effectively select covariates with non-zero coefficients and estimate parameters simultaneously.
From Fig 2(a), it can be observed that the frequency distribution of the response variable is somehow positively skewed and the boxplot in Fig 2(b) clearly shows that there are some outlying observations present in the data. Since, the response variable is skewed, therefore, gamma distribution was fitted on it after applying K-S test (i.e, p-value>0.05). For the purpose of the proposed weights, we set our tuning parameters values as q = 0.5 and α = 0.312.
Percentage prediction/test errors were calculated for the five regularization regression techniques, and were compared with percentage test error of the ridge regression (i.e, 0.0829). The percentage errors were shown by a bar graph in Fig 3. It can be seen from Fig 3 that Lasso performs very poorly with percentage test error which is 113.915%, and which is 13.915% more than ridge regression. The prediction error of elastic net is 14.21%, (i.e, 85.79%) lower and CBPR is 22.08%, (i.e, 77.92%) lower than ridge penalized regression. Among these methods the maximum reduction in percentage test error is observed in the proposed penalized regression procedure with percentage test error is 64.29%, which is 35.71% lower than ridge method.
In terms of sparsity, the number of non-zero estimated predictors coefficients of Lasso, elastic net, CBPR and RAL are 16, 32, 42 and 35, respectively, out of the total number of 4088 predictors. Since, the proposed RAL method selects 35 covariates with minimum prediction error.
7 Conclusion
In this article we proposed a robust penalized regression model (RAL) using weighted log-likelihood with the adpative Lasso penalty function. We used the [23] proposed weight function to downweights the points that are large residuals outliers and improved the effectiveness of the proposed algorithm. Four penalization methods in addition to RAL, including Lasso, elastic net, adaptive lasso and CBPR were compared. The numerical simulations shows that for high percentages of contamination the RAL is more robust and outperforms the other penalization procedures interms of prediction accuracy, bootstrapped standard errors and variable selection.
We also illustrate the proposed method in an application to real data analysis. we consider the prostate cancer data set and a high-dimensional data set about riboflavin (Vatamin B2) production by bacillus subtilis. We have evaluated the performance of the different penalized procedures based on training/testing sample partition. The real data comparison in section 6, demonstrate that the proposed robust procedutre (RAL) improves over existing methods in both prediction and varaible selection. Thus, RAL is more robust procedure than other methods to outliers or influential observations.
References
- 1. Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological). 1996; p.267–288.
- 2. Efron B., Hastie T., Johnstone I., and Tibshirani R. Least angle regression. The Annals of Statistics. 2004; 32(2):407–499.
- 3. Zou H. and Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2005;67(2):301–320.
- 4. Fu W. J. Penalized regressions: the bridge versus the lasso. Journal of Computational and Graphical Statistics. 1998; 7(3):397–416.
- 5. Tutz G. and Ulbricht J. Penalized regression with correlation-based penalty. Statistics and Computing. 2009;19(3):239–253.
- 6. Wang F. L., Chan T. H., Thambiratnam D. P., Tan A. C., and Cowled C. J. Correlation-based damage detection for complicated truss bridges using multi-layer genetic algorithm. Advances in Structural Engineering. 2012;15(5):693–706.
- 7. Fan J. and Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association.2001;96(456):1348–1360.
- 8. Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association.2006;101(476):1418–1429.
- 9. Candes E. and Tao T. The dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics.2007;2313–2351.
- 10. Fan J. and Lv J. A selective overview of variable selection in high dimensional feature space. Statistica Sinica. 2010;20(1):101. pmid:21572976
- 11. Lu C., Min H., Gui J., Zhu L., and Lei Y. Face recognition via weighted sparse representation. Journal of Visual Communication and Image Representation.2013;24(2):111–116.
- 12. Gui J., Liu T., Tao D., Sun Z., and Tan T. Representative vector machines: A unified framework for classical classiers. IEEE transactions on cybernetics. 2016;46(8):1877–1888. pmid:26285229
- 13. Gui J., Sun Z., Ji S., Tao D., and Tan T. Feature selection based on structured sparsity: A comprehensive study. IEEE transactions on neural networks and learning systems.2017;33(5):2543–2555.
- 14. Mi J.-X., Lei D., and Gui J. A novel method for recognizing face with partial occlusion via sparse representation. Optik-International Journal for Light and Electron Optics.2013;124(24):6786–6789.
- 15.
Gui J., Sun Z., Hou G., and Tan T. An optimal set of code words and correntropy for rotated least squares regression. In Biometrics (IJCB), 2014 IEEE International Joint Conference on. 1–6.
- 16. Li Y. and Zhu J. L1-norm quantile regression. Journal of Computational and Graphical Statistics.2008;17(1):163–185.
- 17. Wang H., Li G., and Jiang G. Robust regression shrinkage and consistent variable selection through the lad-lasso. Journal of Business & Economic Statistics.2007;25(3):347–355.
- 18. Wu Y. and Liu Y. Variable selection in quantile regression. Statistica Sinica.2009;45(4): 801–817.
- 19. Zou H. and Yuan M. Composite quantile regression and the oracle model selection theory. The Annals of Statistics.2008;11(8):1108–1126.
- 20. Bradic J., Fan J., and Wang W. Penalized composite quasi-likelihood for ultrahigh dimensional variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology).2011;73(3):325–349.
- 21. Belloni A., and Chernozhukov V. L1-penalized quantile regression in high-dimensional sparse models. The Annals of Statistics. 2011;39(1):82–130.
- 22. Fan J., Li Q., and Wang Y. Robust estimation of high-dimensional mean regression. arXiv preprint arXiv.2014;:1410–2150.
- 23. Biswas A., Roy T., Majumder S., and Basu A. A new weighted likelihood approach. Stat,2015;4(1):97–107.
- 24. Agostinelli C. and Markatou M. A one-step robust estimator for regression based on the weighted likelihood reweighting scheme. Statistics & Probability Letters.1998:37(4):341–350.
- 25. Markatou M., Basu A., and Lindsay B. G. Weighted likelihood equations with bootstrap root search. Journal of the American Statistical Association.1998;93(442):740–750.
- 26. Basu A. and Sarkar S. The trade-off between robustness and eciency and the eect of model smoothing in minimum disparity inference. Journal of Statistical Computation and Simulation.1994b:50(34):173–185.
- 27. Basu A. and Sarkar S. Minimum disparity estimation in the errors-in-variables model. Statistics & Probability Letters.1994a:20(1):69–73.
- 28.
McCullagh P., and Nelder J. Generalized linear models. CRC Press;1989.
- 29.
Friedman J., Hastie T., and Tibshirani R. The elements of statistical learning, volume 1. Springer series in statistics Springer, Berlin;2001.
- 30. Wang H., Li B., and Leng C. Shrinkage tuning parameter selection with a diverging number of parameters. Journal of the Royal Statistical Society: Series B (Statistical Methodology).2009;71(3):671–683.
- 31. Stamey Thomas A., Kabalin John N., McNeal John E., et al. Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate. II. radical prostatectomy treated patients. The Journal of Urology.1989;141(5):1076–1083. pmid:2468795
- 32.
Buhlmann P. and Van De Geer S. Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media;2011.