Robust Adaptive Lasso method for parameter’s estimation and variable selection in high-dimensional sparse models

High dimensional data are commonly encountered in various scientific fields and pose great challenges to modern statistical analysis. To address this issue different penalized regression procedures have been introduced in the litrature, but these methods cannot cope with the problem of outliers and leverage points in the heavy tailed high dimensional data. For this purppose, a new Robust Adaptive Lasso (RAL) method is proposed which is based on pearson residuals weighting scheme. The weight function determines the compatibility of each observations and downweight it if they are inconsistent with the assumed model. It is observed that RAL estimator can correctly select the covariates with non-zero coefficients and can estimate parameters, simultaneously, not only in the presence of influential observations, but also in the presence of high multicolliearity. We also discuss the model selection oracle property and the asymptotic normality of the RAL. Simulations findings and real data examples also demonstrate the better performance of the proposed penalized regression approach.


Introduction
Variable selection plays a vital role in modern statistical modeling and machine learning, especially for models in which a large number of predictors and minimum observations, known as high dimensionality. Introducing many predictors in the regression models will result in reducing bias in model, but we wish to select a parsimonious set of important covariates for the efficient prediction. Therefore, variable selection is, then, essential to identify important variables, and produces more interpretable models with better prediction power.
The penalization methods are very useful in this field, and there are a large body of literature on this problems recently. The least absolute shrinkage and selector operator (LASSO), which was introduced by [1], is one of the key steps in coefficient estimation and predictor selection simultaneously, and was further studied by [2] and [3]. [4] proposed bridge regression that depends on L q penalty. [5] proposed correlation based penalty to encourage a grouping effect, and also does well when there is high correlation among explanatory variables. It a1111111111 a1111111111 a1111111111 a1111111111 a1111111111

Background
Let y 1 , y 2 , . . ., y n be an i.i.d random sample from a distribution G, having the density g, is modeled by the parametric functions C θ = {F θ : θ 2 Θ & R d ; d ! 1}. The maximum likelihood estimator(MLE) of θ is obtained by maximizing the likelihood LðyÞ ¼ Q n i¼1 g y ðy i Þ. To obtain the MLE of θ, we solve the equation In the present paper, we consider the solution to the weighted likelihood equation given by The weights w θ (y i ) will be constrained to lies between 0 and 1. This approach is motivated by the aim of generating estimators that are simultaneously robust and asymptotically fully efficient [23]. This technique borders on the idea of minimum disparity estimation proposed by [26] and [27]. [24] and [25] are considered the same approach that provide a quantification of the magnitude and sign of Pearson residuals, and are generally linked to a residual adjustment function employed in minimum disparity estimation. We consider the [23] weighting scheme which is downweighting of discrepant observations, where the strength of the downweighting increases steadily with the degree of discrepancy, or extreme outliers obtain weights close to 1. In the following subsections, we briefly describe the residual function and weight function.

The residual function
Let I(.) denote an indicator function. We define F n and S n as F n ðYÞ ¼ 1 n P n i¼1 IðY y i Þ, S n ðYÞ ¼ 1 n P n i¼1 IðY ! y n Þ, these represent the empirical distribution function and survival function of the data. Let F θ (Y) = P(Y y) and S θ (Y) = P(Y ! y) be corresponding theoretical quantities. Now, the residual function τ n (y i ) proposed by [23] is given as: Where q to choose is a suitable fraction q 0.5. This tuning parameter will determine the proportion of observations on either tail that will be subjected to possible downweighting. According to the above equation, it is clear that the consideration of the distribution function in the left tail and the survival function in the right tail helps to highlight the mismatch of the data and the model in the respective tails. Therefore, treat it as a case that requires downweighting.

The weight function
Now, we have our residual function, the next objective is to construct suitable weight function. The weight function should have the following properties: (a) 0 w(τ n (y)) 1, w(0) = 1 (b) w(−1) is small, preferably close to 1.
[23] define residual adjustment function for weight as follows: HðyÞ ¼ e À ay 2 ; Where α is a positive constant. There may be different forms for the downweighting structure represented by the function H(.). The role of this function has to be extensively studied in [24] and [25]. In this paper, we have used the above defined function. Now, the weight function according to [23], can be defined as: 3 Proposed Robust Adaptive Lasso: Penalizing the negative weighted log-likelihood Consider the linear regression model Where y n×1 is the response vector, X n×p be the predictor matrix, β = (β 1 , . . ., β p ) T is the co-efficient vector, and ε = (ε 1 , . . ., ε n ) T is a vector of i.i.d random variables. We consider the following regularization problem where w(τ) is the weight function in Eq (3), l i (.) is the conditional log-likelihood, and p λ (.) is a non-negative penalty function on [0, 1] with a regularization parameter λ n ! 0. The use of weight function in Eq (3) is overcome by the difficulty of heavy tails of the error distribution. It is non-negative, bounded above 1 and twice differentiable with respect to τ. The loss function for many models is convex, if the conditional distribution like l i (.), is from a class of the exponential family ( [28]). The penalty function we will use is whereb init is an initial estimator. In the present paper we get this from ridge regression estimator when p ) n, and in usual case (n > p) we used robust tukey bisquare M-estimator as initial estimator.

Theoretical properties
Consider the constant weight vector w, the penalized weighted log-likelihood function based on n samples is, Where W n (β) represent the weighted likelihood function. We write the true parameter coefficient vector as b 0 ¼ ðb T 10 ; b T 20 Þ T , where β 10 consists of all 's' non-zero components and β 20 consists of the remaining 's-1' zero components. We write the corresponding maximizer of Eq (6) The Fisher Information matrix is Where W = w.w T is (1 × n) positive definite matrix. Let I 1 (β 10 ) = I 11 (β 10 , 0), where I 11 (β 10 , 0) is the leading (s × s) submatrix of I(β 0 ) with β 20 = 0. We also assume that if ffiffiffi n p l n ¼ O p ð1Þ, then the adaptive Lasso type estimator satisfies jjb n À b 0 jj ¼ O p n À 1 n À Á . It shows thatb n is root nconsistent if λ n 7 ! 0 ( [7]). Next we show that, when λ n is chosen properly the proposed Robust estimator has the oracle property under the same regularity conditions of [7].
Theorem: Assume that ffiffiffi n p l n 7 ! 0 and nλ n 7 ! 1. Then under the some regularity conditions, with probability tending to 1, the root n-consistent adaptive Lasso Robust estimatorb n must satisfy the following properties. (i) Sparsity;b 2n ¼ 0.

Selection of the tuning parameters
We now discuss an important issue of the tuning parameters selection regarding the construction of weight functions and in the different regularization methods. For the weight functions, increasing the value of the parameter "α" leads to greater downweighting. To make balance between the degree of robustness and efficiency, we need an extensive numerical studies. In the present simulation study we keep it in the range of (0.05, 0.005). This appears to be a difficult problem, but solving this issue remains among our plan for future work.
On the other hand, in the different regularization methods described in this paper, we need to choose an optimal tuning parameters l 0 n s. Intuitively, an optimals of these tuning parameters be chosen through a variety of tools such as cross validation (CV), generalized CV and Bayesian information criterion (BIC) approach.
There are well-established methods for choosing such parameters [29]. [17] and [30] used BIC-type criterion, and [3] using 10-fold CV technique and proved that the resulting estimator with such type of tuning parameters satisfies the good prediction accuracy and model selection. Following this idea, we apply 10-fold CV procedure for selection of tuning parameters.

Simulation studies
In this section, we introduced some numerical examples to illustrate the performance of the Robust Adaptive Lasso method described in section (3) with other methods. We set two scenarios in every example in terms of pairwise correlation between predictors, i.e, r = 0.5 and 0.85. Besides this, four levels of contamination were considered in error distribution (δ = 0%, 10%, 20% and 30%) from two types of contamination, that is, scale and location contamination.
The following four performance measures were calculated: (a) Prediction Error-computed on the test data set.
(b) Bootstrap Standard Error-by using the bootstrapp with B = 500 resamplings on the 1000 mean squared errors (in percentage).
(c) The average number of "0" estimated coefficients correctly, denoted by "C".
(d) The average number of "0" estimated coefficients incorrectly, denoted by "I".
Scenario-IV: This scenario is same as scenario-I, except that the error distribution is exponential(1) contaminated by an exponential(1/5).

Effects of sample size
We observe that, across sample sizes, the median test error and bootstrap standard error decreased with the increasing sample size from 50 to 100 in both cases of low and high correlation among predictors. This pattern of decreasing holds for all regularization methods, but the decrease in the proposed Robust adaptive Lasso is more than other competitors. On the other hand, in terms of variable selection with the increase of sample size, the ratio of incorrectly "0" selection of coefficients, denoted by "I", decreases significantly. Table 1 also shows that for, n = 100, the correct selection, denoted by "C", is improved as compared for n = 50, but the proposed Robust adaptive Lasso methods, defeated all other methods, in both cases, i.e, in "C" and "I" and performs just like oracle estimator (i.e, C = 5 and I = 0).

Effects of level of contamination
Under ideal condition (unit normal error distribution, δ = 0%), the results in Table 1 indicate that the "CBPR" and proposed method provides good results in terms of both prediction and variable selection, particularly in r = 0.85, the predictor selection performance is the nearest to oracle estimator, but in, n = 100, case, the test error of the proposed method is lower than all other methods.
For the 10% data contamination condition, the proposed Robust adaptive Lasso shows strong performance in terms of both prediction accuracy and variable selection, but in only one cell, i.e, for n = 50 and δ = 10%, the variable selection performance is not better than CBPR. From Table 1, it is clear that the proposed method shows superior performance under 20% and 30% contamination conditions for both cases of sample sizes and correlations among predictors, respectively. In this extreme location contamination, the proposed Robust method has performed just like an oracle estimator and also has minimum bootstrap standard errors with excellent predictor selection performance. Hence, the overall performance of proposed the Robust adaptive Lasso method is better than other with the increase of location contamination.

Effects of multicollinearity
In given scenario-I, we also consider two cases for correlation between predictors, i.e, r = 0.5 and 0.85. In both situations, the Robust adaptive Lasso method outperforms all other methods, except CBPR which has a little better result in case of high correlation and clean data (δ = 0%), particularly, in variable selection. But, the levels of contamination increases in both low and high collinearity. The proposed method findings become better and better. From Table 1, we can also see that the [8] gives very poor results in high correlation between predictors, especially, in variable selection.
Simulation results for scenario-II are summarized in Table 2, in which we consider scale level of contamination scheme where N(0, 1) model is contaminated by N(0, 25). The details are discussed as below.

Effects of sample size
From the findings given in Table 2, it is evident that the prediction errors and bootstrap standard errors of all regularization approaches blows up in the presence of scale contamination. The results shows that when sample size doubled from 50 to 100, the test errors and as well as bootstrap standard errors significantly decreases, for fixed contamination and correlation among predictors. Among these five methods, the reduction in the proposed Robust method is maximum. We can also observe that, the variable selection performance is also improved when the sample size increases, particularly for the proposed method, in terms of both aspects "C" and "I". Generally speaking, the usual positive effect of sample size is on prediction accuracy and standard error.

Effects of level of contamination
It is clear form Table 2, that the proposed Robust regularization technique provides good results in terms of model error in all cases of contamination. The variable selection performance of Lasso and [8] are extremely poor when the contamination rate increases from% to 30%. The findings also indicate that the proposed method remains almost consistent in variable selection under low and high contamination, as compared to other competitors.

Effects of multicollinearity
From Table 2, it can be shown that in both cases of correlation, i.e, r = 0.5 and r = 0.85, the proposed Robust method is the best one overall. But it is very interesting to see that the Robust RAL method in terms of test errors and bootstrap standard errors is relatively better when the predictors are highly correlated, but in predictor selection in high correlation and extreme contamination aspect (i.e, 20% and 30%) is worse than low correlation condition among predictors. For example, in case of fixed (n, p) = (50, 8) and δ = 30%, the proposed Robust method estimated 4.639 predictors on the average, in "C" and in aspect of "I" is 0, when r = 0.5. But on the other hand, in r = 0.85, the situation is 4.037 in aspect of "C" and 0.745 in "I", respectively.
In scenario-III, the predictors dimension is larger than sample observations, i.e, p = 400 and n = 120, but the dimension of the true model is fixed to be 15. The detailed results are depicted in Table 3. To compare the performance of the Robust Adaptive Lasso estimator, we did not report the results of the [8], because in case of, p ) n, it is nontrivial to calculate reliable initial estimates (i.e, OLS estimators) for weights used in it. From Table 3, the results (1 − δ)N(0, 1) + δN(0, 25) confirm robustness of the proposed method with increasing the level of contamination, and hence, the prediction error and bootstrap standard errors decrease slowly when δ increases towards 30%. In case of variable selection, from Table 3, it can be seen that the predictor selection results of the other methods have a little more advantage than the proposed method in the aspect of "C", especially for low level of contamination and moderate correlation, i.e, r = 0.5, but in extreme contamination condition, i.e, (δ = 30%) the performance of Robust proposed method becomes better than others. On the other hand, in high correlation set up among predictors, the proposed technique gives satisfactory results in both aspects of "C" and "I", except for clean data, i.e, (δ = 0%).

Table 2. Simulation results for the scale level of contamination in the model
In scenario-IV, we consider the non-normal error distribution. The Table 4, presents the prediction error, bootstrap standard error and predictor selection performance when the error has heavy tailed an exponential(1) distribution, contaminated by an exponential (1/5) distribution. From Table 4, it may be seen that the contamination proportion tends to the level of 30%, the prediction error of the proposed penalized regression method is stable and does not increases as compared to others. In terms of predictor selection, the proposed robust method remains better for higher level of contamination too, for fixed sample size and correlation ammong predictors. Additionally, for the small sample size Table 4 reported that the test errors for the different procedures are grater than in case of large sample size (i.e, n = 100), for all fixed cases of conatmination and correlation.
Our simulation results for scenario-V are depicted in Table 5. For the exponential error distribution, the robust adaptive Lasso exhibits good prediction performance in both low and high correlations among predictors when contamination tends to 30%. In terms of variable selection, we observe that the performance of the proposed method tends to dominate the other penalized regression procedures when the level of contamination increases towards 30%. These findigs suggest that the proposed robust procedure, by utilizing weights proposed by [23], is effective when the tails got heavier. (1 − δ)N(0, 1) + δN(−10, 1)  6 Real data application

Prostate cancer data
The data set for this subsection comes from a study by [31] and analyzed by [29] for estimation and variable selection. The data set consists of 97 observations and 9 variables. We used the Robust Adaptive Lasso method response variable which is "lcavol" and the rest are explanatory variables. The proposed Robust penalized method was applied along with four other penalized approaches (i.e, given in Table 6). The first 30 observations were used as training data set and the rest as testing data to evaluate the prediction ability. In Fig 1, the QQ-plot and boxplot shows that there are three distinct outliers in the response variable. A normal model would provide a nice fit to response variable if the outliers are deleted. We set our optimum tuning parameters values as q = 30, that is to determine the proportion of observations on either tail that will be subjected to downweighting, and "α = 2.202" for proposed approach. Beside this, the weights associated with these three identified outliers are, 0.0380, 0.0092 and 0.0359, respectively.
The columns of Table 6 present the different penalized approaches with the proposed robust method, their associated coefficient estimates and test errors. From the Table 6, we can see that the proposed method produces more sparse solution and selects two covariates, i.e, "lcp" and "lpsa", corresponding to the smallest test error, i.e, 0.807.

Microarray data-riboflavin production by bacillus subtilis
Here we analyze the high-dimensional real data set (S2 File) about riboflavin production by bacillus subtilis which was analyzed by [32]. Here the continuous response variable Y which measures the logarithm of the production rate of riboflavin, p = 4088, is the number of covariates corresponding to the logarithms of expression levels of genes and n = 71 individuals of genetically homogeneous sample. Our main objective here is to test whether our method can effectively select covariates with non-zero coefficients and estimate parameters simultaneously. From Fig 2(a), it can be observed that the frequency distribution of the response variable is somehow positively skewed and the boxplot in Fig 2(b) clearly shows that there are some outlying observations present in the data. Since, the response variable is skewed, therefore, gamma distribution was fitted on it after applying K-S test (i.e, p-value>0.05). For the purpose of the proposed weights, we set our tuning parameters values as q = 0.5 and α = 0.312.
Percentage prediction/test errors were calculated for the five regularization regression techniques, and were compared with percentage test error of the ridge regression (i.e, 0.0829). The percentage errors were shown by a bar graph in Fig 3. It can be seen from Fig 3 that Lasso performs very poorly with percentage test error which is 113.915%, and which is 13.915% more than ridge regression. The prediction error of elastic net is 14.21%, (i.e, 85.79%) lower and CBPR is 22.08%, (i.e, 77.92%) lower than ridge penalized regression. Among these methods the maximum reduction in percentage test error is observed in the proposed penalized regression procedure with percentage test error is 64.29%, which is 35.71% lower than ridge method. In terms of sparsity, the number of non-zero estimated predictors coefficients of Lasso, elastic net, CBPR and RAL are 16, 32, 42 and 35, respectively, out of the total number of 4088 predictors. Since, the proposed RAL method selects 35 covariates with minimum prediction error.

Conclusion
In this article we proposed a robust penalized regression model (RAL) using weighted log-likelihood with the adpative Lasso penalty function. We used the [23] proposed weight function to downweights the points that are large residuals outliers and improved the effectiveness of the proposed algorithm. Four penalization methods in addition to RAL, including Lasso, elastic net, adaptive lasso and CBPR were compared. The numerical simulations shows that for high percentages of contamination the RAL is more robust and outperforms the other penalization procedures interms of prediction accuracy, bootstrapped standard errors and variable selection.
We also illustrate the proposed method in an application to real data analysis. we consider the prostate cancer data set and a high-dimensional data set about riboflavin (Vatamin B 2 ) production by bacillus subtilis. We have evaluated the performance of the different penalized procedures based on training/testing sample partition. The real data comparison in section 6, demonstrate that the proposed robust procedutre (RAL) improves over existing methods in both prediction and varaible selection. Thus, RAL is more robust procedure than other methods to outliers or influential observations.