Modified Liu estimators in the linear regression model: An application to Tobacco data

Background The problem of multicollinearity in multiple linear regression models arises when the predictor variables are correlated among each other. The variance of the ordinary least squared estimator become unstable in such situation. In order to mitigate the problem of multicollinearity, Liu regression is widely used as a biased method of estimation with shrinkage parameter ‘d’. The optimal value of shrinkage parameter plays a vital role in bias-variance trade-off. Limitation Several estimators are available in literature for the estimation of shrinkage parameter. But the existing estimators do not perform well in terms of smaller mean squared error when the problem of multicollinearity is high or severe. Methodology In this paper, some new estimators for the shrinkage parameter are proposed. The proposed estimators are the class of estimators that are based on quantile of the regression coefficients. The performance of the new estimators is compared with the existing estimators through Monte Carlo simulation. Mean squared error and mean absolute error is considered as evaluation criteria of the estimators. Tobacco dataset is used as an application to illustrate the benefits of the new estimators and support the simulation results. Findings The new estimators outperform the existing estimators in most of the considered scenarios including high and severe cases of multicollinearity. 95% mean prediction interval of all the estimators is also computed for the Tobacco data. The new estimators give the best mean prediction interval among all other estimators. The implications of the findings We recommend the use of new estimators to practitioners when the problem of high to severe multicollinearity exists among the predictor variables.


Methodology
In this paper, some new estimators for the shrinkage parameter are proposed. The proposed estimators are the class of estimators that are based on quantile of the regression coefficients. The performance of the new estimators is compared with the existing estimators through Monte Carlo simulation. Mean squared error and mean absolute error is considered as evaluation criteria of the estimators. Tobacco dataset is used as an application to illustrate the benefits of the new estimators and support the simulation results.

Findings
The new estimators outperform the existing estimators in most of the considered scenarios including high and severe cases of multicollinearity. 95% mean prediction interval of all the estimators is also computed for the Tobacco data. The new estimators give the best mean prediction interval among all other estimators.

Introduction
Ordinary least squared (OLS) method of estimation is commonly used in linear regression models. When the problem of multicollinearity exists among the predictor variables then the results obtained by the method of OLS can be misleading [1]. Ridge regression (RR) and Liu regression (LR) suggested by [2,3] respectively are the two commonly used methods in order to mitigate this problem. LR is usually preferred over RR, because it is the linear function of its shrinkage parameter d [4]. The optimal value of shrinkage parameter d in LR plays an important role in minimizing the variance. Many researchers have suggested several LR estimators for estimating d. Few of them are [4][5][6]and very recently [7][8][9]. The existing estimators perform better only when the problem of multicollinearity is not very high. In case of very high or severe multicollinearity, the existing estimators do not perform well in terms of smaller mean squared error (MSE) and mean absolute error (MAE) respectively. To overcome this problem, it was necessary to develop some new estimators.
Therefore, the objective of this paper is to propose some new estimators that are robust to the presence of very high to severe level of multicollinearity. In this paper, the performance of some existing LR estimators is investigated and some new estimators for shrinkage parameter d are proposed. The new proposed estimators give the optimal choice of shrinkage parameter and are robust to the presence of very high and severe multicollinearity. Also, the new estimators are compared with the existing ones through a Monte Carlo simulation based on MSE and MAE performance criterions. The MSE and MAE of the new estimators is smaller than OLS and other existing LR estimators and outperform in most of the considered scenarios.
Rest of the article is organized distributed as follows. The statistical methodology that includes the model estimation, new proposed and existing LR estimators are discussed in Section 2. The simulation design and results are discussed in Section 3. Section 4 includes the empirical application to demonstrate the benefits of the new estimators. The conclusion of the paper is given in Section 5.

Statistical methodology
Consider the following multiple linear regression model in matrix form as: where y is the vector of response variable with order (n × 1), X is the fixed design matrix of predictor variables of order (n × p) and β is the p × 1 vector of population regression coefficients. ε is the vector of random errors with order (n × 1). ε is distributed as normal with mean E(ε) = 0 and variance covariance matrix Eðεε 0 Þ ¼ s 2 I n , I n is an (n × n) identity matrix. The OLS estimator of β is given below:b the classical linear regression models are satisfied [1]. However, in the presence of multicollinearity, OLS estimator become inefficient and provide large variance [4]. To circumvent such situation, numerous biased estimation methods are available that provide smaller MSE than OLS and LR is one of them. The LR estimator defined in [2] is given below: In the presence of multicollinearity,b LIU provide the smaller MSE than OLS [10]. The optimal choice of shrinkage parameter d plays a vital role in minimizing the MSE ofb LIU . Some existing LR estimators for the shrinkage parameter d are given in the following sub-section.

Some existing LR estimators
Consider the canonical form of model (1) as: where Z = XD and a ¼ ða . . .; l p Þ consists of the eigen values of the X 0 X matrix. Note here that MSEðâÞ ¼ MSEðbÞ so it suffices to consider the canonical form only. The OLS estimator can be defined in canonical form as follows: The LR estimator is defined as:â The first estimator for d was suggested by [2] and is given below: whereâ j is the j th element ofâ, an OLS estimator of α.ŝ 2 is the unbiased estimator of population error variance σ 2 and λ j is the j th eigen value of the matrix X 0 X. Liu in [2] also suggested the following estimator: Shukur et al., [6] considered the idea of [5,11] and suggested the following three estimators: Shukur et al., [6] also suggested the following estimators: Based on the work of [4,7], we proposed three new LR estimators in the section to follow.

Proposed method
Following the idea of [4,7], we propose the following new estimator: where 'γ' is the quantile probability. In order to obtain the minimum MSE and MAE, the new estimatord g depends on the quantile probability whose value is selected according to the level of multicollinearity [7]. Since the range of shrinkage parameter must be between zero and one, therefore we rewrite the proposed estimator as: Eq (17) satisfies the interval condition for shrinkage parameter d suggested by [2]. In order to present the role of quantile probability, we choose some specific values for 'γ' as: 0 (minimum), 0.25 (first quartile) and 0.50 (median). The mathematical form of three new LR estimators obtained is given below: The procedure for generating and analyzing data is given in the next section.

The design of an experiment
In this section Monte Carlo simulation experiment, a commonly used procedure in literature for the data generation and analysis, Performance evaluation criterion and results are also discussed in this section.

The Monte Carlo simulation
In this section, the performance of LR estimators is compared through extensive simulations.
Following [12], The predictor variables are generated as: where ρ is the degree or level of multicollinearity between the predictor variables and are given as 0.90, 0.99, 0.999 and 0.9999. z ij are the random numbers obtained from the standard normal distribution. The n observations on the response variable are computed as: where ε i~N (0,σ 2 )0, σ 2 is the error variance. β 0 is considered to be identically zero. Following [11], the eigen vector corresponding to maximum eigen value of the X 0 X matrix is taken as the vector of regression coefficients. Following [4][5][6] the different factors we choose to vary in our study are given below: Error variance: σ 2 = 0.5, 1, 2

Performance evaluation criteria
Following [4], MSE and MAE criterions are used to judge the performance of the different LR estimators. Estimated MSE (EMSE) and MAE (EMAE) are defined as: where b î is the estimated value of β. M shows the simulation runs. In this study we choose M = 5000. The EMSE simulation results are presented in Tables 1-3 and Fig 1 and EMAE in  Tables 4-6. The results are discussed in the section to follow.

Results and discussion
The EMSE and EMAE values of the new and existing LR estimators are presented in Tables 1-6 and Fig 1 respectively. The performance of the LR estimators is evaluated with respect to different factors such as multicollinearity, error variance, sample size and the predictor variables. These factors affect the simulation design [8]. The effect of each factor on EMSE and EMAE of estimators are discussed below: Multicollinearity: Increase in the level of multicollinearity increases the EMSE and EMAE of all the estimators. The performance of OLS estimator deteriorates when the multicollinearity becomes very high. LR estimators outperform OLS for all the levels of multicollinearity. However, among all LR estimators, the EMSE and EMAE of the proposed estimators D8-D10 is generally smaller than existing estimators. While the estimator D5 remain close competitor to the proposed estimators only in the case mild to high multicollinearity. But in case of high to severe multicollinearity only the proposed estimators outperform. Fig 1 also support the proposed estimators.
Sample size: Increase in the sample size generally decreases the EMSE and EMAE of all the estimators. But the variation in the sample size does not alter the best performance of proposed estimators as observed in the case of multicollinearity.
Predictors: When the number of predictors increases the EMSE and EMAE of all the estimator's increases. But the performance pattern of the estimators remains same as in the case of multicollinearity and sample size. It is also seen from the tables that increase in the EMSE of OLS estimator is relatively higher than all LR estimators. LR with Liu parameter D8 exhibits the lowest EMSE and EMAE.
Error variance or Standard deviation: The EMSE and EMAE of the estimators increases with the increase in the value of error variance. However, the performance of proposed estimators is better than other existing estimators.
The concluded remarks from Tables 1-6 are that the new LR estimators D8-D10 perform efficiently than the other existing LR estimators particularly in the case high to severe multicollinearity. The new estimators also outperform the OLS estimator substantially. Therefore, it is concluded that the new estimators D8-D10 outperform in terms of smaller EMSE and EMAE. Also, among new estimators, the new estimator D8 is more efficient and is the best choice for the practitioners in the presence of high and severe multicollinearity.

Applications
In the previous section, the performance of estimators is evaluated through Monte Carlo simulation experiment where some ideal conditions are assumed. Contrary to the simulation study, in this section, a numerical example of Tobacco dataset taken from [13] is considered to evaluate the estimators in real world problems.

Tobacco data
The first numerical example used in this study is the Tobacco dataset taken from [13] to compare the performance of new estimators in applied scenario. This data has already been used in literature, see e.g., [14]. The dataset consists of 30 observations of tobacco blends. The percentage concentrations of four important components are considered as predictor variables and the amount of heat given off by the tobacco during the smoking process as a response variable.
The model for this dataset is defined as: Condition number (CN) is used to measure the severity of multicollinearity among predictor variables [15] given as:

CN ¼
ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi l max = l min q ;  where λ max and λ min are the maximum and minimum eigen values of the matrix X 0 X respectively. Following [1], a rule of thumb is that multicollinearity is moderate if the CN is between 10 and 30, high if it is between 30 and 100 and severe when it is greater than 100. The CN for this dataset is 43.50096, which shows that high multicollinearity exists among the predictor variables. The Shapiro-Wilk (W) normality test is used to test the normality of response variable. We obtain the value test statistic W = 0.91248 and P-value = 0.06719 which shows that the response variable is normal at 5% level of significance. MSE of OLS and Liu estimators from [2] can be written as: The estimated values for d, regression coefficients and MSE of estimators are presented in Table 7. This table shows that the LR estimators have smaller MSE than OLS. However, among the LR estimators, new estimator D8 outperform and therefore highly efficient among others.

Prediction interval
In this section, 95% mean prediction interval of all the estimators is computed from the Tobacco dataset. We consider the following values of predictor variables: X 0 ' = (X 10 , X 20 , X 30 , X 40 ) = (20.6, 10.9, 33.62, 39.76). 100(1-α)% mean prediction interval for the response variable is given as: whereŷ 0 ¼b 1 X 10 þb 2 X 20 þb 3 X 30 þb 4 X 40 ,ỹ 0 ¼b 1 X 10 þb 2 X 20 þb 3 X 30 þb 4 X 40 ,b andb are the OLS and Liu estimators respectively. t 1À a 2 is the 1 À a 2 À � quantile from the Student's t- For detail see [1,4]. The results for the 95% mean prediction interval is given in the Table 8. From this table, we see that the new estimator D8 gives the best mean prediction interval among all other estimators.

Concluding remarks
In this paper, some new quantile based LR estimators for the shrinkage parameter 'd' are proposed in order to minimize the variance and mitigate the problem of multicollinearity. Monte Carlo simulation experiment was performed to compare the performance of estimators. MSE and MAE performance measures were used. Multicollinearity, Sample size, predictor variables and error variance were the different factors we choose to vary in our study. It is concluded that all the LR estimators generally perform better than OLS estimator. Furthermore, among the LR estimators, the new estimators have shown best performance in the simulation and application. The LR is a robust choice than the OLS when the problem of multicollinearity is present. Moreover, among the new estimators, D8 performs better than other considered estimators in many evaluated instances particularly when the problem of multicollinearity is very high and severe. Estimators D5 and D9 were the close competitors to D8. Therefore, we recommend the use of LR method with shrinkage new estimator D8 over OLS when the problem of multicollinearity is present in the data.

Future research
Future research directions: In this research only the problem of multicollinearity is considered.
When the outliers are also present in the data then the performance of the estimators will change. In future, we can develop some new robust LR estimators to overcome the joint problem of multicollinearity and outliers.