Variable selection in multivariate multiple regression

Introduction In many practical situations, we are interested in the effect of covariates on correlated multiple responses. In this paper, we focus on estimation and variable selection in multi-response multiple regression models. Correlation among the response variables must be modeled for valid inference. Method We used an extension of the generalized estimating equation (GEE) methodology to simultaneously analyze binary, count, and continuous outcomes with nonlinear functions. Variable selection plays an important role in modeling correlated responses because of the large number of model parameters that must be estimated. We propose a penalized-likelihood approach based on the extended GEEs for simultaneous parameter estimation and variable selection. Results and conclusions We conducted a series of Monte Carlo simulations to investigate the performance of our method, considering different sample sizes and numbers of response variables. The results showed that our method works well compared to treating the responses as uncorrelated. We recommend using an unstructured correlation model with the Bayesian information criterion (BIC) to select the tuning parameters. We demonstrated our method using data from a concrete slump test.


Introduction
Multivariate multiple regression analysis is often used to assess covariate effects when one or multiple response variables are collected in observational or experimental studies. Many  The analysis of multivariate outcomes is more difficult when there are multiple types of outcomes. These occur frequently in the investigation of, e.g., dose-response experiments in toxicology [10,11], birth defects in teratology [12], and pain in public health research [12,13]. The methodologies used for mixed outcomes include factorization-based approaches on extensions of the general location model [14,15]. However, these approaches depend on parametric distributional assumptions. Approaches based on latent variables include [16] and [17]. Modified generalized estimating equations (GEEs) [18] have been used to model longitudinal data; these approaches are of great interest because of their simplicity.
The GEE approach of [18] provides flexible modeling of multivariate observations based on a quasi-likelihood (QL) approach. In QL modeling, one assumes the existence of the first two moments of the responses of interest. It extends the GEE methodology to simultaneously analyze binary, count, and continuous outcomes with nonlinear models that incorporate the intra-subject correlation. The method uses a working correlation matrix. The incorporation of the intra-subject correlation makes this approach attractive. However, when we apply a joint model for all responses, many regression parameters must be estimated, and some have little or no influence on the responses. Large models can be difficult to interpret, so variable selection for multi-response modeling is of great interest.
We first systematically study the GEE approach in a cross-sectional set-up with multiple responses [11,19]. Simultaneous parameter estimation and variable selection [20] has been used in many areas, including longitudinal data analysis [21]. We have extended this method to multivariate multiple regression using a penalized GEE methodology. We use the Bayesian information criterion (BIC) and generalized cross validation (GCV) to find the tuning parameters. Our simulation studies show that our methodology performs well.
The remainder of the paper is organized as follows. In the next section, we review the GEE for multiple responses and introduce our penalized GEE and the computational procedures. We discuss the distributional properties of the estimates and presents the simulation studies, subsequently and provides concluding remarks in the last section.

GEE for multiple outcomes
We now discuss the GEE model based on the marginal distributions of the response for the analysis of longitudinal data. In a cross-sectional study with multiple responses, [12] used the GEE approach to estimate the parameters. Let the observations ðy m i ; x m i Þ denote the response and covariate respectively for the mth response (m = 1, 2, . . ., M i ) measured on subject i = 1, . . ., n. The QL approach requires us to specify the first two moments of the data (y m i ). We define For each subject i, let D i be an M i × p full-rank derivative diagonal block matrix be the M i × M i working covariance matrix of y i . Here, is an M i × M i working correlation matrix parameterized with the parameter vector α. The GEE estimatorβ is asymptotically consistent as n goes to infinity.

Penalized GEE
To perform parameter estimation and variable selection simultaneously in the presence of mixed discrete and continuous outcomes, we propose a penalized version of the extended GEEs [12,19]. Penalized likelihood methods such as LASSO [22] and SCAD [20] have been successful both theoretically and in practice. All the variables are considered at the same time, which may lead to a better global submodel. The penalized GEE has the feature that the consistency of the model holds even if the working correlation is misspecified. However, to improve the statistical efficiency of the coefficient, we recommend a covariance matrix based on the estimate of the unstructured working correlation. The regression coefficients β can be estimated by solving the penalized GEEs defined by where P 0 l ðβÞ ¼ @P l ðβÞ=@ β is the vector derivative of the penalty function P λ (β) with λ being the vector of tuning parameters.
Although different penalty functions can be adopted, we consider only LASSO and SCAD. The former has the sparsity property, and the latter simultaneously achieves the three desirable properties of an ideal penalty: sparsity, unbiasedness, and continuity [20]. The LASSO penalty defined as P λ (β) = ||β|| as per [22], where as per [20], the derivative of the SCAD penalty is defined as for some a > 2 and b > 0: where a and λ are tuning parameters. Computational algorithm. To computeβ, we use the local quadratic approximation (LQA) algorithm [20]. With the aid of the LQA, the optimization of (2) can be carried out using a modified Newton-Raphson (MNR) algorithm. The estimate ofβ at the (r + 1)th iteration isβ rþ1 ¼β r À ( @Sðβ r Þ @β À nS l ðβ r Þ where S l ðb r Þ ¼ diagðP 0 l ðjb 1r jÞ=jb 1r j; . . . ; P 0 l ðjb pr jÞ=jb pr jÞ; @Sðβ r Þ @β ¼ À Given a tuning parameter λ, we repeat the above algorithm to updateβ r until we achieve convergence. Correlation structure. Many researchers (e.g., [23][24][25]) have shown that an incorrectly specified correlation structure reduces the estimation efficiency. Thus, we suggest using an unstructured correlation structure R u (α) to estimate each variance and covariance uniquely. This structure can be estimated using a residual-based moment method. Let d VðaÞ ¼Â 1=2 diagðR u ; . . . ;R u ÞÂ 1=2 be the unstructured covariance matrix estimate. We havê

Tuning parameter selection
We set a = 3.7 for SCAD penalty as per [20]. Thus, we tune λ for both LASSO and SCAD. We define the GCV [26] criterion via and the BIC (see [27] and [6]) via where D is the deviance of the model and df(λ) = tr{X(X T X + nS λ ) −1 X T }. We choose the tuning parameter λ that minimizes GCV(λ) and BIC(λ).

Properties of estimates
Let b ¼ ðβ A ; β N Þ be the true vector of the regression coefficients. Under some necessary regularity conditions [28,29] for sufficiently large n, the parameter estimates of the penalized GEE with the LASSO (λ = O p (n −1/2 )) and SCAD (λ = o p (1)) penalties are consistent and asymptotically normal, i.e., ffi ffi ffi

Performance analysis
We conducted a series of simulation studies to investigate the performance of our variable selection approach on continuous, binary, and count response outcomes using the LASSO and SCAD penalty functions. The simulations were conducted using the R software. For faster optimization of the tuning parameter λ, we used the warm-starting principle, where the initial value of β is replaced byβ ðlþdlÞ for the MNR algorithm. We select the model with minimum BIC(λ) or GCV(λ). We assess the model performance using the model error (ME) [20] as well as the standard error and the correct and incorrect deletions. The ME is due to the lack of fit of an underlying model and is denoted by MEðbÞ. Its size reflects how well the model fits the data: The ME has been expressed as the median relative model error (MRME). The relative model error is defined via where ME full is the ME calculated by fitting the data with the full model. The correct deletions are the average number of true zero coefficients correctly estimated as zero, and the incorrect deletions are the average number of true nonzero coefficients erroneously set to zero. In the tables, the estimated values for correct and incorrect deletions are reported in the columns "Correct" and "Incorrect". For comparison purposes, we estimated the covariance matrix of the response variables based on both the unstructured working correlation (UWC) and the independent working correlation (IWC). We simulated 1000 data sets consisting of n = 50 and n = 100 observations from the response model with i = 1, 2, . . . n subjects and j = 1, 2, . . ., m responses. For binary outcomes we use a logit link; for count outcomes we use a log link; and for continuous (normal) outcomes we use the identity link function. We generated the covariates X ij from the multivariate normal distribution with marginal mean 0, marginal variance 1, and AR(1) correlation with ρ x = 0.5. For the simulations, we considered the following three cases of continuous, binary, and count response outcomes with different β values and correlation ρ y between the responses and with s 2 y ¼ 1. Case 1: Three correlated cormal responses. We consider correlated normal responses (m = 3) with AR(1) true correlation. We set ρ y = 0.7 and consider two covariates (k = 2) with β = (β (1) , β (2) , β (3) ) = ((3, 1.5), (0, 0), (2, 0)). The simulation results are summarized in Table 1 for IWC and Table 2 for UWC. The tables show that the nonzero estimates of both SCAD and LASSO are close to the true values, i.e., b ð1Þ However, the standard errors of the estimates in Table 2 are lower, which can be attributed to the correlation between the responses. For both n = 50 and n = 100, the mean ME and its standard error are smaller for SCAD than LASSO. The average number of zero coefficients increases as n increases in Table 2, especially for SCAD. This indicates that SCAD performs better than LASSO.
Case 2: Two correlated normal responses and one independent binary response. We consider three outcomes (m = 3): two continuous and one binary. The continuous outcomes were generated from a normal distribution and were correlated with AR(1) true correlation. We set ρ y = 0.7 and consider the binary outcome from an independent binary observation and two covariates (k = 2) with β = (β (1) , β (2) , β (3) ) = ((3, 1.5), (0, 0), (2, 0)). The simulation results are summarized in Tables 3 and 4. The tables show that the nonzero estimates for IWC are similar to those for UWC. However, because of the large correlation (0.7) between the continuous responses, the standard errors of b ð1Þ

PLOS ONE
Variable selection in multivariate multiple regression average number of zero coefficients is higher for UWC than for IWC. As the SCAD sample size increases, the mean ME and its standard error decrease for both GCV and BIC. The LASSO estimates for b ð1Þ 3 are not close to the true value, but the SCAD estimates of the nonzero coefficients are all close to the true values. Thus, SCAD performs better than LASSO.
Case 3: Two correlated normal responses and one binary response. We consider three outcomes (m = 3): two continuous and one binary. They are generated using an unstructured correlation structure with the parameters ρ 12 = 0.3, ρ 13 = 0.4, and ρ 23 = 0.6, and we consider two covariates (k = 2) with β = (β (1) , β (2) , β (3) ) = ((3, 1.5), (0, 0), (2/3, 0)). We set the β values for the binary outcome smaller than before to avoid numerical instability. The correlated normal and binary outcomes were generated in R using the BinNor package [30] for generating multiple binary and normal variables simultaneously given marginal characteristics and association structure; it is based on the methodology of [31]. The simulation results are summarized in Tables 5 and 6. The tables show that if the sample size is increased, the mean ME and its standard error are reduced. Again, the standard errors of the nonzero parameter estimates are lower for UWC than IWC. The average numbers of zero coefficients using SCAD with BIC for all sample sizes are close to the target value of three, and for SCAD with GCV the nonzero estimated coefficients are close to the true values for n = 50 and n = 100.
Overall, Tables 1 to 6 show that the nonzero estimates are unbiased regardless of the correlation structure. However, the unstructured correlation resulted in lower standard errors compared to the estimates based on an independent working correlation. The average number of zero coefficients is higher in the unstructured case. We notice a decrease in the mean ME when the sample size increases from 50 to 100 for both LASSO and SCAD. SCAD has a smaller mean ME than LASSO in all cases. We conclude that SCAD with BIC performs well.

Case study
We now revisit the concrete slump test data set discussed in Section 1. From Fig 1, we see that slump (Y 1 ) and flow (Y 2 ) are highly correlated. We therefore used penalized GEE to perform the variable selection and parameter estimation. The resulting estimates are given in Tables 7  to 9. The second and third columns of the tables give the performance using penalized GEE with IWC for SCAD and LASSO. The fourth and fifth columns give the performance using penalized GEE with UWC. For the model selection procedures, both unweighted BIC and GCV were used to estimate the regression coefficients; their performance was similar. Therefore, we present only the results based on the unweighted BIC. Table 7 shows that SCAD with IWC selected 5 of the 7 covariates for slump (Y 1 ), whereas LASSO with IWC selected 4 covariates. The difference is that LASSO omitted fine aggregate (X 7 ). SCAD and LASSO with UWC obtained the same estimates for all the variables: they retained fine aggregate (X 7 ) but forced fly ash (X 2 ) and coarse aggregate (X 6 ) to zero. Table 8 shows that both SCAD and LASSO with IWC selected fly ash (X 2 ), water (X 4 ), and coarse aggregate (X 6 ) for flow (Y 2 ), but SCAD and LASSO with UWC selected only fly ash (X 2 ) and water (X 4 ). The standard errors of the estimates is lower with UWC. Table 9 shows that LASSO with IWC selected all the covariates except coarse aggregate (X 6 ) for CS (Y 3 ), whereas the other methods dropped coarse aggregate (X 6 ) and superplasticizer (X 5 ). Concrete slump test data with artificial binary response. For illustration purposes, we create an artificial binary response variable to indicate whether or not a specimen can sustain a heavy load before distortion. For this analysis, we consider that concrete with a compressive strength below 35 is of poor quality. We therefore convert this continuous response to a binary based on the quality. Let Y 3 = 1 if the compressive strength is above 35, and Y 3 = 0

PLOS ONE
Variable selection in multivariate multiple regression otherwise. The goal is to apply variable selection to model the correlated continuous and binary outcomes. The resulting estimates are given in Tables 10 to 12 (the columns of these  tables are the same as those for Tables 7 to 9). Table 10 shows that SCAD with IWC selected 5 of the 7 covariates for slump (Y 1 ), whereas LASSO with IWC selected 4 covariates. The difference is that LASSO omitted fine aggregate (X 7 ). These results are similar to the independent results in Table 7, which confirms the use of IWC. SCAD with UWC forced fly ash (X 2 ) to zero whereas SCAD with IWC did not. LASSO with UWC selected the same variables as LASSO with IWC. Table 11 shows that all the methods selected fly ash (X 2 ), water (X 4 ), and aggregate (X 6 ) for flow (Y 2 ). Table 12 shows that all the methods except LASSO with IWC selected 5 covariates for the binary CS (Y 3 ). The estimates obtained with UWC have lower standard errors.

Conclusion
We have considered the selection of significant variables in multivariate multiple-response regression problems. We developed an extended GEE approach to take into account the correlation among the response variables. Our approach automatically and simultaneously selects the significant variables in high-dimensional models. We also proposed an efficient algorithm to implement the method. We performed many Monte Carlo simulations to assess the performance of the method for different sample sizes. The results showed that the methodology works well, especially when the SCAD penalty function is used together with the BIC tuning criterion. The estimates of β are unbiased regardless of the choice of correlation structure. We demonstrated the approach in a case study.