Bayesian variable selection in linear quantile mixed models for longitudinal data with application to macular degeneration

This paper presents a Bayesian analysis of linear mixed models for quantile regression based on a Cholesky decomposition for the covariance matrix of random effects. We develop a Bayesian shrinkage approach to quantile mixed regression models using a Bayesian adaptive lasso and an extended Bayesian adaptive group lasso. We also consider variable selection procedures for both fixed and random effects in a linear quantile mixed model via the Bayesian adaptive lasso and extended Bayesian adaptive group lasso with spike and slab priors. To improve mixing of the Markov chains, a simple and efficient partially collapsed Gibbs sampling algorithm is developed for posterior inference. Simulation experiments and an application to the Age-Related Macular Degeneration Trial data to demonstrate the proposed methods.


Introduction
Quantile regression(QR) in longitudinal or panel data models have received increasing attention in recent years. For example, [1] proposed a general method to estimate the QR coefficients by using a L 1 penalized likelihood approach. Later [2] generalized the work of [1] to be more flexible model with endogenous variables. [3] offered a estimation of individual effects for panel data QR models, which is widely applied by practitioners because of computationally simple. [4] studied quantile panel models with correlated random effects. [5] proposed a Stochastic Approximation of the EM (SAEM) algorithm to analyze linear quantile mixed regressions(LQMMs) via the asymmetric Laplace distribution. Other related papers include [6][7][8][9], among many others.
Like all regression issues, when there are many covariates in longitudinal or panel QR models, variable selection becomes necessary to avoid overfitting and multicollinearity. [10] presented a penalized quantile regression model for random intercept using the Bayesian lasso priors and Bayesian adaptive lasso priors, respectivly. [11] also considered a Bayesian Lasso approach to jointly estimate a vector of covariate effects and a vector of random effects by introducing an l 1 penalty. However, for the random effects, their method mainly focus on the penalty of diagonal elements in the covariance matrix of random effects after a modified Cholesky decomposition. In addition, they neglect the nonnegative constraint on the diagonal terms in the covariance matrix of random effects after a modified Cholesky decomposition, which can cause non-uniqueness in the decomposition [12]. Then, we first propose a new Bayesian shrinkage approach to solve these problems in the LQMMs. In particular, we develop an extended Bayesian adaptive group lasso which can accommodate the nonnegative constraint of the diagonal terms to attempt to identify the non-zero random effects. Unfortunately, the shrinkage approach need some ad hoc methods to do variable selection similar to [10] and [11] for the well known reason. An alternative to the Bayesian variable selection is to use spike and slab priors. We then develop a general variable selection technic via the Bayesian adaptive lasso and the extended Bayesian adaptive group lasso with spike and slab priors to identify the significant fixed and random effects simultaneously. The paper is organized as follows. Section 2 presents the reparameterized LQMMs based on a Cholesky decomposition of covariance matrix of random effects. Section 3 describes the structure of our hierarchical Bayesian LQMMs and discuss our prior specifications. We also develop a Partially collapsed Gibbs (PCG) sampling algorithm for posterior inference. In section 4, we will develop a variable selection procedure for both fixed and random effects in a LQMM with spike and slab priors and an efficient PCG. We illustrate the performance of our method by simulation studies and a real data example in section 5 and 6. Results show that the proposed approach performs very well. In Section 7, we conclude the paper.

The re-parameterized LQMMs
Consider n subjects containing n i observations, ðy ij ; x 0 ij ; z 0 ij Þ, for j = 1, . . ., n i and i ¼ 1; . . . ; n; N ¼ X n i¼1 n i , where y ij is the jth response for subject i, x 0 ij is the jth row of a known n i × p matrix x i and z 0 ij is the jth row of a known n i × q matrix z i . The canonical linear mixed model is as follows: where β is a p × 1 vector of unknown population fixed parameters, α i is a q × 1 vector of unobservable random effects with covariance matrix D. In order to guarantee the positive semidefinite of D, we adopt the Cholesky decomposition to it, resulting in where Γ = (γ st ) is a q × q lower triangular matrix with nonnegative diagonal entries. Similar to [13], the model (1) can be re-expressed as follows: the transformation matrix J 2 would have the following form With model (2), the linear conditional τth (0 < τ < 1) mixed quantile estimator can be calculated by In a Bayesian setup, this is equivalent to assuming that the error terms ε ij in (2) follow the asymmetric Laplace distribution (ALD), of which the probability density function is given by where σ −1 and τ are the scale and skewness parameters, respectively. [14] first used the ALD to establish Bayesian QR in a linear model for independent data and proposed a random walk Metropolis-Hastings algorithm to draw samples. Since then, there has been a great deal of literature on the extension and application of the QR based on the ALD such as [15][16][17][18]. However, direct sampling using the ALD is often inconvenient and not easy to generalize to more complex scenarios. We employ a mixture representation of ALD proposed by [19] to facilitate Bayesian inference in LQMMs, which was also used by many authors (e.g., see [11,[20][21][22]), where v ij * exp(1/σ) has an exponential prior with mean 1/σ, z ij * N(0, 1) has a standard normal prior and is independent of v ij , x 1 ¼ 1À 2t tð1À tÞ and x 2 2 ¼ 2 tð1À tÞ . Let v = (v ij : i = 1, . . ., n; j = 1, . . ., n i ), z = (z ij : i = 1, . . ., n; j = 1, . . ., n i ). Hence, based on the Cholesky decomposition and the mixture representation of ALD, we obtain the following hierarchal model,

Shrinkage estimation of linear quantile mixed regression
We first propose a new Bayesian shrinkage approach in LQMMs by using the Bayesian adaptive lasso and an extended Bayesian adaptive group lasso. A regularized τth quantile estimates of the fixed and random effect coefficients can be defined by arg min s:t: g kk � 0; k ¼ 1; :::q: Motivated by [23], we put a Laplace priors pðb s js; l 0 , and suppose that the error terms ε ij come from the ALD (4). Then, the posterior distribution of β and γ is proportional to So it can be easily observed that maximizing the posterior distribution (7) is equivalent to minimizing the target function (6) if we ignore the nonnegative constraints γ kk � 0, k = 1, . . .q. However, it is not clear how to incorporate this constraints within the Bayesian Penalized LQMMs framework. Using truncated normal priors, [24] and [25] solved a similar problem on the selection of fixed effects, which is non-group structure in linear and generalized mixed model, respectively. We extend the idea to the Bayesian Group lasso on the shrinkage of random effects with group structure. We specify the conditional prior distributions of γ k as To facilitate posterior inference, we utilize a mixture representation to yield where l 2k ¼ sl 0 2k . Then, γ k can be expressed hierarchically as for k = 1, . . ., q. Moving to the fixed effects parameters, β can also be written hierarchically as follows b s jt s � Nð0; t s Þ and t s jl 2 1s � Gamma 1; for s = 1, . . ., p. There are two main ways to estimate the tuning parameter in Bayesian regularization: first, a fully Bayesian method, which specifies a hyperprior on it, such as a conjugate Gamma distribution [23], and second, an empirical Bayesian method, which estimates it using maximum marginal likelihood method [26]. Here we use the fully Bayesian approach and set Gamma priors on the parameters l 2 1s , l 2 2k and σ to complete the Bayesian structure. We develop a Partially collapsed Gibbs (PCG) sampling sampler by integrating the full conditional distribution of β with respect to {b j } to sample from posterior distributions. In order to improve the convergence behavior of the Gibbs sampler, [27] proposed PCG which is a generalization of blocking and collapsing method [28]. The key idea of the PCG is to replace some of the posterior distributions with marginal posterior distribution while preserving the target distribution. Next, we present the conditional distribution of unknown parameters.
(1) Conditional distribution of β: x ij w ij ðy ij À x 1 v ij Þ: (2) Conditional distribution of b i : The conditional distribution of b i is a multivariate normal distribution with meanm b i and Here m 0 ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi (4) Conditional distributions of l 2 1s and l 2 2k : The full conditional distributions of l 2 1s and l 2 2k are independent Gamma distributions, (6) Conditional distribution of γ: . . ., n; j = 1, . . ., n i , and let F ijk be the covariate vector corresponding to γ k , k = 1, . . ., q. The full conditional distribution of γ k is then a truncated normal distribution N k ðμ g k ;Ŝ g k ÞIðg kk � 0Þ wherê . . . ; g T q Þ T and F ij(k represent the covariate matrix corresponding to γ (k) . (7) Conditional distribution of σ: The full conditional distribution of σ is

Bayesian model selection in LQMMs
The shrinkage approach proposed in the above section has potential for variable selection in LQMMs. However, the estimators are never exact 0 due to the continuous priors on β and γ. This section develops a general model selection method with spike and slab priors to identify the relative important effects. We first discuss the prior specification and propose the hierarchal model. Subsequently, we consider a Partially collapsed Gibbs sampling algorithm for posterior computation, and derive the relevant conditional distributions. We assume the following hierarchical Bayesian lasso with independent spike and slab type priors for β: where δ 0 (�) denotes a point mass at 0, p 0 0s is the prior probability of excluding the sth fixed effect in the model, which is assigned a beta prior with parameters a p 0 ing in a noninformative uniform prior on (0, 1). Furthermore, we assume the following extended hierarchical adaptive Bayesian group lasso with independent spike and slab type priors for γ: where π 0 is the prior probability of excluding the kth random effect in the model, which is also assumed to be a noninformative uniform prior on (0, 1). We also present the conditional distribution of unknown parameters.
(1) Conditional distribution of β: x ij w ij ðy ij À x 1 v ij Þ and The conditional distribution of β s is then a spike and slab distribution, The full conditional distribution of v À 1 ij follows a Inverse Gaussian(μ 0 , λ 0 ) with m 0 ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi Conditional distributions of λ 1s and λ 2k : The full conditional distributions of λ 1s and λ 2k are independent Gamma distributions, (5) Conditional distributions of t s and η k : where InverseGamma(a, b) denotes a Inverse Gamma distribution with shape parameter a and scale parameter b. (6) Conditional distribution of p 0 0s and π 0 : The full conditional distributions of p 0 0s and π 0 are independent Beta distribution. p 0 0s jb s � betað1 þ Iðb s ¼ 0Þ; 2 À Iðb s ¼ 0ÞÞ; Iðg k ¼ 0ÞÞ: (7) Conditional distribution of γ: , and let F ijk be the covariate vector corresponding to be the covariate vector corresponding to γ k , k = 1, . . ., q. The full conditional distribution of γ k is then a a spike and slab distribution, Here, F(�) indicates the cumulative distribution function of the standard normal distribution, m kk is the kth element ofμ g k andŝ 2 kk is the kth diagonal element ofŜ g k .
With each simulation of β, we generate data under three different quantiles i.e. τ = 0.1, 0.3 and 0.5, and four different error distributions: normal, normal mixture, laplace, and laplace mixture as follows We consider three sample sizes n = 30, n i = 5; n = 50, n i = 5; n = 100, n i = 5. Therefore, we have a total of 36 simulation setups for each of the sample sizes. We simulate 100 data sets for simulation setups. Priors for the Bayesian methods are taken to be weak prior. The hyperparameters in the Gamma priors for tuning and scale parameters are all set to be 0.1. We run our PCG sampler for 20000 following initial burn-in 10000 cycles. The convergence of the proposed PCG is monitored by using the multivariate potential scale reduction factor (MPSRF) introduced by [31]. Figs 1-6 show that the MPSRF generally become stable and get close to 1 after about 10000 iterations under normal mixture error distribution, suggesting that the above burn-in size is large enough to ensure the convergence of PCG. The results for other error distributions are similar and omitted.
We use the posterior median estimates of β and D as our point estimator, denoted asβ and D, respectively, and we consider two error measures including the mean absolute deviations (MAD), root mean squared errors (RMSE). To check the performance of the parameter estimation, we also calculate the median of mean square error for β (MMEðβÞ) and the median of quadratic loss error for D (MMRðDÞ). More specifically, jy ij Àŷ ij j; RMSE ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi 1 250 MEðβÞ ¼ ðβ À βÞ T ðβ À βÞ; whereŷ ij is the predicted value of y ij , and the median is taken over the 100 simulations. The  Tables 1-9. Because BL, BAL, and QR neglect the covariance structure of random effects, we just report the MMRðDÞ results of the other three techniques for comparison. A few observations can be seen from the results. Firstly, it can be seen from Figs 7-12 that the proposed approaches perform well in general, the adGL generally outperforms the other methods in terms of MAD and RMSE, which indicate explicitly penalizing the off-diagonal covariance components can improve the model selection procedure. Similar findings were also obtained by [32]. On the other hand, the adSpikeGL also has good performance, yet does not perform as well as the adGL. Secondly. Tables 1-9 show that adSpikeGL performs the best in terms of MMEðβÞ and MMRðDÞ. Take n = 50 for an example, adSpikeGL has the smallest MMEðβÞ in 22 out of 36 simulation setups, and the smallest MMRðDÞ in 24 out of 36 simulation setups. Though PMQ performs good for τ = 0.1, its performance drops when τ = 0.3 and τ = 0.5, especially in dense case (Simulation 2). Thirdly, Tables 1-9 also display that our

PLOS ONE
approaches work well in terms of MAD, RMSE, MMEðβÞ and MMRðDÞ even when the true error term is not ALD, indicating that the LQMMs using ALD in model (2) is merely a working model with artificial assumptions that aim to achieve equivalence between the problems in maximizing ALD and minimizing (3). Finally, as we can observe from Figs 7-12 and

PLOS ONE
Tables 1-9 that, in terms of the four criterions, adGL, adSpikeGl and PMQ all have an overall decreasing trend as the sample size increases, whereas BA, BAL and QR generally fail to estimate accurately, possibly because they almost completely ignore the random effects, which describe within-subject dependence among data, and their performance doesn't get better as the sample size grows.

PLOS ONE
For adSpikeGL, the median posterior probability model (MPPM) criteria [33] and the highest posterior probability model (HPPM) criteria are used for model selection, respectively. Tables 10-14 and 11-15 summarize the number of correct and wrong zero coefficients for all cases based on MPPM and HPPM, respectively. On the whole, adSpikeGL generally yield

PLOS ONE
promising results in most cases, even for extreme quantile when τ = 0.1 with notoriously challenging. As the sample size grows from 30 to 100, the number of the wrong zero coefficients goes to zero at all considered quantiles, and the number of the correct zero coefficients also become large at moderate and median quantiles. These all indicates good performance of our method in variable selection.

PLOS ONE
Next, in order to evaluate the sensitivity of our methods, we consider the following four different priors of tuning and scale parameters for Simulation 1: case(1)            Table 16, it can be observed that the differences between the four different priors is quite small in general. Hence our methods are not sensitivity to the choice of the hyper-parameters for the priors.

Real data example
In this section, we use the Age-Related Macular Degeneration Trial (ARMD) data to assess the behavior of the proposed model under different quantiles. This dataset consists of 4 variables, available in the R package nlme. Visual acuity of 188 patients with ARMD were measured at 4 time points, resulting in 752 observations. We consider the relationship between the variable of the visual acuity (VISUAL) and three other explanatory variables, including the value of visual acuity measured at baseline(VISUAL0), the time of the measurement(TIME), the treatment indicator(Treat). A detailed description is given in [34]. [34] analyzed the data set using a linear mean mixed model and found that there exist possible interactions between time and treat effects, we then consider the following LQMM (i = 1, . . ., 188, j = 1, . . ., 4): Here we choose five different values of τ, i.e. 0.1, 0.3, 0.5, 0.7, 0.9, to thoroughly describe the response distribution. We implement the adGL, adSpikeGL, PMQ, BL, BAL and QR models on this data set. Table 17 shows the summaries of the MAD and the RMSE, where ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi 1 752 The results in Table 17 suggest that PMQ provides smaller MAD and the RMSE than other methods in upper quantiles(τ = 0.7 and τ = 09), however, adGL and adSpikeGL perform better in lower and median quantile (τ = 0.1, τ = 0.3 and τ = 0.5). We also report the posterior median estimations and 95% credible intervals for fixed effects and random effect variances in Table, respectively. As in simulations, adGL, adSpikeGL and PMQ are inclined to have similar performance, meanwhile, BAL, BL and QR also are likely to behave similarly, and all methods except for the adSpikeGL do not compress any fixed and random effects to zero. The results of adSpikeGL in Tables 18 and 19 show that (i) for fixed effects, VISUAL0, TIME and TREAT are all identified as effective variables of Visual acuity at all quantiles except for TREAT at extreme quantile (τ = 0.1), this may be because treatment has no significant effect on the patients with poor visual acuity, (ii) for random effects, VISUAL0 and TREAT are deemed to be significant predictors of Visual acuity at all quantiles, however, TIME to be

Conclusion and discussion
In this work, we presented a novel shrinkage method to linear quantile mixed model for analysing and interpreting longitudinal data. We also consider the variable selection based on a Bayesian adaptive lasso and an extended Bayesian adaptive group lasso with spike and slab priors. Unlike other approaches for LQMMs, which need some ad hoc methods to do model selection, adSpikeGL can directly identify the significant fixed and random effects simultaneously. Several criterions were adopted to evaluate the finite sample performance of the proposed methods thoroughly by simulations and real dat example. The results reveal that the proposed methods is very competitive with the existing methods. On the other hand, as the sample size gets large, adGL and adGLSpike generally tend to provide smaller MMAD, MRMSE, MMEðβÞ and MMRðDÞ across all scenarios, and adGLSpike also has good performance in variable selection in most cases. The only exception is at τ = 0.1 where the number of the correct zero coefficients of adGLSpike did not show an increasing trend with sample size. The problem can be caused by the drawbacks of the ALD that do not allow the skewness, kurtosis and tails of distribution to vary. To overcome this problem [35] recently introduced a generalized asymmetric Laplace distribution (GALD) to establish QR in linear regression. In the future, we would like to extend the GALD to the linear mixed effect models. Moreover, we would also like to consider some other Bayesian shrinkage priors, such as the double-Pareto prior [36], horseshoe prior [37], the normal-gamma prior [38], etc, though the lasso-type priors is the most widely used shrinkage priors in the literature.