Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Study of Bayesian variable selection method on mixed linear regression models

  • Yong Li ,

    Roles Funding acquisition, Project administration, Software

    qjsfxyly@163.com

    Affiliation School of Mathematics and Statistics, Qujing Normal University, Qujing, China

  • Hefei Liu,

    Roles Conceptualization, Formal analysis, Funding acquisition, Methodology, Writing – original draft

    Affiliation School of Mathematics and Statistics, Qujing Normal University, Qujing, China

  • Rubing Li

    Roles Data curation, Writing – review & editing

    Affiliation School of Economics, Shanghai University of Finance and Economics, Shanghai China

Abstract

Variable selection has always been an important issue in statistics. When a linear regression model is used to fit data, selecting appropriate explanatory variables that strongly impact the response variables has a significant effect on the model prediction accuracy and interpretation effect. redThis study introduces the Bayesian adaptive group Lasso method to solve the variable selection problem under a mixed linear regression model with a hidden state and explanatory variables with a grouping structure. First, the definition of the implicit state mixed linear regression model is presented. Thereafter, the Bayesian adaptive group Lasso method is used to determine the penalty function and parameters, after which each parameter’s specific form of the fully conditional posterior distribution is calculated. Moreover, the Gibbs algorithm design is outlined. Simulation experiments are conducted to compare the variable selection and parameter estimation effects in different states. Finally, a dataset of Alzheimer’s Disease is used for application analysis. The results demonstrate that the proposed method can identify the observation from different hidden states, but the results of the variable selection in different states are obviously different.

Introduction

Multiple observation data of each index of the sample are required in biomedical and econometric research. Such data are usually referred to as longitudinal data. The mixed linear regression model is commonly used for fitting these data. In general, mixed linear regression models contain two parts: fixed effects and random effects that are subject to an unknown distribution. The variable selection problem in a mixed linear regression model usually focuses on the variable selection in the fixed effect part.

In recent years, the class of variable selection methods with penalty functions has become very popular. These methods are based on the least absolute shrinkage and selection operator (i.e., Lasso) method proposed by Tibshiran [1]. This class of penalty methods can perform variable selection and parameter estimation, and exhibits good stability and strong statistical properties. For example, the SCAD(Smoothly Clipped Absolute Deviation) penalty that was proposed by Fan and Li [2] can satisfy several excellent properties such as asymptotic unbiasedness, sparsity, and continuity. Zou [3] presented adaptive Lasso, which exhibits consistency when the number of variables is fixed and the sample size approaches infinity. This method solves the problem of poor consistency in Lasso estimation. Moreover, the adaptive group Lasso proposed by Wang and Leng [4] assigns different adjustment parameters to different regression system arrays, whereby effective variable selection and coefficient estimation can be performed, and subsequently, improved results can be obtained.

When Tibshiran proposed the Lasso method, he proved that when the prior distribution of the regression coefficient is a Laplace distribution, the estimation result of the regression coefficient that is obtained by the Lasso method is consistent with the result of the maximum a posteriori probability estimate, which led to the new concept of Bayesian Lasso. As the Bayesian method exhibits excellent stability and high computational efficiency, this method has been rapidly expanded. On this basis, Park and Casella [5] proposed a complete Bayesian model with conditional Laplace distribution as the prior distribution, and used Gibbs sampling to estimate the posterior distribution of the parameters. Subsequently, Kyung [6] further extended this model and proposed a complete Bayesian formula that can be combined with several variants of Lasso. Leng [7] extended this model to the complete Bayesian adaptive Lasso and applied it to the variable selection of linear models. Lykou [8] used the Bayesian Lasso method to select the model variables. Khondker [9] further extended this method to the Bayesian covariance Lasso method. Raman [10] proposed the Bayesian version of group Lasso method, applied it to contingency tables, and proved its stability and efficiency. Ibrahim [11] introduced the SCAD penalty and adaptive Lasso into the mixed linear regression model. Feng and Wang [12] presented the Bayesian adaptive group Lasso method and applied it to the semiparametric structural equation model. Kang and Song [13] applied the Bayesian adaptive group Lasso to the semiparametric hidden Markov model.

However, in general, the research on variable selection with a grouping structure of the explanatory variables under a mixed linear regression model with an implicit state remains lacking, and few studies have used Bayesian Lasso and its variants to solve this problem. In this study, we introduce the Bayesian adaptive group Lasso into the mixed linear regression model with hidden states to select the variables and estimate the parameters. The purpose is to explore the screening of explanatory variables in a mixed linear regression model when the samples have different states, and the explanatory variables are significant in some states and not significant in others.

The remainder of this paper is organized as follows: Section II introduces the basic form of the mixed linear regression model and its variable selection, Bayesian theory, Bayesian Lasso and its extension method, and the MCMC sampling algorithm, with a focus on the Bayesian adaptive group Lasso method. Section III introduces the core theory of this paper. First, the data and mixed linear regression model used in this study are outlined. Thereafter, the use of Bayesian adaptive group Lasso to estimate the parameters and select the variables under this mixed linear regression model is presented. Furthermore, the fully conditional posterior distribution of the unknown parameters involved in the Bayesian hierarchical model is derived. Finally, the specific algorithm steps of the Gibbs sampling in this study are provided. Section IV presents the application research. Subsequently, the estimation effect of the methods and algorithms on the real parameters and variable selection accuracy is valuated according to the numerical simulation results, and an example is provided. Section V summarizes the paper.

Model description

Consider the following mixed linear regression model, where the observed individuals are recorded as i = 1, 2, ⋯, N, and the observed is t = 1, 2, ⋯, T. Under the condition Sit = s, the regression model is: (1)

In the above, εit is the random error, which is independently and identically distributed in N(0, σ2), Sit is the state of the i-th sample at the t-th observation, Sit = s means that the model is defined in the specific state s. Parameter θs is the unknown regression coefficient, which is also known as the fixed effect, and θs = (αs, βs)T. The explanatory variables corresponding to αs are independent of one another. βs represents the coefficient corresponding to the explanatory variable with a grouping structure.

Furthermore, αs is an L-dimensional vector, βs is a p-dimensional vector, and xit = (xit1, xit2, ⋯, xit(p+L)) is the known explanatory variable. Let the unknown random vector us be m-dimensional.Parameter us is often referred to as the random effect, and it is generally assumed that . Thus, we know that vector zit is m-dimensional, and because the model established in this study is red the longitudinal data model in the mixed linear regression model, zit is a vector in which the i-th component is 1 and the other components are 0, i = 1, 2, ⋯, N.

For state Sit, the following assumption applies: (2) where s = 1, 2, ⋯, S.qs is an unknown constant value with , and S is a known positive integer; that is, the total number of states is known.

As the observation values in different states will not affect the theoretical form of the conditional prior of each unknown parameter and its corresponding full conditional posterior distribution, but only the specific numerical calculation, the distinction of the states of each observational value is discussed in the iterative calculation [14]. Therefore, the subsequent theoretical part is investigated in the specific state s. For convenience of the description, the subscript s is omitted. In specific state s, the model is abbreviated as (3) where θ = (α, β)T, u is distributed in is distributed in N(0, σ2), and α = (α1, α2, ⋯, αL)T. All explanatory variables with a grouping structure are divided into J groups. The set of subscripts of each group is marked as Gj, j = 1, 2, ⋯, J. Thus, we can rewrite θ = (α, β)T as .

Bayesian inference principle

Bayesian adaptive group Lasso

In this study, the Bayesian adaptive group Lasso has the following penalty function form: (4) where positive definite matrix is a pj-order identity matrix, and λl and γj are positive penalty parameters that have positive values. and γj can be selected to calculate the corresponding full conditional posterior distribution, and the estimated value can be obtained by the Gibbs method [15].

We introduce the conditional Laplacian prior as the prior distribution of the coefficients of the explanatory variable [16], rewrite the model into a hierarchical structure, provide the fully conditional posterior distribution of all the parameters to be estimated, and subsequently, calculate their estimated values according to Gibbs.

The conditional Laplace prior for coefficient α is (5) where αl is the l-th component of α, which is independent and identically distributed in a univariate Laplace conditional distribution [17], with the location parameter 0 and scale parameter .

The conditional Laplace prior for coefficient β is (6) where , which denotes the components of β, is independent and identically distributed in a multivariate Laplace distribution.

Subsequently, the above Laplace prior distribution is expressed as a normal mixed distribution with an exponential mixed distribution [18, 19]:

For α: (7) For β: (8)

For convenience of the description, we combine the components of each parameter: let ε = (ε11, ε12, ⋯, εmT)T be an mT-dimensional vector, Z = (z11, z12, ⋯, zmT)T be a matrix of mT × mT, and Σ = σ2ImT be a matrix of mT × mT. Moreover, is a matrix of m × m.

Let ε* = Zu + ε, which is distributed in NmT(0, Σ + ZDZT). Therefore, according to the model assumption, the conditional distribution of the explained variable Y can be obtained as follows: (9)

Let Σ + ZDZTσ*2ImT, we can rewrite (9) as follows: (10) where the * of σ*2ImT is omitted for a succinct description. The prior for parameter σ2 is set as the inverse gamma distribution. Thus, the model can be expressed as the following hierarchical model: (11) where a, b, aλ, bλ, aγ and bγ are hyperparameters.

Gibbs sampling

The hierarchical model for Bayesian adaptive group Lasso was obtained in the previous section. It is necessary to solve the fully conditional posterior distribution of all unknown parameters to use Gibbs sampling to estimate the parameters involved in the model [20, 21].

According to the hierarchical model, all conditional posterior distributions of the parameters are obtained as follows: where .

Gibbs sampling can be used for parameter estimation once all of the conditional posterior distributions of all unknown parameters have been obtained. The confidence interval criterion method proposed by Li and Lin [17] is used for the variable selection. According to this method, for the coefficients corresponding to variables without a grouping structure α, if the 95% confidence interval does not cover zero, the variable can be considered as significant; otherwise, the variable is considered as not significant and is eliminated. For the coefficients corresponding to variables with a grouping structure β, if the 95% confidence interval of the estimated coefficient of a variable in the group covers zero, the entire group of variables is eliminated.

In the Gibbs process, the specific iteration procedure is as follows:

(1) The specific state of the observed value is unknown but the total number of hidden states is known, and an initial value is assigned to the hidden states: Let . The initial value of each parameter under specific state s is:

(2) For the k-th iteration:

sample the parameters in each state s, s = 1, 2, ⋯, S:

sample from ,

sample from ,

⋯⋯

sample from ,

until all parameters in all states have been converged.

Subsequently, the extracted parameters are used to calculate the full conditional probability density function of the hidden state: where and is the likelihood function of the observation in state s. Thus, the conditional probability density function of all hidden states is obtained, following which the state of each observation at this time can be obtained using distribution U(0, 1) as auxiliary sampling:

Update parameter qs:

The k-th iteration ends.

(3) Return to step (2) and perform the (k + 1)-th iteration until the target number of iterations is reached.

Simulation experiment

Model settings

The main purpose of the numerical simulations are testing the accuracy of the model parameter estimation and variable selection, the accuracy of determining the state of each observation, and the differences in the variable selection results under different states. Moreover, the effects of different sample sizes on the parameter estimation were investigated.

(1) The simulation settings are as follows:

A total of 100 experiments were conducted in which the following situations were considered each time: observation times T = 3; sample size N = 100, and 300; number of hidden states S = 2. The probability that each observation value belonged to state 1 or state 2 was the same, namely 0.5.

In the first state: α = (−1.7, 1.3), , , and ; ; .

In the second state: , and .

The settings in the two states were considered for design matrix X:

The part corresponding to coefficient α, namely Xα, was distributed in the multivariate normal N(0,I), where I is the identity matrix.

The part corresponding to coefficient , namely , was distributed in .

As there was a strong correlation between the components of , the following settings were available: the element in row i and column k of was 0.7|ik|, i, k = 1, 2, 3; the element in rowi and column k of was 0.6|ik|, i, k = 1, 2, 3; the element in row i and column k of was 0.4|ik|, i, k = 1, 2, 3.

The following settings were used for the random effects:

(2) Hyperparameter and MCMC settings:

Hyperparameters a, b, aλ, bλ,aγ, and bγ in the hierarchical model (11) were set to [22]:(prior I) a = 1, b = 0.1, aλ = 1, bλ = 0.1, aγ = 1, and bγ = 0.01.

The number of iterations of the MCMC was set to 5000. Three groups of different initial values were set for all parameters to be estimated and the EPSR values of three parallel simulation sequences of all parameters were calculated. When the number of iterations was 2000, the EPSR values of all parameters were less than 1.2. This indicated that the sample converged at 2000 iterations. Therefore, the samples that were obtained from the first 2500 iterations were removed as aging values and only the following 2500 data were retained for analysis to ensure convergence. The posterior mean was used as the estimated value of each parameter.

Analysis of results

After repeating the experiment 100 times, the estimated conditions of each coefficient selected in the model could be classified in two different quantities and two states, as indicated in Table 1.

It can be observed from Table 1 that the model generally had a good estimation effect for each parameter, and the estimation effect on each component of α was better than that on each component of . This may be because the three components of had a strong correlation with one another and imposed penalties on the entire group, so the estimation effect of a single component was somewhat poor.

Furthermore, the accuracy of the parameter estimation increased with the increase in the sample size, indicating that the estimation effect of the model increased.

The model variable selection was also investigated, and the 95% confidence interval of the posterior mean value of each parameter was calculated. For parameter α, if the confidence interval of its components was covered to zero, the corresponding variables were removed. For parameters , if the confidence interval of the components of was covered to zero, the entire group of was removed.

In this study, the components of β were considered as a whole. In each experiment, among the three components of the estimated value of β in two states, the true value was zero, and the variable selection result was the number of components excluding their corresponding variables. A total of 100 record results were obtained in each of the two states, and the mean value of the results was the average of the correct zeros in Table 2. Accordingly, the average of incorrect zeros indicated that the true value was not zero, but the variable selection result was the average of the number of coefficient components, excluding their corresponding variables.

thumbnail
Table 2. Identification results of insignificant variables.

https://doi.org/10.1371/journal.pone.0283100.t002

It can be observed from Table 3 that in the 100 repeated experiments, the two group vectors with zero real coefficients in state 2 were eliminated. However, in state 1, only one group vector had a true coefficient of zero, but two group vectors were excluded; that is, one group vector with a true coefficient of non-zero was excluded from the model.

According to Tables 13, only one of the components of parameter in state 1 had all zero values. However, according to the “confidence interval criterion”, if the confidence interval of a component covers zero, the entire corresponding group of variables should be eliminated. In as only one component had a large coefficient, it was considered that as a whole was not significant and needed to be eliminated.

Sensitivity analysis

In this section, we conduct sensitivity analysis to examine whether the proposed method is sensitive to the prior specification. we reset the hyperparameters as follows:(priorII) a = 6, b = 4, aλ = 2, bλ = 0.01, aγ = 3, bλ = 0.05. The MCMC setting is not changed. Table 4 is the parameter estimate result under prior II.

The estimated results of parameters in Table 4 are similar to those in Table 1. The experimental results show that the proposed variable selection method is robust to the prior distribution hyperparameters.

Case study

The proposed model and method can be used in the study of Alzheimer’s disease to illustrate the practicability of the model and method. The data and more information can be found on its website(www.adni-info.org). Because many individuals had information missing, this research did not consider the missing data problem, we deleted individuals with information missing. So we selected 512 patients and collected their clinical information and basic variables at the base period, 6 months, 12 months, 24 months and 36 months. N = 512 and T = 5 in this model. The specific information of response variables and interest covariables initially selected by us is shown in Table 5.

In the model, FAQ(Functional Assessment Qestionaire) scores were selected as the response variable(yit) to reflect the cognitive and behavioral abilities of respondents. Among the 11 possible interest variables, X1, X2, X3, X4 is the inborn and unchangeable biological genetic information, X5, X6, X7 is the changeable biological information, X8, X9 is the past historical information, and X10, X11 is the current social attribute that may change. Therefore, we divide 11 variables into 4 groups:,,,. We roughly divide the respondents into two states, one is with cognitive and behavioral disorders, and the other is without or with slight cognitive and behavioral disorders. We need to study the following issues: 1.What are the factors that affect cognitive and behavioral abilities in each state? 2.In each state, what is the influence relationship between covariates and response variables?

The above problem is to select variables and estimate the parameters of model . Here α is the intercept term.

Before the empirical analysis, we first standardized the three variables, FAQ sore(y), age (X5) and years of education (X8). We choose the hyperparameter of prior I in the analysis. Table 6 shows the results of variable selection and parameter estimation.

From Table 6, we can see the variable selection consequences of the model. Under state 1, three groups of variables G1, G2, G4 have significant effects on response variables, while G3 has no significant effects on response variables. Under state 2, G2 has a significant impact on the response variable, while G1, G3, G4 has no significant impact on the response variable.

Substituting the corresponding coefficients and variables into the model, we can get that when the respondents have cognitive impairment, the influence relationship model of FAQ score is y = 1.71 − 0.25X1 − 0.16X2 + 0.72X3 + 0.08X4 + 0.11X5 + 0.46X6 − 0.29X7 − 0.69X10 − 0.39X11. When the respondents has no cognitive impairment or mild cognitive impairment, the influence relationship model of FAQ score is y = −0.84 + 0.27X5 + 1.14X6 − 0.43X7.

The results of variable selection show that different contents affect FAQ scores in different cognitive states. When the respondent is in a cognitive disorder state, the innate genetic information, changeable biological information and current social attributes will affect the respondent’s cognitive ability (FAQ score). When the respondent is in the state of no cognitive impairment or mild cognitive impairment, only changeable biological information has a significant impact on cognitive ability (FAQ score).

From the consequences of variable selection, we found that the changeable biological information (X5, X6, X7) had a significant impact on cognitive ability no matter the respondents were in any state. Further analysis shows that X5 and X6 have a positive impact on the FAQ score, and X7 has a negative impact on the FAQ score. This shows that the older you get, the bigger 42 you get, the weaker your cognitive ability is. The larger the volume of hippocampus, the stronger the cognitive ability. In addition, under the condition of no cognitive impairment or slight cognitive impairment, the influence of innate genetic information (X1, X2, X3, X4) and current social attributes (X10, X11) on cognitive ability is not significant. In the state of cognitive impairment, the influence of these two groups of variables on cognitive ability is significant. This is an interesting discovery. For example, does this mean that people of different genders have different risks of cognitive impairment? These results provide a novel perspective that deserves further inves.

In addition, the original data set gives the diagnostic status of each respondent at each test. We use the results of the last iteration of MCMC as the model to classify the status of respondents. Through comparison, out of 2048 sample points (513 × 4 = 2048), 1962 sample points have positive classification results, with a correct rate of 95.8%. This shows that our model has good adaptability to data sets.

Conclusions

In this study, Bayesian adaptive group Lasso was applied to the mixed linear regression model with hidden states, adaptive Lasso was applied to certain independent variables, and adaptive group Lasso was applied to several variables with a grouping structure. Under the Bayesian framework, the selection of the penalty function and penalty parameters as well as that of the prior distribution of each parameter was provided, following which the concrete form of all conditional posterior distributions of each parameter were calculated. The specific implementation steps of the Gibbs sampling were presented. Finally, the effects of the model parameter estimation and variable selection were discussed. The simulation analysis demonstrated that the proposed model can better identify the insignificant variables, eliminate the insignificant variables with a grouping structure, and estimate the parameters accurately. The case study verified that the same set of variables may or may not be significant in different states.

References

  1. 1. Tibshirani R. Regression Shrinkage and Selection Via the Lasso. J R Stat Soc B. 1996;58: 267–288.
  2. 2. Fan J, Li R. Variable selection via nonconcave penalized likelihood and its Oracle properties. J Am Stat Assoc. 2001;96: 1348–1360.
  3. 3. Zou H. The adaptive Lasso and Its Oracle Properties. J Am Stat Assoc. 2006;101: 1418–1429.
  4. 4. Wang H, Leng C. A note on adaptive group Lasso. Comp Stat Data Anal. 2008;52: 5277–5286.
  5. 5. Park T, Casella G. The Bayesian Lasso. J Am Stat Assoc. 2008;103: 681–686.
  6. 6. Kyung M. Penalized regression, standard errors, and Bayesian Lassos. Bayesian Anal. 2010;5: 369–411.
  7. 7. Leng C, Tran M, Nott D. Bayesian adaptive Lasso. Ann Inst Stat Math. 2014;66: 221–244.
  8. 8. Lykou A, Ntzoufras I. On Bayesian Lasso variable selection and the specification of the shrinkage parameter. Stat Comput. 2013;23: 361–390.
  9. 9. Khondker ZS, Zhu H, Chu H, Lin W, Ibrahim JG. The Bayesian covariance Lasso. Stat Interface. 2013;6: 243–259. pmid:24551316
  10. 10. Raman S, Fuchs T, Wild P, et al. The Bayesian group-lasso for analyzing contingency tables. Proceedings of the 26th Annual International Conference on Machine Learning. 2009: 881–888.
  11. 11. Ibrahim J, Zhu H, Garcia R, Guo R. Fixed and random effects selection in mixed effects models. Biometrics. 2011;67: 495–503. pmid:20662831
  12. 12. Feng X, Wang G, Wang Y, Song X. Structure detection of semiparametric structural equation models with Bayesian adaptive group Lasso. Stat Med. 2015;34: 1527–1547. pmid:25640461
  13. 13. Kang K, Song X, Hu X, Zhu H. Bayesian adaptive group Lasso with semiparametric hidden Markov models. Stat Med. 2019;38: 1634–1650. pmid:30484887
  14. 14. Liu H, Song X, Bayesian Analysis of Mixture Structural Equation Models With an Unknown Number of Components, Structural Equation Modeling: A Multidisciplinary Journal, 2018, 25(01): 41–55.
  15. 15. Liu H, Song X, Zhang B, Varying-coefficient hidden Markov models with zero-effect regions, Computational Statistics and Data Analysis, 2022, 73:1–19.
  16. 16. Liu H, Song X, Tang Y, Zhang B, Bayesian quantile nonhomogeneous hidden Markov models, Statistical Methods in Medical Research, 2021, 30(01): 112–128. pmid:32726188
  17. 17. Flynn C, Hurvich C and Simonoff J, (2013) Efficiency for Regularization Parameter Selection in Penalized Likelihood Estimation of Misspecified Models Journal of the American Statistical Association, 108, 1031–1043.
  18. 18. Andrews D, Mallows C. Scale mixtures of normal distributions. Journal of the Royal Statistical Society: Series B (Methodological). 1974;36: 99–102.
  19. 19. Torbjorn E, Taesu K, Lee T. On the multivariate Laplace distribution. IEEE Signal Process Lett. 2006;13: 300–303.
  20. 20. Geman S, Geman D. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans Pattern Anal Mach Intell. 1984;6: 721–741. pmid:22499653
  21. 21. Hobert J, Casella G. The effffect of improper priors on Gibbs sampling in hierarchical linear mixed models. Journal of the American Statistical Association, 1996, 91(436): 1461–1473.
  22. 22. Li Q, Lin N. The Bayesian elastic net. Bayesian Anal. 2010;5: 151–170.