Addressing cluster-constant covariates in mixed effects models via likelihood-based boosting techniques

Boosting techniques from the field of statistical learning have grown to be a popular tool for estimating and selecting predictor effects in various regression models and can roughly be separated in two general approaches, namely gradient boosting and likelihood-based boosting. An extensive framework has been proposed in order to fit generalised mixed models based on boosting, however for the case of cluster-constant covariates likelihood-based boosting approaches tend to mischoose variables in the selection step leading to wrong estimates. We propose an improved boosting algorithm for linear mixed models where the random effects are properly weighted, disentangled from the fixed effects updating scheme and corrected for correlations with cluster-constant covariates in order to improve quality of estimates and in addition reduce the computational effort. The method outperforms current state-of-the-art approaches from boosting and maximum likelihood inference which is shown via simulations and various data examples.


Introduction
Linear mixed models [1] proved to be a very popular tool for analysing data with repeated measurements, especially clustered longitudinal data from clinical surveys. Nevertheless, they are applicable to much broader fields and various overviews can be found in [2][3][4]. Fitting these models can be achieved with a variety of R packages available [5,6] and classical methods for inference like tests [7] or selection criteria [8,9] have been developed.
In order to use mixed models for prediction analysis, various approaches to regularized regression like lasso [10,11] and boosting techniques [12] have been proposed. Lasso type approaches can be found in [13] for linear and in [14] for generalized linear mixed models. Boosting in general can be distinguished between gradient boosting [15,16] and likelihoodbased boosting [17,18]. Both boosting methods are capable of fitting mixed models and for the latter an extensive framework has been proposed towards this matter in [19][20][21] and is included in the R package GMMBoost [22] available on CRAN. Apart from improving prediction analysis, component-wise boosting methods are due to an iterative and component-wise fitting process suitable for high dimensional data and implicitly offer variable selection. Good

PLOS ONE
PLOS ONE | https://doi.org/10.1371/journal.pone.0254178 July 9, 2021 1 / 17 a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 insights into component-wise boosting can be found in [23] for gradient boosting and in [24] for gradient and likelihood-based boosting as well. Please note that when talking about boosting, we always refer to the component-wise variant. However, the bGLMM algorithm from the GMMBoost package tends to struggle with cluster-constant covariates, e.g. baseline covariates like gender or treatment group in longitudinal studies. The specified selection and updating procedure of the bGLMM algorithm tends to favour cluster-varying covariates while the simultaneously updated random intercepts partly account for effects actually evolving from cluster-constant covariates. As shown in Fig 1, this malfunction already occurs in a very basic data example with the popular Orthodont dataset, which is available in various R packages. The dataset depicts the evolution of an orthodontal measurement of 27 children and contains two covariates. A basic linear mixed model with random intercepts returns the two coefficient estimatesb lme gender ¼ À 2:32 by lmer andb b gender ¼ 0:00 by bGLMM for the effect of the cluster-constant covariate gender. The reason for this difference becomes clear when looking at the random intercepts, where bGLMM tends to compensate the missing effect for gender by assigning every female subject a random intercept lowered by 2.32. Although the structure of the Orthodont data set is very simple and does not require boosting, it is evident that the described weak spot of bGLMM is not confined to more complex datasets and thus can occur for any clustered data containing cluster-constant covariates.
We therefore propose an updated algorithm with various changes in order to avoid the phenomenon of random intercepts growing too quickly. These changes include the usage of smaller starting values and weaker random-effects updates to prevent the random effects from growing too fast as well as undocking the random effects update from the fixed effects boosting scheme, which guarantees a fair comparison between the single covariates for the fixed effects. Most importantly, we introduce a correction step for the random effects estimation to avoid possible correlations with observed covariates. The contribution of the present work is therefore a novel and better performing boosting algorithm regarding both estimation accuracy and runtime for mixed models, particularly in the presence of cluster-constant covariates. The algorithm not only solves the prescribed identification issues but in addition states the only regularization approach for mixed models, which explicitly accounts for estimation bias arising from possible correlations between random and regularized effects. While existing approaches bypass these issues by excluding affected covariates from the regularization approach, the presented algorithm addresses the phenomena directly by correcting falsely estimated random effects.
The remainder of the paper is structured as follows: Section 2 formulates the underlying model and the updated boosting algorithm as well as a detailed discussion of the changes. The algorithm is then evaluated and compared to other regularization approaches using an extensive simulation study described in Section 3. As an illustrating data example we have chosen the Primary Biliary Cirrhosis data, which further underlines the strengths and weaknesses of the compared methods and is discussed in Section 4. Finally, the results and possible extensions are discussed.

Methods
We propose a novel and improved boosting algorithm for linear mixed models in the following subsections.

Model specification
For clusters i = 1, . . ., n with observations j = 1, . . ., n i we consider the linear mixed model . . . ; z ijq Þ referring to the fixed and random effects β and γ i , respectively. The random components are assumed to follow normal distributions, i.e. ε ij � N ð0; s 2 Þ for the model error and g i � N �q ð0; QÞ for the random effects. This leads to a cluster-wise notation . . . ; ε in i Þ. Finally, we get the common matrix notation In order to perform likelihood inference, let ϑ = (β 0 , β T , γ T ) denote the effects and ϕ = (σ 2 , τ) information of the random components, where τ contains the values of Q. The marginal log-likelihood of the model can be obtained via where f(�|ϑ, ϕ) and p(�|ϕ) denote the normal densities of the model error and the random effects. Laplace approximation following [25] results in the penalized log-likelihood which is going to be maximized simultaneously for ϑ and ϕ by likelihood-based boosting-techniques discussed in the following subsection.

Boosting algorithm
The lbbLMM (likelihood-based boosting for linear mixed models) algorithm iteratively fits the linear mixed model (1) via component-wise likelihood-based boosting. The fitting procedure in general is carried out by Fisher-scoring [26], a variant of Newton's optimization method [27], which iteratively optimizes a given cost function based on quadratic approximations. It therefore obtains updates based on first order and second order derivatives, which are, in the context of Fisher scoring, represented by score vector and Fisher matrix of the underlying cost function. We first give a brief description of the algorithm and discuss the single steps in more detail in the following subsection. Algorithm lbbLMM • Initialize estimates with starting valuesθ ½0� and� ½0� . Choose total number of iterations m stop and step length ν.
• for m = 1 to m stop do • step2: Update random effects Update random effects using an additional Fisher scoring step based on the penalized loglikelihood by calculating and weakly updatingĝ ½m� ¼ĝ ½mÀ 1� þ nCF ran ðgÞ À 1 s ran ðgÞ: The incorporation of the correction matrix C at this step is crucial and its derivation is discussed further below.
• end for • Stop the algorithm at the best performing m� with respect to the specified information criterion. Returnθ ½m � � and� ½m � � as the final estimates.

Computational details of the algorithm
We give a stepwise description of the computational details of the lbbLMM algorithm. For simplicity, we omit iteration indices and hats indicating estimated values whenever appropriate.

Starting values.
The parameters actually underlying the selection process are necessarily set to zero, thusβ ½0� ¼ 0. Initial intercept and model error are set tob ½0� 0 ¼ � y,ŝ 2½0� ¼ VarðyÞ and random effects are initialized asĝ ½0� ¼ 0 with small covariance-matrix, e.g. Q ½0� ¼ diagð0:1; . . . ; 0:1Þ. An alternative approach which is also proposed in [19] would be fitting a standard linear mixed model for intercept and random effects by using e.g. the function lmer from the R package lme4 and extracting the starting values from the model fit.

Fixed effects boosting process.
The computation of the rth update is straight forward by calculating whereX ¼ ð1; X �r Þ is a N × 2 matrix containing a column of ones and the rth column of X associated with the rth covariate and η denoting the current fit. This leads to p possible parameter vectors ϑ r , where only the intercept and rth component received an update according to u r . The best performing component is the one leading to minimal AIC r or BIC r [28,29] given by log Fðy i jϑ r ; �Þ þ 2df; log Fðy i jϑ r ; �Þ þ logðnÞdf: Here, df = ϕ + #{i � p: β i 6 ¼ 0} denotes the model complexity according to the marginal likelihood where #ϕ is the total number of variance-covariance parameters in ϕ.

Random effects update. By calculating
a weak and corrected updateĝ ½m� ¼ĝ ½mÀ 1� þ nCF ran ðgÞ À 1 s ran ðgÞ: for the random effects is obtained. Note that this differs from the approach in [19] as the random effects are updated separately and in addition also receive an update scaled by the step length ν. The weak update ensures that the random effects don't grow to quickly compared to the fixed effects. The disentanglement of the random effects update from the fixed effects updating scheme on the other hand guarantees a fair comparison of the single fixed effects, where the random effects do not play a crucial role. In addition, the Fisher matrix has block-diagonal form making the inversion much easier and thus strongly reducing the computational effort.

Deriving the correction matrix C.
The single random intercepts or random slopes are corrected independently of each other using distinct sets of covariates. For the correction of the sth random effect considerγ s ¼ ðg s1 ; . . . ; g sn Þ T with covariates X cs 2 Mat R ðn; p s Þ Mat R ðn; mÞ denotes the space of all n × m matrices with values in R. where p s denotes the total number of correction-covariates used for the sth random effect. Note that X cs has n rows as it contains only one representative observation from each cluster. The correction matrix contains all cluster-constant covariates for random intercepts and just a column of ones for random slopes, which corresponds to centering the given random slope. The correction matrix C s for the sth random effect is obtained by so that the product ðI n À C s Þγ s corrects the sth random effect for any covariates contained in the corresponding matrix X cs by counting out the orthogonal projections of the sth random effect estimates on the subspace generated by the covariates X cs . This ensures the coefficient estimate for the random effects to be uncorrelated with any observed covariate. These separate corrections are summarised in one single correction matrix by defining the block diagonal C ¼ diagðC 1 ; . . . ; C q Þ and computing where P is a permutation matrix mapping γ to The product C γ then corrects every random effect simultaneously. This concept also proved useful for an improved estimation of mixed models via model-based gradient-boosting [30].

Updating variance-covariance-components.
The covariance matrix Q of the random effects is updated with an approximate EM-algorithm using the posterior curvatures F i of the random effects model [31]. An update is received by computinĝ Each iteration's model error is obtained bŷ s 2½m� ¼ Varðy Àη ½m� Þ:

Choice of steplength.
The steplength 0 < ν � 1 controls the weakness of each update and is substantial in order to avoid overfitting and give each candidate variable equal opportunity to get selected. We stick to the choice of ν = 0.1 for both fixed and random effects updates, which is well established in the boosting community and thus makes a fairer comparison. This ensures that neither of the coefficient estimates is growing too quickly.
2.3.7 Stopping iteration. The algorithm is stopped based on AIC or BIC, i.e.
where AIC [m] and BIC [m] denote the information criteria after m iterations. An alternative and computationally more burdensome stopping rule would rely on cluster-wise cross-validation [32], which is however asymptotically equivalent to the marginal AIC as used above [33].

Simulation study
The algorithm is evaluated with a simulation study. The single simulation scenarios are described in the first subsection, while results are discussed in the latter two. Primary focus is to show, that the algorithm solves the identification problem of the random effects and thus is compared to the bGLMM function of the GMMBoost package available on CRAN. Furthermore, its performance is compared to the classical method implemented in the lmer function of the lme4 package as well as the glmmLasso function of the same-named package, which is another popular approach to regularized regression with potentially high numbers of candidate variables. Please note that we did not include mboost in the comparison, as its approach to random effects is not able to estimate variance components for random intercepts not to mention covariance matrices for multiple slopes. The bGLMM was also compared to the glmmPQL function [4] in [19]. The comparison focuses on mean squared errors of estimates for fixed effects and the random structure as an indicator for overall performance and to address the identification problem. Variable selection properties are evaluated via true and false positive as well as well as false discovery rates. As a side note, we compare computational effort.
The second setup is a slightly altered scenario with two additional random slopes, one for a cluster-constant and one for a cluster-varying covariate, i.e.
with ðg 0i ; g 1i ; g 2i Þ � N �3 ð0; QÞ; Q :¼ where τ 2 {0.4, 0.8, 1.6} and τ � is chosen so that cor(γ ki , γ li ) = 0.6 for all k, l = 1, 2, 3 holds. For β = (β 0 , . . ., β p ) T we consider mean squared errors as an indicator for estimation accuracy with k�k F denoting the Frobenius norm of a given matrix. Variable selection properties are evaluated by calculating false positives (FP), true positives (TP) and false discovery rates (FDR) where P sel and P tot denote the amounts of selected informative and total informative candidate variables with � P sel and � P tot as the equivalents for non-informative covariates. Finally, the elapsed time is measured in seconds where each simulation run was carried out on a 2 x 2.66 GHz-6-Core Intel Xeon CPU (64GB RAM).
Every single simulation setup was independently executed 100 times and, in order to account for skewness of the mean squared error distributions, median values are reported for estimation accuracy and average values for variable selection properties and computation time. The bGLMM boosting algorithm was initialized with m stop = 500, while lbbLMM was iterated up to m stop = 1500. To determine the optimal penalization parameter for glmmLasso, the grid {500, 495, 490, . . ., 0} was used. All of the included regularization approaches were tuned using the BIC. Table 1 summarizes results for estimation accuracy. In general, the lbbLMM algorithm produces very precise estimates while the bGLMM function suffers from the prescribed identification problem yielding a minimum mean squared error of 2 2 + 4 2 = 20 as the cluster-constant covariates are not being selected. All methods get less precise as the values of τ and p increase, only lbbLMM has stable error rates regarding the amount of candidate variables p. Overall, lbbLMM outperforms its competitors in every single scenario. Estimation accuracy of the random structure is described in Table 2. Estimates by lbbLMM behave similarly well as by lmer while the identification problem in bGLMM results in high error rates. While lying in the same range as lmer, lbbLMM clearly outperforms the remaining regularization approaches. Table 3 depicts variable selection properties. While selection quality of glmmLasso improves as p increases, both boosting approaches yield perfect properties with respect to false positives. However, the identification problem of bGLMM leads to a low true positives rate, as informative effects of cluster-constant covariates are being captured

PLOS ONE
Addressing cluster-constant covariates in mixed effects models via likelihood-based boosting techniques by the random intercepts. lbbLMM on the other hand has perfect selection properties with respect to both, true and false positives. Table 4 summarizes results for estimation accuracy. Except for the increased error rates in general, the behaviour is similar to the random intercepts setup (4). lbbLMM again performs stable as p increases and clearly outperforms the other regularization approaches. However, lmer has slightly better error rates in low-dimensional scenarios with higher values for τ, which can be also seen in Table 5. Table 6 depicts variable selection properties for the random slopes setup. Results are almost identical to the random intercepts setup described in Table 3.

Variable selection.
The presence of the identification problem in bGLMM is reflected in high errors for fixed and random effects and low true positives rates. Based on its good values for mse β , mse τ , mse Q and TP it can be stated that lbbLMM not only solves the problems occurring in bGLMM but also offers a reliable and good performing regularization approach to linear mixed models in general. Table 7 depicts average computation times for the random intercepts (4) and random slopes (5) setup. All regularization approaches roughly show a linear scaling with increasing amount of candidate variables p. In most cases, glmmLasso runs noticeably faster than its two boosting competitors. However, a direct comparison is hard to interpret as the computation time of glmmLasso strongly depends on the fineness of the grid used in order to determine the optimal penalization parameter. In addition, the corrupt updating process of bGLMM leads to substantial faster convergence, as the algorithm is due to its identification issue capable of fitting multiple effects in one single iteration and thus achieves faster convergence, which is also the reason why bGLMM runs faster in the random slopes setup. However, the computational effort for bGLMM is very sensitive to increasing numbers of total observations N. Table 8

Primary biliary cirrhosis
The primary biliary cirrhosis (PBC) dataset from 1994 [34] tracks the change of the serum bilirubin level for a total of 312 PBC patients randomized into a treatment and a placebo group and additionally contains baseline covariates as well as follow-up measurements of several biomarkers. The dataset is, among others, available in the JM package [35] and Table 9 gives an overview of the single covariates included in the data and how they are coded in the model formula. The serum bilirubin level, here modelled as the response variable, is considered a strong indicator for disease progression, hence an appropriate quantification of the impact of the given covariates on the serum bilirubin level will lead to an adequate prediction model for the health status of PBC patients. Using boosting to carry out this quantification will optimize the prediction properties. For y ij denoting the jth measurement of serum bilirubin for the ith patient, we formulate the random intercept model   In general, the results reflect what was already observed in the simulation study. glmmLasso struggles with proper variable selection in lower dimensional scenarios and bGLMM does not select any cluster-constant covariates due to misspecification. Although the effect of drug has a comparatively high p-value, the coefficient estimate by bGLMM stands out among all regularization approaches while the value for τ is simultaneously pretty large, which indicates possible bias arising from wrongly identified random intercepts. On the other hand, the rest of the variables which were selected by lbbLMM tend to be of high impact depending on the chosen significance level. For lower choices, e.g. α 2 {0.01, 0.005} [36], the selection process of lbbLMM matches with the selected covariate having significant impact while at the same time receiving shrinkage by the regularization mechanism. Regarding computational effort, bGLMM runs with approximately 17 hours (60543 seconds) tremendously long and needs around 600 times more computation time than its direct competitor lbbLMM.

Discussion
The updated algorithm is due to its minor and major tweaks capable of dealing with clusterconstant covariates in linear mixed models by preventing the random effects from taking up too much space. In addition, it preserves the well-known advantages of boosting techniques in general by offering variable selection and a good functionality even in high dimensional setups. As a very important side effect the computational effort receives a tremendous decrease making the algorithm more applicable to real world scenarios.
Primary hindrance of the lbbLMM algorithm is a missing approach for model choice as the random effects structure has to be specified in advance and does not underlie any selection process. Although reasonable options regarding the random structure are limited in most real world applications and could also be evaluated afterwards using appropriate information criteria, it remains and interesting question, how one could incorporate proper model selection during the updating process while simultaneously preserving the advantages gained by the lbbLMM algorithm.
Canonical extensions of the successful concept include incorporating non-linear predictor functions, i.e. estimation of smooth effects based on P-splines or extending the algorithm from linear mixed models to generalized mixed models to allow more flexible inference for a wider class of data structures. Both have been incorporated in [37] for classical likelihood-based boosting and it is assumed that the proposed tweaks in the present work would improve performance of the more flexible approaches as well.