Generalized Linear Mixed Models for Binary Data: Are Matching Results from Penalized Quasi-Likelihood and Numerical Integration Less Biased?

Background Over time, adaptive Gaussian Hermite quadrature (QUAD) has become the preferred method for estimating generalized linear mixed models with binary outcomes. However, penalized quasi-likelihood (PQL) is still used frequently. In this work, we systematically evaluated whether matching results from PQL and QUAD indicate less bias in estimated regression coefficients and variance parameters via simulation. Methods We performed a simulation study in which we varied the size of the data set, probability of the outcome, variance of the random effect, number of clusters and number of subjects per cluster, etc. We estimated bias in the regression coefficients, odds ratios and variance parameters as estimated via PQL and QUAD. We ascertained if similarity of estimated regression coefficients, odds ratios and variance parameters predicted less bias. Results Overall, we found that the absolute percent bias of the odds ratio estimated via PQL or QUAD increased as the PQL- and QUAD-estimated odds ratios became more discrepant, though results varied markedly depending on the characteristics of the dataset Conclusions Given how markedly results varied depending on data set characteristics, specifying a rule above which indicated biased results proved impossible. This work suggests that comparing results from generalized linear mixed models estimated via PQL and QUAD is a worthwhile exercise for regression coefficients and variance components obtained via QUAD, in situations where PQL is known to give reasonable results.


Introduction
Increasingly, data are collected in which the standard assumption of independence between observations is not met. This could include data that consist of multiple observations on a subject over time or subjects who are clustered in some way (e.g. classes within schools, or households within neighbourhoods). As computational power has grown, analytic methods have been extended to handle increasingly complex data structures.
If the association between observations on the same cluster/ subject is not accounted for in the analytic strategy, inference associated with the estimated parameters may not be correct [1]. When the outcome is binary, one of the main analytic approaches in this context are generalized linear mixed models (GLMMs) [2].
GLMMs extend the linear mixed model to deal with outcomes with non-normal distributions. In particular, GLMMs can handle binary outcomes. In GLMMs, subject-specific random effects, usually normally-distributed, are incorporated in the model. In this way, the second order structure or correlation between subjects in the same cluster can be described and accounted for. When the outcome is binary, in GLMMs the regression coefficient is estimated conditional on the random effect [2], and as such, has a subject-specific interpretation [3,4].
To estimate the parameters in the GLMM, maximizing the exact likelihood involves an intractable integration. Several approaches have been proposed to get around this. A commonly used method is penalized quasi-likelihood (PQL), proposed by Breslow and Clayton [5]. In this implementation, a Laplace approximation is used, resulting in an approximated likelihood function with Normal distribution [5]. One advantage of PQL is that it can accommodate complex correlation structures. However, estimates can be badly biased especially with few subjects per cluster, low event rates, or high inter-cluster variability, because the method uses an approximation to the likelihood [6][7][8].
The main competitor to PQL is numerical integration via adaptive Gaussian Hermite quadrature (QUAD) [9,10]. While QUAD is not computationally efficient for multidimensional random effects (e.g. time series), it can perform adequately with few random effects [1]. While quadrature provides accurate   estimates for regression coefficients under a variety of conditions, convergence of QUAD is often a problem, particularly when variance parameters are close to zero or cluster sizes are small [11,12]. Despite many studies investigating the statistical properties of parameter estimates from generalized linear mixed models (GLMMs), it still remains somewhat unclear under what conditions good properties can be expected from either of these methods. In particular, when the number of clusters is small and the number of subjects per cluster is small, neither PQL nor QUAD are guaranteed to give good results for regression coefficients [13]. Similarly, estimated variance components are often biased with both approaches (e.g. [11]).
If matching results from GLMMs estimated via PQL and QUAD indicated relatively unbiased regression coefficients or variance parameters, this could be an easy ''diagnostic'' to perform. Both estimation methods are available in SAS and R.
In this work, we investigate systematically whether matching regression coefficients, odds ratios or variance components from PQL and QUAD suggest minimal bias in those same parameters. Moreover, we attempt to develop useable guidelines based on comparing the results from PQL and QUAD. For example, should the comparison be between estimated regression coefficients, estimated variance components or both? Moreover, how close is close enough?

Materials and Methods
Statistical simulation was used to assess whether matching results from PQL and QUAD indicate less bias.

Data generation
Our data generation algorithm was developed to generate clustered data. We imagined working in the clustered data context (e.g. children grouped in classes, or people clustered in neighbour- hoods), rather than longitudinal, repeated measures type data We generated an outcome (Y ij ) and a predictor (X ij ). Here, i denotes the cluster, and j denotes the subject within the cluster. Thus Y ij is the outcome observed for subject j from cluster i. The dichotomous independent variable, X ij , was generated from a Bernoulli distribution with probability = 0.5. To generate the corresponding dichotomous outcome variable Y ij , first the probability of the outcome was generated from the following logistic regression model: where u i was a random effect generated from a normal distribution with mean = 0 and variance = s 2 u . By including u i in the data generation step, correlation between observations in the same cluster is induced. Then the dichotomous Y ij variable was generated from a Bernouilli distribution with the probability of the outcome provided by the logistic regression (equation 1). The number of clusters, number of subjects per cluster, b 1 , variance of the random effect, and proportion of subjects with the outcome were all varied, with levels described in Table 1. For each distinct combination (n = 424) of parameters (''simulation scenarios''), 250 data sets were generated, which gave us adequately precise results, while allowing us to investigate a wide range of simulation scenarios.

Data analysis
Two GLMMs (random effects logistic regression models) were fit to the data, including the exposure as an independent variable, and allowing the intercept to vary across the clusters. The model parameters were estimated using penalized quasi-likelihood (PQL) and adaptive Gaussian Hermite quadrature (QUAD). Both models were fit using the GLIMMIX procedure in SAS version 9.2.

Measures of performance
We collected information on bias and variability of the estimated regression coefficient for X (b b 1 ), and odds ratio (exp(b b 1 )), )), as well as the estimated variance of the random intercepts (ŝ s 2 m ), as estimated via PQL or QUAD.
We quantified the proximity of the PQL and QUAD results as the absolute percent difference between the estimated odds ratios, Comparing PQL and Numerical Integration PLOS ONE | www.plosone.org according to the following formula: whereb b 1 PQL andb b 1 QUAD were the regression coefficients as estimated via PQL or QUAD, respectively. The estimated variance components were compared according to the following formula: whereŝ s 2 PQL andŝ s 2 QUAD were the variance of the random effects as estimated by PQL or QUAD, respectively.
We also quantified how close results were to the truth, via the following formulae: absolute percent bias of OR QUADã bsolute value defined as above. When s 2 or b 1 was zero we divided by 1 in formulae 5 and 6.

Data analysis of simulation results
We removed observations where PQL or QUAD did not converge. Model convergence was defined as a model that produced estimates for relevant parameters and did not return a warning. We estimated convergence for each estimation procedure as the number of simulation repetitions that did converge divided by the total attempted (n = 250).
We estimated the median absolute percent bias of the regression coefficients and random intercept variances as estimated via PQL or QUAD for each simulation scenario.
We fit linear regressions, with absolute percent bias of the estimated odds ratios and absolute percent bias of the variance component (e.g. formulae 3-6) as the outcome and measures of how close PQL and QUAD results were (e.g. formulas 1-2) as the predictors, overall and separately for each combination of data generation parameters (i.e. in 424 distinct data generation scenarios). We report the median estimated slope and interquartile range of the slope, the proportion of scenarios in which the predictor was statistically significant and the median R 2 and interquartile range of the R 2 for the models overall (i.e. across all 424 scenarios investigated), as well as by data generation parameters.
Finally, we used mixed quantile regression [14] with absolute percent bias of the estimated odds ratios and variance components (e.g. formulae 3-6) as the outcome and measures of how close PQL and QUAD results were (e.g. formulae 1 and 2) as the predictor of primary interest, and data generation parameters as covariates in the model (i.e. proportion with the outcome, data set size, data set composition, b 1 , s u 2 .) Data set composition categorized data sets as having few large clusters (when the number of clusters was 6), many small clusters (when cluster size was 2), or moderate (all other possibilities). All covariates were entered as dummy variables in the model. Intercepts and the coefficient for similarity of PQL and QUAD results were allowed to vary across simulation scenario.
All data generation and analyses were carried out using SAS/ STAT version 9.2 [15], with the exception of the linear mixed quantile regression which was performed in R version 2.14.2 [16].

Results
Tables 2 and 3 present the median and interquartile range of the absolute percent bias and mean squared error (MSE) of the regression coefficient and variance of the random intercept, respectively, as estimated via QUAD and PQL. Overall, median bias in the PQL or QUAD-estimated regression coefficient was around 30% and increased as the variance of the random effect increased, the proportion with the outcome decreased, the number of observations in the dataset decreased. (See Table 2.) On the other hand, the estimated variance of the random intercept was more biased when estimated via PQL than via QUAD (median absolute percent bias was 29% for QUAD vs. 48% for PQL). For both estimation methods, bias increased as the proportion with the outcome decreased and the size of the dataset decreased. For QUAD, bias decreased as the number of clusters increased, while for PQL the opposite was observed. (See Table 3.) Nonconvergence occurred more often with QUAD than PQL (mean proportion across all simulation scenarios was 8.8 vs. 2.3), and was especially problematic when the proportion of subjects with the outcome was 5% (mean proportion of nonconvergence was 32% for QUAD, but just 8.2 percent for PQL, data not shown). When QUAD did not converge, but PQL did converge, median bias was higher for the PQL-estimated regression coefficient (median = 0.53 with IQR = 0.33-0.88) and variance of the random effect (median = 0.72, IQR: 0.51-0.87) for the estimated. (See Table S1.) The estimated slope was generally positive when absolute percent bias in OR QUAD was the outcome (See Figure 1). The median slope overall was 8.8, suggesting that for a one percent increase in difference between OR QUAD and OR PQL , the absolute percent bias in OR QUAD increased by 8.8. However, the interquartile range was quite wide. For example, the interquartile range of slopes was 7 to 33, 4 to 24 and 2 to 20 for small, medium and large datasets respectively. The estimated slope was never statistically significantly negative. The estimated slope for the effect of absolute percent difference in OR PQL and OR QUAD on the absolute percent bias in OR PQL was statistically significantly negative for 14% (i.e. 60 out of 424) of the data scenarios investigated. The slope was more likely to be negative as the magnitude of b 1 increased, the proportion of subjects with the outcome increased, the size of the data set increased, if there were few observations per cluster, or the intercluster variability was high. (Data not shown).
Overall, in most simulation scenarios the absolute percent difference in OR PQL and OR QUAD was a statistically significant predictor of the absolute percent bias in OR PQL or OR QUAD , respectively, though more often when absolute percent bias in OR QUAD was used as the outcome. (See Figure 2.) The proportion of scenarios in which the absolute percent difference in OR PQL and OR QUAD was a statistically significant predictor decreased as the true regression coefficient increased; and increased as the intercluster variance increased. This proportion decreased as the total number of subjects increased (See Figure 2). The smallest proportion statistically significant were seen when datasets comprised 1500 observations in 6 clusters. Table 4. Results from a linear mixed quantile regression model with absolute percent bias in the odds ratio estimated via PQL or QUAD as the dependent variable and absolute percent difference in the odds ratios as estimated via PQL and QUAD as the independent variable, adjusted for data set characteristics (b 1 , s 2 u , proportion with the outcome (p), total number of observations in the data set and data set composition). A similar pattern of results was seen for the median R 2 of the linear regressions (See Figure 3), with results ranging from 0.08 to 0.45, and 0.03 to 0.31 for OR QUAD and OR PQL , respectively. The worst results were seen when s 2 u = 0, while the best results were seen when b 1 = 0.
We used a linear mixed quantile regression model was used to model the association between absolute percent difference in OR PQL and OR QUAD on the absolute percent bias in OR PQL or OR QUAD . We found that overall median absolute percent bias in OR QUAD increased by 6.5% (95% CI: 4.6-8.4) for each 1% difference in the absolute percent difference in OR PQL and OR QUAD , after adjusting for data set characteristics. However, this slope was quite variable -the variance of the random effect was 8.2. The association was less strong when absolute percent bias in OR PQL was used as the outcome: median bias in OR PQL increased by 1.2% (95% CI: 0.8-1.6) for each 1% difference in the absolute percent difference in OR PQL and OR QUAD , after adjusting for data set characteristics. This slope was less variable -the variance of the random effect was 1.3. See Table 4.
In addition to looking at bias in the odds ratios estimate via PQL and QUAD, we also considered using the regression coefficient. However, results were in general, poorer with smaller slopes, lower R 2 and smaller proportion statistically significant. (Data not shown.) When absolute percent difference in s 2 uPQL and s 2 uQUAD was used as the predictor for the absolute percent bias of s 2 uQUAD and s 2 uPQL , respectively, the estimated slope varied quite widely, especially when absolute percent bias in s 2 uPQL was used as the outcome. (See Figure 4.) The proportion of scenarios in which this was statistically significant was high (e.g. 85% and 91%, respectively). (See Figure 5.) The median proportion of variance explained by the predictor was 13% and 52%, respectively. (See Figure 6). Indeed, it seemed to be a much stronger predictor for PQL than for QUAD. was the outcome -in that case, the median slope was negative.
The slope was negative in 18% and 75% percent of simulation scenarios for QUAD and PQL, respectively. For PQL, negative slopes were more likely to occur when the variance of the random effect was bigger and when there were fewer subjects per cluster. (Data not shown.) We used a linear mixed quantile regression model was used to model the association between absolute percent difference in s 2 uPQL and s 2 uQUAD on the absolute percent bias in s 2 uPQL or s 2 uQUAD . The association was not statistically significant for s 2 uQUAD . The association was small and quite variable for s 2 uPQL , after adjusting for data set characteristics. See Table 5. The absolute difference in s 2 uPQL and s 2 uQUAD was not a very good predictor for the absolute percent bias in OR QUAD or OR PQL -fewer models were statistically significant (e.g. 29% overall for QUAD and 16% overall for PQL), R 2 was low, and the estimated slope was close to 0. (Data not shown.) The absolute percent difference in OR PQL and OR QUAD was not a good predictor of the absolute bias of s 2 uPQL or s 2 uQUAD . It was often statistically significant (e.g. 66% and 83% overall for QUAD and PQL, respectively), though R 2 was usually less than 0.3. In fact, the median slope across all scenarios was negative for PQL. (Data not shown.)

Discussion
Over time, adaptive Gaussian Hermite quadrature has become the gold standard for fitting generalized linear mixed models with binary outcomes. However, given the greater flexibility in terms of modelling correlation structures available with penalized quasilikelihood, and better convergence due to simpler estimation, PQL is still used frequently. Moreover, in some scenarios, neither approach uniformly gives good results. In this work, we systematically evaluated whether matching results from PQL and QUAD indicate less bias in estimated regression coefficients and variance parameters.
Overall, we found that the absolute percent bias of the odds ratio estimated via PQL or QUAD increased as the PQL-and QUAD-estimated odds ratios became more discrepant. While the estimated slope for the association between the absolute percent difference in the PQL-and QUAD-estimated odds ratios and the absolute percent bias of the odds ratio estimated via PQL or QUAD varied markedly depending on the characteristics of the dataset, the association for QUAD was almost always positive. In contrast, when using the absolute bias of the OR estimated via PQL as the outcome, the slope was often negative. In fact, it was negative in scenarios that are known to produce biased results for PQL -namely few subjects per cluster and high intercluster variability [5,17,18]. In these cases, the higher the discrepancy between the results, the more biased the PQL estimated odds ratio was.
We found that the absolute difference in s 2 uPQL and s 2 uQUAD was not a strong predictor for the absolute bias of s 2 uQUAD or the odds ratios estimated via PQL or QUAD. Moineddin et al. found that with two level data structures, the variance components were extremely overestimated with small groups and slightly underestimated with moderate group size for GLMM estimated via quadrature [19]. PQL has been found to underestimate the variance components when the denominator is small [7]. We found that absolute percent bias for s 2 u was greater for PQL than quadrature. For PQL, bias was worse when group size was small while for QUAD bias was worse when the number of groups was small. Given how markedly results varied depending on data set characteristics, specifying some cutpoints above which indicated biased results proved impossible. For example, when identifying odds ratios estimated via QUAD or PQL that were more than 30% biased and using the discrepancy between QUAD and PQL as the test, the area under the curve of the receiver operator curve was 66% for QUAD and 60% for PQL across all scenarios. Despite this, our results show that discrepant results may indicate increased bias.
One strength of this work was the use of simulations to systematically investigate the robustness of the association between similarity in PLQ and QUAD estimates as predictors of bias PQLand QUAD-regression coefficients and variance components. This allowed us to investigate the impact of a wide range of data set characteristics on these associations. Indeed, we varied data set size and composition, proportion of subjects with the outcome, magnitude of the effect under study, and inter-cluster variability in over 400 distinct data generation scenarios. Despite this, our scenarios were certainly not exhaustive.
Moreover, we made many simplifying decisions. We considered data sets with only one categorical predictor, only one level of clustering, and only generated data with normally distributed random intercepts, not random slopes, or more complicated correlation structures. Finally, we have only compared two methods, whereas some may also have been interested in comparing Bayesian methods of estimation [20], or other approaches.
This work suggests that comparing results from generalized linear mixed models estimated via PQL and QUAD is a worthwhile exercise for regression coefficients and variance components obtained via QUAD, in situations where PQL is known to give reasonable results. Results were less useful for results obtained via PQL. In both cases, results strongly depended on features of the data set, making it difficult to create a simple-toimplement rule.

Supporting Information
Table S1 Median (IQR) proportion of samples in which the model did not converge overall and by data generation parameters.  Table 5. Results from a linear mixed quantile regression model with absolute percent bias in s 2 u estimated via PQL or QUAD as the dependent variable and absolute percent different in s 2 u as estimated via PQL and QUAD as the independent variable, adjusted for data set characteristics (b 1 , s 2 u , proportion with the outcome (p), total number of observations in the data set and data set composition).