## Figures

## Abstract

### Background

Over time, adaptive Gaussian Hermite quadrature (QUAD) has become the preferred method for estimating generalized linear mixed models with binary outcomes. However, penalized quasi-likelihood (PQL) is still used frequently. In this work, we systematically evaluated whether matching results from PQL and QUAD indicate less bias in estimated regression coefficients and variance parameters via simulation.

### Methods

We performed a simulation study in which we varied the size of the data set, probability of the outcome, variance of the random effect, number of clusters and number of subjects per cluster, etc. We estimated bias in the regression coefficients, odds ratios and variance parameters as estimated via PQL and QUAD. We ascertained if similarity of estimated regression coefficients, odds ratios and variance parameters predicted less bias.

### Results

Overall, we found that the absolute percent bias of the odds ratio estimated via PQL or QUAD increased as the PQL- and QUAD-estimated odds ratios became more discrepant, though results varied markedly depending on the characteristics of the dataset

### Conclusions

Given how markedly results varied depending on data set characteristics, specifying a rule above which indicated biased results proved impossible.

This work suggests that comparing results from generalized linear mixed models estimated via PQL and QUAD is a worthwhile exercise for regression coefficients and variance components obtained via QUAD, in situations where PQL is known to give reasonable results.

**Citation: **Benedetti A, Platt R, Atherton J (2014) Generalized Linear Mixed Models for Binary Data: Are Matching Results from Penalized Quasi-Likelihood and Numerical Integration Less Biased? PLoS ONE 9(1):
e84601.
doi:10.1371/journal.pone.0084601

**Editor: **Gerardo Chowell, Arizona State University, United States of America

**Received: **July 3, 2013; **Accepted: **November 15, 2013; **Published: ** January 9, 2014

**Copyright: ** © 2014 Benedetti et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

**Funding: **The authors have no support or funding to report.

**Competing interests: ** The authors have declared that no competing interests exist.

## Introduction

Increasingly, data are collected in which the standard assumption of independence between observations is not met. This could include data that consist of multiple observations on a subject over time or subjects who are clustered in some way (e.g. classes within schools, or households within neighbourhoods). As computational power has grown, analytic methods have been extended to handle increasingly complex data structures.

If the association between observations on the same cluster/subject is not accounted for in the analytic strategy, inference associated with the estimated parameters may not be correct [1]. When the outcome is binary, one of the main analytic approaches in this context are generalized linear mixed models (GLMMs) [2].

GLMMs extend the linear mixed model to deal with outcomes with non-normal distributions. In particular, GLMMs can handle binary outcomes. In GLMMs, subject-specific random effects, usually normally-distributed, are incorporated in the model. In this way, the second order structure or correlation between subjects in the same cluster can be described and accounted for. When the outcome is binary, in GLMMs the regression coefficient is estimated conditional on the random effect [2], and as such, has a subject-specific interpretation [3], [4].

To estimate the parameters in the GLMM, maximizing the exact likelihood involves an intractable integration. Several approaches have been proposed to get around this. A commonly used method is penalized quasi-likelihood (PQL), proposed by Breslow and Clayton [5]. In this implementation, a Laplace approximation is used, resulting in an approximated likelihood function with Normal distribution [5]. One advantage of PQL is that it can accommodate complex correlation structures. However, estimates can be badly biased especially with few subjects per cluster, low event rates, or high inter-cluster variability, because the method uses an approximation to the likelihood [6]–[8].

The main competitor to PQL is numerical integration via adaptive Gaussian Hermite quadrature (QUAD) [9], [10]. While QUAD is not computationally efficient for multidimensional random effects (e.g. time series), it can perform adequately with few random effects [1]. While quadrature provides accurate estimates for regression coefficients under a variety of conditions, convergence of QUAD is often a problem, particularly when variance parameters are close to zero or cluster sizes are small [11], [12].

Despite many studies investigating the statistical properties of parameter estimates from generalized linear mixed models (GLMMs), it still remains somewhat unclear under what conditions good properties can be expected from either of these methods. In particular, when the number of clusters is small and the number of subjects per cluster is small, neither PQL nor QUAD are guaranteed to give good results for regression coefficients [13]. Similarly, estimated variance components are often biased with both approaches (e.g. [11]).

If matching results from GLMMs estimated via PQL and QUAD indicated relatively unbiased regression coefficients or variance parameters, this could be an easy “diagnostic” to perform. Both estimation methods are available in SAS and R.

In this work, we investigate systematically whether matching regression coefficients, odds ratios or variance components from PQL and QUAD suggest minimal bias in those same parameters. Moreover, we attempt to develop useable guidelines based on comparing the results from PQL and QUAD. For example, should the comparison be between estimated regression coefficients, estimated variance components or both? Moreover, how close is close enough?

## Materials and Methods

Statistical simulation was used to assess whether matching results from PQL and QUAD indicate less bias.

### Data generation

Our data generation algorithm was developed to generate clustered data. We imagined working in the clustered data context (e.g. children grouped in classes, or people clustered in neighbourhoods), rather than longitudinal, repeated measures type data We generated an outcome (Y_{ij}) and a predictor (X_{ij}). Here, i denotes the cluster, and j denotes the subject within the cluster. Thus Y_{ij} is the outcome observed for subject j from cluster i. The dichotomous independent variable, X_{ij}, was generated from a Bernoulli distribution with probability = 0.5. To generate the corresponding dichotomous outcome variable Y_{ij}, first the probability of the outcome was generated from the following logistic regression model:(1)where u_{i} was a random effect generated from a normal distribution with mean = 0 and variance = . By including u_{i} in the data generation step, correlation between observations in the same cluster is induced. Then the dichotomous Y_{ij} variable was generated from a Bernouilli distribution with the probability of the outcome provided by the logistic regression (equation 1). The number of clusters, number of subjects per cluster, β_{1}, variance of the random effect, and proportion of subjects with the outcome were all varied, with levels described in Table 1. For each distinct combination (n = 424) of parameters (“simulation scenarios”), 250 data sets were generated, which gave us adequately precise results, while allowing us to investigate a wide range of simulation scenarios.

### Data analysis

Two GLMMs (random effects logistic regression models) were fit to the data, including the exposure as an independent variable, and allowing the intercept to vary across the clusters. The model parameters were estimated using penalized quasi-likelihood (PQL) and adaptive Gaussian Hermite quadrature (QUAD). Both models were fit using the GLIMMIX procedure in SAS version 9.2.

### Measures of performance

We collected information on bias and variability of the estimated regression coefficient for X (), and odds ratio (exp()), as well as the estimated variance of the random intercepts (), as estimated via PQL or QUAD.

We quantified the proximity of the PQL and QUAD results as the absolute percent difference between the estimated odds ratios, according to the following formula:(1)where and were the regression coefficients as estimated via PQL or QUAD, respectively.

The estimated variance components were compared according to the following formula:(2)where and were the variance of the random effects as estimated by PQL or QUAD, respectively.

We also quantified how close results were to the truth, via the following formulae:(3)(4)(5)(6)defined as above. When σ^{2} or β_{1} was zero we divided by 1 in formulae 5 and 6.

### Data analysis of simulation results

We removed observations where PQL or QUAD did not converge. Model convergence was defined as a model that produced estimates for relevant parameters and did not return a warning. We estimated convergence for each estimation procedure as the number of simulation repetitions that did converge divided by the total attempted (n = 250).

We estimated the median absolute percent bias of the regression coefficients and random intercept variances as estimated via PQL or QUAD for each simulation scenario.

We fit linear regressions, with absolute percent bias of the estimated odds ratios and absolute percent bias of the variance component (e.g. formulae 3–6) as the outcome and measures of how close PQL and QUAD results were (e.g. formulas 1–2) as the predictors, overall and separately for each combination of data generation parameters (i.e. in 424 distinct data generation scenarios). We report the median estimated slope and interquartile range of the slope, the proportion of scenarios in which the predictor was statistically significant and the median R^{2} and interquartile range of the R^{2} for the models overall (i.e. across all 424 scenarios investigated), as well as by data generation parameters.

Finally, we used mixed quantile regression [14] with absolute percent bias of the estimated odds ratios and variance components (e.g. formulae 3–6) as the outcome and measures of how close PQL and QUAD results were (e.g. formulae 1 and 2) as the predictor of primary interest, and data generation parameters as covariates in the model (i.e. proportion with the outcome, data set size, data set composition, β_{1}, σ_{u}^{2}.) Data set composition categorized data sets as having few large clusters (when the number of clusters was 6), many small clusters (when cluster size was 2), or moderate (all other possibilities). All covariates were entered as dummy variables in the model. Intercepts and the coefficient for similarity of PQL and QUAD results were allowed to vary across simulation scenario.

All data generation and analyses were carried out using SAS/STAT version 9.2 [15], with the exception of the linear mixed quantile regression which was performed in R version 2.14.2 [16].

## Results

Tables 2 and 3 present the median and interquartile range of the absolute percent bias and mean squared error (MSE) of the regression coefficient and variance of the random intercept, respectively, as estimated via QUAD and PQL. Overall, median bias in the PQL or QUAD-estimated regression coefficient was around 30% and increased as the variance of the random effect increased, the proportion with the outcome decreased, the number of observations in the dataset decreased. (See Table 2.)

On the other hand, the estimated variance of the random intercept was more biased when estimated via PQL than via QUAD (median absolute percent bias was 29% for QUAD vs. 48% for PQL). For both estimation methods, bias increased as the proportion with the outcome decreased and the size of the dataset decreased. For QUAD, bias decreased as the number of clusters increased, while for PQL the opposite was observed. (See Table 3.)

Nonconvergence occurred more often with QUAD than PQL (mean proportion across all simulation scenarios was 8.8 vs. 2.3), and was especially problematic when the proportion of subjects with the outcome was 5% (mean proportion of nonconvergence was 32% for QUAD, but just 8.2 percent for PQL, data not shown). When QUAD did not converge, but PQL did converge, median bias was higher for the PQL-estimated regression coefficient (median = 0.53 with IQR = 0.33–0.88) and variance of the random effect (median = 0.72, IQR: 0.51–0.87) for the estimated. (See Table S1.)

Figures 1–3 present the results from separate simple linear regressions to model the effect of the absolute percent difference in OR_{PQL} and OR_{QUAD} (equation 1) on the absolute percent bias in OR_{QUAD} (equation 4), and the absolute percent bias in OR_{PQL} (equation 3), respectively, overall and by data generation parameters.

Median (interquartile range) of the estimated slope is the center of the box, box edges are the 25^{th} and 75^{th} percentile respectively, ends of the dashed lines are the 10^{th} and 90^{th} percentile, respectively.

Median (interquartile range) of the R^{2} is the center of the box, box edges are the 25^{th} and 75^{th} percentile respectively, ends of the dashed lines are the 10^{th} and 90^{th} percentile, respectively.

The estimated slope was generally positive when absolute percent bias in OR_{QUAD} was the outcome (See Figure 1). The median slope overall was 8.8, suggesting that for a one percent increase in difference between OR_{QUAD} and OR_{PQL}, the absolute percent bias in OR_{QUAD} increased by 8.8. However, the interquartile range was quite wide. For example, the interquartile range of slopes was 7 to 33, 4 to 24 and 2 to 20 for small, medium and large datasets respectively. The estimated slope was never statistically significantly negative. The estimated slope for the effect of absolute percent difference in OR_{PQL} and OR_{QUAD} on the absolute percent bias in OR_{PQL} was statistically significantly negative for 14% (i.e. 60 out of 424) of the data scenarios investigated. The slope was more likely to be negative as the magnitude of β_{1} increased, the proportion of subjects with the outcome increased, the size of the data set increased, if there were few observations per cluster, or the intercluster variability was high. (Data not shown).

Overall, in most simulation scenarios the absolute percent difference in OR_{PQL} and OR_{QUAD} was a statistically significant predictor of the absolute percent bias in OR_{PQL} or OR_{QUAD}, respectively, though more often when absolute percent bias in OR_{QUAD} was used as the outcome. (See Figure 2.) The proportion of scenarios in which the absolute percent difference in OR_{PQL} and OR_{QUAD} was a statistically significant predictor decreased as the true regression coefficient increased; and increased as the intercluster variance increased. This proportion decreased as the total number of subjects increased (See Figure 2). The smallest proportion statistically significant were seen when datasets comprised 1500 observations in 6 clusters.

A similar pattern of results was seen for the median R^{2} of the linear regressions (See Figure 3), with results ranging from 0.08 to 0.45, and 0.03 to 0.31 for OR_{QUAD} and OR_{PQL}, respectively. The worst results were seen when σ^{2}_{u} = 0, while the best results were seen when β_{1} = 0.

We used a linear mixed quantile regression model was used to model the association between absolute percent difference in OR_{PQL} and OR_{QUAD} on the absolute percent bias in OR_{PQL} or OR_{QUAD}. We found that overall median absolute percent bias in OR_{QUAD} increased by 6.5% (95% CI: 4.6–8.4) for each 1% difference in the absolute percent difference in OR_{PQL} and OR_{QUAD}, after adjusting for data set characteristics. However, this slope was quite variable – the variance of the random effect was 8.2. The association was less strong when absolute percent bias in OR_{PQL} was used as the outcome: median bias in OR_{PQL} increased by 1.2% (95% CI: 0.8–1.6) for each 1% difference in the absolute percent difference in OR_{PQL} and OR_{QUAD}, after adjusting for data set characteristics. This slope was less variable – the variance of the random effect was 1.3. See Table 4.

In addition to looking at bias in the odds ratios estimate via PQL and QUAD, we also considered using the regression coefficient. However, results were in general, poorer with smaller slopes, lower R^{2} and smaller proportion statistically significant. (Data not shown.)

When absolute percent difference in σ^{2}_{uPQL} and σ^{2}_{uQUAD} was used as the predictor for the absolute percent bias of σ^{2}_{uQUAD} and σ^{2}_{uPQL}, respectively, the estimated slope varied quite widely, especially when absolute percent bias in σ^{2}_{uPQL} was used as the outcome. (See Figure 4.)

Median (interquartile range) of the estimated slope is the center of the box, box edges are the 25^{th} and 75^{th} percentile respectively, ends of the dashed lines are the 10^{th} and 90^{th} percentile, respectively.

The proportion of scenarios in which this was statistically significant was high (e.g. 85% and 91%, respectively). (See Figure 5.) The median proportion of variance explained by the predictor was 13% and 52%, respectively. (See Figure 6). Indeed, it seemed to be a much stronger predictor for PQL than for QUAD. was the outcome – in that case, the median slope was negative.

Median (interquartile range) of the R^{2} is the center of the box, box edges are the 25^{th} and 75^{th} percentile respectively, ends of the dashed lines are the 10^{th} and 90^{th} percentile, respectively.

The slope was negative in 18% and 75% percent of simulation scenarios for QUAD and PQL, respectively. For PQL, negative slopes were more likely to occur when the variance of the random effect was bigger and when there were fewer subjects per cluster. (Data not shown.)

We used a linear mixed quantile regression model was used to model the association between absolute percent difference in σ^{2}_{uPQL} and σ^{2}_{uQUAD} on the absolute percent bias in σ^{2}_{uPQL} or σ^{2}_{uQUAD}. The association was not statistically significant for σ^{2}_{uQUAD}. The association was small and quite variable for σ^{2}_{uPQL}, after adjusting for data set characteristics. See Table 5.

The absolute difference in σ^{2}_{uPQL} and σ^{2}_{uQUAD} was not a very good predictor for the absolute percent bias in OR_{QUAD} or OR_{PQL} – fewer models were statistically significant (e.g. 29% overall for QUAD and 16% overall for PQL), R^{2} was low, and the estimated slope was close to 0. (Data not shown.)

The absolute percent difference in OR_{PQL} and OR_{QUAD} was not a good predictor of the absolute bias of σ^{2}_{uPQL} or σ^{2}_{uQUAD}. It was often statistically significant (e.g. 66% and 83% overall for QUAD and PQL, respectively), though R^{2} was usually less than 0.3. In fact, the median slope across all scenarios was negative for PQL. (Data not shown.)

## Discussion

Over time, adaptive Gaussian Hermite quadrature has become the gold standard for fitting generalized linear mixed models with binary outcomes. However, given the greater flexibility in terms of modelling correlation structures available with penalized quasi-likelihood, and better convergence due to simpler estimation, PQL is still used frequently. Moreover, in some scenarios, neither approach uniformly gives good results. In this work, we systematically evaluated whether matching results from PQL and QUAD indicate less bias in estimated regression coefficients and variance parameters.

Overall, we found that the absolute percent bias of the odds ratio estimated via PQL or QUAD increased as the PQL- and QUAD-estimated odds ratios became more discrepant. While the estimated slope for the association between the absolute percent difference in the PQL- and QUAD-estimated odds ratios and the absolute percent bias of the odds ratio estimated via PQL or QUAD varied markedly depending on the characteristics of the dataset, the association for QUAD was almost always positive. In contrast, when using the absolute bias of the OR estimated via PQL as the outcome, the slope was often negative. In fact, it was negative in scenarios that are known to produce biased results for PQL – namely few subjects per cluster and high intercluster variability [5], [17], [18]. In these cases, the higher the discrepancy between the results, the more biased the PQL estimated odds ratio was.

We found that the absolute difference in σ^{2}_{uPQL} and σ^{2}_{uQUAD} was not a strong predictor for the absolute bias of σ^{2}_{uQUAD} or the odds ratios estimated via PQL or QUAD. Moineddin et al. found that with two level data structures, the variance components were extremely overestimated with small groups and slightly underestimated with moderate group size for GLMM estimated via quadrature [19]. PQL has been found to underestimate the variance components when the denominator is small [7]. We found that absolute percent bias for σ^{2}_{u} was greater for PQL than quadrature. For PQL, bias was worse when group size was small while for QUAD bias was worse when the number of groups was small.

Given how markedly results varied depending on data set characteristics, specifying some cutpoints above which indicated biased results proved impossible. For example, when identifying odds ratios estimated via QUAD or PQL that were more than 30% biased and using the discrepancy between QUAD and PQL as the test, the area under the curve of the receiver operator curve was 66% for QUAD and 60% for PQL across all scenarios. Despite this, our results show that discrepant results may indicate increased bias.

One strength of this work was the use of simulations to systematically investigate the robustness of the association between similarity in PLQ and QUAD estimates as predictors of bias PQL- and QUAD- regression coefficients and variance components. This allowed us to investigate the impact of a wide range of data set characteristics on these associations. Indeed, we varied data set size and composition, proportion of subjects with the outcome, magnitude of the effect under study, and inter-cluster variability in over 400 distinct data generation scenarios. Despite this, our scenarios were certainly not exhaustive.

Moreover, we made many simplifying decisions. We considered data sets with only one categorical predictor, only one level of clustering, and only generated data with normally distributed random intercepts, not random slopes, or more complicated correlation structures. Finally, we have only compared two methods, whereas some may also have been interested in comparing Bayesian methods of estimation [20], or other approaches.

This work suggests that comparing results from generalized linear mixed models estimated via PQL and QUAD is a worthwhile exercise for regression coefficients and variance components obtained via QUAD, in situations where PQL is known to give reasonable results. Results were less useful for results obtained via PQL. In both cases, results strongly depended on features of the data set, making it difficult to create a simple-to-implement rule.

## Supporting Information

### Table S1.

**Median (IQR) proportion of samples in which the model did not converge overall and by data generation parameters.**

doi:10.1371/journal.pone.0084601.s001

(DOC)

## Author Contributions

Conceived and designed the experiments: AB. Performed the experiments: AB. Analyzed the data: AB. Wrote the paper: AB RP JA. Interpreted results: AB RP JA.

## References

- 1.
Molenberghs G, Verbeke G (2005) Models for Discrete Longitudinal Data. New York: Springer.
- 2.
Diggle P, Heagerty P, Liang K-Y, Zeger SL (2002) Analysis of Longitudinal Data. Oxford: Oxford University Press.
- 3. Jang JY, Kang SK, Chung HK (1993) Biological exposure indices of organic solvents for Korean workers. International Archives of Occupational & Environmental Health 65: S219–S222 15.
- 4. Neuhaus JM, Kalbfleisch JD, Hauck WW (1991) A comparison of cluster-specific and population average approaches for analyzing correlated binary data. International Statistical Review 59: 25–35.
- 5. Breslow N, Clayton D (1993) Approximate inference in generalized linear mixed models. J Am Stat Assoc 88: 9–25.
- 6.
Breslow N (2003) Whither PQL? UW Working Paper Series 192.
- 7. Breslow N, Lin X (1995) Bias correction in generalized linear mixed models with a single component of dispersion. Biometrika 82: 91.
- 8. Jang W, Lim J (2009) A Numerical study of PQL estimation biases in generalized linear mixed models under heterogeneity of random effects. Comm Stat Sim Comp 38: 692–702.
- 9. Pinheiro J, Bates D (1995) Approximations to the log-likelihood function in the nonlinear mixed-effects model. J Comput Graph Stat 4: 12–35.
- 10. Hedeker D, Gibbons RD, Flay BR (1994) Random-effects regression models for clustered data with an example from smoking prevention research. J Consult Clin Psychol 62: 757–765.
- 11. Callens M, Croux C (2005) Performance of likelihood-based methods for multilevel binary regression models. J Stat Comput Sim 75: 1003–1017.
- 12. Ng ESW, Carpenter J, Goldstein H, Rasbash J (2006) Estimation in generalized linear mixed models with binary outcomes by simulated maximum likelihood. Stat Modelling 6: 23–42.
- 13. Austin P (2010) Estimating multilevel logistic regression models when the number of clusters is low: a comparison of different statistical software procedures. Int J Biostat 6: Article 16.
- 14. Geraci M, Bottai M (2013) Linear quantile mixed models. Statistics & Computing DOI 10.1007s1122201393819/s11222-013-9381-9.
- 15.
SAS/STAT (2011) Version 9.2 for Windows. [computer program]. Cary, NC, USA: SAS Institute Inc.
- 16.
R Development Core Team (2012) R: A language and environment for statistical computing, version 2.14.2 [computer program]. Vienna, Austria: R Foundation for Statistical Computing.
- 17. Goldstein H, Rasbash J (1996) Improved approximations for multilevel models with binary responses. J Roy Stat Soc A Sta 159: 505–513.
- 18. Rodriguez G, Goldman N (1995) An assessment of estimation procedures for multilevel models with binary responses. J Roy Stat Soc A Sta 158: 73–90.
- 19. Moineddin R, Matheson FI, Glazier RH (2007) A simulation study of sample size for multilevel logistic regression models. BMC Med Res Methodol 7: 34.
- 20.
Gelman A and Hill J (2006) Data Analysis using Regression and Multilevel/Hierarchical Models. Cambridge: Cambridge University Press.