Figures
Abstract
Binary Generalized Linear Mixed Model (GLMM) is the most common method used by researchers to analyze clustered binary data in biological and social sciences. The traditional approach to GLMMs causes substantial bias in estimates due to steady shape of logistic and normal distribution assumptions thereby resulting into wrong and misleading decisions. This study brings forward an approach governed by skew generalized t distributions that belong to a class of potentially skewed and heavy tailed distributions. Interestingly, both the traditional logistic and probit mixed models, as well as other available methods can be utilized within the skew generalized t-link model (SGTLM) frame. We have taken advantage of the Expectation-Maximization algorithm accelerated via parameter-expansion for model fitting. We evaluated the performance of this approach to GLMMs through a simulation experiment by varying sample size and data distribution. Our findings indicated that the proposed methodology outperforms competing approaches in estimating population parameters and predicting random effects, when the traditional link and normality assumptions are violated. In addition, empirical standard errors and information criteria proved useful for detecting spurious skewness and avoiding complex models for probit data. An application with respiratory infection data points out to the superiority of the SGTLM which turns to be the most adequate model. In future, studies should focus on integrating the demonstrated flexibility in other generalized linear mixed models to enhance robust modeling.
Citation: Tovissodé CF, Diop A, Glèlè Kakaï R (2021) Inference in skew generalized t-link models for clustered binary outcome via a parameter-expanded EM algorithm. PLoS ONE 16(4): e0249604. https://doi.org/10.1371/journal.pone.0249604
Editor: Luca Citi, University of Essex, UNITED KINGDOM
Received: August 15, 2020; Accepted: March 19, 2021; Published: April 6, 2021
Copyright: © 2021 Tovissodé et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the paper and its Supporting information files.
Funding: CFT is grateful to the Centre d’Excellence Africain en Sciences Mathématiques et Applications (CEA-SMA, https://ceasma-benin.org/) for funding his work. CFT was also financially supported by the African German Network of Excellence in Science (AGNES), through the "AGNES mobility grant for young scientists from sub Saharan Africa" (https://agnes-h.org/). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Binary outcomes are prominent in many applied sciences, including but not limited to biological and social sciences. Moreover, in cross sectional as well as panel studies, dichotomous responses are often naturally grouped by sampling techniques or some properties of the sampling units [1]. The preferred modern method to analyze clustered binary data is through the Generalized Linear Mixed Model (GLMM) framework [2]. Indeed, when a binary outcome has been recorded repeatedly or in the presence of latent factors, GLMMs allow accounting explicitly for over-dispersion and correlation within clusters using random effects.
Let Yij denote the binary outcome (0 or 1) of the jth measurement (j = 1, 2, ⋯, ni) and Yi the collection of responses from the ith cluster, i.e. , i = 1, 2, ⋯, n. In terms of an underlying latent continuous random vector
and random effects bi = (bi1, ⋯, biq)⊤, the mixed probit model (PM) assumes that Yij are conditionally independent and given as [3]:
(1)
where IA(x) is the indicator function which equals to 1 if x ∈ A and 0 otherwise; ηi is the ni−vector of linear predictors,
; β is the p−vector of fixed effects;
and
are respectively known ni × p and ni × q matrices of covariates with Xij = (Xij1, ⋯, Xijp)⊤ and Wij = (Wij1, ⋯, Wijq)⊤;
is the ni × ni identity matrix and
denotes the q-variate normal distribution, with null mean vector and variance-covariance an unknown q × q matrix D, meant to capture the dependence structure of Yi. The latent variable Zij serves for a convenient stochastic representation of the conditional outcome Yij. Equivalently, one may write P(Yij = 1|bi) = Φ(ηij) with Φ(⋅) the cumulative distribution function (cdf) of the standard normal distribution, standing as the inverse link function mapping the linear predictor ηij and the predicted probabilities of the outcome Yij. Combined with the normality assumption on random effects, the systematic use of this link and the well known alternative, the logit link, is somewhat controversial [4, 5].
The link function indeed has a critical role in GLMMs since it heavily impacts estimates, predictions and consequently interpretations [4, 6]. As a result, in the binary generalized linear model literature, aside the logistic and probit models based on the steady shape logistic and normal distributions respectively, there has been increasing efforts to render the link function flexible. Many works have considered heavy tailed link functions, for instance the Semi-Nonparametric [7], Student-t [8] and generalized t [9] distributions, and elliptical scale mixtures [10, 11]. Indeed, the maximum likelihood estimators of logistic and probit regression models are not robust to outliers [7]. Heavy tailed links are not sensitive to outliers and thus allow outlier-robust inference. In particular, the links functions based on the Student t distribution incorporate observation-specific stochastic weights which can be used for outlier detection [7, 12]. Similarly, skew-probit [13], skew generalized t [9], asymmetric logistic [14], loglog and complementary loglog, power symmetric and reciprocal power symmetric [15] links were used among others to handle situations where the probability of a given binary response approaches zero at a different rate than it approaches one. Skew logistic distributions have also been developed (see e.g. [16]) and may be used with the same aim in mind. For example [9], discussed a prostate cancer study where the outcome variable Y represents the presence or the penetration of cancer in or near the prostate capsule of patients. The rate at which the probability of “Y = 1” approaches one is expected to be very different (slower) from the rate at which it approaches zero [9]. For this study, the skew generalized t-link fits best the data [9], indicating that the simultaneous use of skewed and heavy tailed link functions can lead to more effective modelling of binary data.
Furthermore, although random effects are traditionally assumed to be normally distributed in GLMMs, this may not be realistic [17, 18]. Therefore, huge efforts have been devoted to making the random effects distribution in GLMMs flexible, replacing the normal distribution with, for instance Semi-Nonparametric [19], probability integral transformation of normal [20], skew normal [21], log-normal [22, 23], Student-t [24] and scale mixtures of normal [25] distributions.
The above background demonstrates the extent to which the number of possible approaches for fitting a flexible GLMM to correlated binary outcomes goes, although none of these approaches attempts to explicitly account for skewness and tail behavior of the link function as well as the random effects distribution simultaneously. However, the misspecification of the link function or the random effects distribution can introduce substantial bias and reduce the accuracy of the mean response as well as heterogeneity estimates [6, 18]. Standing in a fully parametric frame, we propose a unifying approach based on skew generalized t (SGT) distributions [26], that is the class of models including among others the normal, the skew normal and the Student t models. The use of a skew generalized t family instead of the Student t family allows to rescale fixed effects so that they have the same interpretation as in the mixed probit model in Eq (1).
Our contributions include i) an extension of the flexible generalized t-link model built for independent binary samples proposed by [9] to deal with dependent binary samples (mixed model); ii) a parameter-expanded EM algorithm [27] for computing the maximum likelihood of skew generalized t-link models for correlated binary data, extending the EM algorithm of [24] for t-link models; and iii) empirical Bayes estimators of skew t distributed random effects in mixed models for binary data.
The organization of the paper is as follows. Section 2 presents preliminary results on the SGT distributions and the truncated SGT distributions and their first two moments. Section 3 specifies the SGT-link model (SGTLM) and describes maximum likelihood estimation and cluster-specific inference based on random effects and weights. A simulation study assessing the relative performance of the SGTLM relative to existing methods and the application of the modelling approach to a real respiratory infection data are presented in Section 4. Concluding remarks are given in section 5.
Preliminary results
In this section, we present some useful properties of the skew generalized t distributions.
Multivariate skew generalized t distributions
Multivariate skew generalized t (SGT) distributions are special cases of multivariate skew scale mixture of normal (SSMN) distributions [28] (pages 102-103) which we first introduce. A random variable Z is said to follow a p−variate SSMN distribution with location μ, scale Ω, and shape λ, if it can be represented as [29] (page 20, Eq 3.12):
(2)
where U, called scale mixing variable, is a positive random variable with cdf FU(⋅|ν) indexed by a parameter vector ν,
is the standard half normal distribution; Z0, X and U are independent;
and δ = (1 + λ⊤ λ)−1/2 Ω1/2 λ. Different choices of the scale mixing distribution FU(⋅|ν) result in various sub-classes of skew elliptical distributions, for instance, the skew normal when P(U = 1) = 1 [28] (page 103); the skew contaminated normal when ν = (ν1, ν2)⊤, 0 < ν1 < 1, 0 < ν1 ≤ 1 and U is discrete and takes the values U = 1 with probability 1 − ν1 and U = ν2 with probability ν1 [30] (page 308); the skew slash when
, ν > 0 [30] (page 307); and the skew generalized t when ν = (ν, ν0)⊤, ν > 0, ν0 > 0,
[28] (page 105). The following result states conditions for the identifiability of SSMN distributions, a requirement for reliable inference using this class of distributions.
Lemma 1 (see S1 Appendix for a proof) The free parameters (μ, δ, Ω and ν) of a SSMN distribution with the representation in Eq (2) are identifiable if and only if i) the scale mixing distribution FU(⋅|ν) is identifiable and ii) FU(⋅|ν) does not satisfy for any element νk of ν and any distribution function H(⋅|ν−k) where ν−k is the vector ν without the element νk. If U has a probability density function (pdf) fU(u|ν) for all u > 0, then the condition ii) is equivalent to fU(⋅|ν) does not satisfy
for any pdf hU(⋅|ν−k).
On setting and defining the expectations
, and assuming that
for the required expectations, the first two central moments of a SSMN vector Z are given by [28] (pages 109-110):
(3)
(4)
The ability of the SSMN distributions to capture more data structure than the normal, the skew normal or the scale mixture of normals is reflected in the expressions for skewness (
) and kurtosis (
) indices given for the kth marginal of Z as [31]:
(5)
(6)
where δk is the kth element of δ and
is the kth diagonal element of the covariance matrix given in Eq (4).
We notice from the expressions for skewness Eq (5) and kutosis Eq (6) indices that the parameter λ controls the shape of the distribution only through the working shape parameter δ = (1 + λ⊤ λ)−1/2 Ω1/2 λ. This quantity is invariant under marginalization, i.e. by the stochastic representation in Eq (2), for any arbitrary subset of Z, the working shape parameter is the corresponding subset of δ. It is worth noticing however that the quantity δ cannot be specified unrestrictedly, independently of Ω. Indeed, we observe that δ = (1 + λ⊤ λ)−1/2 Ω1/2 λ implies that λ = (1 + λ⊤ λ)1/2 Ω−1/2 δ. This in turn gives λ⊤ λ = (1 + λ⊤ λ)δ⊤ Ω−1 δ which yields λ⊤ λ(1−δ⊤ Ω−1 δ) = δ⊤ Ω−1 δ. We then get so that
, i.e. 1 + λ⊤ λ = (1 − δ⊤ Ω−1 δ)−1 provided δ⊤ Ω−1 δ ≠ 1. Therefore, λ is recorvered from δ and Ω under the constraint δ⊤ Ω−1 δ < 1 as:
(7)
It is nevertheless possible to unrestrictedly specify δ and
(positive definite). In this case, Ω is recovered as
. Using the Sherman-Morrison identity [32] (page 121, Eq 3.1), we have
from which we get
that simplifies as
. We thus have
hence
(8)
In the binary data modeling framework, we shall consider δ and
as model parameters as they turn to be easier to estimate by the mean of the EM algorithm. For the multivariate Skew Generalized t (SGT) distribution, the mixing variable U is gamma distributed, i.e.
with pdf [33] (page 1, Eq 1):
(9)
The p−variate SGT distribution, denoted
with ν = (ν, ν0)⊤ has pdf for
[28] (page 106):
(10)
where
(11)
is the pdf of the p−variate Generalized t (GT) distribution, z0 = Ω−1/2 (z − μ), α = λ⊤ z0, and T(⋅|ν) is the cdf of the standard univariate t distribution with ν degrees of freedom. For SGT distributions, the expectations
required for computing moments given in Eqs (3)–(6) have for t < ν the form
(12)
It is worthwhile noticing that the gamma mixing pdf fG(⋅|ν/2, ν0/2) satisfies the condition i) of Lemma 1 but not the condition ii). The SGT ditribution with ν as a parameter is thus not identifiable. However, restricting ν0 to a fixed value (so that only ν is considered as a parameter) is sufficient to ensure identifiability of the SGT family of distribution. When ν0 = ν, the p−variate SGT distribution
reduces to the p−variate Skew t (ST) distribution [28] (page 106), denoted
which is thus identifiable with pdf Stp(⋅|μ, Ω, λ, ν) and cdf Stp(⋅|μ, Ω, λ, ν). If λ = 0, the SGT distribution reduces to the GT distribution which equals the Student t distribution for ν0 = ν. The following lemma formalizes the relationship between skew generalized t and skew t distributions.
Lemma 2 (see S2 Appendix for a proof) Let us consider the SGT distribution with ν = (ν, ν0)⊤ and pdf in Eq (10). Set
. The following statements hold:
- SGtp(z|μ, Ω, λ, ν) = Stp(z|μ, Ω*, λ, ν);
- If
then
.
Lemma 2 indicates that any SGT vector is a rescaled version of a ST vector. However, in the frame of binary data models, the use of a SGT distribution instead of a simple ST distribution as link function allows to control the scale of the link function through the parameter ν0 [9]. Specifically, a skew generalized t-link allows to define a skewed and heavy-tailled binary mixed model where fixed effects have the same scale and interpretation as in the mixed probit model in Eq (1). Interestingly, the popular logit and probit links for binary data can be recast as special cases of the cdf of the SGT class of distributions. Indeed, the normal distribution is a limiting case of SGT distributions when ν0 = ν → ∞ and λ = 0. Moreover, the logistic distribution is well appoximated by a rescaled Student t distribution with appropriate degrees of freedom [8] (page 228). These constatations make the SGT distributions appropriate for extending the traditional probit and logistic GLMMs, accounting for skewness and heavy tail behaviors. To this end, we present in the next section some results on truncated multivariate SGT distributions since binary data can reflect truncation of latent continuous variables.
Truncated multivariate skew generalized t distributions
As seen for the mixed probit model in Eq (1), models for binary data can be defined by truncating latent variables following continuous distributions. We define in this section a class of truncated multivariate skew generalized t distributions which are useful for a latent variable representation of skew generalized t-link binary data models. We also give expressions to evaluate some joint moments of a truncated multivariate skew generalized t distribution and a gamma distribution, as they prove useful in the implementation of the EM algorithm for the skew generalized t-link model.
Let represent a p−variate skew generalized t (SGT) vector restricted to a p-dimensional hyperplane
; with μ a p−vector (location), Ω a p × p positive definite matrix (scale), λ a p−vector (shape) and ν = (ν, ν0)⊤ a vector of positive scalars (degrees of freedom). The pdf of
is:
(13)
where SGtp(⋅|μ, Ω, λ, ν) is the pdf in Eq (10) and
serves for normalization. The cdf of Z is denoted
. When λ = 0, we obtain a truncated generalized t distribution denoted
with pdf
and cdf
. When ν0 = ν, we get a truncated ST distribution denoted
with pdf
and cdf
. If both λ = 0 and ν0 = ν, the truncated multivariate SGT distribution is reduced to a truncated multivariate t distribution [34] denoted
with pdf
and cdf
.
In the frame of correlated binary data models, the truncation region typically has the form
where
are real intervals of the form
or
, for
(k = 1, 2, ⋯, p). Let us consider for instance a vector Y of three binary outcomes obtained by truncating the elements of a 3−variate SGT vector
: Yk = 0 if Zk ≤ 0 and Yk = 1 if Zk > 0. In practice, however, only the binary outcomes (Y) are observable whereas the latent outcome Z is unobservable. Suppose one observes the binary outcomes y = (1, 0, 1)⊤. This implies that the corresponding value z of the latent vector Z satisfies z1 > 0, z2 ≤ 0, and z3 > 0. The conditional distribution of Z given Y = y (required for maximum likelihood estimation using the EM algorithm) is thus
truncated to the region
, i.e.
as defined in Eq (13).
We shall use the simplified notation with
to denote a truncated SGT distribution
when
is the right truncated hyperplane
. In this case, αst = SGTp(a|μ, Ω, λ, ν) with SGTp(⋅|μ, Ω, λ, ν) the cdf of the p−variate ST distribution. This corresponds for instance to the situation where all binary outcomes are zeros. When λ = 0, the right truncated SGT distribution
is a right truncated GT distribution denoted
whose pdf and cdf are respectively denoted TGtp(⋅|μ, Ω, ν, a) and TGTp(⋅|μ, Ω, ν, a). When ν0 = ν, the right truncated SGT distribution is a right truncated ST distribution denoted
with pdf TStp(⋅|μ, Ω, λ, ν, a) and cdf TSTp(⋅|μ, Ω, λ, ν, a). If both λ = 0 and ν0 = ν, the distribution is reduced to a right truncated t distribution denoted
with pdf Ttp(⋅|μ, Ω, ν, a) and cdf TTp(⋅|μ, Ω, ν, a). In the above example, if y = (0, 0, 0)⊤, then the truncation region becomes
. Since all truncation points are zeros, we shall write in this case
with a = (0, 0, 0)⊤ using the above simplified notation.
The implementation of an EM algorithm for a SGT distribution based binary data model requires joint moments of the form ,
,
and
for s ∈ {1, 2}, Z(1) = Z and Z(2) = Z Z⊤,
,
, α = λ⊤ Ω−1/2(Z − μ), ζ1(x) = ϕ(x)/Φ(x) with ϕ(⋅) the pdf of the standard normal distribution, and
is an hyperplane of the form
with
, (ak, ∞)} for a = (a1, ⋯, ap)⊤. Proposition 1 hereafter will be useful for the derivation of
,
,
and
.
Proposition 1 (see S3 Appendix for a proof) Let with ν = (ν, ν0)⊤,
and set α = λ⊤ Ω−1/2(Z − μ). Then, for any real r > − ν and an integrable function g(⋅) of Z:
(14)
(15) where
,
,
, and
;
,
,
,
with
,
,
and
.
By the mean of a simple linear transformation, we obtain from Proposition 1 the joint expectations ,
,
and
in terms of moments of a truncated multivariate skew t distribution.
Corollary 1 (see S4 Appendix for a proof) Let with ν = (ν, ν0)⊤,
,
and
with
or
. Then, on setting
,
, A = diag(A1, ⋯, Ap) with Ak = 1 if
and Ak = −1 if
, a* = Aa, μ* = A μ,
,
, λ* = Aλ and αst = STp(a*|μ*, Ω*, λ*, ν),
(16)
(17)
(18)
(19)
(20)
(21) where
,
, and we have set
and
.
For a practical use of Corollary 1, the cumulative multivariate skew t distribution is required. To this end, the function pmst of the package sn [35] in R freeware [36] is an appropriate routine.
Moments of truncated multivariate skew generalized t distributions
The evaluation of expectations involved in Corollary 1 calls for general expressions for the first and second order moments of truncated multivariate SGT distributions. These moments are required in the implementation of an EM algorithm for a SGT distribution based binary data model. The moments have been derived for truncated multivariate t distributions by [34] and were used by [24] in their implementation of the EM algorithm for a t-link GLMM. We present in this section the expressions for the first two moments of the multivariate SGT distributions, relying on the Theorem 1 of [37] and the moments of truncated multivariate t distributions available from [34] (see also [38]).
Let with ν = (ν, ν0)⊤ and
, i.e. a p−variate SGT vector restricted to the right truncated hyperplane
. The pdf of Z is:
(22)
where αst = SGTp(a|μ, Ω, λ, ν), SGTp(⋅|μ, Ω, λ, ν) is the cdf of the p−variate SGT distribution. If μ = 0, Ω is a correlation matrix (Ω = R) and ν0 = ν, then
. In this case, the first two moments of Z can be evaluated using the following proposition which simply combines Theorem 3 in [34] with Theorem 1 in [37].
Proposition 2 (see S5 Appendix for a proof) Let with R a correlation matrix. Then,
(23)
(24) where
with Tp(⋅|μ, Ω, ν) the cdf of the p−variate t distribution with location μ, scale Ω and degrees of freedom ν;
with ith element
, t(⋅) being the pdf of the standard Student t distribution;
with
,
and
;
with ith element
,
; H* is the p × p matrix with diagonal elements
and off diagonal elements defined as
with
;
; D* is the p × p diagonal matrix with diagonal elements
, H*i denoting the ith column of H*;
,
,
,
, with δi the ith element of δ, ρij the (ij)th element of R; Hi the ith column of H;
the vector a with its (i + 1)th element (i.e.ai) deleted;
the (i + 1)th column of
with its (i + 1)th element (i.e. 1) deleted;
,
being
with its (i + 1)th row and column deleted;
;
the vector
with its (i + 1)th and (j + 1)th elements (i.e.ai and aj) deleted;
,
being
with its (i + 1)th and (j + 1)th rows and columns deleted;
being the matrix
with its (i + 1)th and (j + 1)th columns deleted, and only its (i + 1)th and (j + 1)th rows kept;
;
;
the vector a with its ith element (i.e.ai) deleted;
,
being R with its ith row and column deleted;
being the matrix
with its first and (i + 1)th columns deleted, and only its first and (i + 1)th rows kept; and
.
The following corollary gives the first two moments of a general right truncated SGT vector with ν = (ν, ν0)⊤.
Corollary 2 Let with ν = (ν, ν0)⊤. Then,
(25)
(26) where
,
is the ith diagonal element of Ω,
, R is the correlation matrix from Ω,
, a* = Λ−1(a − μ) and E{X} and E{XX⊤} are available from Proposition 2.
When ν → ∞, the truncated multivariate SGT family has the truncated multivariate skew normal family as a limiting case (see S5 Appendix for a definition and formulas for computing the first two moments).
Skew generalized t-link mixed binomial model
This section defines the skew generalized t-link model (SGTLM) and describes an Expectation-Maximization (EM) algorithm [39] accelerated using parameter expansion [27] for likelihood inference. Empirical Bayes estimators of random effects and weights are also obtained for cluster specific inference.
Model specification and marginal log-likelihood
The skew generalized t-link GLMM (SGTLM) considered in this work is defined as:
(27)
where Yij is the binary outcome of the jth measurement (j = 1, 2, ⋯, ni) on the ith cluster (i = 1, 2, ⋯, n), Zi is a latent continuous outcome which determines the observable
, and bi is a vector of q random effects associated to the cluster i. In Eq (27), ηi = Xi β + Wi bi, β, Xi and Wi are as in Eq (1);
,
,
,
is the ni−vector of all ones,
,
with υ0 > 0 and ν > 2;
and
with
and
a q × q positive define matrix.
In the SGTLM, the distribution of a single latent outcome Zij is where
and
denotes a univariate SGT distribution with location μ, scale ω2, shape λ and degrees of freedom ν. Therefore, on denoting SGT(⋅|μ, ω2, δ, ν) the cdf of a scalar SGT distribution
, the success probability of an outcome Yij given the random effects bi is
. Unlike in the common probit model (PM) (see Eq (1)), for a given cluster i, the ni latent outcomes Zij are not independent given the random effects bi. Indeed, on using Eq (4) and setting
, the variance-covariance matrix of Zi given bi is
(28)
so that the correlation coefficient between two elements Zij and Zik of Zi with k ≠ j is
. Thus, conditional on random effects, the ni latent outcomes in Zi are uncorrelated only when δε = 0, i.e. a skewed link function implies correlated latent outcomes within a cluster i.
The positive constant υ0 in the SGTLM controls the scale of the latent variable Zij and thus the scale of the model link function. Indeed, from Eq (28), the conditional variance of Zij is . Setting υ0 = 1 would yield a skew t-link model (i.e. ν = (ν, ν)⊤). However, to make fixed effects in the SGTLM comparable with fixed effects in the common probit model (PM) characterized by a link function with a unit scale (i.e. Var{Zij|bi} = 1), we have set
(29)
The application of Eq (3) to bi shows that, as in the PM, random effects in the SGTLM have null mean vector E{bi} = 0. Using Eq (4), the variance-covariance matrix of random effects is given by
(30)
When δε = 0 and δ = 0, the SGTLM is reduced to the t-link model in [24] except υ0 = 1 therein. As ν → ∞ (so that Ui = 1), the STGLM has as limiting case the mixed skew-probit model (SPM) which reduces to the PM for δε = 0 and δ = 0.
By Eq (2), the SGTLM has the stochastic representation
(31)
where Ui and Vi are independent. In this representation of the SGTLM in terms of more common distributions, the ni binary outcomes Yij of a cluster i are independent given the random effects bi, the scale mixing variable Ui and the half normal variable Vi. Note that given Ui and Vi, Zi|bi and bi are normally distributed and share the same Ui and Vi. As a result, the joint distribution of Zi and bi belongs to the class of SGT distributions. From the stochastic representation in Eq (31), we obtain the unconditional distributions of Zi and
as follows.
Proposition 3 (see S6 Appendix for a proof). Let us consider the latent vector Zi and the binary variable Yij in Eq (27) and define with
,
,
, and the related shape parameter
. Then
. Furthermore, the vector of binary outcomes Yi has a multivariate Bernoulli distribution with joint probability mass function,
(32) and Yij has a Bernoulli distribution with success probability
and probability mass function,
(33) where
, Aij = 1 − 2yij,
is the jth diagonal element of Ωi,
, and
with Δij and μij the jth elements of Δi and μi respectively.
Eq (32) conveniently expresses for a value yi of Yi, the likelihood as a cumulative probability of a ST distribution whose location, scale and shape parameters depend on yi, using the identities P(Zij > zij) = P(−Zij < −zij) and sign(Zij) = 2Yij − 1 where sign(⋅) returns the sign of its argument. On using Eq (4) on the distribution of Zi given in Proposition 3, the variance-covariance matrix of the outcomes at the latent scale is
(34)
Thus, in a model with a cluster-specific random intercept (q = 1) with δ = 0 and
, the latent intra-class correlation coefficient (the proportion of variance explained by clustering at latent scale) is given by
(35)
The joint distribution of Zi and bi (i.e.
) is
where
,
,
,
and
. Thus, for j = 1, 2, ⋯, ni and k = 1, 2, ⋯, q, the correlation between Zij and bk is
with
the variance of bk,
the variance of Zij and
is the covariance between Zij and bk, Wij is the jth row of Wi and
is the kth column of
,
is the kth diagonal element of
and δk is the kth element of δ.
The parameters of the SGTLM include β, δε, δ, and ν where the vech(⋅) operator returns the lower triangle elements of its matrix argument. In order to avoid non-regular likelihood problems occurring in models based on the Student distribution and its extensions (in particular when ν is close to zero) [40], we follow some recent related works [24, 41] and first consider ν as known, focusing on
. Classical inference on θ is based on the marginal likelihood of the observed data y. Using Proposition 3, the joint marginal log-likelihood of n independent clusters yi (i = 1, 2, ⋯, n) is:
(36)
From Eq (36), an optimization routine like the R function optim can be used for inference on θ. We however develop an EM algorithm to circumvent the ni-dimensional integral in Eq (32) when estimating θ.
Model identifiability
Estimations in the skew generalized t-link model (SGTLM) may produce inconsistent results which would induce unreliable and misleading conclusions, if the model is not identifiable. It is thus of great importance to check whether different points in the parameter space can be distinguished from observations yi. We analyse in this section the identifiability of the SGTLM and indicate when it is necessary to restrict the parameter space to ensure reliable inference from observed data. We restrict attention to the case υ0 = 1 (ignoring Eq (29)) since υ0 is an artificial device only included to ensure a unit variance in the conditional link function as in the traditional probit mixed model.
Although not sufficient, the identifiability of the SGTLM requires both the marginal random effects distribution and the conditional model given random effects to be identifiable. The identifiability of the random effects distribution follows from the identifiability of multivariate skew t distributions. We survey the identifiability of the conditional model before turning to the marginal model.
• Conditional identifiability
Conditional on the random effects bi and for fixed degrees of freedom parameter ν, the identifiability of the SGTLM reduces to the identifiability of the fixed effects skew-probit model. This follows because the skew t inverse link function is an average of the skew-probit inverse link function with respect to the gamma mixing distribution. The identifiability of the skew-probit model with one covariate has been recently shown to depend on the nature (binary/continuous) of the covariate in the model [42]. Indeed, the fixed effect skew-probit model is not identifiable in the absence of any covariate (i.e. each Xi is a column of ni ones) [42] (page 1624, Proposition 2.1) or in the presence of a binary covariate [42] (page 1626, Proposition 2.2). On the other hand, the fixed effect skew-probit model is identifiable when the covariate is continuous [42] (page 1627, Proposition 2.3). Extension to the case of multiple covariates is straightforwardly obtained by requiring the covariate matrix to be of full column rank as in the classical linear regression model context. Whenever binary covariates are considered or no covariate is considered, we advocate to set δϵ = 0 so that the conditional model reduces to a classical probit model.
• Marginal identifiability
From Proposition 3, it appears that when υ0 = 1 the paramaters δε and δ enter the marginal distribution of Yi only through the marginal working shape whose jth element can be written
. As a result, caution is required when learning the model parameters from some realizations yi of Yi. Indeed, if the model includes a random intercept term, i.e.Wij has the form
where
, we can partition the random effects working shape as
with
so that the jth element of the marginal working shape reads
. Therefore, only the sum δϵ + δ0 could be estimated and it would not be possible to distinguish in the observed skewness the part due to the random intercept from the part due to the conditional link function. This confounding issues may actually be avoided by considering more complex models based for instance on the fundamental skew distributions [43]. Recall that skewness is introduced in the SGTLM through a hidden standard half normal variable, namely Vi. As opposed to the unique standard half normal variable Vi used for both Zi and bi in the SGTLM, the use of two different standard half normal hidden variables for Zi and bi [44] (page 667 eq 5-6) or two different standard half multivariate normal hidden vectors [45] (page 420 eq 2.2) remove the confounding problem.
Fortunately, the non identifiability of the skewness of conditional link function and random intercept does not affect the success probability of the response since this only depends on Δi, but not on the individual values of δϵ and δ0. However, since the conditional link scale depends on δϵ through υ0, the confounding problem affects the scale and thus the interpretation of fixed effects. Moreover, inference on the random intercept is affected since the random intercept variance and skewness depend on δ0. For example, δϵ + δ0 = 0 only indicates null marginal skewness, and in no way absence of link function and random intercept skewness which could be equally strong but of opposite signs. Thus, only a lower bound can be given to the random intercept variance: where
is the first diagonal element of
. To rule out this peculiar situation where the model is not marginally fully identifiable, some previous works on skew normal/skew t distributions have considered the restriction δϵ = 0 (regardless of the presence of a random intercep term) for instance in the context of linear mixed effects models [30, 46, 47] (page 1492 eq 2, page 4100 eq 4, and page 309 eq 3.2 respectively), multivariate measurement error models [48] (page 35, Eq 4.11) and non linear mixed effect models [49] (page 7 eq 10), but no argument was given to support this choice.
In the very common situation where the mixed model includes a random intercept term, prior information on the shape of the link function and/or the random intercept is required to place a meaningful restriction on the parameter space by setting for instance δϵ = 0 or δ0 = 0 or δϵ = δ0. In the absence of such information, we advocate to consider the restriction δ0 = 0 because the success probability of a response may exibit skewness, irrespective of the presence of random effects. This restriction thus allows to recover a fixed effects skew generalized t-link model when no random effect is considered. Overall, the restriction δ0 = 0 simply expresses the unability of the SGTLM to capture any additional skewness structure from the data through the inclusion of only random intercept. For completeness, we develop in the next section an estimation procedure for the full model in Eq (27), since the introduction of any equality restriction on δϵ and δ0 can be straightforwardly reduced to δ0 = 0.
The two restrictions discussed in this section are related to the structure of the quantity Δi and are required only for some specific data structures (models including a binary covariate or models with a random intercept only). However, even when a restriction is required on δε or the first element of δ, the quantity Δi itself can remain unrestricted. When a restriction is required, it forces the skewness from a data to be summarized either by δε or elements of δ. Overall, the SGT-link function is always allowed to be skewed (unconditional link). But some designs do not allow to distinguish skewness in the conditional link function from skewness in the distribution of random effects.
Maximum likelihood inference
Estimation via the EM algorithm.
The choice of the value of υ0 in Eq (29) is in line with one of our purposes: rescale fixed effects so that they have the same interpretation as in the mixed probit model. There is no need to define υ0 depending on situations, because in our proposal, υ0 is fixed. However, since υ0 is simply a scaling factor, it may be given any positive value during estimation, as long as the estimates are rescaled after the convergence of the estimation procedure so that υ0 is finally given by Eq (29). Indeed, because routines are basically written for the skew t distributions, we used υ0 = 1 in the EM algorithm and rescaled the estimates at the end of the procedure. Let us consider the complete data . Because yi only retains the signs of elements of zi, the joint density of yi and zi is
where
with
if yij = 0 and
if yij = 1. The density of
is thus
. Hence by Bayes’s rule and in light of Eq (27) with υ0 = 1, the density of
is:
(37)
where fV(vi) = 2ϕ(vi) I(0,∞)(vi) and fG(⋅|ν/2, ν/2) is given in Eq (9). By Eq (37) and on setting
and
, the complete data log-likelihood is:
(38)
where
, and tr(⋅) is the trace operator. Let
the estimate of θ at the kth EM iteration. The E-step of the (k + 1)th iteration finds the expectation
of ℓ(⋅|ycom) given the observed data y and the current parameter estimate
:
(39)
where
and
so that we have
Note that the conditional expectation of
is 0 since given yi,
. The E-step thus reduces to the computation of the conditional expectations
,
,
,
,
,
,
,
,
,
and
. The expressions for these expectations (except
) are given in the following result where we have dropped the supraindex (k) for simplicity.
Proposition 4 (see S7 Appendix for a proof). Consider the random variables Yij, Zi, bi, Ui, and Vi as defined in Eq (27) with υ0 = 1, and an update of the model parameter θ. Let
,
,
,
,
,
,
,
,
,
. Then:
(40)
(41)
(42)
(43)
(44)
(45)
(46) where
,
,
,
, and the expectations
,
,
,
and
are to be evaluated directly using Corollary 1 applied to the conditional latent vector
,
with
if yij = 0 and
if yij = 1.
The M-step jointly maximizes over θ. This yields the following updating expressions for θ.
Proposition 5 (see S8 Appendix for a proof) Consider an identifiable SGTLM as defined in Eq (27) with υ0 = 1; and an estimate of θ. Set
,
,
,
,
and
. At EM iteration (k + 1), the updates of β, δε, δ and
are given by:
(47)
(48)
(49)
(50)
At convergence of the EM algorithm, we obtain the estimate of θ. The corresponding estimate of the variance-covariance matrix of random effects is
. More generally, when ν is not actually known, the M-step of the EM algorithm can be extended to include a profiled marginal log-likelihood maximization step. Indeed, at EM iteration k, we notice that the estimate
of θ depends on ν only through
and the profiled marginal log-likelihood Lν(⋅) for ν can be obtained by simply substituting
for θ in Eq (36). We can thus find
using a one dimensional optimization routine (e.g. optimize in R) to maximize Lν(⋅). Then, the update of the parameter θ becomes
. The use of the profiled marginal log-likelihood instead of a profiled version of
can provide substantial gain of efficiency [50] and mostly helps bypass the calculation of
which does not have any known closed form.
It is worthwhile noticing however that the inclusion of a profiled marginal log-likelihood maximization step would prevent the convergence of the whole estimation procedure if the marginal log-likelihood in Eq (36) as a function of ν is unbounded for a particular dataset. This issue especially when ν is close to zero. Another challenge associated to the estimation of ν is time. The use of this strategy requires a very fast routine to compute cumulative probabilities of the skew t distributions. As an alternative route to estimate ν, we point out the model selection approach of [51] (page 893). It consists in setting a grid of feasible values of ν and obtaining a sequence of estimates of θ. Then, the couple ν and
maximizing the marginal log-like-lihood in Eq (36) is taken as the estimates of ν and θ.
Accelerating EM via parameter-expansion.
Besides its attractiveness and stability for handling incomplete data models, the EM algorithm sometimes experiences slow convergence, which has motivated many methods to accelerate its linear convergence speed. Among popular EM accelerators, the so-called parameter-expanded (PX) EM algorithm was proposed by [27] to speed up convergence. Let us consider a complete data model F(ycom|θ). The PX-EM algorithm expands F(ycom|θ) to a larger model FX(ycom|Θ) parameterized by where θ⋆ plays in FX(ycom|Θ) the role of θ in F(ycom|θ) and α is a working parameter. The use of the PX-EM algorithm requires that (1) α admits a value α0 that preserves the original complete data model and (ii) the observed-data model is preserved by a many-to-one reduction function R: Θ ↦ θ = R(Θ) which allows an unambigious recovering of θ from Θ. We refer to [27] for more details. For the SGTLM, let us consider the following expanded complete data model obtained by including a working q × q scale matrix α into the linear predictor as ηi = Xi β⋆ + Wi α bi:
This expanded model equals the STGLM in Eq (27) when α takes the value α0 = Iq, and has expanded parameter
where vec is the usual operator which stacks the columns of its matrix argument. Under this model, the marginal distribution of Yi remains as given in Eq (32) with
and
so that Θ reduces as
. As the observed-data model is preserved whatever the value of α, we fix α = Iq at each E-step of the EM procedure. Therefore, the E-step of the PX-EM algorithm uses Proposition 4 to obtain conditional expectations required in Eq (39) as for the classical EM algorithm. At the M-step, the estimates of δ⋆ and
are still given by Eqs (49)–(50) respectively whereas the estimates of δε⋆, β⋆ and vec(α) are:
(51)
(52)
where
,
, and
with ⊗ the direct product operator. Using the reduction function, the original model parameter estimates can be recovered as
,
,
and
. In the neighbourhood of the ML estimate of θ, the working scale estimate
becomes close to α0 = Iq [27] so that the advantage of the PX-EM algorithm over the classical EM algorithm disappears. We thus propose to stop the PX acceleration once |λmax| < ϵ with λmax the dominant eigen value of
and ϵ a pre-specified tolerance value (e.g. ϵ = 10−2).
Summary of the estimation procedure.
The estimation procedure starts with a parameter , k = 0 and iterates the following six steps until convergence.
- E-step: compute conditional expectations defined by Eqs (40)–(46) with
.
- PX M-step: obtain
and
using Eqs (49)–(52) and the reduction function:
,
,
and
.
- Test: compute λmax the dominant eigen value of
. If |λmax| < 10−2 then compute the marginal likelihood
using Eq (36) and go to 4) with k = k + 1, otherwise return to 1) with k = k + 1.
- E-step: compute conditional expectations defined by Eqs (40)–(46) with
.
- M-step: obtain
using Eqs (47)–(50).
- Test: compute the marginal likelihood
using Eqs (36). If
then go to 7), otherwise return to 4) with k = k + 1.
- Rescaling: compute υ0 using (29) and rescale the estimates as
,
and
. Return
.
Approximate standard errors.
With a view to allow asymptotic inference in SGTLM, we follow the empirical information-based method of [52] (pages 132-133) to compute the asymptotic variance-covariance matrix of the ML estimate of θ under some general regularity conditions. The observed information matrix is defined to be
where
,
,
being the contribution of the single observation yi to the expected complete data log-likelihood in Eq (39). On setting
, the elements
of the score gi can be explicitly evaluated using:
(53)
(54)
(55)
(56)
Afterwards, the standard errors of estimated model parameters are approximated by square roots of diagonal elements of
and confidence intervals can be built assuming asymptotic normality.
Empirical Bayes estimators of random effects and weights.
In this section, we provide the empirical Bayes estimators of cluster specific random effects and weights that are useful for evaluating individual intercepts and slopes as well as detecting outlying individuals. From Eq (27), the distribution of bi conditional on Zi = zi, Ui = ui and Vi = vi is multivariate normal with mean and covariance matrix
where
, si = δ − ri Δi,
,
, and
. The conditional mean of bi given Yi = yi is thus:
(57)
where
,
,
and the quantities
,
,
and
are to be evaluated using Corollary 1 applied to
with
,
,
,
if yij = 0 and
if yij = 1. The empirical Bayes estimators of bi can then be obtained as
.
For outlying individuals detection, individual weights Ui are predicted by [53] which is given by Eq (16) in Corollary 1 applied to Zi|Yi = yi. The empirical Bayes estimators of Ui are thus given by
. Relatively low weights (< 1) are indicative of outlying individuals.
Applications
This section presents a simulation study for assessing performance of SGTLM, and an application of the modeling approach to a real dataset.
Simulation study
We conducted a simulation to evaluate the proposed approach to the analysis of correlated binary data. The simulation experiment targeted four specific objectives. First, it assessed for different sample sizes, the abilities of the probit (PM), the skew-probit (SPM), the generalized t-link (GTLM) and the skew generalized t-link (SGTLM) models to recover population parameters when the common normality assumption for the link function is either violated or not. The widely used logistic model was not investigated as the logistic distribution can be considered as a special case of the Student t distribution [8] hence the logistic model is a special case of GTLM. Second, the experiment evaluated the extent to which asymptotic 95% confidence intervals (CI95%) can detect the presence of spurious skewness. Third, the experiment evaluated the ability of empirical Bayes estimators of random effects to predict true random effects. Finally, the simulation study assessed the ability of Akaike’s information criterion (AIC), Schwarz’s Bayesian information criterion (BIC) and Hannan-Quinn criterion (HQ) to select the correct model fit. All computations were performed in R.
Simulation design.
Mimicking the structure of the simulation model studied in [24] (page 1116), we considered the following GLMM:
where ηi = (ηi1, ⋯, ηi6)⊤, ηij = β0 + β1 X1i + b0i + b1i W1ij, bi = (b0i, b1i)⊤; X1 is a dichotomous covariate (Bernoulli distribution with sucess probability 0.5) and W1 is a continuous occasion-varying random covariate (standard normal distribution); β0 is an intercept, i.e. the general mean of the linear predictor ηij and β1 is the fixed-effects associated to the covariate X1 with values arbitrarily fixed to β0 = 1 and β1 = −1; b0i is a random intercept associated to the cluster i, b1i is the random slope associated to W1i;
is a 2 × 2 scale matrix with diagonal elements 0.5 and 1, and off diagonal element 0.25;
is a positive distribution with finite first two negative moments, i.e.
for t = 1, 2; and
. We considered δ = (0, δ1)⊤, i.e. a null random intercept skewness to ensure the identifiability of the model.
Under this general class of SSMN latent models, we considered two data models. The first is the probit data model where U = 1, δ1 = 0 and δε = 0 (probit link, υ0 = 1). The second is the skew generalized t-link data model with δε = −2 and δ1 = 2 and with ν = 5 (υ0 = 0.4598). We considered for each data model, sample sizes (n) of 100, 500 and 1000 and thus generated three sets of covariates which were used for all simulations involving each of the two data models. Under each of the six resulting simulation settings, we generated 250 datasets to which we fitted the four fitting models under evaluation (PM, SPM, GTLM, SGTLM), considering the model degrees of freedom as known and equals to ν = 5 for the SGTLM. Fixed effects (β0 and β1) and skewness parameters (δϵ and δ1) were initialized to zero whereas the scale matrix
was initialized to the 2×2 identity matrix.
Performance measures.
In addition to estimates of fixed effects (, k = 0, 1) and skewness (
for SPM and SGTLM, l = ϵ, 1) and related empirical standard errors and CI95%, we recorded random effects variances (
of b0i and
of b1i) and covariance (σ12) and their approximate standard errors derived using the delta method [54] as implemented in the R package car [55], empirical Bayes estimates of individual random effetcs (
), and the AIC, BIC and HQ criteria defined as:
,
, and
where
is the maximized log-likelihood value, N is the total number of observations and Np is the number of estimated model parameters. These data were used to compute various performance measures (Table 1) including the relative bias (%Bias) and the root mean square error (RMSE) in
and
; the standard deviations (SD) of
; the quadratic mean (
) of standard errors of
; the coverage probabilities (
) of
and
, i.e. the proportion of times the CI95% for βk or δl included the true value; the arithmetic mean (
) of the square of Pearson’s correlation (coefficient of determination) between simulated and Bayes estimates of subject random effects; and the arithmetic means of information criteria AIC (
), BIC (
) and HQ (
) across the 250 simulated datasets per simulation setting.
Simulation results.
Simulation results presented in Tables 2 and 3 show that under the probit data generation mechanism, the probit, the skew-probit, the generalized t (GT)-link and the skew generalized t (SGT)-link models recovered the population parameter values. Indeed, the percentage of bias was below 5% at all levels for fixed effects, whereas for variance components, the percentage of bias was below 20%. We particularly noticed a high relative bias in the variance component (%Bias = 17.33) under probit fit to probit data (assuming the true model) with small sample size (n = 100). This may be explained by the maximum likelihood estimation method which is known for providing biased variance components [56]. Nevertheless, this can be improved by opting for residual maximum likelihood estimation procedure [56]. However, it is worth noticing that the estimation improves as the sample size increases and the empirical standard error estimates agree with the standard deviations from the simulations. The results for n = 100 and n = 500 are consistent with findings in [24] where empirical information based standard errors approached Monte Carlo standard errors. Moreover, the 95% confidence interval for the skewness parameters allows to detect spurious skewness in the skew-probit and the SGT-link models with coverage probabilities of 100%. This result can be explained by the underlined high accuracy of information based standard errors in this type of model. The power in predicting random effects varied from R2 = 0.45 to R2 = 0.47 for random intercepts and from R2 = 0.23 to R2 = 0.28 for random slopes, but was comparable for the three fitting models. Finally, it appears that on average all model selection criteria correctly considered probit fitting model as the parsimonious model.
Under a SGT-link data generation mechanism, the probit model performed poorly, showing large relative fixed effects bias values which decreased from 46% for samples of size n = 100 to 18% for samples of size n = 1000 (Table 4). The SGT-link model estimates were the less biased (%Bias < 12) as well as the most accurate with the lowest root mean square errors across all levels (Table 5). The same observations apply to variance components which were highly biased downward for probit, skew-probit and GT-link models (%Bias up to 90) relative to the SGT-link model (%Bias < 7). Regarding estimates of skewness parameters, the coverage probability was low (53% to 94%) for small sample size (n = 100) and approached nominal (95%) value for larger sample sizes (n = 500, 1000). The skew-probit model estimates (coverage probability down to 53%) were less reliable than estimates from the SGT-link model (coverage probability above 90%).
Clearly, the SGT-link model adjusted better with non normal data and accordingly, random effects prediction is better with SGT-link model (R2 ≥ 0.49) than with the probit or the skew-probit model. Moreover, all the considered model selection criteria namely AIC, BIC and HQ on average correctly selected the SGT-link model as the preferred model.
Application to the respiratory infection data
To demonstrate the usefulness of the proposed approach to correlated binary data modeling, we revisited the respiratory illness data (available in geepack package [57] in R) which was used by [24] to illustrate their t-link GLMM. The respiratory illness data was obtained from a clinical study of the effect of a treatment on 111 patients with respiratory illness, recruited from two different clinical centers. The patients were examined and their respiratory state (categorized as 1 = good, 0 = poor) determined (baseline). They were then randomized to receive either placebo or an active treatment. The goal of the study was to determine whether the treatment induced a better respiratory state in treated patients. The outcome is the respiratory state measured at four visits for each patient as good (y = 1) or poor (y = 0). In addition to the treatment (treat = 0 for placebo group (P) and treat = 1 for treated group (A)), the following fixed covariates were included: the clinical center (center = 0 for the first center and center = 1 for the second center), the baseline (respiratory state at the first visit), gender (sex = 0 for female (F) and sex = 1 for male (M)) and the interaction of treatment and gender. Following [24], we assumed that the age effect is patient-specific (random slope) and thus considered the patient age centered around its median (31 years) as a random covariate. Since the fixed covariates included binary variables (treat, gender, center and baseline), a conditional skew-probit model is not identifiable given random effects and we thus set δϵ = 0 to ensure identifiability.
For the purpose of comparison, we fitted the probit, skew-probit, GT-link and SGT-link models. We initialized fixed effects β and the random slope skewness parameter δ to zero whereas the random slope scale was initialized to one. For the GT-link and the SGT-link models, we considered the model selection approach of [51] with degrees of freedom ν = 2.5, 2.6, …, 15.
As depicted in Fig 1, the profiled marginal log-likelihood for the GT-link model is unbounded, with smaller ν corresponding to better fit in accordance with the t-link model fits in [24] (ν ≤ 4). We thus set ν = 2.5 for the t-link model. For the SGT-link fit, Fig 1 indicates that the log-likelihood is bounded with a maximum at ν = 3.7, suggesting heavy tail link function and random slope distributions. The difference in the behaviours of the GT-link and SGT-link models may be explained by the implication of ν in the location of the skew t-link model through (see Eq (32)).
The maximum likelihood (ML) estimates under the probit, skew-probit, GT-link and SGT-link models (Table 6) are somewhat close for the four fitted models which all show that respiratory illness is associated to clinical center, baseline state and treatment, with the treatment effect varying with gender. The SGT-link fit additionally indicates that, irrespective of the treatment, the respiratory state is poorer for male patients (Table 6, ) than females. We notice for this dataset, that the intercept coefficient estimate increases with model complexity and estimates of fixed effects and their respective standard errors are shrunk toward zero for the skew-probit model relative to the probit one, and for the skew-probit model relative to the SGT-link one. The skew-probit model fit also gave a higher skewness (
) as compared with the SGT-link model fit (
). Although the estimated skewness is relatively low for both skew-probit and SGT-link models, the use of a skewed and heavy tail link clearly improved, not only the precision of estimates but also the adequacy between data and model. Indeed, the asymptotic 95% confidence interval for δ under the SGT-link includes zero (CI95% = [−0.0010, 0.0534]), but we noticed from the simulation results that asymptotic CI95% for skewness parameters becomes reliable only in large samples (n ≥ 500), whereas information criteria are reliable for all tested sample sizes. Thus, based on the AIC, BIC and HQ criteria in Table 6, the SGT-link fit is the best for the respiratory illness data. The estimate of the variance of the random slope of age is
for the SGT-link fit, with close values under probit and skew-probit models. From the SGT-link fit, it appears that the treatment induced an overall better respiratory state for treated patients (with a negative coefficient β4 = −2.0398 for the placebo group). Moreover, the treatment has on average a better effect on female patients than on male patients (with a positive coefficient, β5 = 1.3425 for male patients in the placebo group). However, as noted by [24], new studies are required to check this latter trend because of the highly unbalanced proportion of males (79%) and females (21%) in the data.
Conclusion
This work has considered the skew generalized t class of distributions for both link and random effects distributions in mixed models for binary data. The objective was to improve the exploitation of binary data bearing oddities such as skewness and tails thicker/thinner than the normal distribution. To allow inference in such models, we developped a maximum likelihood estimation procedure based on the EM algorithm. We combined results from [34] and [37] to obtained expressions for computing moments of truncated multivariate skew t distributions. The computation used existing R functions for the multivariate skew t cumulative distribution function. Our simulation experiment showed that, irrespective of sample size, the SGT-link model outperforms the probit GLMM when the underlying data generation mechanism is not normal. We also demonstrated that the skew generalized-link model performed better than the skew-probit and the generalized t-link GLMMs, when the underlying data is both skewed and heavy tailed.
An important finding is that when the model degrees of freedom ν is small and very large values are assumed (fitting probit and skew-probit models), the estimates of fixed effects are biased, whereas when ν is large but small values are assumed, the estimates of fixed effects are not biased. Moreover, asymptotic inference using information based standard errors proved highest ability accuracy in detecting spurious skewness in large samples (n ≥ 500) and information criteria on average selected the correct model fit for all tested sample sizes (n = 100, 500, 1000). These findings extend results in [24] on t-link GLMM to SGT-link GLMM, asserting that information criteria are reliable for selecting the best model for a particular dataset.
However, the simulation experiments revealed that the EM algorithm has a high computational cost. For instance, in a model with q = 2 random effects, n = 100 clusters and ni = 6 observations per cluster, the mean running time for the SGT-link model fit was 4.76 minutes which is almost 135 times the time required by the probit model fit (2.12 seconds). Our implementation relies on the pmst function of the R package sn [35] to compute the cumulative probabilities of skew t distributions. This function uses the one dimensional routine integral of R on the multivariate normal cumulative distribution function. The use of the EM algorithm for large q values (e.g. q = 10, 15) requires the prior development of a faster routine for the computation of cumulative probabilities of skew t distributions. This will make the EM algorithm scalable for large q + ni. On multicore plateformes, parallel computing can also substantially speed computations up. The expressions provided for computing moments of truncated multivariate skew t distributions is limited to work for models with ν > 2. The use of formulae given in [38] will extend our EM algorithm to very small degrees of freedom (1 < ν ≤ 2).
Binary data related to very rare events often require special treatment and are generally analysed using zero inflated models [58]. The development of a skew generalized t-link model with zero inflation can significantly improve the exploitation of such data. In addition to binary data, GLMMs handle other data types like count, proportional and ordinal outcomes. From the good performance demonstrated in this work and in previous related ones [9, 24], we believe that the simultaneous introduction of flexible links and random effects distributions in GLMM would benefit knowledge extraction from observed data in applied research fields where advances rely on modeling capacity.
Supporting information
S1 Appendix. Proof of Lemma 1.
This supporting information gives a proof of Lemma 1.
https://doi.org/10.1371/journal.pone.0249604.s001
(PDF)
S2 Appendix. Proof of Lemma 2.
This supporting information gives a proof of Lemma 2.
https://doi.org/10.1371/journal.pone.0249604.s002
(PDF)
S3 Appendix. Proof of Proposition 1.
This supporting information gives a proof of Proposition 1.
https://doi.org/10.1371/journal.pone.0249604.s003
(PDF)
S4 Appendix. Proof of Corollary 1.
This supporting information gives a proof of Corollary 1.
https://doi.org/10.1371/journal.pone.0249604.s004
(PDF)
S5 Appendix. S5 Proof and limiting case of Proposition 2.
This supporting information gives a proof of Proposition 2. The first two moments of truncated multivariate skew normal distributions (limiting case as ν → ∞) are also given (required for fitting skew-probit link models).
https://doi.org/10.1371/journal.pone.0249604.s005
(PDF)
S6 Appendix. Proof of Proposition 3.
This supporting information gives a proof of Proposition 3.
https://doi.org/10.1371/journal.pone.0249604.s006
(PDF)
S7 Appendix. Proof of Proposition 4.
This supporting information gives a proof of Proposition 4.
https://doi.org/10.1371/journal.pone.0249604.s007
(PDF)
S8 Appendix. Proof of Proposition 5.
This supporting information gives a proof of Proposition 5.
https://doi.org/10.1371/journal.pone.0249604.s008
(PDF)
Acknowledgments
The authors wish to thank the editor and two referees for their relevant comments and suggestions. They are also grateful to Matthews Lazaro (Kamuzu College of Nursing, Lilongwe, Malawi) for the time he devoted to edit the manuscript for language usage, spelling, and grammar.
References
- 1.
El-Saeiti IN. Performance of mixed effects for clustered binary data models. In: AIP Conference Proceedings. vol. 1643. AIP; 2015. p. 80–85.
- 2. Nelder JA, Wedderburn RW. Generalized linear models. Journal of the Royal Statistical Society: Series A (General). 1972;135(3):370–384.
- 3. McCulloch CE. Maximum likelihood variance components estimation for binary data. Journal of the American Statistical Association. 1994;89(425):330–335.
- 4.
Chen MH. Skewed link models for categorical response data. In: Skew-Elliptical Distributions and Their Applications. Chapman and Hall/CRC; 2004. p. 151–172.
- 5. McCulloch CE, Neuhaus JM. Misspecifying the shape of a random effects distribution: why getting it wrong may not matter. Statistical science. 2011;28(3):388–402.
- 6. Czado C, Santner TJ. The effect of link misspecification on binary regression inference. Journal of statistical planning and inference. 1992;33(2):213–231.
- 7. Stewart MB. Semi-nonparametric estimation of extended ordered probit models. Stata Journal. 2004;4(1):27–39.
- 8.
Liu C. Robit regression: a simple robust alternative to logistic and probit regression. In: Gelman A, Meng XL, editors. Applied Bayesian Modeling and Casual Inference from Incomplete-Data Perspectives. England: Wiley London; 2004. p. 227–238.
- 9. Kim S, Chen MH, Dey DK. Flexible generalized t-link models for binary response data. Biometrika. 2008;95(1):93–106.
- 10. Abanto-Valle CA, Dey DK. State space mixed models for binary responses with scale mixture of normal distributions links. Computational Statistics & Data Analysis. 2014;71:274–287.
- 11. Basu S, Mukhopadhyay S. Binary response regression with normal scale mixture links. BIOSTATISTICS-BASEL-. 2000;5:231–242.
- 12. Pinheiro JC, Liu C, Wu YN. Efficient algorithms for robust estimation in linear mixed-effects models using the multivariate t distribution. Journal of Computational and Graphical Statistics. 2001;10(2):249–276.
- 13. Chen MH, Dey DK, Shao QM. A new skewed link model for dichotomous quantal response data. Journal of the American Statistical Association. 1999;94(448):1172–1186.
- 14. Komori O, Eguchi S, Ikeda S, Okamura H, Ichinokawa M, Nakayama S. An asymmetric logistic regression model for ecological data. Methods in Ecology and Evolution. 2016;7(2):249–260.
- 15. Lemonte AJ, Bazán JL. New links for binary regression: an application to coca cultivation in Peru. Test. 2018;27(3):597–617.
- 16. Asgharzadeh A, Esmaeili L, Nadarajah S, Shih S. A generalized skew logistic distribution. REVSTAT–Statistical Journal. 2013;11(3):317–338.
- 17. Carlin JB, Wolfe R, Brown CH, Gelman A. A case study on the choice, interpretation and checking of multilevel models for longitudinal binary outcomes. Biostatistics. 2001;2(4):397–416. pmid:12933632
- 18. Agresti A, Caffo B, Ohman-Strickland P. Examples in which misspecification of a random effects distribution reduces efficiency, and possible remedies. Computational Statistics & Data Analysis. 2004;47(3):639–653.
- 19. Chen J, Zhang D, Davidian M. A Monte Carlo EM algorithm for generalized linear mixed models with flexible random effects distribution. Biostatistics. 2002;3(3):347–360. pmid:12933602
- 20. Nelson KP, Lipsitz SR, Fitzmaurice GM, Ibrahim J, Parzen M, Strawderman R. Use of the probability integral transformation to fit nonlinear mixed-effects models with nonnormal random effects. Journal of Computational and Graphical Statistics. 2006;15(1):39–57.
- 21. Hosseini F, Eidsvik J, Mohammadzadeh M. Approximate Bayesian inference in spatial GLMM with skew normal latent variables. Computational Statistics & Data Analysis. 2011;55(4):1791–1806.
- 22. Broström G, Holmberg H. Generalized linear models with clustered data: Fixed and random effects models. Computational Statistics & Data Analysis. 2011;55(12):3123–3134.
- 23. Gad AM, El Kholy RB. Generalized linear mixed models for longitudinal data. International Journal of Probability and Statistics. 2012;1(3):41–47.
- 24. Prates MO, Costa DR, Lachos VH. Generalized linear mixed models for correlated binary data with t-link. Statistics and Computing. 2014;24(6):1111–1123.
- 25. Santos CC, Loschi RH. EM-Type algorithms for heavy-tailed logistic mixed models. Journal of Statistical Computation and Simulation. 2017;87(15):2940–2961.
- 26. Azzalini A, Capitanio A. Distributions generated by perturbation of symmetry with emphasis on a multivariate skew t-distribution. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2003;65(2):367–389.
- 27. Liu C, Rubin DB, Wu YN. Parameter expansion to accelerate EM: the PX-EM algorithm. Biometrika. 1998;85(4):755–770.
- 28. Branco MD, Dey DK. A general class of multivariate skew-elliptical distributions. Journal of Multivariate Analysis. 2001;79(1):99–113.
- 29.
Hugo LDV, Cabral CRB. Scale Mixtures of Skew-Normal Distributions. In: Hugo LDV, Cabral CRB, Zeller CB, editors. Finite Mixture of Skewed Distributions. Switzerland: Springer International Publishing; 2018. p. 15–36.
- 30. Lachos VH, Ghosh P, Arellano-Valle RB. Likelihood based inference for skew-normal independent linear mixed models. Statistica Sinica. 2010;20:303–322.
- 31.
Capitanio A. On the canonical form of scale mixtures of skew-normal distributions; 2012. Available from: https://arxiv.org/abs/1207.0797.
- 32.
Kéri G. The Sherman-Morrison formula for the determinant and its application for optimizing quadratic functions on condition sets given by extreme generators. In: Giannessi F, Pardalos P, T R, editors. Optimization Theory. Boston: Springer; 2001. p. 119–138.
- 33. Ahmed A, Reshi J, Mir K. Structural properties of size biased Gamma distribution. IOSR J Mathem. 2013;5:55–61.
- 34. Ho HJ, Lin TI, Chen HY, Wang WL. Some results on the truncated multivariate t distribution. Journal of Statistical Planning and Inference. 2012;142(1):25–40.
- 35.
Azzalini A. The R package sn: The Skew-Normal and Related Distributions such as the Skew-t (version 1.5-2).; 2018. Available from: http://azzalini.stat.unipd.it/SN.
- 36.
R Core Team. R: A Language and Environment for Statistical Computing; 2019. Available from: https://www.R-project.org/.
- 37.
Galarza CE, Matos LA, Lachos VH. Moments of the doubly truncated selection elliptical distributions with emphasis on the unified multivariate skew-t distribution. arXiv preprint arXiv:200714980. 2020.
- 38. Galarza CE, Lin TI, Wang WL, Lachos VH. On moments of folded and truncated multivariate Student-t distributions based on recurrence relations. Metrika. 2021; p. 1–26.
- 39. Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society Series B (methodological). 1977;39(1):1–22.
- 40. Fernandez C, Steel MF. Multivariate Student-t regression models: Pitfalls and inference. Biometrika. 1999;86(1):153–167.
- 41. da Silva Braga A, Cordeiro GM, Ortega EM, Silva GO. The Odd Log-Logistic Student t Distribution: Theory and Applications. Journal of Agricultural, Biological and Environmental Statistics. 2017;22(4):615–639.
- 42. Lee D, Sinha S. Identifiability and bias reduction in the skew-probit model for a binary response. Journal of Statistical Computation and Simulation. 2019;89(9):1621–1648.
- 43. Arellano-Valle RB, Genton MG. Fundamental skew distributions. Journal of Multivariate Analysis. 2005;96:93–116.
- 44. Arellano-Valle R, Bolfarine H, Lachos V. Bayesian inference for skew-normal linear mixed models. Journal of Applied Statistics. 2007;34(6):663–682.
- 45. Arellano-Valle R, Bolfarine H, Lachos V. Skew-normal linear mixed models. Journal of data science. 2005;3(4):415–438.
- 46. Lin TI, Lee JC. Estimation and prediction in linear mixed models with skew-normal random effects for longitudinal data. Statistics in medicine. 2008;27(9):1490–1507. pmid:17708515
- 47. Lachos VH, Dey DK, Cancho VG. Robust linear mixed models with skew-normal independent distributions from a Bayesian perspective. Journal of Statistical Planning and Inference. 2009;139(12):4098–4110.
- 48. Lachos VH, Labra FV, Ghosh P. Multivariate skew-normal/independent distributions: properties and inference. Pro Mathematica. 2014;28(56):11–53.
- 49. Pereira MAA, Russo CM. Nonlinear mixed-effects models with scale mixture of skew-normal distributions. Journal of Applied Statistics. 2019;46(9):1602–1620.
- 50. Liu C, Rubin DB. The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence. Biometrika. 1994;81(4):633–648.
- 51. Lange KL, Little RJ, Taylor JM. Robust statistical modeling using the t distribution. Journal of the American Statistical Association. 1989;84(408):881–896.
- 52. Meilijson I. A fast improvement to the EM algorithm on its own terms. Journal of the Royal Statistical Society Series B (Methodological). 1989;51(1):127–138.
- 53. Meza C, Osorio F, De la Cruz R. Estimation in nonlinear mixed-effects models using heavy-tailed distributions. Statistics and Computing. 2012;22(1):121–139.
- 54. Cox C. Delta method. Encyclopedia of biostatistics. 2005;2.
- 55.
Fox J, Weisberg S. An R Companion to Applied Regression. 3rd ed. Thousand Oaks CA: Sage; 2019. Available from: https://socialsciences.mcmaster.ca/jfox/Books/Companion/.
- 56. Meza C, Jaffrézic F, Foulley JL. Estimation in the probit normal model for binary outcomes using the SAEM algorithm. Computational Statistics & Data Analysis. 2009;53(4):1350–1360.
- 57. Yan J. geepack: Yet Another Package for Generalized Estimating Equations. R-News. 2002;2/3:12–14.
- 58. Hall DB. Zero-Inflated Poisson and Binomial Regression with random effects: A Case Study. Biometrics. 2000;56(4):1030–1039. pmid:11129458