Skip to main content
Advertisement
  • Loading metrics

An approximate-copula distribution for statistical modeling

  • Sarah S. Ji ,

    Roles Data curation, Formal analysis, Methodology, Software, Writing – original draft, Writing – review & editing

    smji@g.ucla.edu

    Affiliation Department of Biostatistics, University of California, Los Angeles, Los Angeles, California, United States of America

  • Benjamin B. Chu,

    Roles Data curation, Formal analysis, Software, Writing – original draft, Writing – review & editing

    Affiliation Department of Biomedical Data Science, Stanford University, Stanford, California, United States of America

  • Hua Zhou,

    Roles Conceptualization, Funding acquisition, Software, Supervision, Writing – original draft, Writing – review & editing

    Affiliations Department of Biostatistics, University of California, Los Angeles, Los Angeles, California, United States of America, Department of Computational Medicine, University of California, Los Angeles, Los Angeles, California, United States of America

  • Kenneth Lange

    Roles Conceptualization, Funding acquisition, Methodology, Project administration, Supervision, Writing – original draft, Writing – review & editing

    Affiliations Department of Computational Medicine, University of California, Los Angeles, Los Angeles, California, United States of America, Department of Human Genetics, University of California, Los Angeles, Los Angeles, California, United States of America, Department of Statistics and Data Science, University of California, Los Angeles, Los Angeles, California, United States of America

Abstract

Copulas, generalized estimating equations, and generalized linear mixed models promote the analysis of grouped data where non-normal responses are correlated. Unfortunately, parameter estimation remains challenging in these three frameworks. Based on prior work of Tonda, we derive a new class of probability density functions that allow explicit calculation of moments, marginal and conditional distributions, and the score and observed information needed in maximum likelihood estimation. We also illustrate how the new distribution flexibly models longitudinal data following a non-Gaussian distribution. Finally, we conduct a tri-variate genome-wide association analysis on dichotomized systolic and diastolic blood pressure and body mass index data from the UK-Biobank, showcasing the modeling potential and computational scalability of the new distributional family.

Author summary

Modeling correlated responses is computationally challenging beyond the Gaussian realm. For instance, how should repeated binary outcomes in longitudinal studies be modeled? When a dataset contains both continuous and discrete responses, how can their dependence be captured in a principled and efficient way? This paper introduces a new class of probability distributions that enables flexible modeling of correlated responses of mixed type. Inspired by statistical copulas, the proposed approach is designed to remain computationally efficient even in high-dimensional settings. We refer to this framework as an approximate copula model and show that it provides a promising alternative to classical methods such as generalized linear mixed models and generalized estimating equations. To demonstrate its flexibility and scalability, we apply the approximate-copula model to genome-wide association (GWAS) data involving a mixture of continuous, binary, and count responses.

1 Introduction

1.1 Motivation

The analysis of correlated data is stymied by the lack of flexible multivariate distributions with fixed margins. Once one ventures beyond the confines of multivariate Gaussian distributions, analysis choices are limited. [8] launched the highly influential method of generalized estimating equations (GEEs). This advance allows generalized linear models (GLMs) to accommodate the correlated traits encountered in panel and longitudinal data and effectively broke the stranglehold of Gaussian distributions in analysis. The competing method of statistical copulas introduced earlier by Sklar is motivated by the same consideration [14]. Finally, generalized linear mixed models (GLMMs) [2,21] attacked the same problem. GLMMs are effective tools for modeling overdispersion and capturing the correlations of multivariate discrete data.

However, none of these three modeling approaches is a panacea. GEEs lack a well-defined likelihood, and estimation searches can fail to converge. For copula models, likelihoods exist, but are unwieldy, particularly for discrete outcomes. Copula calculations scale extremely poorly in high dimensions. Computing with GLMMs is problematic since their densities have no closed form and require evaluation of multidimensional integrals. Gaussian quadratures scale exponentially in the dimension of the parameter space. Markov Chain Monte Carlo (MCMC) can be harnessed in Bayesian versions of GLMMs, but even MCMC can be costly. For these reasons alone, it is worth pursuing alternative modeling approaches.

This brings us to an obscure paper by the Japanese mathematical statistician Tonda. Working within the framework of Gaussian copulas [15] and generalized linear models, Tonda introduces a device for relaxing independence assumptions while preserving computable likelihoods [17]. He succeeds brilliantly except for the presence of an annoying constraint on the parameter space of the new distribution class. The fact that his construction perturbs marginal distributions is forgivable.

1.2 Our contributions

The current paper has several purposes. First, by adopting a slightly different working definition, we show how to extend Tonda’s construction to lift the awkward parameter constraint. Our new definition allows explicit calculation of (a) moments, (b) marginal and conditional distributions, and (c) the score and observed information of the loglikelihood and allows (d) generation of random deviates. Tonda tackles item (a), omits items (b) and (c), and mentions item (d) only in passing. For maximum likelihood estimation (MLE), he relies on a non-standard derivative-free algorithm [13] that scales poorly in high dimensions. We present two gradient-based algorithms for extracting high-dimensional MLEs.

The key computational advantage of this new definition is that the loglikelihood contain no determinants, matrix inverses, or multidimensional integrals, in contrast to other multivariate outcome models. These features resolve computational bottlenecks in parameter estimation. As a consequence, correlated but non-continuous responses can be efficiently analyzed. We advocate gradient-based estimation methods that avoid computationally intensive second derivatives. To provide asymptotic standard errors and confidence intervals, we capitalize on sandwich estimators. These rely on the observed information matrix computed after convergence. For completeness, we derive the exact Hessian for our approximate-copula loglikelihood.

Here we illustrate the flexibility of our model by closely studying two common scenarios: (1) longitudinal data analysis with non-Gaussian repeated measurements, and (2) multivariate analysis with mixtures of continuous, binary, and discrete outcomes. Scalable software for these techniques is either nonexistent or severely limited. Our simulation studies and real data examples highlight not only the virtues of the approximate-copula model but also its limitations. For reasons to be explained, we find that the model reflects reality best when the number of components of each independent sampling units is low or the correlations between responses within a unit are small.

1.3 Paper organization

In subsequent sections, we begin by introducing the approximate-copula model and studying its statistical properties. Then we illustrate how the model is used in practice to analyze correlated non-Gaussian variables. Getting the correlation structure of the variables right is a key step in modeling both longitudinal and multi-trait data. Next, we discuss the details of parameter estimation that enable model fitting. Finally, we present analysis results on both simulated and real data. These results showcase the speed and flexibility of the approximate-copula model. In particular, our genome-wide association (GWAS) example demonstrates its scalability.

2 Materials and methods

2.1 Notation

For the record, here are some notational conventions used in the sequel. All vectors and matrices appear in boldface. The entries of the vector consist of 0’s, and the standard basis vector has all entries 0 except a 1 in entry i. The superscript indicates a vector or matrix transpose. The Euclidean norm of a vector is denoted by , and the spectral norm of a matrix is . For a smooth real-valued function , we write its gradient (column vector of partial derivatives) as , its first differential (row vector of partial derivatives) as , and its second differential (Hessian matrix) as . If is vector-valued with ith component , then the differential (Jacobi matrix) has th row . The transpose is the gradient of . Differentials can be constructed from directional derivatives .

2.2 Definition of the approximate-copula model

Consider d independent random variables with densities relative to measures , with means , variances , third central moments , and fourth central moments . Let be an positive semidefinite matrix, and α be the product measure . Inspired by [17], we let be the diagonal matrix with ith diagonal entry and consider the nonnegative function

Its average value is

It follows that the function

(1)

is a probability density with respect to the measure α. Detailed derivations of Tonda’s approximation are found in Sect S1.1 in S1 File. The virtue of the density (1) is that it overcomes the independence restriction and steers the estimated covariance matrix toward the sample covariance matrix of the residuals. Inclusion of the positive semidefinite matrix is designed to achieve this goal. For example, if is Gaussian and is Bernoulli, then the density (1) allows one to induce correlation between the two components (one continuous and the other discrete) of the binary random vector . Note that is technically not a copula because it only approximately preserves the marginal distributions . Later, we will see how tends to inflate marginal variances and accommodate correlation. This can be a blessing rather than a curse if correlation is present or the true marginal distributions are over-dispersed compared to the assumed marginal distributions.

2.3 Moments

Let be a random vector distributed as . To calculate the mean of , note that our independence assumption implies

Hence, if is the skewness of , then

for any matrix norm . The mean is close to when the diagonal entries of and, hence itself, are small.

To calculate the covariance matrix of , note that

The indicated expectations relative to reduce to

When and is the kurtosis of ,

Because , we find that

Because the kurtosis , the multiplier of is nonnegative, and the variance is inflated for small.

When ,

Hence, the covariance and correlation satisfy

As a check, the quantities , , and reduce to the correct values , , and 0, respectively, when .

2.4 Marginal and conditional distributions

Let S be a subset of with complement T. To simplify notation, suppose . Now write

where is the vector of standardized residuals. The marginal density of is

To derive the conditional density of given by , we divide the joint density by the marginal density of . This action produces the conditional density

with normalizing constant . From this density, our well-rehearsed arguments lead to the conditional mean

for . The corresponding conditional variance is

and the corresponding conditional covariances are

for , , and . It is noteworthy that to order , the conditional and marginal means agree, and the conditional and marginal covariances agree.

2.5 Generation of random deviates

To generate a random vector from the density (1), we first sample from its marginal density

and then sample the subsequent components from their conditional distributions . If we denote the set by , then the conditional density of given the previous components is

where .

When the densities are discrete, each stage of sampling is straightforward. Consider any random variable Z with nonnegative integer values, discrete density , and mean ν. The inverse method of random sampling reduces to a sequence of comparisons. We partition the interval into subintervals with the ith subinterval of length . To sample Z, we draw a uniform random deviate U from and return the deviate j determined by the conditions . The process is most efficient when the largest occur first. This suggests that we let k denote the least integer and rearrange the probabilities in the order This tactic is apt put most of the probability mass first and render sampling efficient.

When the densities are continuous, each stage of sampling is probably best performed by inverse transform sampling. This requires calculating distribution functions and forming their inverses, either analytically or by Newton’s method. The required distribution functions assume the form

The integrals are available as special functions for Gaussian, beta, and gamma densities . For instance, if is the standard normal density and is the standard normal distribution, then

To avoid overburdening the text with classical mathematics, we omit further details. Additional derivations can be found in Sect S1.2 in S1 File.

2.6 Model for longitudinal data

As a concrete example, we now illustrate how to use the approximate-copula model for the longitudinal setting, with n independent subjects. The response vector for subject i consists of traits values and p covariates (features) per trait. The component represents the measured trait of subject i at time j. The data matrix may include both time-varying features (for example medication use) and time-invariant features (for example gender). Linear mixed models [5,18] are a sensible modeling choice when the trait values in are continuous. When measurements are discrete, the approximate-copula model often runs orders of magnitude faster than generalized linear mixed models and leads to more plausible loglikelihoods. Sects 3.1 and 3.4 cover both simulated and real longitudinal data.

In the approximate-copula model of longitudinal data, the random vector follows the approximate-copula density appearing in Eq (1). Marginal means of the base distribution are linked to covariates through

(2)

where the inverse link function is applied component-wise. Although it is not strictly necessary, we assume that each of the measurements follows the same base distribution. This simplifying assumption holds, for example, with repeated blood-pressure measurements.

Because each subject i can be measured at a different number of time points, the choice of positive semidefinite matrices is constrained. We explore three structured options:

These traditional choices define the variance component (VC) model decomposing into a linear combination of m known positive semidefinite matrices against unknown nonnegative variance components , the auto-regressive (AR1) model, and the compound symmetric (CS) model. The must be estimated in the VC model, while and ρ must be estimated in the AR1 and CS models. Sect 2.9.2 describes how estimation is performed. The parameters supplement the regression coefficients of the base distributions.

2.7 Model for unstructured multivariate data

As another example, we will apply the approximate-copula model for analyzing multiple responses. The multivariate model involves n independent samples exhibiting d responses and p covariates. Each component of the approximate-copula response is allowed to have a different base distribution and a different inverse link . In contrast to the longitudinal model, we postulate a matrix of regression coefficients that capture the unique impacts of the p covariates on the d responses. Means are linked to covariates via . If we define

then we can apply the previous notation

(3)

This notational change allows us to estimate a parameter vector , with the understanding that has length p for longitudinal data and length pd for multivariate data.

Because the positive semidefinite matrices are constant across samples, one can estimate an single unstructured matrix

where is the lower-triangular Cholesky decomposition of . In the interests of parsimony, one could replace by a low-rank matrix, but we will not pursue this suggestion further. In summary, the multivariate approximate-copula model is parametrized by mean-effect parameters in addition to the nonzero parameters of the Cholesky decomposition . The requirement for all i is the only constraint on the parameter space.

2.8 Genome-wide association studies

The ability to analyze correlated non-Gaussian responses is particularly pertinent to genome-wide association studies (GWAS). Most current methods for multivariate GWAS are based on Gaussian approximations [3,23]. The normality assumption makes it challenging to examine multiple traits where some are dichotomous or integer-valued. Here we illustrate how to apply approximate-copula models to GWAS data involving mixtures of continuous, binary, and discrete outcomes. In addition, our approximate-copula framework easily extends to the analysis of non-Gaussian longitudinal traits, such as repeated binary measurements. These traits are common in Electronic Health Record data. For the sake of brevity, we do not pursue this lead here.

In multivariate GWAS, n subjects are measured on q non-genetic covariates (for example sex, age, and diet) and p single-nucleotide polymorphisms (SNPs), where . Together the covariates influence a small set of d traits (phenotypes). Due to the high-dimensionality of the problem, typically each SNP is examined separately. If represents the mean effect of a SNP on each of d phenotypes, then the null hypothesis

(4)

is pertinent. To test the hypothesis (4), a likelihood ratio tests (LRT) is certainly possible. This involves maximizing the loglikelihood under the null hypothesis and comparing it to the likelihood under the alternative hypothesis where varies. If and denote the respective maximum loglikelihoods, then the likelihood ratio statistic follows the distribution

(5)

SNPs can be selected by applying the Bonferroni’s correction to the resulting p-values. For GWAS with human subjects, the stringent cutoff p-value is standard.

Although this strategy is conceptually simple, it requires fitting alternative models. To alleviate the computational burden, we calculate the LRT on only the most promising SNPs. One can screen for the top SNPs by examining the gradient under the maximum likelihood estimates of the null model estimates . The norm quantifies the signal strength of the SNP under consideration. We will derive the gradient later. Computing the gradient under the null is much faster than fitting a full likelihood model under the alternative. We show in Sect S1.6.1 in S1 File that ordering SNPs by is strongly correlated with ordering them by . Algorithm 1 summarizes our fast multivariate GWAS procedure. Table N in S1 File displays a QQ plot for Algorithm 1, showing that the resulting p-values are valid.

Algorithm 1. Fast likelihood-ratio tests for multivariate GWAS.

2.9 Parameter estimation

Throughout this section, the vector denotes the mean effects tied to the base distributions through Eqs (2) and (3). The vector summarizes the dependency parameters determining the positive semidefinite matrices . For example, in the longitudinal AR1 model, , and in the multivariate model, , where captures the lower triangle of the Cholesky decomposition .

2.9.1 Mean components.

Consider n independent realizations from the approximate-copula density (1). Each of these may be of a different dimension and possess a different mean vector , positive semidefinite matrice , and component densities . If denotes the vector of standardized residuals for sampling unit i, then the loglikelihood of the sample is

The score (gradient of the loglikelihood) with respect to is clearly

where is the differential (Jacobi matrix) of the vector . An easy calculation shows that has entries

The Hessian (second differential) of the loglikelihood with respect to is

where

Calculation of and requires computing , and by repeated application of the chain rule. A full example appear in Sect S1.3.6 in S1 File.

In searching the likelihood surface, it is often beneficial to approximate the observed information by a positive semidefinite matrix. This suggests replacing the observed information matrix by the expected information matrix under the base model and dropping indefinite matrices in the exact Hessian. These steps give the approximate Hessian

which is clearly negative semidefinite. These formulas provide the ingredients for implementing Newton’s method or a quasi-Newton method for updating .

2.9.2 Derivatives pertinent to VC models.

Maximization of the loglikelihood depends on the derivatives of positive semidefinite matrices through the functions and . Here we present details of the longitudinal VC model and relegate others to Sect S1.3.3 in S1 File. Recall that the longitudinal VC model involves the decomposition

of into a linear combination of m known positive semidefinite matrices against unknown nonnegative variance components . Assuming there are no shared mean and parameters,

in obvious notation. The part of the loglikelihood relevant to estimation of can be expressed as

Consequently, the score and Hessian with respect to are

2.9.3 Derivatives of an unstructured Gamma matrix.

If we parametrize using the Cholesky factor as suggested in Sect 2.7, then the loglikelihood function can be written as

The directional derivatives

follow from the standard rules of differentiation [7,11]. For coding purposes it is easier to invoke reverse-mode automatic differentiation tools rather than implementing the Hessian in closed form [12]. Indeed, if we write

then the loglikelihood can be viewed as a vector-input scalar-output function . At some sacrifice of computational speed, automatic differentiation will evaluate and at a current parameter vector . These derivatives enable implementation of Newton’s method or a quasi-Newton method for fitting the multivariate approximate-copula model.

2.9.4 Nuisance parameter estimation.

Many distributions are parametrized by additional nuisance parameters that also require estimation. In general, the strategy for estimating these is similar to the strategy for estimating the mean or variance parameters. For brevity, we illustrate this procedure concretely for the Gaussian and negative binomial distributions in S1 File Sects S1.3.4 and S1.4.

2.9.5 Initialization.

Most optimization algorithms benefit from good starting values. The obvious candidate for is the maximum likelihood estimate delivered by the base model. For structured models, we use an MM algorithm [6] to initialize variance components. This process is described in Sect S1.3.2 in S1 File. Under the CS and AR1 models, we initialize the variance component by the crude estimate from the MM algorithm treating . For unstructured models, we initialize by the Cholesky decomposition of sample correlation matrix of .

3 Results

3.1 Simulation studies under the longitudinal model

To assess estimation accuracy of the approximate-copula model, we first present simulation studies for the Poisson and negative binomial base distributions with log link function, under the VC parameterization of . Additional simulation studies with different base distributions under the AR(1), CS and VC parameterizations of are included in Sect S1.5 in S1 File.

In each simulation scenario, the non-intercept entries of the predictor matrix are independent standard normal deviates. True regression coefficients . For the negative binomial base, all dispersion parameters are . Each simulation scenario was run on 100 replicates for each sample size and number of observations per independent sampling unit.

Under the VC parameterization of the choice allows us to compare to the random intercept GLMM fit using MixedModels.jl. When the random effect term is a scalar, MixedModels.jl uses Gaussian quadrature for parameter estimation. We compare estimates and run-times to the random intercept GLMM fit of MixedModels.jl with 25 Gaussian quadrature points. We conduct simulation studies under two scenarios (simulation I and II). In simulation I, it is assumed that the data are generated by the approximate-copula model with , and in simulation II, it is assumed that the true distribution is the random intercept GLMM with .

Simulation I: In this scenario, we simulate datasets under the approximate-copula model as outlined in Sect 5 and compare MLE fits under the approximate-copula model and GLMM. Top panel of Fig 1 helps us assess estimation accuracy and how well the GLMM density approximates the approximate-copula density.

thumbnail
Fig 1. Simulation study under the longitudinal model.

Top panel features MSE for and under Simulation I setting with Poisson base (left) and Negative binomial base (right). The bottom panel features MSE of and under Simulation II setting with Poisson base (left) and Negative binomial base (right). Here cluster size refers to , the number of observations per sample. AC abbreviates approximate-copula and GLMM abbreviates generalized linear mixed model.

https://doi.org/10.1371/journal.pcbi.1013922.g001

As anticipated, the MSE’s across all base distributions decrease as sample size increases. For data simulated under the approximate-copula model, approximate-copula mean squared errors (MSE) are generally lower than GLMM MSE’s. GLMM estimated variance components are often zero and stay relatively constant across sample sizes. This confirms the fact that the two models are different in how they handle random effects, particularly with larger sampling units ().

Simulation II: In the second simulation, we generate datasets under the random intercept Poisson GLMM and compare MLE fits delivered by the two models. Bottom panel of Fig 1 now shed light on how well the approximate-copula density approximates the GLMM density under different magnitudes of the variance components.

As expected, MSE’s under GLMM analysis are now generally lower than those under approximate-copula analysis. For the Poisson and negative binomial base distributions with , the bottom panel of Fig 1 indicates biases for the approximate-copula estimates of for larger sampling units up to sample size , similar to the bias observed for in the Poisson GLMM example on top panel.

3.2 Run times

Run times under simulation I and II are comparable. Table 1 presents average run times and their standard errors in seconds for replicates under simulation II with . All computer runs were performed on a standard 2.3 GHz Intel i9 CPU with 8 cores. Runtimes for the approximate-copula model are presented given multi-threading across 8 cores. We note the current version of MixedModels.jl does not allow for multi-threading across multiple cores.

thumbnail
Table 1. Run times and (standard error of run times) in seconds based on replicates under simulation II with Poisson and negative binomial (NB) Base, sampling unit size and sample size n. AC abbreviates approximate-copula and GLMM abbreviates generalized linear mixed model.

https://doi.org/10.1371/journal.pcbi.1013922.t001

Because the approximate-copula loglikelihoods contain no determinants or matrix inverses, our software experiences less pronounced increases in computation time as sample and sampling unit sizes grow compared to GLMM implemented in MixedModels.jl. Run times for the approximate-copula model are faster than those of MixedModels.jl for discrete outcomes (Tables 1 and E in S1 File) and slower for Gaussian distributed outcomes (Table F S1 File). This general trend also holds on a per core basis. This discrepancy is hardly surprising since MixedModels.jl takes into account the low-rank structure of in the random intercept linear mixed model (LMM). This tactic reduces the computational complexity per sample from to . More detailed comparisons appear in Sect S1.5.1 in S1 File.

Finally, we study the negative binomial base distribution for longitudinal model in more depth. We compared our negative binomial fits with those delivered by the three popular R packages for GLMM estimation in Sect S1.9 in S1 File. Within Julia, MixedModels.jl explicitly warns the user against fitting GLMM’s with unknown dispersion parameter r. Our software updates r iteratively by Newton’s method, holding the other parameters fixed, see Sect S1.3.4 in S1 File for more estimation details.

3.3 GWAS simulations

Simulations under the multivariate approximate-copula model demonstrate the potential of Algorithm 1 in GWAS. Specifically, we simulated subjects, covariates, independent SNPs, and correlated responses. The simulated covariates have entries drawn from . A column of 1’s is appended to to accommodate an intercept. The true effect sizes are randomly sampled from a distribution. The number of minor alleles for each SNP follows a distribution with . For simplicity, the true is simulated under the AR1 model

Under this setup we explore 3 simulation scenarios:

Simulation III: Here we assume each response is an independent sample from the approximate-copula model. Each of the d components of is randomly chosen from the Gaussian, Bernoulli, and Poisson base distributions. The variance of the Gaussian base is set at .

Simulation IV: Here the responses are also generated from the approximate-copula model, but now all d components have a Bernoulli base. This scenario is appropriate given multiple correlated case-control responses.

Simulation V: Here the responses follow a multivariate Gaussian distribution. This choice allows us to assess whether approximate-copula fitting correctly collapses to the underlying base model.

To assess power, we randomly choose causal SNPs. If denotes the effect of a causal SNP on the d phenotypes, then we impose the constraint , where s varies between 0 and 1. Thus, a causal SNP influences each of the d responses, but the magnitudes of its effect are both correlated and random.

Over 100 replicates, we compare the power of Algorithm (2.8) against a penalized regression (IHT) algorithm [3] and the multivariate linear mixed model [23]. As shown in Fig 2, we achieve better power than both IHT and GEMMA when the data generative process follows the approximate-copula model (simulation III and IV). When the responses are purely Gaussian (simulation V), approximate-copula fitting offers comparable power to IHT. Finally, Sect S1.6.2 in S1 File demonstrates that the approximate-copula model produces valid p-values.

thumbnail
Fig 2. Power simulation for the proposed multivariate GWAS routine in Algorithm (2.8).

Here AC denotes approximate-copula, IHT denotes iterative hard thresholding, a penalized sparse regression method [3], and GEMMA implements a multivariate linear-mixed model [23]. The colored band represent standard deviation.

https://doi.org/10.1371/journal.pcbi.1013922.g002

3.4 Longitudinal analysis of the NHANES data

For many repeated measurement problems, a simple random intercept model is sufficient to account for correlations between different responses on the same subject. To illustrate this point and the performance of the approximate-copula model, we now turn to a bivariate example from the NHANES I Epidemiologic Followup Study (NHEFS) dataset [4]. In this example, we group the data by subject ID and jointly model the number of cigarettes smoked per day in 1971 and the number of cigarettes smoked per day in 1982 as a bivariate outcome. For fixed effects, we include an intercept and control for sex, age in 1971, and the average price of tobacco in the state of residence. The average price of tobacco is a time-dependent covariate that is adjusted for inflation using the 2008 U.S. consumer price index (CPI). Participants with missing responses or predictors were excluded from the model cohort. A total of NHANES I participants constitute the cohort.

Table 2 compares loglikelihoods and run times under the longitudinal regression model with Poisson, negative binomial, and Bernoulli base distributions as computed by our approximate-copula software, the GLMM package MixedModels.jl (a more efficient version of the lme4 package [1]), and the GEE package EstimatingEquationsRegression.jl. For the Bernoulli base distribution, we transformed each count outcome to a binary indicator with value if the number of cigarettes smoked per day is greater than the sample average and value otherwise. The maximum loglikelihood of the approximate-copula model is lower than that of GLMM for the Poisson base and higher than that of GLMM for the negative binomial and Bernoulli bases. The approximate-copula model generally runs faster than GLMM but slightly slower than GEE.

thumbnail
Table 2. The loglikelihood and run times for the longitudinal NHANES data under the approximate-copula (AC) model, GLMM, and GEE, using three different base distributions. Best values appear in bold faced type. Note the loglikelihoods for GEE are NA since it is not likelihood-based. All sampling units are of size .

https://doi.org/10.1371/journal.pcbi.1013922.t002

Table 3 presents detailed parameter estimates for the negative binomial base. Because overdispersion is a feature of this dataset, the Poisson base distribution represents a case of model misspecification as documented in Table G S1 File. The negative binomial base distribution is a better choice for analysis. Under the Poisson base, the approximate-copula model inflates the variance component to account for the over-dispersion. Under the negative binomial base distribution, both our model and MixedModels.jl estimate the variance component to be . This suggests that no additional overdispersion exists in the data. The estimates for under the approximate-copula model with Poisson base are closer to the more realistic estimates under the negative binomial base than those of GLMM. Sex and age variables in the approximate copula models have larger estimated standard errors than in GLMM or GEE. As a consequence, the predictor age is no longer statistically significant. Given its small estimated effect size, this interpretation is arguably preferable.

thumbnail
Table 3. Comparisons of parameter estimates on the NHANES data under negative binomial approximate-copula (AC) model, GLMM, and GEE. All sampling units are of size . Here r is the nuisance parameter from negative binomial regression and θ is the variance component.

https://doi.org/10.1371/journal.pcbi.1013922.t003

3.5 Multivariate GWAS on UK biobank data

We also conducted a 3-trait analysis of hypertension related phenotypes from the second release of the UK-Biobank [16]. The underlying traits, average systolic blood pressure (SBP), average diastolic blood pressure (DBP), and body mass index (BMI), are correlated, heritable, and the subject of previous association studies [3]. Although the traits are continuous, we dichotomize both SBP and DBP to illustrate a multivariate analysis with correlated non-continuous responses. Following the clinical definition of stage 2 hypertension [19], we set SBP to 1 if a patient’s average SBP is mm Hg and DBP to 1 if a patient’s average DBP is mm Hg. Otherwise, these traits are set to 0.

After quality control (see Sect S1.8 in S1 File), the data includes autosomal SNPs and a subset of subjects without missing phenotypes. We split this subset of data into 483 contiguous blocks, each containing roughly contiguous 1000 SNPs, and ran Algorithm in parallel across them. Each job finished in less than a day. Altogether 617 SNPs pass our threshold for likelihood ratio testing. Fig 3 depicts our results in a GWAS Manhattan plot [20]. Sect S1.10 in S1 File compares our multivariate GWAS result to three separate univariate GWASes interpreted via a Cauchy combination test [9].

thumbnail
Fig 3. A 3-trait multivariate GWAS on BMI, dichotomized SBP, and dichotomized DBP.

The black horizontal dotted line indicates the the genome-wide threshold of . The most significant SNP within a 1Mb window is labeled and colored purple. All other significant SNPs are colored blue and unlabeled. The legend on the right shows chromosome density.

https://doi.org/10.1371/journal.pcbi.1013922.g003

After pruning secondary but significant SNPs within 1Mb windows, we uncover roughly 24 association hotspots. The strongest signals come from previously known associations with BMI, such as rs1421085 on chromosome 16, rs10871777 on chromosome 18, and rs13107325 on chromosome 4. These SNPs are known to be associated with BMI independently of SBP and DBP [3]. Because dichotomizing a trait loses information, we expect most discoveries to be associated with BMI. However, we were able to discover SNPs such as rs17367504 on chromosome 1, rs2681492 on chromosome 12, and rs653178 on chromosome 12 previously known to be associated with SBP and DBP independently of BMI. Interestingly, five SNPs rs4500930, rs7721099, rs2293579, rs11191548, and rs34783010 missed in our previous analysis are known to be associated with BMI [10]. It is possible our previous analysis discovered nearby proxies instead. While our current analysis is hardly definitive, it demonstrates that multi-trait GWAS is indeed possible taking proper account of inter-trait correlations.

4 Discussion

We propose a new model for analyzing multivariate data based on Tonda’s Gaussian copula approximation. Our approximate-copula model enables the analysis of correlated responses and handles random effects needed in applications such as panel and repeated measures data. The approximate-copula model trades Tonda’s awkward parameter space constraint for a simple normalizing constant. This allows one to engage in full likelihood analysis under a tractable probability density function with no implicit integrations or matrix inverses. The approximate-copula model is relatively easy to fit and friendly to likelihood ratio hypothesis testing.

For maximum likelihood estimation, we recommend a combination of two numerical methods. The first is a block ascent algorithm that alternates between updating the mean parameters by a version of Newton’s method and updating the variance parameters by a minorization-maximization (MM) algorithm. The second method jointly updates and the variances components by a standard quasi-Newton algorithm. The MM algorithm converges quickly to a neighborhood of the MLE but then slows. In contrast, the quasi-Newton struggles at first and then converges quickly. Thus, we start with the block ascent algorithm and then switch to the quasi-Newton algorithm. Both algorithms and their combination are available in our Julia package.

The approximate-copula model shines when GLMM and GEE are difficult to apply or yield unsatisfactory results. We illustrate this fact for longitudinal binary traits and multiple correlated traits of mixed types. To its credit, the approximate-copula model is likelihood-based, preserves base distributions approximately, and enables hypothesis testing of mean effects and correlations. However, our numerical tests suggest caution. In particular, performance may degrade when correlations between responses are strong or when cluster sizes are large. In this setting the normalizing constant grows and tends to shrink estimates of , making variance component estimates unreliable. In defense of the model, our limited evidence suggests that estimates of mean effects are hardly affected. We suspect that likelihood ratio tests of correlation are also reliable. Given the multivariate nature of the model, it also enhances statistical precision and power in the analysis of mean effects when correlations exist among the traits. In our opinion, variance components are nuisance parameters compared to mean effects in important examples such as GWAS.

In any case, we recommend univariate analysis by standard statistical tools as a preclude to multivariate analysis under the approximate-copula model. Understanding the nature of complicated data is best served when the data is approached from several angles. Despite its simplifications, the approximate-copula model’s balance of speed, interpretability, and flexibility make it a valuable complement to existing methods. We hope that other statisticians will agree, work to improve its performance, and apply it to their own data.

Web resources

Our software is freely available to the scientic community through the OpenMendel [22] platform.

Project home page: https://github.com/OpenMendel/ApproxCopula.jl

Supported operating systems: Mac OS, Linux, Windows

Programming language: Julia 1.6+

License: MIT

All commands needed to reproduce the following results are available at https://github.com/sarah-ji/ApproxCopula-reproducibility

Supporting information

S1 File. Additional mathematical derivations, simulations, as well as real data analysis results.

https://doi.org/10.1371/journal.pcbi.1013922.s001

(PDF)

Acknowledgments

We thank Tetsuji Tonda for his pioneering contributions and Seyoon Ko for help in parallelizing our Julia code.

References

  1. 1. Bates D, Mächler M, Bolker B, Walker S. Fitting linear mixed-effects models using lme4. J Stat Soft. 2015;67(1).
  2. 2. Breslow NE, Clayton DG. Approximate inference in generalized linear mixed models. Journal of the American Statistical Association. 1993;88(421):9–25.
  3. 3. Chu BB, Ko S, Zhou JJ, Jensen A, Zhou H, Sinsheimer JS, et al. Multivariate genome-wide association analysis by iterative hard thresholding. Bioinformatics. 2023;39(4):btad193. pmid:37067496
  4. 4. Cohen BB. Plan and operation of the NHANES I epidemiologic followup study, 1982–84. US Department of Health and Human Services, Public Health Service, National; 1987.
  5. 5. Fitzmaurice GM, Laird NM, Ware JH. Applied longitudinal analysis. John Wiley & Sons; 2012.
  6. 6. Lange K. MM optimization algorithms. SIAM; 2016.
  7. 7. Lange K. A tutorial on Hadamard semidifferentials. Foundations and Trends in Optimization. 2024;6(1):1–62.
  8. 8. Liang K-Y, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73(1):13–22.
  9. 9. Liu Y, Xie J. Cauchy combination test: a powerful test with analytic p-value calculation under arbitrary dependency structures. J Am Stat Assoc. 2020;115(529):393–402. pmid:33012899
  10. 10. MacArthur J, Bowler E, Cerezo M, Gil L, Hall P, Hastings E, et al. The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog). Nucleic Acids Res. 2017;45(D1):D896–901. pmid:27899670
  11. 11. Magnus JR, Neudecker H. Matrix differential calculus with applications in statistics and econometrics. John Wiley & Sons; 2019.
  12. 12. Moses W, Churavy V. Instead of rewriting foreign code for machine learning, automatically synthesize fast gradients. In: Advances in Neural Information Processing Systems, 2020. p. 12472–85.
  13. 13. Ohtaki M. Globally convergent algorithm without derivatives for maximizing a multivariate function. In: Proceedings of Development of Statistical Theories and their Application for Complex Nonlinear Data. 1999.
  14. 14. Sklar M. Fonctions de repartition an dimensions et leurs marges. Publ Inst Statist Univ Paris. 1959;8:229–31.
  15. 15. Song PX-K, Li M, Yuan Y. Joint regression analysis of correlated data using Gaussian copulas. Biometrics. 2009;65(1):60–8.
  16. 16. Sudlow C, Gallacher J, Allen N, Beral V, Burton P, Danesh J, et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 2015;12(3):e1001779. pmid:25826379
  17. 17. Tonda T. A class of multivariate discrete distributions based on an approximate density in GLMM. Hiroshima Math J. 2005;35(2).
  18. 18. Verbeke G, Molenberghs G, Verbeke G. Linear mixed models for longitudinal data. Springer; 1997.
  19. 19. Whelton PK, Carey RM, Aronow WS, Casey DE Jr, Collins KJ, Dennison Himmelfarb C, et al. 2017 ACC/AHA/AAPA/ABC/ACPM/AGS/APhA/ASH/ASPC/NMA/PCNA Guideline for the Prevention, Detection, Evaluation, and Management of High Blood Pressure in Adults: A Report of the American College of Cardiology/American Heart Association Task Force on Clinical Practice Guidelines. J Am Coll Cardiol. 2018;71(19):e127–248. pmid:29146535
  20. 20. Yin L, Zhang H, Tang Z, Xu J, Yin D, Zhang Z, et al. rMVP: a Memory-efficient, Visualization-enhanced, and Parallel-accelerated Tool for Genome-wide Association Study. Genomics Proteomics Bioinformatics. 2021;19(4):619–28. pmid:33662620
  21. 21. Zeger SL, Karim MR. Generalized linear models with random effects; a Gibbs sampling approach. Journal of the American Statistical Association. 1991;86(413):79–86.
  22. 22. Zhou H, Sinsheimer JS, Bates DM, Chu BB, German CA, Ji SS, et al. OPENMENDEL: a cooperative programming project for statistical genetics. Hum Genet. 2020;139(1):61–71. pmid:30915546
  23. 23. Zhou X, Stephens M. Efficient multivariate linear mixed model algorithms for genome-wide association studies. Nat Methods. 2014;11(4):407–9. pmid:24531419