## Figures

## Abstract

Existing models for assessing microbiome sequencing such as operational taxonomic units (OTUs) can only test predictors’ effects on OTUs. There is limited work on how to estimate the correlations between multiple OTUs and incorporate such relationship into models to evaluate longitudinal OTU measures. We propose a novel approach to estimate OTU correlations based on their taxonomic structure, and apply such correlation structure in Generalized Estimating Equations (GEE) models to estimate both predictors’ effects and OTU correlations. We develop a two-part Microbiome Taxonomic Longitudinal Correlation (MTLC) model for multivariate zero-inflated OTU outcomes based on the GEE framework. In addition, longitudinal and other types of repeated OTU measures are integrated in the MTLC model. Extensive simulations have been conducted to evaluate the performance of the MTLC method. Compared with the existing methods, the MTLC method shows robust and consistent estimation, and improved statistical power for testing predictors’ effects. Lastly we demonstrate our proposed method by implementing it into a real human microbiome study to evaluate the obesity on twins.

## Author summary

Human microbiome sequencing data analysis has been a fast growing area of genomic research in recent years. Although there have been several works for detecting predictors on a single operational taxonomic unit (OTU) or multiple OTUs simultaneously, there is limited work on how to estimate the correlations between multiple OTUs and incorporate such relationship into models to evaluate longitudinal OTU measures. Here we propose a novel approach to estimate OTU correlations based on their taxonomic structure after integrating longitudinal and other types of repeated OTU measures, and apply such correlation structure in Generalized Estimating Equations (GEE) models to estimate both predictors’ effects and OTU correlations. The method is theoretically sound and practically easy to implement, and we provide corroborating evidence from simulation and a real human microbiome study.

**Citation: **Chen B, Xu W (2020) Generalized estimating equation modeling on correlated microbiome sequencing data with longitudinal measures. PLoS Comput Biol 16(9):
e1008108.
https://doi.org/10.1371/journal.pcbi.1008108

**Editor: **Benjamin Althouse,
Institute for Disease Modeling, UNITED STATES

**Received: **March 27, 2020; **Accepted: **June 30, 2020; **Published: ** September 8, 2020

**Copyright: ** © 2020 Chen, Xu. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

**Data Availability: **All relevant data are within the manuscript and its Supporting Information files.

**Funding: **WX was funded by Natural Sciences and Engineering Research Council of Canada (NSERC Grant RGPIN-2017-06672), Princess Margaret Cancer Foundation Award. BC is a post-doctoral fellowship trainee and supported by Princess Margaret Cancer Foundation for AI and Microbiome Program. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

**Competing interests: ** The authors have declared that no competing interests exist.

This is a

PLOS Computational BiologyMethods paper.

## Introduction

Human microbiome sequencing data analysis has been a fast-growing area of genomic research in recent years. Several studies showed that the microbial composition is associated with environmental and host factors [1–3]. The microbiome data are usually characterized by 16S ribosomal ribonucleic acid (rRNA) gene sequencing or shotgun metagenomics sequencing [4, 5]. Both sequencing technologies provide reads of bacteria counts clustered into operational taxonomic units (OTUs), where each OTU is typically mapped to a taxon at level species, genus, family, order, class, phylum, kingdom or domain in a taxonomic structure.

For each sample, OTU counts can be converted to relative abundances (RAs). No matter the OTU data is in format of counts or RAs, there are a few analytical challenges which prevent the application of standard regression methods on association study between microbial composition and the environmental or genetic factors. First, the OTU data usually contains excessive zeros, which prevents modelling the OTU data by using standard types of distributions. Next, for each individual, there may exist repeated measures of OTUs, such as microbiome samples collected from different locations of human body, or multiple observations at different time points in longitudinal setting. Furthermore, the sequencing method usually detects hundreds or thousands of OTUs, which are potentially correlated with each other [6]. Identifying correlations between taxa is a common goal in genomic survey [7]. An accurate estimated correlation can be used to determine drivers in environmental ecology or contribution to habitat niches or disease; it is also a powerful tool to help researchers with hypothesis generation, such as determining which interactions might be biologically relevant in their system, and should be given further study [8]. So instead of considering each OTU as independent, it is desirable to incorporate the taxonomic information into the analysis, which reflects the correlation structure between the OTUs.

Several solutions have been proposed to answer each of these challenges. Zero-inflated microbiome data can be fitted by either zero-inflated models or two-part models [9, 10]. Repeated measures can be characterized by random effects in mixed effects models [11–15]. Modelling multiple OTUs together remains a challenging problem, although several attempts have been made. La Rosa et al. [16] and Chen et al. [17] proposed an approach which assumes that multiple OTUs follow Dirichlet multinomial (DM) distribution. However, the DM assumption imposes a negative correlation among OTUs where the true correlation can be both positive and negative. In addition, it has a fixed covariance structure which cannot flexibly handle various dispersion patterns. Tang et al. [18] proposed zero-inflated generalized Dirichlet multinomial distribution which allows for a more general covariance structure and excessive zeros in OTU counts. To further eliminate the negative correlation assumption, they also proposed distribution-free non-parametric tests [19, 20], which are robust to any correlation structures within a cluster of taxa. However, parameter estimates of covariate effects and correlation coefficients were not available due to the non-parametric essence. Alternatively, Shi et al. [21] proposed a model for Paired-Multinomial Data which works for a pair of repeated measures or a pair of correlated OTUs. Zhang at al. [22] considered estimating pairwise correlations between OTUs. Xu et al. [23] used latent variables to account for the correlation of multiple OTUs. Zhan et al. and Koh at al. [24, 25] adopted correlated sequence kernel association test assuming a random effect for each OTU, and Grantham et al. [26] used Bayesian factor analysis to cluster correlated OTUs into different factors. However, none of these approaches can model the taxonomic relationship between OTUs and provide estimations for complex correlation structure.

In order to estimate and test the association between the predictors and OTUs as well as simultaneously estimating the correlation parameters between OTUs, we propose a generalized estimating equation (GEE) [27] approach which can handle multiple correlated OTUs with repeated measures. Applying GEE model to either microbiome data [28, 29] or repeated measures such as longitudinal zero-inflated data [30–32] is not new. The novel part of our method is to develop and construct correlation structures which can truly represent the taxonomic correlations and time dependency of longitudinal OTU measures. First, we develop a correlation structure of multiple OTUs solely depending on their taxonomic structure, so that the correlation structure can provide meaningful estimates of OTU correlations. Not like the multinomial models which assume negative correlations, the correlation of OTUs in the proposed model can be both positive and negative. In addition, we incorporate the taxonomic structure with correlations due to repeated measures, and all correlations of repeated measures can be explicitly estimated.

We organize this paper as following. In Methodology section, the detailed methodology framework is introduced including the zero-inflated GEE models, the construction of correlation structure on multiple OTUs with repeated measures, parameter estimation and hypothesis testing under the Microbiome Taxonomic Longitudinal Correlation (MTLC) model. Extensive simulation studies for comparing the performance of the proposed approach to other models are presented in Simulation section. In Application section, the proposed model is applied into a real microbiome sequencing study. The conclusion and further improvements of our method are discussed in Discussion section.

## Methodology

### Taxonomic structure of OTUs

#### Numerical representation of taxonomic structure.

For known taxonomic structure of *N* OTUs, we consider its numeric representation, i.e., representing the structure by a list of numerical vectors. Throughout this paper, we call taxonomic levels from species to domain from lowest to highest. First, we find the taxonomic level at which all observed *N* OTUs belong to the same taxon but not at one level lower, and define such level as level 1. For example, if all OTUs belong to the same class but not the same order, then the level class would be level 1. Similarly, we can identify the taxonomic level at which each OTU represents a different taxon but not at one level higher, and define such level as level *I*. For example, if each OTU belongs to a different genus but not a different family, then the level genus would be level *I*. Fig 1 illustrates an example with *I* = 4 (class, order, family, and genus), where class is level 1 and genus is level 4.

For *i* = 1, …*I*, let *M*_{i} be the number of taxa at taxonomic level *i*. By definition, *M*_{1} = 1 and *M*_{i} = *N*. For *m*_{i} = 1, …, *M*_{i}, denotes each taxon at level *i*, and is the number of OTUs belonging to taxon . are then computed by the following algorithm:

It is easy to check that for *i* = 1, …, *I*,

Let . Then the taxonomic structure can be numerically represented by (*n*_{1}, …, *n*_{I}).

In the illustrative taxonomic structure example from Fig 1, we observe 6 correlated OTUs with *I* = 4. Then *M*_{1} = 1, *M*_{2} = 2, *M*_{3} = 3, *M*_{4} = 6, and the numerical representation of Fig 1 is *n*_{1} = 6, *n*_{2} = (3, 3), *n*_{3} = (2, 1, 3), *n*_{4} = (1, 1, 1, 1, 1, 1).

#### Correlation matrix of taxonomic structure.

Following the taxonomic structure, it is natural to assume that OTUs belonging to same taxa at higher levels may have some correlation. Because all OTUs belong to the same taxa at the highest taxonomic level (e.g., Bacteria domain), they are all correlated in principle. For *N* OTUs, there are up to pairwise correlations. When *N* is large, it would be infeasible to model correlation parameters, and our intuition is to reduce the number of parameters by making some reasonable assumptions such that many of the correlations are equal, according to the known taxonomic structure. The basic assumption we made is that for a cluster of OTUs, if each OTU represents a different taxon at level *i* + 1 but they all belong to the same taxon at level *i*, then all pairwise correlations of OTUs within this cluster should be equal. Under this assumption, there is only one correlation parameter in the simple case when *I* = 2. When *I* > 2, there are more than two levels in the OTU taxonomic structure, in which case the pairwise correlation coefficients for different pairs of OTUs may be equal or unequal, depending on the taxa which the OTUs belong to at each level. For a pair of OTUs, if they belong to different taxa at level *i* + 1 but the same taxa at level *i*, we call the taxon at level *i* as its first common taxon. For any two pairs of OTUs. A natural extension of our basic assumption is that two pairs of OTUs are assumed to have same correlation if and only if the first common taxa of both pairs are identical. Formally, let and be two pairs of OTUs, which have correlation *ρ** and *ρ*^{†}. is the first common taxon of , and is the first common taxon of . Then we assume

For all *N* OTUs, we define a taxonomic structure matrix to indicate which correlations are equal and which are not. The taxonomic structure matrix is an *N* × *N* symmetric matrix, where all diagonal entries are denoted by , and off-diagonal entries are indexed by uppercase Roman numbers, i.e., (see Fig 1). Each different index value represents a different correlation, and equal index value indicates the corresponding correlations are estimated by the same coefficient. We use Roman numbers to avoid any confusion with other Arabic numerals used elsewhere throughout our work, because these indices are categorical numbers which do not indicate any quantity. The values of off-diagonal entries are determined by the following steps:

- For
*i*= 1, …,*I*− 1, Let**Γ**_{i}be an*N*×*N*block diagonal matrix, For*m*_{i}= 1, …,*M*_{i}, each block*B*_{1i}is an matrix, whose diagonal entries are and off-diagonal entries are .*M*_{0}has default value 0. - When
*i*= 1, Let**Γ**^{(1)}=**Γ**_{1}be the interim correlation matrix. - When
*i*= 2, …,*I*− 1, replace the block diagonal entries of**Γ**^{(i−1)}by and keep all other entries the same. The interim correlation matrix after the replacement at level*i*is defined as**Γ**^{(i)}. - Sort all off-diagonal entries in
**Γ**^{(I−1)}from largest to smallest, where the smallest value corresponds to smallest order (order 1). Replace all off-diagonal entries by their corresponding orders in uppercase Roman numbers and define the new matrix as**Γ**.**Γ**is the taxonomic structure matrix which is numerically represented by (*n*_{1}, …,*n*_{I}).

In the above example of 6 hypothetical OTUs in Fig 1,

Applying step 2 and 3 to achieve
Applying step 4 and the final taxonomic structure matrix **Γ** is

In taxonomic structure matrix **Γ**, the index values are illustrated in Fig 1: index indicates correlation of OTUs belonging to the same class but different orders; index indicates correlation of OTUs belonging to the same order but different families; index and indicate correlations of OTUs belonging to the same family.

### Modelling correlations from repeated measures

#### Correlations of longitudinal data.

Repeated measures of single OTU from the same individual may be another source of correlation, e.g., OTU observation at multiple time points within the same person. Fig 2 shows repeated measures of multiple OTUs at *l* time points.

There are several different ways to characterize the correlations between each pair of time points, such as exchangeable, Toeplitz and unstructured. Exchangeable structure assumes all correlations are equal to each other. Toeplitz structure assumes time points with equal temporal distance have equal correlation. Unstructured model assumes each pair has different correlations and it is the most complicated structure in terms of correlation parameter estimation. Besides that, other correlation structures such as autoregressive, moving averages are also used for longitudinal data analysis [33, 34]. In this paper, we assume the correlation structure within the same individual is pre-specified. The correlation structure matrix within same individual following a given correlation structure is denoted by **Ω**_{T}. The diagonal entries are denoted by again, and off-diagonal entries are indexed by lowercase Roman numbers, i.e., , etc‥ For example, if the longitudinal OTU observations consist of 3 time points, then **Ω**_{T} assuming exchangeable structure is
Alternatively, **Ω**_{T} assuming Toeplitz structure is

#### Sample correlation.

In addition to time correlation, there may exist other types of sample correlations, such as two or more individuals from the same pedigree, or simply any repeated measures from the same individual. Without loss of generality we assume there are two repeated samples *S*_{1} and *S*_{2}. Then sampling correlation is represented by correlation structure matrix **Ω**_{S}:

#### Combining longitudinal and sample correlation.

Let **Ω** be the correlation structure combining both longitudinal and sample correlation. **Ω** = **Ω**_{T} or **Ω**_{S} when only time points correlation or sample correlation exists. When both correlations exist, we consider all combinations of time points and repeated samples in one big correlation structure **Ω**. For example, if there are two repeated samples at each of the 3 time points, then for each OTUs there are 6 observations for each individual in total, and **Ω** becomes

### Incorporating taxonomic structure with repeated measures

Suppose **Ω** has dimension *L*. For *a* = 1, …, *N* and *b* = 1, …, *N*, **Ω**(Γ_{ab}) is an *L* × *L* correlation matrix as a function of Γ_{ab}, such that
Γ_{‥} and Ω_{‥} are entries of **Γ** and **Ω** from corresponding rows and columns. We denote **Ω**(Γ_{ab}) as **Ω**^{ab} for notation simplicity.

To integrate repeated measures correlation structure **Ω** with taxonomic structure **Γ**, we introduce the integrative correlation matrix
where **Ω**^{ab} is defined above. ** R** is a

*J*×

*J*matrix where

*J*=

*N*×

*L*, and each of its entry has the form . The first subscript, Γ

_{‥}, is either or an uppercase Roman number indexing taxonomic structure correlation; the second subscript, Ω

_{‥}, is either or a lowercase Roman number indexing correlation from repeated measures of single OTU. In the above example, , and . The diagonal entries of

**, always equal to 1, and the off-diagonal entries are estimated in the next section.**

*R*### Microbiome Taxonomic Longitudinal Correlation (MTLC) model

After specifying the correlation matrix within one cluster of OTUs with repeated measures, in this section, we introduce how to model the association between multiple OTUs and their predictors of interest. We propose a Microbiome Taxonomic Longitudinal Correlation (MTLC) model to estimate predictor effects, correlation coefficients between OTUs, longitudinal measures and other repeated measures. We also perform a hypothesis testing of the predictor effects based on MTLC model. The estimates and tests are achieved by Generalized Estimating Equations (GEE) framework.

#### Generalized estimating equation framework.

Let *y*_{k}’s be independent clusters for *k* = 1, …*K*, and each cluster has length *J*_{k}. For *j* = 1, …*J*_{k}, let *x*_{kj} denote the vector of covariates with length *p*, and is the mean of *y*_{k}. Then for each observation *y*_{kj},
(1)
where *g* is a known link function and ** β** are the regression parameters of the

*p*covariates

*x*_{kj}. The conditional variance of

*y*

_{kj}is defined as Var(

*y*

_{kj}|

*x*_{kj}) =

*ν*(

*μ*

_{kj})

*ϕ*, where

*ν*is the variance function depending on the distribution of

*y*

_{kj}, and

*ϕ*is the dispersion parameter being

*σ*

^{2}for normally distributed

*y*

_{kj}and 1 for other distributions belonging to exponential family. For estimating

**, the following generalized estimating equation is solved: (2) where and . Here , and**

*β*

*R*_{k}(

**) is the working correlation matrix following the correlation structure**

*ρ***constructed in section “Incorporating taxonomic structure with repeated measures”, where**

*R***is the collection of all correlation coefficients in**

*ρ*

*R*_{k}. Clearly depends on

**and**

*ρ**ϕ*, which also needs to be estimated. If we define the Pearson residual , then . Next, is estimated as a function of

*ϕ*and

*e*

_{kj}. The exact formula of depends on the correlation structure

**, and a few examples of under different structures are given in Liang et al [27] and Wang [33]. Because the Pearson residuals**

*R**e*

_{kj}’s also depend on , it yields an iterative scheme which switches between estimating

**from fixed value of and and estimating**

*β**ϕ*and

**for a fixed value of . Under GEE theory [27], this scheme yields a consistent estimate for**

*ρ***. Moreover is asymptotically normally distributed with mean**

*β***and variance (3) where Cov(**

*β*

*y*_{k}) is the true underlying covariance matrix of

*y*_{k}. The consistent estimator of

*V*_{β}, , is achieved by replacing , , and for

**,**

*β***,**

*ρ**ϕ*and Cov(

*y*_{k}).

GEE method yields consistent estimator of ** β**, even if the structure of working correlation matrix is not correctly specified. The misspecified

*R*_{k}(

**) only affects the efficiency of . The consistent estimation of correlation matrix , however, relies on correct specification of the correlation structure.**

*ρ*For testing a hypothesis of *H*_{0}: ** Cβ** =

**, a Wald test statistic can be used with the form (4) and , where**

*c**q*is the rank of matrix

**.**

*C*#### Estimating predictors effects on OTUs.

Based on the GEE framework, we develop the MTLC model to assess the association between OTUs and the predictors of interest, accounting for the correlation of repeated OTU measures. To deal with the excess zeros of OTUs using MTLC model, first we convert quantitative OTU observations to binary outcomes (0 and 1), indicating the prevalence of OTU in each observation. Next, we focus on the OTU relative abundance (RA) of each non-zero observation, and assume the RAs following normal distribution after log transformation. We use two separate GEE models, one for assessing the predictor effects on OTU prevalence, and the other for assessing the predictor effects on positive RA. The predictors’ overall effects are finally tested by combining the test statistics from these two GEE models.

Formally, for *k* = 1, …*K* and *j* = 1, …, *J*_{k}, we assume each OTU observation *y*_{kj} follows a mixture of Bernoulli and log-normal distribution: suppose follows a Bernoulli distribution with , and follows a normal distribution such that , then the distribution function of *y*_{kj} is
where Φ is the distribution function of . By definition, represents OTU prevalence observations because
and represents the positive RAs because for all *y*_{kj} > 0. We use to denote the vector of all , and to denote the the subset of where *y*_{kj} > 0.

Rather than running generalized linear model directly on *y*_{k}, we apply GEE method separately on and . For these two GEE models, the predictors’ design matrices *X*_{k} do not have to be the same in principal, although they could be the same in many practical situations. Without loss of generality we simply assume the predictors are same in each part of the GEE model in this paper. We choose logit link function for binary outcomes and identity link function for log transformed non-zero outcomes, and the two parts of the GEE model are
(5)
and
(6)
Using iterative scheme discussed in section “Generalized estimating equation framework” on and , we can achieve the corresponding parameter estimation and .

#### Hypothesis testing.

For testing if the predictors have effects to either the prevalence of OTUs or the quantitative amount of RA, the null hypothesis is
Assuming same *X*_{k} for the part and part of GEE model, *β*^{(0)} and *β*^{(+)} will have the same dimension *p*. Moreover, *C*^{(0)} = *C*^{(+)} and *c*^{(0)} = *c*^{(+)} in many practical situations. For example, if we want to test the first *q* predictors in *X*_{k} and the rest *p* − *q* extra covariates are not of interest, then
For each part of *H*_{0}, the corresponding test statistics *W*^{(0)} and *W*^{(+)} are computed following Eq 4.

It follows section “Generalized estimating equation framework” that and . Besides, for jointly testing two null hypotheses by the combined test on *W*^{(0)} and *W*^{(+)}, we adopt Cauchy combination test [35], which does not require the independence assumption between *W*^{(0)} and *W*^{(+)}. Let *p*^{(0)} and *p*^{(+)} be the corresponding p-values, then the Cauchy combination test statistic is
(7)

#### Estimating correlation coefficients.

In our proposed MTLC model, the correlation structure is based on OTU taxonomic structure and characterizing correlations between repeated measures. Here we assume the two GEE models corresponding to the OTU prevalence part and positive RA part have the same correlation structure ** R**. However, the estimated values of correlation coefficients, and , may be different for each part of the GEE model. For and , and are estimated separately following the iterative scheme discussed in section “Generalized estimating equation framework”.

It needs to be noted that GEE models do not require each cluster has equal cluster size, which could happen, for example, in unbalanced study designs and/or when some observations are missing. Even if has equal size for all *k*, may have different sizes as it is a collection of only positive RAs. It implies that the dimension of ** R** may be greater than the length of and for some

*k*. In such case, the rows and columns in

**corresponding to empty values of OTU observations need to be removed, and we denote the modified correlation structure matrices by and correspondingly for each**

*R**k*. When applying the estimating equations in our MTLC model, we essentially use and as the working correlation matrices.

## Simulation

### Simulation settings

Simulation studies are designed to simulate zero inflated multivariate normal distribution to reflect the correlation of −log_{10} transformed OTUs. To achieve this, we simulate both multivariate Bernoulli distribution samples *Y*^{(0)} and truncated multivariate normal distribution samples ** Z** of size

*K*and length

*J*. Multivariate normal distributions are truncated to generate positive samples because all −log

_{10}transformed RAs should be positive. We further assume a single binary predictor

**, where**

*X***also has dimension**

*X**K*×

*J*, and the mean of

*Y*^{(0)}and

**depend on**

*Z***. Specifically, we simulate , and**

*X***∼**

*Z**N*

_{J}(

*Xβ*^{(+)},

**) truncated at 0. The zero-inflated multivariate normal distribution samples are computed as**

*R***=**

*Y*

*Y*^{(0)}

**.**

*Z***is indirectly associated with**

*Y***via**

*X*

*Y*^{(0)}and

**.**

*Z*For illustration purpose, we assume the simplest correlation structure, i.e., two correlated OTUs under taxonomic structure and two repeated measures at different time points). The correlation matrix ** R** is then derived following section “Incorporating taxonomic structure with repeated measures”:
, and denote the correlation between two time points and between two OTUs. represents the correlation of observations from different OTU and different time points, which is not of primary interest. We assume the simulated multivariate Bernoulli and multivariate normal distribution follow the same correlation structure

**, but the correlation coefficients and can be different.**

*R*After achieving the zero-inflated multivariate normal distribution samples ** Y**, we run a GEE logistic model following Eq 5 to estimate the effects of

**to OTU prevalence**

*X*

*Y*^{(0)}, and GEE linear model following Eq 6 to estimate

**effects to the non-zero RAs**

*X*

*Y*^{(+)}, where

*Y*^{(+)}is the subset of

**such that . Under GEE theory, both**

*Z*

*Y*^{(0)}and

*Y*^{(+)}yield consistent estimations of

**and**

*β***. However, we simulate**

*ρ***rather than**

*Z*

*Y*^{(+)}, where

**and**

*Z*

*Y*^{(+)}may not yield same estimations in general. To solve this issue, we simulate

**and**

*Z*

*Y*^{(0)}independently, which implies that has the same distribution as

*z*

_{kj}. Therefore,

**also yields consistent estimations of**

*Z***and**

*β***.**

*ρ*Different from some literature that ** Y** is directly simulated, we conducted our stimulation on

*Y*^{(0)}and

**separately. This is because following the mixture distribution framework, we conduct two separate GEE models on**

*Z*

*Y*^{(0)}and

*Y*^{(+)}rather than one model directly on

**. In this way, we can clearly specify the true values of predictor’s main effects and OTU correlations in simulation settings, and evaluate if the estimations of these values are unbiased explicitly. As a sensitivity analysis to evaluate the robustness of our model performance, we also simulate**

*Y*

*Y*^{(0)}and

**from (generalized) linear mixed model. Results are presented in S1 Appendix.**

*Z*### Inferences for predictor’s main effects

First, we evaluate the performance of our proposed MTLC model for estimating and testing the main effects or the predictor ** X**. Let

*β*

^{(0)}denote the effects on OTU prevalence and

*β*

^{(+)}denote the effects on the log

_{10}transformed none-zero RA. We evaluate the unbiasedness of estimated , , Type I error for testing

*β*

^{(0)}=

*β*

^{(+)}= 0 and test power when

*β*

^{(0)}and/or

*β*

^{(+)}≠ 0. OTU observations are simulated under the simulation settings discussed in section “Simulation settings” with sample size

*K*= 1000 and various combinations of

*β*

^{(0)}and

*β*

^{(+)}values. We assume and for both the multivariate normal and multivariate Bernoulli distribution.

*β*, Type I errors and powers are estimated based on 1000 replications. The computation time is about 4 hours to complete all 1000 replications on a desktop computer with quad-core processor and 8GB of RAM.

Next we compare our MTLC model to other models. All models are described in Table 1.

For each model, the estimated , , Type I error and power are summarized in Table 2. We find all estimates of *β*^{(0)} and *β*^{(+)} are unbiased under MTLC model. For the one-part models, because there is no true value of *β* as a mixture of *β*^{(0)} and *β*^{(+)}, the unbiasedness of estimated *β* cannot be evaluated. Regarding the variations of estimated , the 2.5 and 97.5 percentile of the empirical distributions of are shown in S1 Appendix.

Given the true Type I error at 0.05, 2P_ind and 1P_ind model have inflated Type I error, and all other estimated Type I errors are accurate. It needs to be noted that when only one of *β*^{(0)} and *β*^{(+)} equal to 0, the Type I error estimation is still accurate. For example, when (*β*^{(0)}, *β*^{(+)}) = (0, 0.05), the GEE^{(0)} model for testing *β*^{(0)} = 0 has Type I error 0.062, which is not affected by the non-zero value of *β*^{(+)}. It further confirms the independence of the linear and logistic regression parts in the two-part model.

We also evaluate the power performance of different models. The power of 2P_ind and 1P_ind model are inflated due to Type I error inflation. Our proposed MTLC model is most powerful in general. When one of *β*^{(0)} and *β*^{(+)} is 0, the MTLC model is slightly less powerful than one of GEE^{(0)} and GEE^{(+)} model which only tests the part that *β* ≠ 0. However, when both *β*^{(0)} and *β*^{(+)} are non-zero, the MTLC model is much more powerful than both GEE^{(0)} and GEE^{(+)} model. The 1P_GEE model and 1P_RE model have similar powers. It needs to be noted that the 1P_RE model is not able to accommodate negative correlations due to the natural or random effects. This is the reason that we choose *ρ*_{01} and *ρ*_{10} to be positive in the simulation settings. When the true correlations are negative, the 1P_RE model simply reduces to 1P_ind model. Comparing to the MTLC model, the power of the one-part models drops dramatically when *β*^{(0)} and *β*^{(+)} have opposite sign. This is because the positive effect cancels out the negative effects in one-part models, but both effects are well captured in two-part models. When *β*^{(0)} and *β*^{(+)} have same direction, we do observe some cases that the power of one-part models are larger. This is related to how to deal with the excess zeros in the one-part models. Detailed discussion about this issue is provided in section “Two-part vs. one-part models”.

### Estimations for the correlation coefficients

The MTLC model can also provide estimations of correlation coefficients. First we evaluate the unbiasedness of the correlation estimates. Let *ρ*^{(0)} and *ρ*^{(+)} be correlation coefficients in GEE^{(0)} and GEE^{(+)} model. In simulation settings, we choose and , *β*^{(0)} = −0.1 and *β*^{(+)} = 0.05. The specified *β* values do not affect the estimation of ** ρ**. Sample size

*K*= 1000 and number of replications remains to be 1000.

The correlation structure of OTUs is based on the taxonomic structure, which is usually known in practice. However, the correlation structure of repeated measures within each OTU may not be known and usually requires subjective assumptions. One merit of GEE model is that even if the assumption of correlation structure is not correct, it does not affect the estimation of main effect *β*. The estimations are consistent under different assumptions of correlation structure, as illustrated by Yan [36] and confirmed by our simulation study (results not shown). Besides that, we evaluate the consistency of correlation estimations under wrong correlative structure setting.

In contrast to the correct correlation structure ** R**, we first construct a model with a correlation matrix assuming that OTUs are independent while time points are still correlated. After that, we construct another model with correlation matrix assuming that time points are independent while OTUs are still correlated. When OTUs are assumed to be independent, the GEE model may only estimate ; when time points are independent, the GEE model may only estimate . The correlation estimations are summarized in Table 3.

From Table 3, the correlation estimates under true correlation structure are all unbiased. When the correlation structure is not correctly specified, it may not estimate all correlation coefficients for the correct correlation structure, but more interestingly, for those correlation coefficients which can be estimated under the misspecified structure, the estimation remains to be unbiased. It implies that if we are not interested in estimating all correlations in the correct correlation structure, we can simplify the correlation structure. For example, because the estimation of is not of interest, we can set it to 0 without affecting the estimation of and .

The correlation structure only contains two OTUs and two time points, so the GEE correlation estimates are essentially pairwise correlations, and thus they can be compared with corresponding Pearson correlation coefficients. Both results are consistent as expected. The merit of our MTLC model is that when the correlation structure is more complicated and the pairwise Pearson correlation is not available, it may still provide unbiased estimation of the correlation matrix.

### Two-part vs. one-part models

For one-part models, if we take −log_{10} transformation of both the non-zero RAs and 0, then all 0 becomes ∞. To solve this issue, one common approach is to change all 0 to some small value close to 0, such as 10^{−5}. However, we find the one-part model test powers are sensitive to this arbitrary small value. In Table 4, we replace −log_{10} 0 by 6, 5 4 and 3 and compare corresponding test powers with the MTLC model. We only present the 1P_GEE model as we have shown in Table 2 that the 1P_RE model has similar power to 1P_GEE.

Table 4 indicates that there is no optimal choice of the value for replacing 0 RAs. For each value selected, depending on (*β*^{(0)}, *β*^{(+)}), there may exist some situations such that the one-part model has comparable power or even slightly better power than corresponding two-part model (e.g., 0.650 vs. 0.609 when (*β*^{(0)}, *β*^{(+)}) = (0.1, 0) and replacing 0 by 10^{−6}), but the power loss is much more significant for some other values of *β* (e.g., 0.138 vs. 0.421 when (*β*^{(0)}, *β*^{(+)}) = (0, 0.05) and replacing 0 by 10^{−6}). We conclude that our MTLC models has superior and robust power performance compared to the one-part models, and suggest readers avoid using the one-part models in practice when there are excessive numbers of 0s in OTU data.

## Application

We implement our proposed MTLC model on a twin study described in Turnbaugh et al. [37]. The full dataset is provided in the supporting information S1 Data. The data consists of 54 families and each family has a pair of twins. Each individual has at most two observations at two time points. The primary research question is to assess the association between obesity status (lean, overweight or obese) and OTUs, and estimate the correlations between two time points, each pair of twins and OTUs. For illustration purpose, we only analyze OTUs within the order *Clostridiales*, which consists of 9 OTUs at genus level. The taxonomic structure of these 9 OTUs are shown in Fig 3.

From Fig 3, all 9 OTUs begin to belong to the same taxa (*Clostridiales*) at level order, and each of the 9 OTUs belongs to a different taxon at level genus. We define level order as level 1, level family as level 2 and level genus as level 3, thus *I* = 3. Accordingly, the numerical representation of the taxonomic structure is *n*_{1} = 9, *n*_{2} = (4, 1, 4), *n*_{3} = (1, 1, 1, 1, 1, 1, 1, 1, 1).

Next, following the 4 steps described in section “Taxonomic structure of OTUs”, the taxonomic structure matrix is

Because each OTU is observed at two time points for a pair of twins, the repeated measure correlation structure following section “Modelling correlations from repeated measures” is

The dimension of **Γ** and **Ω** are *N* = 9 and *L* = 4, so as described in section “Incorporating taxonomic structure with repeated measures”, the integrative correlation matrix ** R** has dimension

*J*=

*N*×

*L*= 36. For

*a*= 1, …, 9 and

*b*= 1, …, 9, if , then if , then if , then if , then The integrative correlation matrix is then

To apply the proposed MTLC model, all OTU observations are summarized as ** Y**.

**is the single binary predictor denoting obesity status (lean vs. obese/overweight). Both**

*X***and**

*Y***have dimension**

*X**K*×

*J*where

*K*= 54 and

*J*= 36. Some pedigrees only consist one individual instead a pair of twins, and OTUs are observed at one instead of two time points for some individuals, hence missing values exist in the matrix

**. Next,**

*Y***is separated as**

*Y*

*Y*^{(0)}and

*Y*^{(+)}representing OTU prevalences and positive RAs. We assume each follows Bernoulli distribution with mean and follows log normal distribution with mean . Then under MTLC model,

**and**

*Y***have the following relationship: (8) (9)**

*X**α*

^{(0)}and

*α*

^{(+)}are intercept parameters which are not our primary interest. Our goal is to estimate the effects of obesity status

*β*

^{(0)}and

*β*

^{(+)}, and test

*H*

_{0}:

*β*

^{(0)}=

*β*

^{(+)}= 0.

*β*

^{(0)}and

*β*

^{(+)}are estimated separately under Eq 2, and

*H*

_{0}is tested by the combined test statistic

*W*

_{MTLC}following Eq 7.

We summarize the estimates of obesity effects for predicting OTUs and corresponding p-values for testing *H*_{0} in Table 5. We compare the MTLC model with the other models listed in Table 1. Using our MTLC model, obesity has shown significant overall association with these OTUs. Specially, it has shown significant association with the prevalence of OTUs, but no significant association with the non-zero RAs. All other models do not detect the overall significance. The computation time is less than 30 seconds for the twin study dataset.

Correlation estimates are presented in Table 6. and are correlation between the two time points and correlation between the two twins. , and are OTU correlations, representing correlation from different family but within the same order *Clostridiales*, and correlation within the same family *Lachnospiraceae* or *Ruminococcaceae*.

When Pearson correlations are available ( and ), they are quite consistent with the correlation estimates under GEE models. However, Pearson correlation is not available for OTU correlations due to the complicated taxonomic structure, and only our proposed MTLC model can estimate these correlations.

## Discussion

In this paper, we develop and implement a novel approach to model the correlations of OTUs based on the biological taxonomic structure. The proposed MTLC model can incorporate the taxonomic structure with repeated measures from longitudinal data. It has accurate Type I error, unbiased estimation of model parameters and robust power performance under a variety of situations. Compared to existing methods, our method is more powerful and can provide unbiased estimation of the correlation coefficients between multiple OTUs and repeated measures.

The MTLC model allows for sufficient flexibility of the correlation matrix construction. It not only allows different correlation matrices for the logistic regression part and linear regression part, but also put no constraint on the range of each correlation coefficient, i.e., any positive or negative value from -1 to 1. In contrast, the random effect in mixed effect model naturally leads to a positive correlation, because the same random effect adds to a few correlated samples. When the true correlations are negative, the mixed effects model (e.g., Chen et al. [13]) is simply reduced to ordinary linear and logistic regression model with independence assumption, which results in incorrect Type I errors as we have shown in section “Inferences for predictor’s main effects”. In summary, the MTLC model provides a reliable analytical framework for longitudinal microbiome data analysis.

Our methodology for constructing correlation matrix of taxonomic structure imposes no constraints to the number of OTUs, which is denoted by *N*. Based on the computation time shown in our simulation and application study, we find the MTLC model runs fast overall. However, when *N* is large, (e.g., *N* > 1000), the correlation matrix has a high dimension, and it may cause computational issues and become time consuming to implement the MTLC model. In such case, we suggest a dimension reduction by selecting a subgroup of OTUs. For example, if OTUs are from the same phylum but different classes. Our MTLC model can be implemented on each class separately or focus on the classes of interest, instead on the whole phylum.

We have shown that the correlation estimation is consistent under MTLC model, but the estimation accuracy is not clear. Yan [36] proposed standard error estimations of the correlation coefficients under GEE approach. When corresponding Pearson correlations are also available, we have found the standard error under GEE approach may depart from the standard error of Pearson correlations. Because the underlying distribution of the correlation estimates is unknown, it lacks theoretical justifications of the standard error estimates. Further studies are required for estimating the accurate standard errors of correlation coefficients under our MTLC model.

The MTLC model assumes −log_{10} transformed positive RAs following normal distribution. Clearly this is not the only approach to modelling the RA data, and there is no universal answer for choosing the “best” approach. Liu et al. [38] gave an overview for modelling zero-inflated non-negative continous data in general and proposed a few alternative distributions for the positive part of RAs. For example, zero-inflated beta distribution is another commonly used approach [13, 39], because beta distribution has range from 0 to 1 exactly matching the range of RAs.

When *β*^{(0)} and *β*^{(+)} have opposite signs, the predictor’s effects are described as “dissonant”. Under this scenario, the two-part models showing more powerful results in the simulation studies coincides with existing literature [9, 40]. In microbiome context, an example of this scenario is that, an antibiotic treatment may be effective in reducing the risk of carrying some specific bacteria, but may result in the growth of these bacteria once they survive due to antibiotic resistance [41, 42].

For the proposed method, the dimension of predictors’ design matrix *X*_{k}, *p*, is assumed to be less than the number of clusters *K*. For high dimensional predictor space, e,g., gene expressions in genome-wide association study, it is possible to encounter the situation of *p* ≥ *K*. In such cases regression models cannot be directly applied, and dimension reduction techniques need to be used. Traditional approaches such as principal component analysis and penalized regression including ridge regression and LASSO, as well as some machine learning based feature selection methods can be considered to be incorporated into the proposed method to deal with high dimensional predictors. We are planning to extend the proposed method to deal with such high dimensional predictors situation.

We have treated repeated longitudinal measures as a few discrete time points in our MTLC model. When there are more time points for each sample and the exact observation time for each sample is continuous, it is a natural extension of our current work to consider time as a continuous variable and OTU observations as a function of time. Further investigation of functional data analysis techniques can be explored and integrated with the OTU correlation structure developed in this paper.

## Supporting information

### S1 Data. Data for the real microbiome sequencing study in Application section.

https://doi.org/10.1371/journal.pcbi.1008108.s001

(XLS)

## Acknowledgments

The authors would like to thank Dr. Lillian L. Siu, Dr. Bryan Coburn, Dr. Pierre Schneeberger, Dr. Osvaldo Espin-Garcia and Dr. Jeffrey Rosenthal for helpful discussions and suggestions at different stages of our study.

## References

- 1. Kinross JM, von Roon AC, Holmes E, Darzi A, Nicholson JK. The human gut microbiome: implications for future health care. Current Gastroenterology Reports. 2008;10:396–403.
- 2. Cho I, Blaser MJ. The human microbiome: at the interface of health and disease. Nature Reviews Genetics. 2012;13:260–270.
- 3. Gerber GK. The dynamic microbiome. FEBS Letters. 2014;588(22):4131–4139.
- 4. Qin J, Li R, Raes J, Arumugam M, Burgdorf KS, et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature. 2010;464(7285):4131–4139.
- 5. Kuczynski J, Lauber CL, Walters WA, Wegener L, Clemente PJC, et al. Experimental and analytical tools for studying the human microbiome. Nature Reviews Genetics. 2012;13:47–58.
- 6. Mandal S, Van Treuren W, White RA, Eggesbo M, Knight R, Peddada SD. Analysis of composition of microbiomes: a novel method for studying microbial composition. Microbial Ecology in Health and Disease. 2015;26(1).
- 7. Friedman J, Alm EJ. Inferring Correlation Networks from Genomic Survey Data. PLoS Computational Biology. 2012;8(9):e1002687.
- 8. Weiss S, Treuren WV, Lozupone C, Faust K, Friedman J, et al. Correlation detection strategies in microbial data sets vary widely in sensitivity and precision. The ISME Journal. 2016;10:1669–1681.
- 9. Xu L, Turpin W, Paterson AD, Xu W. Assessment and Selection of Competing Models for Zero-Inflated Microbiome Data. PLoS ONE. 2015;10(7):e0129606.
- 10. Kaul A, Mandal S, Davidov O, Peddada SD. Analysis of Microbiome Data in the Presence of Excess Zeros. Frontiers in Microbiology. 2017;8:2014.
- 11. Su L, Tom BDM, Long DL, Yiu S, Farewell VT. Two-Part and Related Regression Models for Longitudinal Data. Annual Review of Statistics and Its Application. 2017;4(1):283–315.
- 12. Anthea M. Random Effects Modeling and the Zero-Inflated Poisson Distribution. Communications in Statistics—Theory and Methods. 2014;43(4):664–680.
- 13. Chen EZ, Li H. A two-part mixed-effects model for analyzing longitudinal microbiome compositional data. Bioinformatics. 2016;32(17):2611–2617.
- 14. Zhang X, Mallick H, Tang Z, Zhang L, Cui X, et al. Negative binomial mixed models for analyzing microbiome count data. BMC Bioinformatics. 2017;18(4):1–10.
- 15. Zhang X, Pei YF, Zhang L, Guo B, Pendegraft AH, et al. Negative Binomial Mixed Models for Analyzing Longitudinal Microbiome Data. Frontiers in Microbiology. 2018;9:1683.
- 16. La Rosa PS, Brooks JP, Deych E, Boone EL, Edwards DJ, et al. Hypothesis testing and power calculations for taxonomic-based human microbiome data. PLoS ONE. 2012;7(12):e52078.
- 17. Chen J, Li H. Variable selection for sparse Dirichlet-multinomial regression with an application to microbiome data analysis. The Annals of Applied Statistics. 2013;7(1):418–442.
- 18. Tang ZZ, Chen G. Zero-inflated generalized Dirichlet multinomial regression model for microbiome compositional data analysis. Biostatistics. 2018;20(4):698–713.
- 19. Tang ZZ, Chen G, Alekseyenko AV, Li H. A general framework for association analysis of microbial communities on a taxonomic tree. Bioinformatics. 2017;33(9):1278–1285.
- 20. Tang ZZ, Chen G. Robust and Powerful Differential Composition Tests for Clustered Microbiome Data. Statistics in Biosciences. 2019.
- 21. Shi P, Li H. A Model for Paired-Multinomial Data and Its Application to Analysis of Data on a Taxonomic Tree. Biometrics. 2017;73(4):1266–1278.
- 22. Zhang Y, Han SW, Cox LM, Li H. A multivariate distance–based analytic framework for microbial interdependence association test in longitudinal study. Genetic Epidemiology. 2017;41(8):769–778.
- 23. Xu L, Peterson AD, Xu W. Bayesian latent variable models for hierarchical clustered count outcomes with repeated measures in microbiome studies. Genetic Epidemiology. 2017;41(3):221–232.
- 24. Zhan X, Xue L, Zheng H, Plantinga A, Wu MC, et al. A small–sample kernel association test for correlated data with application to microbiome association studies. Genetic Epidemiology. 2018;42(8):772–782.
- 25. Koh H, Li Y, Zhan X, Chen J, Zhao N. A Distance–Based Kernel Association Test Based on the Generalized Linear Mixed Model for Correlated Microbiome Studies. Frontiers in Microbiology. 2018;10:458.
- 26. Grantham NS, Guan Y, Reich BJ, Borer ET, Gross K. MIMIX: a Bayesian Mixed–Effects Model for Microbiome Data from Designed Experiments. Journal of the American Statistical Association: Application and Case Studies. 2019;0(0):1–11.
- 27. Liang KY, Zeger SL. Longitudinal Data Analysis Using Generalized Linear Models. Biometrika. 1986;73(1):13–22.
- 28. Kelly BJ, Imai I, Bittinger K, Laughlin A, Fuchs BD, et al. Composition and dynamics of the respiratory tract microbiome in intubated patients. BMC Microbiome. 2016;4(7).
- 29. Seekatz AM, Rao K, Santhosh K, Young VB. Dynamics of the fecal microbiome in patients with recurrent and nonrecurrent Clostridium difficile infection. BMC Genome Medicine. 2016;8(47).
- 30. Ballinger GA. Using Generalized Estimating Equations for Longitudinal Data Analysis. Organizational Research Methods. 2004;7(2):127–150.
- 31. Shults J, Ratcliffe SJ. Analysis of multi-level correlated data in the framework of generalized estimating equations via xtmultcorr procedures in Stata and qls functions in Matlab. Statistics and Its Inference. 2009;2(2):187–196.
- 32. Lee AH, Xiang L, Hirayama F. Modeling Physical Activity Outcomes: “A Two-part Generalized-estimating-equations Approach. Epidemiology. 2010;21(5):626–630.
- 33. Wang M. Generalized Estimating Equations in Longitudinal Data Analysis: A Review and Recent Developments. Advances in Statistics. 2014;2014(303728):1–11.
- 34. Zadlo T. On longitudinal moving average model for prediction of subpopulation total. Statistical Papers. 2015;56(3):749–771.
- 35. Liu Y, Xie J. Cauchy Combination Test: A Powerful Test With Analytic p-Value Calculation Under Arbitrary Dependency Structures. Journal of the American Statistical Association. 2020;115(529):393–402.
- 36. Yan J. The R Package geepack for Generalized Estimating Equations. Journal of Statistical Software. 2006;15(2).
- 37. Turnbaugh PJ, Hamady M, Yatsunenko T, Canterel BL, Duncan A, et al. A core gut microbiome in obese and lean twins. Nature. 2009;457(7228):480–484.
- 38. Liu L, Shih YCT, Strawderman RL, Zhang D, Johnson BA, et al. Statistical Analysis of Zero-Inflated Nonnegative Continuous Data: A Review. Statistical Science. 2019;34(2):253–279.
- 39. Chai H, Jiang H, Lin L, Liu L. A marginalized two-part Beta regression model for microbiome compositional data. PLoS Computational Biology. 2018;14(7):e1006329.
- 40. Lachenbruch PA. Comparisons of two-part models with competitors. Statistics in Medicine. 2001;20:1215–1234.
- 41. Costelloe C, Metcalfe C, Lovering A, Mant D, Hay AD. Effect of antibiotic prescribing in primary care on antimicrobial resistance in individualpatients: systematic review and meta-analysis. British Medical Journal. 2010;340(7756):1120.
- 42. Munita JM, Arias CA. Mechanisms of Antibiotic Resistance. Microbiology Spectrum. 2016;4(2).