Generalized estimating equation modeling on correlated microbiome sequencing data with longitudinal measures

Existing models for assessing microbiome sequencing such as operational taxonomic units (OTUs) can only test predictors’ effects on OTUs. There is limited work on how to estimate the correlations between multiple OTUs and incorporate such relationship into models to evaluate longitudinal OTU measures. We propose a novel approach to estimate OTU correlations based on their taxonomic structure, and apply such correlation structure in Generalized Estimating Equations (GEE) models to estimate both predictors’ effects and OTU correlations. We develop a two-part Microbiome Taxonomic Longitudinal Correlation (MTLC) model for multivariate zero-inflated OTU outcomes based on the GEE framework. In addition, longitudinal and other types of repeated OTU measures are integrated in the MTLC model. Extensive simulations have been conducted to evaluate the performance of the MTLC method. Compared with the existing methods, the MTLC method shows robust and consistent estimation, and improved statistical power for testing predictors’ effects. Lastly we demonstrate our proposed method by implementing it into a real human microbiome study to evaluate the obesity on twins.


Author summary
Human microbiome sequencing data analysis has been a fast growing area of genomic research in recent years. Although there have been several works for detecting predictors on a single operational taxonomic unit (OTU) or multiple OTUs simultaneously, there is limited work on how to estimate the correlations between multiple OTUs and incorporate such relationship into models to evaluate longitudinal OTU measures. Here we propose a novel approach to estimate OTU correlations based on their taxonomic structure after integrating longitudinal and other types of repeated OTU measures, and apply such correlation structure in Generalized Estimating Equations (GEE) models to estimate both predictors' effects and OTU correlations. The method is theoretically sound and practically easy to implement, and we provide corroborating evidence from simulation and a real human microbiome study. This is a PLOS Computational Biology Methods paper.

Introduction
Human microbiome sequencing data analysis has been a fast-growing area of genomic research in recent years. Several studies showed that the microbial composition is associated with environmental and host factors [1][2][3]. The microbiome data are usually characterized by 16S ribosomal ribonucleic acid (rRNA) gene sequencing or shotgun metagenomics sequencing [4,5]. Both sequencing technologies provide reads of bacteria counts clustered into operational taxonomic units (OTUs), where each OTU is typically mapped to a taxon at level species, genus, family, order, class, phylum, kingdom or domain in a taxonomic structure. For each sample, OTU counts can be converted to relative abundances (RAs). No matter the OTU data is in format of counts or RAs, there are a few analytical challenges which prevent the application of standard regression methods on association study between microbial composition and the environmental or genetic factors. First, the OTU data usually contains excessive zeros, which prevents modelling the OTU data by using standard types of distributions. Next, for each individual, there may exist repeated measures of OTUs, such as microbiome samples collected from different locations of human body, or multiple observations at different time points in longitudinal setting. Furthermore, the sequencing method usually detects hundreds or thousands of OTUs, which are potentially correlated with each other [6]. Identifying correlations between taxa is a common goal in genomic survey [7]. An accurate estimated correlation can be used to determine drivers in environmental ecology or contribution to habitat niches or disease; it is also a powerful tool to help researchers with hypothesis generation, such as determining which interactions might be biologically relevant in their system, and should be given further study [8]. So instead of considering each OTU as independent, it is desirable to incorporate the taxonomic information into the analysis, which reflects the correlation structure between the OTUs.
Several solutions have been proposed to answer each of these challenges. Zero-inflated microbiome data can be fitted by either zero-inflated models or two-part models [9,10]. Repeated measures can be characterized by random effects in mixed effects models [11][12][13][14][15]. Modelling multiple OTUs together remains a challenging problem, although several attempts have been made. La Rosa et al. [16] and Chen et al. [17] proposed an approach which assumes that multiple OTUs follow Dirichlet multinomial (DM) distribution. However, the DM assumption imposes a negative correlation among OTUs where the true correlation can be both positive and negative. In addition, it has a fixed covariance structure which cannot flexibly handle various dispersion patterns. Tang et al. [18] proposed zero-inflated generalized Dirichlet multinomial distribution which allows for a more general covariance structure and excessive zeros in OTU counts. To further eliminate the negative correlation assumption, they also proposed distribution-free non-parametric tests [19,20], which are robust to any correlation structures within a cluster of taxa. However, parameter estimates of covariate effects and correlation coefficients were not available due to the non-parametric essence. Alternatively, Shi et al. [21] proposed a model for Paired-Multinomial Data which works for a pair of repeated measures or a pair of correlated OTUs. Zhang at al. [22] considered estimating pairwise correlations between OTUs. Xu et al. [23] used latent variables to account for the correlation of multiple OTUs. Zhan et al. and Koh at al. [24,25] adopted correlated sequence kernel association test assuming a random effect for each OTU, and Grantham et al. [26] used Bayesian factor analysis to cluster correlated OTUs into different factors. However, none of these approaches can model the taxonomic relationship between OTUs and provide estimations for complex correlation structure.
In order to estimate and test the association between the predictors and OTUs as well as simultaneously estimating the correlation parameters between OTUs, we propose a generalized estimating equation (GEE) [27] approach which can handle multiple correlated OTUs with repeated measures. Applying GEE model to either microbiome data [28,29] or repeated measures such as longitudinal zero-inflated data [30][31][32] is not new. The novel part of our method is to develop and construct correlation structures which can truly represent the taxonomic correlations and time dependency of longitudinal OTU measures. First, we develop a correlation structure of multiple OTUs solely depending on their taxonomic structure, so that the correlation structure can provide meaningful estimates of OTU correlations. Not like the multinomial models which assume negative correlations, the correlation of OTUs in the proposed model can be both positive and negative. In addition, we incorporate the taxonomic structure with correlations due to repeated measures, and all correlations of repeated measures can be explicitly estimated.
We organize this paper as following. In Methodology section, the detailed methodology framework is introduced including the zero-inflated GEE models, the construction of correlation structure on multiple OTUs with repeated measures, parameter estimation and hypothesis testing under the Microbiome Taxonomic Longitudinal Correlation (MTLC) model. Extensive simulation studies for comparing the performance of the proposed approach to other models are presented in Simulation section. In Application section, the proposed model is applied into a real microbiome sequencing study. The conclusion and further improvements of our method are discussed in Discussion section.

Taxonomic structure of OTUs
Numerical representation of taxonomic structure. For known taxonomic structure of N OTUs, we consider its numeric representation, i.e., representing the structure by a list of numerical vectors. Throughout this paper, we call taxonomic levels from species to domain from lowest to highest. First, we find the taxonomic level at which all observed N OTUs belong to the same taxon but not at one level lower, and define such level as level 1. For example, if all OTUs belong to the same class but not the same order, then the level class would be level 1. Similarly, we can identify the taxonomic level at which each OTU represents a different taxon but not at one level higher, and define such level as level I. For example, if each OTU belongs to a different genus but not a different family, then the level genus would be level I.
It is easy to check that for i = 1, . . ., I, Then the taxonomic structure can be numerically represented by (n 1 , . . ., n I ).
In the illustrative taxonomic structure example from Correlation matrix of taxonomic structure. Following the taxonomic structure, it is natural to assume that OTUs belonging to same taxa at higher levels may have some correlation. Because all OTUs belong to the same taxa at the highest taxonomic level (e.g., Bacteria intuition is to reduce the number of parameters by making some reasonable assumptions such that many of the correlations are equal, according to the known taxonomic structure. The basic assumption we made is that for a cluster of OTUs, if each OTU represents a different taxon at level i + 1 but they all belong to the same taxon at level i, then all pairwise correlations of OTUs within this cluster should be equal. Under this assumption, there is only one correlation parameter in the simple case when I = 2. When I > 2, there are more than two levels in the OTU taxonomic structure, in which case the pairwise correlation coefficients for different pairs of OTUs may be equal or unequal, depending on the taxa which the OTUs belong to at each level. For a pair of OTUs, if they belong to different taxa at level i + 1 but the same taxa at level i, we call the taxon at level i as its first common taxon. For any two pairs of OTUs. A natural extension of our basic assumption is that two pairs of OTUs are assumed to have same correlation if and only if the first common taxa of both pairs are identical. Formally, let P � and P y be two pairs of OTUs, which have correlation ρ � and ρ † . t m � i ;i � is the first common taxon of P � , and t m y i ;i y is the first common taxon of P y . Then we assume For all N OTUs, we define a taxonomic structure matrix to indicate which correlations are equal and which are not. The taxonomic structure matrix is an N × N symmetric matrix, where all diagonal entries are denoted by D, and off-diagonal entries are indexed by uppercase Roman numbers, i.e., I; II; III (see Fig 1). Each different index value represents a different correlation, and equal index value indicates the corresponding correlations are estimated by the same coefficient. We use Roman numbers to avoid any confusion with other Arabic numerals used elsewhere throughout our work, because these indices are categorical numbers which do not indicate any quantity. The values of off-diagonal entries are determined by the following steps: 1. For i = 1, . . ., I − 1, Let Γ i be an N × N block diagonal matrix, 4. Sort all off-diagonal entries in Γ (I−1) from largest to smallest, where the smallest value corresponds to smallest order (order 1). Replace all off-diagonal entries by their corresponding orders in uppercase Roman numbers and define the new matrix as Γ. Γ is the taxonomic structure matrix which is numerically represented by (n 1 , . . ., n I ).
In the above example of 6 hypothetical OTUs in Applying step 2 and 3 to achieve Applying step 4 and the final taxonomic structure matrix Γ is In taxonomic structure matrix Γ, the index values are illustrated in Fig 1: index I indicates correlation of OTUs belonging to the same class but different orders; index II indicates correlation of OTUs belonging to the same order but different families; index III and IV indicate correlations of OTUs belonging to the same family. There are several different ways to characterize the correlations between each pair of time points, such as exchangeable, Toeplitz and unstructured. Exchangeable structure assumes all correlations are equal to each other. Toeplitz structure assumes time points with equal temporal distance have equal correlation. Unstructured model assumes each pair has different correlations and it is the most complicated structure in terms of correlation parameter estimation. Besides that, other correlation structures such as autoregressive, moving averages are also used for longitudinal data analysis [33,34]. In this paper, we assume the correlation structure within the same individual is pre-specified. The correlation structure matrix within same individual following a given correlation structure is denoted by O T . The diagonal entries are denoted by

Modelling correlations from repeated measures
Alternatively, O T assuming Toeplitz structure is Sample correlation. In addition to time correlation, there may exist other types of sample correlations, such as two or more individuals from the same pedigree, or simply any repeated measures from the same individual. Without loss of generality we assume there are two repeated samples S 1 and S 2 . Then sampling correlation is represented by correlation structure matrix O S :

Incorporating taxonomic structure with repeated measures
Suppose O has dimension L. For a = 1, . . ., N and b = 1, . . ., N, O(Γ ab ) is an L × L correlation matrix as a function of Γ ab , such that : Γ ‥ and O ‥ are entries of Γ and O from corresponding rows and columns. We denote O(Γ ab ) as O ab for notation simplicity.
To integrate repeated measures correlation structure O with taxonomic structure Γ, we introduce the integrative correlation matrix and each of its entry has the form r ðG ‥ ;O ‥ Þ . The first subscript, Γ ‥ , is either D or an uppercase Roman number indexing taxonomic structure correlation; the second subscript, O ‥ , is either D or a lowercase Roman number indexing correlation from repeated measures of single OTU. In the above example, The diagonal entries of R, r ðD;DÞ always equal to 1, and the off-diagonal entries are estimated in the next section.

Microbiome Taxonomic Longitudinal Correlation (MTLC) model
After specifying the correlation matrix within one cluster of OTUs with repeated measures, in this section, we introduce how to model the association between multiple OTUs and their predictors of interest. We propose a Microbiome Taxonomic Longitudinal Correlation (MTLC) model to estimate predictor effects, correlation coefficients between OTUs, longitudinal measures and other repeated measures. We also perform a hypothesis testing of the predictor effects based on MTLC model. The estimates and tests are achieved by Generalized Estimating Equations (GEE) framework. Generalized estimating equation framework. Let y k 's be independent clusters for k = 1, . . .K, and each cluster y k ¼ ðy k1 ; . . . y kJ k Þ has length J k . For j = 1, . . .J k , let x kj denote the vector of covariates with length p, and μ k ¼ ðm k1 ; . . . ; m kJ k Þ is the mean of y k . Then for each observation y kj , where g is a known link function and β are the regression parameters of the p covariates x kj . The conditional variance of y kj is defined as Var(y kj |x kj ) = ν(μ kj )ϕ, where ν is the variance function depending on the distribution of y kj , and ϕ is the dispersion parameter being σ 2 for normally distributed y kj and 1 for other distributions belonging to exponential family. For estimating β, the following generalized estimating equation is solved: . . . m kJ k �Þ, and R k (ρ) is the working correlation matrix following the correlation structure R constructed in section "Incorporating taxonomic structure with repeated measures", where ρ is the collection of all correlation coefficients in R k . Clearlyβ depends on ρ and ϕ, which also needs to be estimated.
If we define the Pearson residual e kj ¼ ðy kj À m kj Þ= ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi Next,ρ is estimated as a function of ϕ and e kj . The exact formula ofρ depends on the correlation structure R, and a few examples ofρ under different structures are given in Liang et al [27] and Wang [33]. Because the Pearson residuals e kj 's also depend onβ, it yields an iterative scheme which switches between estimating β from fixed value of� andρ and estimating ϕ and ρ for a fixed value ofβ. Under GEE theory [27], this scheme yields a consistent estimate for β. Moreoverβ is asymptotically normally distributed with mean β and variance where Cov(y k ) is the true underlying covariance matrix of y k . The consistent estimator of V β , V β , is achieved by replacingβ,ρ,� and fy k À μ k ðβÞgfy k À μ k ðβÞg 0 for β, ρ, ϕ and Cov(y k ). GEE method yields consistent estimator of β, even if the structure of working correlation matrix is not correctly specified. The misspecified R k (ρ) only affects the efficiency ofβ. The consistent estimation of correlation matrix R k ðρÞ, however, relies on correct specification of the correlation structure.
For testing a hypothesis of H 0 : Cβ = c, a Wald test statistic can be used with the form and W! d w 2 ðqÞ , where q is the rank of matrix C. Estimating predictors effects on OTUs. Based on the GEE framework, we develop the MTLC model to assess the association between OTUs and the predictors of interest, accounting for the correlation of repeated OTU measures. To deal with the excess zeros of OTUs using MTLC model, first we convert quantitative OTU observations to binary outcomes (0 and 1), indicating the prevalence of OTU in each observation. Next, we focus on the OTU relative abundance (RA) of each non-zero observation, and assume the RAs following normal distribution after log transformation. We use two separate GEE models, one for assessing the predictor effects on OTU prevalence, and the other for assessing the predictor effects on positive RA. The predictors' overall effects are finally tested by combining the test statistics from these two GEE models.
Formally, for k = 1, . . .K and j = 1, . . ., J k , we assume each OTU observation y kj follows a mixture of Bernoulli and log-normal distribution: suppose y ð0Þ kj follows a Bernoulli distribution with Pðy ð0Þ kj ¼ 1Þ ¼ m ð0Þ kj , and y ðþÞ kj follows a normal distribution such that y ðþÞ kj � Nðm ðþÞ kj ; s 2 Þ, then the distribution function of y kj is and y ðþÞ kj represents the positive RAs because log 10 y kj ¼ y ðþÞ kj for all y kj > 0. We use y ð0Þ k to denote the vector of all y ð0Þ kj , and y ðþÞ k to denote the the subset of y ðþÞ kj where y kj > 0. Rather than running generalized linear model directly on y k , we apply GEE method separately on y ð0Þ k and y ðþÞ k . For these two GEE models, the predictors' design matrices X k do not have to be the same in principal, although they could be the same in many practical situations. Without loss of generality we simply assume the predictors are same in each part of the GEE model in this paper. We choose logit link function for binary outcomes and identity link function for log transformed non-zero outcomes, and the two parts of the GEE model are It follows section "Generalized estimating equation framework" that W ð0Þ ! d w 2 ðq ð0Þ Þ and W ðþÞ ! d w 2 ðq ðþÞ Þ . Besides, for jointly testing two null hypotheses by the combined test on W (0) and W (+) , we adopt Cauchy combination test [35], which does not require the independence assumption between W (0) and W (+) . Let p (0) and p (+) be the corresponding p-values, then the Cauchy combination test statistic is Estimating correlation coefficients. In our proposed MTLC model, the correlation structure is based on OTU taxonomic structure and characterizing correlations between repeated measures. Here we assume the two GEE models corresponding to the OTU prevalence part and positive RA part have the same correlation structure R. However, the estimated values of correlation coefficients,ρ ð0Þ andρ ðþÞ , may be different for each part of the GEE model. For y ð0Þ k and y ðþÞ k ,ρ ð0Þ andρ ðþÞ are estimated separately following the iterative scheme discussed in section "Generalized estimating equation framework".
It needs to be noted that GEE models do not require each cluster has equal cluster size, which could happen, for example, in unbalanced study designs and/or when some observations are missing. Even if y ð0Þ k has equal size for all k, y ðþÞ k may have different sizes as it is a collection of only positive RAs. It implies that the dimension of R may be greater than the length of y ð0Þ k and y ðþÞ k for some k. In such case, the rows and columns in R corresponding to empty values of OTU observations need to be removed, and we denote the modified correlation structure matrices by R ð0Þ k ðrÞ and R ðþÞ k ðρÞ correspondingly for each k. When applying the estimating equations in our MTLC model, we essentially use R ð0Þ k ðρÞ and R ðþÞ k ðρÞ as the working correlation matrices.

Simulation settings
Simulation studies are designed to simulate zero inflated multivariate normal distribution to reflect the correlation of −log 10 transformed OTUs. To achieve this, we simulate both multivariate Bernoulli distribution samples Y (0) and truncated multivariate normal distribution samples Z of size K and length J. Multivariate normal distributions are truncated to generate positive samples because all −log 10 transformed RAs should be positive. We further assume a single binary predictor X, where X also has dimension K × J, and the mean of Y (0) and Z depend on X. Specifically, we simulate Y ð0Þ � Bernoulli J ð expðXβ ð0Þ Þ 1þexpðXβ ð0Þ Þ Þ, and Z � N J (Xβ (+) , R) truncated at 0. The zero-inflated multivariate normal distribution samples are computed as Y = Y (0) Z. Y is indirectly associated with X via Y (0) and Z.
For illustration purpose, we assume the simplest correlation structure, i.e., two correlated OTUs under taxonomic structure and two repeated measures at different time points). The correlation matrix R is then derived following section "Incorporating taxonomic structure with repeated measures": R ¼ r ðD;DÞ r ðD;iÞ r ðI;DÞ r ðI;iÞ r ðD;iÞ r ðD;DÞ r ðI;iÞ r ðI;DÞ r ðI;DÞ r ðI;iÞ r ðD;DÞ r ðD;iÞ r ðI;iÞ r ðI;DÞ r ðD;iÞ r ðD;DÞ : r ðD;DÞ ¼ 1, r ðD;iÞ and r ðI;DÞ denote the correlation between two time points and between two OTUs. r ðI;iÞ represents the correlation of observations from different OTU and different time points, which is not of primary interest. We assume the simulated multivariate Bernoulli and multivariate normal distribution follow the same correlation structure R, but the correlation coefficientsρ ð0Þ andρ ðþÞ can be different.
After achieving the zero-inflated multivariate normal distribution samples Y, we run a GEE logistic model following Eq 5 to estimate the effects of X to OTU prevalence Y (0) , and GEE linear model following Eq 6 to estimate X effects to the non-zero RAs Y (+) , where Y (+) is the subset of Z such that y ðþÞ kj ¼ z kj jðy ð0Þ kj ¼ 1Þ. Under GEE theory, both Y (0) and Y (+) yield consistent estimations of β and ρ. However, we simulate Z rather than Y (+) , where Z and Y (+) may not yield same estimations in general. To solve this issue, we simulate Z and Y (0) independently, which implies that y ðþÞ kj has the same distribution as z kj . Therefore, Z also yields consistent estimations of β and ρ.
Different from some literature that Y is directly simulated, we conducted our stimulation on Y (0) and Z separately. This is because following the mixture distribution framework, we conduct two separate GEE models on Y (0) and Y (+) rather than one model directly on Y. In this way, we can clearly specify the true values of predictor's main effects and OTU correlations in simulation settings, and evaluate if the estimations of these values are unbiased explicitly. As a sensitivity analysis to evaluate the robustness of our model performance, we also simulate Y (0) and Z from (generalized) linear mixed model. Results are presented in S1 Appendix.

Inferences for predictor's main effects
First, we evaluate the performance of our proposed MTLC model for estimating and testing the main effects or the predictor X. Let β (0) denote the effects on OTU prevalence and β (+) denote the effects on the log 10 transformed none-zero RA. We evaluate the unbiasedness of estimatedb ð0Þ ,b ðþÞ , Type I error for testing β (0) = β (+) = 0 and test power when β (0) and/or β (+) 6 ¼ 0. OTU observations are simulated under the simulation settings discussed in section "Simulation settings" with sample size K = 1000 and various combinations of β (0) and β (+) values. We assume r ðD;iÞ ¼ r ðI;DÞ ¼ 0:3 and r ðI;iÞ ¼ 0 for both the multivariate normal and multivariate Bernoulli distribution. β, Type I errors and powers are estimated based on 1000 replications. The computation time is about 4 hours to complete all 1000 replications on a desktop computer with quad-core processor and 8GB of RAM.
Next we compare our MTLC model to other models. All models are described in Table 1.
For each model, the estimatedb ð0Þ ,b ðþÞ , Type I error and power are summarized in Table 2. We find all estimates of β (0) and β (+) are unbiased under MTLC model. For the one-part models, because there is no true value of β as a mixture of β (0) and β (+) , the unbiasedness of estimated β cannot be evaluated. Regarding the variations of estimatedb, the 2.5 and 97.5 percentile of the empirical distributions ofb are shown in S1 Appendix.
Given the true Type I error at 0.05, 2P_ind and 1P_ind model have inflated Type I error, and all other estimated Type I errors are accurate. It needs to be noted that when only one of β (0) and β (+) equal to 0, the Type I error estimation is still accurate. For example, when (β (0) , β (+) ) = (0, 0.05), the GEE (0) model for testing β (0) = 0 has Type I error 0.062, which is not affected by the non-zero value of β (+) . It further confirms the independence of the linear and logistic regression parts in the two-part model. We also evaluate the power performance of different models. The power of 2P_ind and 1P_ind model are inflated due to Type I error inflation. Our proposed MTLC model is most powerful in general. When one of β (0) and β (+) is 0, the MTLC model is slightly less powerful than one of GEE (0) and GEE (+) model which only tests the part that β 6 ¼ 0. However, when both β (0) and β (+) are non-zero, the MTLC model is much more powerful than both GEE (0) and GEE (+) model. The 1P_GEE model and 1P_RE model have similar powers. It needs to be noted that the 1P_RE model is not able to accommodate negative correlations due to the natural or random effects. This is the reason that we choose ρ 01 and ρ 10 to be positive in the simulation settings. When the true correlations are negative, the 1P_RE model simply reduces to 1P_ind model. Comparing to the MTLC model, the power of the one-part models drops dramatically when β (0) and β (+) have opposite sign. This is because the positive effect cancels out the negative effects in one-part models, but both effects are well captured in two-part models. When β (0) and β (+) have same direction, we do observe some cases that the power of one-part models are larger. This is related to how to deal with the excess zeros in the one-part models. Detailed discussion about this issue is provided in section "Two-part vs. one-part models".
The correlation structure of OTUs is based on the taxonomic structure, which is usually known in practice. However, the correlation structure of repeated measures within each OTU may not be known and usually requires subjective assumptions. One merit of GEE model is that even if the assumption of correlation structure is not correct, it does not affect the estimation of main effect β. Theb estimations are consistent under different assumptions of correlation structure, as illustrated by Yan [36] and confirmed by our simulation study (results not shown). Besides that, we evaluate the consistency of correlation estimations under wrong correlative structure setting.
In contrast to the correct correlation structure R, we first construct a model with a correlation matrix assuming that OTUs are independent while time points are still correlated. After that, we construct another model with correlation matrix assuming that time points are independent while OTUs are still correlated. When OTUs are assumed to be independent, the GEE model may only estimate r ðD;iÞ ; when time points are independent, the GEE model may only estimate r ðI;DÞ . The correlation estimations are summarized in Table 3.
From Table 3, the correlation estimates under true correlation structure are all unbiased. When the correlation structure is not correctly specified, it may not estimate all correlation coefficients for the correct correlation structure, but more interestingly, for those correlation coefficients which can be estimated under the misspecified structure, the estimation remains to be unbiased. It implies that if we are not interested in estimating all correlations in the correct correlation structure, we can simplify the correlation structure. For example, because the estimation of r ðI;iÞ is not of interest, we can set it to 0 without affecting the estimation of r ðD;iÞ and r ðI;DÞ . The correlation structure only contains two OTUs and two time points, so the GEE correlation estimates are essentially pairwise correlations, and thus they can be compared with corresponding Pearson correlation coefficients. Both results are consistent as expected. The merit of our MTLC model is that when the correlation structure is more complicated and the pairwise Pearson correlation is not available, it may still provide unbiased estimation of the correlation matrix.

Two-part vs. one-part models
For one-part models, if we take −log 10 transformation of both the non-zero RAs and 0, then all 0 becomes 1. To solve this issue, one common approach is to change all 0 to some small value close to 0, such as 10 −5 . However, we find the one-part model test powers are sensitive to this arbitrary small value. In Table 4, we replace −log 10 0 by 6, 5 4 and 3 and compare corresponding test powers with the MTLC model. We only present the 1P_GEE model as we have shown in Table 2 that the 1P_RE model has similar power to 1P_GEE. Table 4 indicates that there is no optimal choice of the value for replacing 0 RAs. For each value selected, depending on (β (0) , β (+) ), there may exist some situations such that the onepart model has comparable power or even slightly better power than corresponding two-part model (e.g., 0.650 vs. 0.609 when (β (0) , β (+) ) = (0.1, 0) and replacing 0 by 10 −6 ), but the power loss is much more significant for some other values of β (e.g., 0.138 vs. 0.421 when (β (0) , β (+) ) = (0, 0.05) and replacing 0 by 10 −6 ). We conclude that our MTLC models has superior and robust power performance compared to the one-part models, and suggest readers avoid using the one-part models in practice when there are excessive numbers of 0s in OTU data.

Application
We implement our proposed MTLC model on a twin study described in Turnbaugh et al. [37]. The full dataset is provided in the supporting information S1 Data. The data consists of 54 families and each family has a pair of twins. Each individual has at most two observations at two time points. The primary research question is to assess the association between obesity status (lean, overweight or obese) and OTUs, and estimate the correlations between two time  points, each pair of twins and OTUs. For illustration purpose, we only analyze OTUs within the order Clostridiales, which consists of 9 OTUs at genus level. The taxonomic structure of these 9 OTUs are shown in Fig 3.  From Fig 3, all 9 OTUs begin to belong to the same taxa (Clostridiales) at level order, and each of the 9 OTUs belongs to a different taxon at level genus. We define level order as level 1, level family as level 2 and level genus as level 3, thus I = 3. Accordingly, the numerical representation of the taxonomic structure is n 1 = 9, n 2 = (4, 1, 4), n 3 = (1, 1, 1, 1, 1, 1, 1, 1, 1).
Next, following the 4 steps described in section "Taxonomic structure of OTUs", the taxonomic structure matrix is Γ ¼   D  I  I  I III III III III III  I  D  I  I III III III III III   I  I  D  I III III III III III  I  I  I  D III III III III III  III III III III D III III III III   III III III III III D II II II  III III III III III II D II II   III III III III III II II D II  III III III III III II II II Because each OTU is observed at two time points for a pair of twins, the repeated measure correlation structure following section "Modelling correlations from repeated measures" is The dimension of Γ and O are N = 9 and L = 4, so as described in section "Incorporating taxonomic structure with repeated measures", the integrative correlation matrix R has The integrative correlation matrix is then . . .
To apply the proposed MTLC model, all OTU observations are summarized as Y. X is the single binary predictor denoting obesity status (lean vs. obese/overweight). Both Y and X have dimension K × J where K = 54 and J = 36. Some pedigrees only consist one individual instead a pair of twins, and OTUs are observed at one instead of two time points for some individuals, hence missing values exist in the matrix Y. Next, Y is separated as Y (0) and Y (+) representing OTU prevalences and positive RAs. We assume each y ð0Þ kj follows Bernoulli distribution with mean m ð0Þ kj and y ðþÞ kj follows log normal distribution with mean m ðþÞ kj . Then under MTLC model, Y and X have the following relationship: α (0) and α (+) are intercept parameters which are not our primary interest. Our goal is to estimate the effects of obesity status β (0) and β (+) , and test H 0 : β (0) = β (+) = 0. β (0) and β (+) are estimated separately under Eq 2, and H 0 is tested by the combined test statistic W MTLC following Eq 7.
We summarize the estimates of obesity effects for predicting OTUs and corresponding pvalues for testing H 0 in Table 5. We compare the MTLC model with the other models listed in Table 1. Using our MTLC model, obesity has shown significant overall association with these OTUs. Specially, it has shown significant association with the prevalence of OTUs, but no significant association with the non-zero RAs. All other models do not detect the overall significance. The computation time is less than 30 seconds for the twin study dataset.
Correlation estimates are presented in Table 6. r ðD;iÞ and r ðD;iiÞ are correlation between the two time points and correlation between the two twins. r ðI;DÞ , r ðII;DÞ and r ðIII;DÞ are OTU correlations, representing correlation from different family but within the same order Clostridiales, and correlation within the same family Lachnospiraceae or Ruminococcaceae. When Pearson correlations are available (r ðD;iÞ and r ðD;iiÞ ), they are quite consistent with the correlation estimates under GEE models. However, Pearson correlation is not available for OTU correlations due to the complicated taxonomic structure, and only our proposed MTLC model can estimate these correlations.

Discussion
In this paper, we develop and implement a novel approach to model the correlations of OTUs based on the biological taxonomic structure. The proposed MTLC model can incorporate the taxonomic structure with repeated measures from longitudinal data. It has accurate Type I error, unbiased estimation of model parameters and robust power performance under a variety of situations. Compared to existing methods, our method is more powerful and can provide unbiased estimation of the correlation coefficients between multiple OTUs and repeated measures.
The MTLC model allows for sufficient flexibility of the correlation matrix construction. It not only allows different correlation matrices for the logistic regression part and linear regression part, but also put no constraint on the range of each correlation coefficient, i.e., any positive or negative value from -1 to 1. In contrast, the random effect in mixed effect model naturally leads to a positive correlation, because the same random effect adds to a few correlated samples. When the true correlations are negative, the mixed effects model (e.g., Chen et al. [13]) is simply reduced to ordinary linear and logistic regression model with independence assumption, which results in incorrect Type I errors as we have shown in section "Inferences for predictor's main effects". In summary, the MTLC model provides a reliable analytical framework for longitudinal microbiome data analysis.
Our methodology for constructing correlation matrix of taxonomic structure imposes no constraints to the number of OTUs, which is denoted by N. Based on the computation time shown in our simulation and application study, we find the MTLC model runs fast overall. However, when N is large, (e.g., N > 1000), the correlation matrix has a high dimension, and it may cause computational issues and become time consuming to implement the MTLC model. In such case, we suggest a dimension reduction by selecting a subgroup of OTUs. For example, if OTUs are from the same phylum but different classes. Our MTLC model can be implemented on each class separately or focus on the classes of interest, instead on the whole phylum.
We have shown that the correlation estimation is consistent under MTLC model, but the estimation accuracy is not clear. Yan [36] proposed standard error estimations of the correlation coefficients under GEE approach. When corresponding Pearson correlations are also available, we have found the standard error under GEE approach may depart from the standard error of Pearson correlations. Because the underlying distribution of the correlation estimates is unknown, it lacks theoretical justifications of the standard error estimates. Further studies are required for estimating the accurate standard errors of correlation coefficients under our MTLC model.
The MTLC model assumes −log 10 transformed positive RAs following normal distribution. Clearly this is not the only approach to modelling the RA data, and there is no universal answer for choosing the "best" approach. Liu et al. [38] gave an overview for modelling zeroinflated non-negative continous data in general and proposed a few alternative distributions for the positive part of RAs. For example, zero-inflated beta distribution is another commonly used approach [13,39], because beta distribution has range from 0 to 1 exactly matching the range of RAs.
When β (0) and β (+) have opposite signs, the predictor's effects are described as "dissonant". Under this scenario, the two-part models showing more powerful results in the simulation studies coincides with existing literature [9,40]. In microbiome context, an example of this scenario is that, an antibiotic treatment may be effective in reducing the risk of carrying some specific bacteria, but may result in the growth of these bacteria once they survive due to antibiotic resistance [41,42].
For the proposed method, the dimension of predictors' design matrix X k , p, is assumed to be less than the number of clusters K. For high dimensional predictor space, e,g., gene expressions in genome-wide association study, it is possible to encounter the situation of p � K. In such cases regression models cannot be directly applied, and dimension reduction techniques need to be used. Traditional approaches such as principal component analysis and penalized regression including ridge regression and LASSO, as well as some machine learning based feature selection methods can be considered to be incorporated into the proposed method to deal with high dimensional predictors. We are planning to extend the proposed method to deal with such high dimensional predictors situation.
We have treated repeated longitudinal measures as a few discrete time points in our MTLC model. When there are more time points for each sample and the exact observation time for each sample is continuous, it is a natural extension of our current work to consider time as a continuous variable and OTU observations as a function of time. Further investigation of functional data analysis techniques can be explored and integrated with the OTU correlation structure developed in this paper.
Supporting information S1 Data. Data for the real microbiome sequencing study in Application section. (XLS) S1 Appendix. Additional simulation results. (PDF)