A marginalized two-part Beta regression model for microbiome compositional data

Haitao Chai; Hongmei Jiang; Lu Lin; Lei Liu

doi:10.1371/journal.pcbi.1006329

Abstract

In microbiome studies, an important goal is to detect differential abundance of microbes across clinical conditions and treatment options. However, the microbiome compositional data (quantified by relative abundance) are highly skewed, bounded in [0, 1), and often have many zeros. A two-part model is commonly used to separate zeros and positive values explicitly by two submodels: a logistic model for the probability of a specie being present in Part I, and a Beta regression model for the relative abundance conditional on the presence of the specie in Part II. However, the regression coefficients in Part II cannot provide a marginal (unconditional) interpretation of covariate effects on the microbial abundance, which is of great interest in many applications. In this paper, we propose a marginalized two-part Beta regression model which captures the zero-inflation and skewness of microbiome data and also allows investigators to examine covariate effects on the marginal (unconditional) mean. We demonstrate its practical performance using simulation studies and apply the model to a real metagenomic dataset on mouse skin microbiota. We find that under the proposed marginalized model, without loss in power, the likelihood ratio test performs better in controlling the type I error than those under conventional methods.

Author summary

Semi-continuous compositional data are typically analyzed using two-part models which separately describe the probability of zero values and the distribution of positive values. The second part of the model provides a conditional interpretation of covariate effects on the positive response. However, it is of great interest in many applications to assess the covariate effect on the marginal mean of the response. For this purpose, we propose a marginalized two-part model by reparameterizing the marginal mean in Part II. We show that the proposed marginalized two-part model outperforms conventional methods by simulation studies in terms of controlling the Type I error and maximizing the power. We apply our method to a microbiota dataset, and find consistent results with our simulation studies.

Citation: Chai H, Jiang H, Lin L, Liu L (2018) A marginalized two-part Beta regression model for microbiome compositional data. PLoS Comput Biol 14(7): e1006329. https://doi.org/10.1371/journal.pcbi.1006329

Editor: Dan Knights, University of Minnesota, UNITED STATES

Received: July 13, 2017; Accepted: June 26, 2018; Published: July 23, 2018

Copyright: © 2018 Chai et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The data are publicly available at https://www.nature.com/articles/ncomms3462#supplementary-information.

Funding: This research was partly supported by AHRQ R01 HS 020263 and National Natural Science Foundation of China (Grant No. 11571204). HC was supported by the China Scholarship Council (201606220131) as a visiting student at Northwestern University. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

In recent years, metagenomics studies have been growing rapidly due to the advances of next-generation sequencing (NGS) technologies [1]. Microbiota have been known to be associated with various diseases, e.g., obesity and diabetes [2, 3], Crohn’s disease [4], bacterial vaginosis [5], and cancer [6, 7].

The microbial abundance is usually measured in read counts. However, such quantities are not directly comparable across samples due to the uneven total sequence counts of samples. Therefore, the read counts are often normalized to relative abundances which sum to 1 for all microbes in a sample [8]. Relative abundance can be characterized by a point mass at zero and a right-skewed continuous distribution with a positive support, the so-called “semi-continuous” or “zero-inflated continuous” data. The zero values indicate that certain microbes are absent in the sample, or the rare microbes are present but missed due to undersampling, while the continuous distribution with a positive support describes the levels of relative abundance among the present microbes.

The relative abundance is often described by a two-part model [9], which separates zeros and positive values explicitly by two submodels: a logistic model for the probability of the outcome being positive in Part I and a (generalized) linear regression model for the amount of the (transformed) positive value in Part II. An important issue in such two-part models is to determine the distributional form in Part II. The nonzero relative abundance data are non-normally distributed and bounded in [0, 1). Beta distribution has been used to model this outcome. A two-part Beta regression model can be thus developed [10–12]. It includes two sets of parameters, one in the logistic regression for the presence of a microbe, and the other in the Beta regression for the relative abundance conditional on the presence of the microbe. These two sets of parameters are interpreted as effects on the presence of a microbe and on the level of relative abundance given that the microbe is present, respectively. That is, there is a conditional interpretation in Part II. However, it is often of great interest to have a straightforward interpretation of covariate effects on the overall marginal (unconditional) mean. For example, [13] proposed a marginalized two-part log-normal model by parameterizing covariates effects directly in terms of the marginal mean.

As conventional two-part Beta regression models do not provide an unconditional interpretation of covariate effects, we propose a marginalized two-part Beta regression model for microbiome abundance data which parameterizes covariate effects in terms of the marginal mean. The proposed model not only accounts for the zero-inflated nature of the microbiome data but also yields more interpretable effect estimates.

Of note, an alternative to describe zero-inflated data is the Tobit model [14] where zero values are considered as left censored observations of the underlying true negative values (of Normal or other distributions accommodating negative values). However, the Tobit model is not appropriate for the Beta distribution which does not have a support of negative values. Consequently, the Tobit model cannot be applied directly to the relative abundance data.

Models

In the following Section, we will introduce the conventional two-part Beta regression model and the proposed marginalized two-part Beta regression model. We will also describe their properties to assess the overall impact of covariates on the marginal mean, and demonstrate that the proposed model outperforms the conventional model.

Two-part Beta regression model

We begin with the conventional two-part model with a Beta component in Part II [10–12]. For a given operational taxonomic unit (OTU), let Y_i denote its semi-continuous relative abundance for subject i, where 0 ≤ Y_i < 1 and i = 1, 2, …, n. Specifically, a two-part Beta regression model has the following form: where the density function of the Beta distribution is parameterized as with μ_i (0 < μ_i < 1) and ϕ (ϕ > 0) being the mean and dispersion parameters of the Beta distribution, respectively, and p_i is the probability that the observation Y_i is from the Beta distribution. The two-part model describes the probability p_i in the logistic component and the conditional mean in the Beta component as functions of covariates, (1) (2) where α and β are vectors of regression coefficients, X_i = (1, x_i1, …, x_ip)^T is the (p + 1) dimensional covariate vector (including an intercept) for the i-th subject. We assume identical covariates for both parts of the model for simplicity of notation. One can instead allow for different sets of covariates for the two parts.

Marginalized two-part Beta regression model

To obtain interpretable covariate effects on the marginal mean, we propose the following marginalized two-part Beta regression model. Let v_i = E(Y_i) be the marginal mean of Y_i. The first part of the proposed marginalized two-part model is the same as Part I in the conventional two-part model, (3)

In Part II, the marginal (unconditional) mean v_i, instead of the conditional mean μ_i, is modeled as a function of covariates: (4)

As we can see, the marginalized two-part model not only captures zero-inflation and skewness as the conventional two-part model, but also allows us to examine covariate effects on the overall marginal mean.

In the S1 Text, we can see that the likelihood of the conventional two-part model can be reparameterized to that of α, γ and ϕ in the marginalized model. However, the interpretation of covariate effects are different in the two frameworks, which will be elaborated in the next subsection.

The estimation of the marginalized two-part model can be carried out in SAS Proc NLMIXED (The main code is shown in S1 Code). To obtain starting values of the estimation, a logistic model and a Beta regression model are fitted for the binary part and the positive part, respectively. Then the estimates of these two models are used as starting values for the two-part marginalized model. The convergence of the estimation is determined by a threshold value 1 × 10⁻⁸ for the relative gradient, a common convergence criterion in SAS Proc NLMIXED. This criterion is satisfied in our simulations for all replicates, and in the real data analysis for all 131 OTUs.

Interpretation of covariate effects

For the conventional model.

Using the conventional two-part model shown in Eqs (1) and (2), β_j is interpreted as the effect of a unit increase in the jth covariate on the logit of the conditional mean of Y_i given Y_i is positive. In many applications, however, the primary interest is to examine the impact of covariates on the overall marginal mean E(Y_i). For the conventional two-part model, we have (5)

Along the lines of [15], we can assess the effect of the j-th continuous covariate x_ij on the unconditional mean as (6) where with α_j and β_j being the coefficients corresponding to x_ij in the conventional two-part model and X_i(−j), α_(−j), and β_(−j) be the corresponding vectors with the j-th covariate removed.

A straightforward calculation shows that (6) can be equivalently written as (7) where

As the logit transformation is a monotonically increasing function in the interval (0, 1), the hypothesis test of the covariate effects on the marginal mean is equivalent to that on its logit transformation. In Eq (7), the logit transformation of the marginal mean abundance is independent of covariate x_ij if both α_j and β_j are zero. However, if α_j and β_j have opposite signs, even when they are not zero, the logit transformation of the marginal mean abundance may be still independent of covariate x_ij. Furthermore, the coefficients c₁(α_j, β_j) and c₂(α_j, β_j) in Eq (7) are functions of α_j and β_j. Thus, the independence between the marginal mean and covariate x_ij cannot be tested simply as the hypothesis of α_j = 0 and β_j = 0, e.g., by the likelihood ratio test. Instead, the Delta method has to be used on the hypothesis test of Eq (7), which depends on X_i(−j) in a complicated way.

When the interest is to assess the effect of a discrete variable on response, e.g., placebo vs. treatment, Eq (7) no longer applies. Without loss of generality, consider a binary covariate x_ik taking value 0 or 1. Similar to [15], the difference in the logit transformation of the marginal mean with x_ik = 1 vs. x_ik = 0 is used to evaluate the impact on the expected marginal mean response.

Under the conventional two-part model, the difference between the logit transformations with x_ik = 1 and x_ik = 0 is (8) where

It is worth noting that b₁(α_k), b₂(β_k), and b₃(α_k, β_k) all equal to 0 if α_k and β_k are 0. Similar to the continuous covariate, the logit transformation of the marginal mean abundance does not depend on the binary covariate x_ik if both α_k and β_k are zero. However, even though neither of the coefficients is zero, the transformed mean abundance may still be independent of the binary covariate x_ik when α_k and β_k have opposite signs. Eq (8) indicates that the independence between the response and the binary covariate x_ik cannot be ascertained by directly testing α_k = 0 and β_k = 0 by e.g., the likelihood ratio test, as shown in the simulation studies and the real data analysis.

For the marginalized model.

In the marginalized two-part model Eqs (3) and (4), the effect of a continuous covariate x_ij on the marginal mean E(Y_i) can be characterized by (9) where γ_j is the coefficient corresponding to x_ij in Eq (4). Thus, the effect of the covariate x_ij on the marginal mean abundance is determined by its coefficient in the marginalized model. With the marginalized two-part model, we can estimate the coefficient γ_j as well as test the effect on the marginal mean.

As for a binary covariate in the marginalized two-part model, the difference in logit transformation of the marginal mean with x_ik = 1 vs. x_ik = 0 can be expressed as (10)

One can see that the effect of a binary covariate x_ik on the marginal mean abundance is determined by its coefficient γ_k in the marginalized two-part model. The logit transformation of the marginal mean abundance with x_ik = 1 is bigger than that with x_ik = 0 when γ_k is positive, and the reverse is true when γ_k is negative.

Results

In this section, simulation studies and real data analysis are presented to assess the performance of the proposed marginalized and the conventional two-part models. Results show that the proposed model outperforms the conventional model, which is consistent with the theoretical results.

Simulation studies

In this section, we conduct simulation studies to evaluate the finite-sample performance of the proposed marginalized two-part model. To test the effect of the covariate on the overall marginal mean E(Y_i), likelihood ratio tests (LRT) are performed and compared under the marginalized two-part (MTP) model and the conventional two-part (CTP) model. In addition, the two sample T-test and the Wilcoxon rank sum test are also compared.

We assume that, in both parts, there is only one binary covariate x₁, which is generated from the Bernoulli distribution with p = 0.5. However, according to the interpretation of the covariate effects in the preceding section, the proposed model can be applied to multiple covariates. The response y_i is generated below: where is the conditional mean given that y_i is positive and ϕ is the dispersion parameter of the Beta distribution.

In the simulation studies, 1000 samples of sizes 200 and 400 are generated. We set the parameters as α₀ = 1.5, γ₀ = −2.5, and ϕ = 1, while α₁ and γ₁ may have different values according to which of the two criteria are under study: the type I error or the power.

First, we evaluate the type I error for testing the null hypothesis H₀: the binary covariate x₁ has no effect on the overall marginal mean of y_i. In the MTP model, this is equivalent to testing as shown in Eq (10). However, testing in the CTP model is not equivalent to testing H₀. Specifically, according to Eq (8), even though neither of the coefficients is zero, the binary covariate x₁ may still have no effect on the marginal mean. This means that the conventional model cannot control the type I error for testing H₀ when both α₁ and β₁ are non-zero.

The results are shown in Fig 1. Type I errors are calculated under two settings: α₁ = 0, γ₁ = 0 and α₁ = 1, γ₁ = 0. For each setting, two α-levels are considered: 0.01 and 0.05. As we can see from Fig 1, under the first setting (α₁ = 0, γ₁ = 0), all the methods control the type I error reasonably well. Under the setting α₁ = 1, γ₁ = 0, the LRT under the MTP and the T-test control the type I error well, while the LRT under the CTP and the Wilcoxon test cannot control the type I error, especially the LRT under the CTP model. Because in this setting, testing in the CTP model is not equivalent to testing the null hypothesis H₀.

Download:

Fig 1. Type I errors of the four methods.

The results in the upper panels correspond to the setting α₁ = 0, γ₁ = 0 and the lower panels correspond to setting α₁ = 1, γ₁ = 0. In each setting, the left panel shows the results for significance level 0.01 and the right panel shows the results for level 0.05. The dashed horizontal line in each panel represents the correct level. The results for sample size 400 can be found in S1 Fig in Supporting information.

https://doi.org/10.1371/journal.pcbi.1006329.g001

The powers under two different settings, α₁ = 0, γ₁ = 1 and α₁ = 1, γ₁ = 1, are shown in Fig 2. As we can see, the LRT under the CTP and the MTP are the most powerful methods with the power close to 1 in all settings. The Wilcoxon test performs a little worse than the LRT while the T-test has the lowest power.

Download:

Fig 2. Powers of the four methods.

The upper panels show the powers corresponding to the setting α₁ = 0, γ₁ = 1 and the lower panels show the powers corresponding to the setting α₁ = 1, γ₁ = 1. In each setting, the left panel shows the results for significance level 0.01 and the right panel shows the results for level 0.05. The powers for sample size 400 are shown in S2 Fig in Supporting information.

https://doi.org/10.1371/journal.pcbi.1006329.g002

We also estimate the coefficients in the MTP model under the setting α₁ = 1, γ₁ = 1. The results in Table 1 demonstrate that the biases are negligible and the coverage probabilities are acceptably close to the nominal level 0.95 for all the model parameters. In addition, we observe small differences between the empirical standard errors and our estimates. The mean squared errors for sample size 400 are smaller than those for sample size 200.

Download:

Table 1. Estimates of the coefficients in the marginalized two-part model under the setting α₁ = 1, γ₁ = 1.

https://doi.org/10.1371/journal.pcbi.1006329.t001

According to the simulation results, the LRT under the MTP model has the best performance: it controls the type I error reasonably well and also achieves the best power. The T-test has the similar performance in the error control while it is not as powerful as the LRT under the MTP model. The LRT under the CTP model is powerful, however, it fails to control the type I error. The Wilcoxon test has poor performances in both the error control and power than the LRT under the MTP model.

To assess the robustness of the proposed method, we consider a setting where positive responses are generated from another distribution. First of all, the only covariate x_i is generated from the Uniform distribution on (0, 1), while the response y_i has the following distribution: where and the overall marginal mean v_i of the response is

Instead of the Beta distribution, positive responses are generated from the Binomial distribution Bin(100, μ_i) and then divided by 100 to make them bounded in (0, 1). As in the previous simulation, we set . The probability of having exactly 0 success in 100 trials is (1 − μ_i)¹⁰⁰, which is negligible with the proper choice of the parameters α and γ. Thus almost all the zero values in this zero-inflated Binomial data are structural zeros.

In this simulation study, 1000 samples of sizes 200 and 400 are generated. The parameters are set as α₀ = 2, γ₀ = −0.5, while α₁ and γ₁ may have different values in order to calculate the type I errors and the powers.

The type I errors are calculated under two settings: α₁ = 0, γ₁ = 0 and α₁ = 1, γ₁ = 0. For each setting, two α-levels are considered: 0.01 and 0.05. As we can see from Fig 3, under both settings, the proposed marginalized model controls the type I error reasonably well. The conventional model controls the type I error under the setting α₁ = 0, γ₁ = 0 while it fails under the setting α₁ = 1, γ₁ = 0, similar to Fig 1.

Download:

Fig 3. Type I errors for the CTP model and the MTP model.

The results in the upper panels correspond to the setting α₁ = 0, γ₁ = 0 and the lower panels correspond to the setting α₁ = 1, γ₁ = 0. In each setting, the left panel shows the results for significance level 0.01 and the right panel shows the results for level 0.05. The dashed horizontal line in each panel represents the correct α-level.

https://doi.org/10.1371/journal.pcbi.1006329.g003

As shown in Fig 4, both the marginalized model and the conventional model have power equal to 1 under all settings.

Download:

Fig 4. Powers for the CTP model and the MTP model.

The powers in the upper panels correspond to the setting α₁ = 0, γ₁ = 1 and the lower panels correspond to setting α₁ = 1, γ₁ = 1. In each setting, the left panel shows the results for significance level 0.01 and the right panel shows the results for level 0.05.

https://doi.org/10.1371/journal.pcbi.1006329.g004

From the simulation studies we can conclude that the proposed marginalized two-part Beta regression model is powerful and control the type I error well. Also, it is robust against model misspecification.

Real data analysis

In this section, the proposed marginalized two-part model and the conventional two-part model are applied to a real metagenomic dataset on mouse skin microbiota to investigate the effects of immunization on the relative abundances of 131 core OTUs [16, 17]. The data are publicly available at https://www.nature.com/articles/ncomms3462#supplementary-information. In addition to the likelihood ratio tests under CTP and MTP, the T test and the Wilcoxon rank sum test are also included for comparison. All the tests are carried out with Bonferroni’s correction.

The skin dataset contains the relative abundances of the most common 131 OTUs for 261 mouse skin samples, including 78 non-immunized and 183 immunized individuals. There is a presence of a large portion of zero abundances in the skin data, ranging from 0 to 68.97% with average 33.03% and median 33.72% (see S3 and S4 Figs). The positive values are highly right skewed and the logit transformations in the MTP model and the CTP model capture the skewness (See S5 Fig).

Fig 5 shows the results for these four methods. As we can see, the LRT under the marginalized two-part model results in significant effects of immunization on 45 (namely, 31 + 14) OTUs. The LRT under the conventional two-part model has significant results for all these 45 OTUs, and 14 (namely, 8 + 4 + 2) additional OTUs. The T test identifies 31 of these 45 OTUs and another 7 (namely, 4 + 3) OTUs. Similar to the LRT under conventional two-part model, the Wilcoxon test identifies all these 45 OTUs and 21 (namely, 9 + 8 + 4) additional OTUs. Finally, 60 OTUs are not identified by any methods.

Download:

Fig 5. Venn diagram for the OTUs.

Among all the 131 OTUs, 60 OTUs are not identified by any methods and the other 71 OTUs are identified by at least one method. For example, “31” in the intersection of all sets indicates that 31 OTUs are identified by all methods; while “4” located in the intersection of three sets, indicates that 4 OTUs are identified by three methods, namely, the T test, the CTP model, and the Wilcoxon test.

https://doi.org/10.1371/journal.pcbi.1006329.g005

The LRT under the CTP model and the Wilcoxon test identify more OTUs than the LRT under the MTP due to their failure to control the type I error as shown in Simulation studies (Fig 1). Actually, for those 14 OTUs identified by the CTP but not by the MTP, all of them have significant coefficients in Part I of the two-part model. Out of the 21 OTUs that are identified by the Wilcoxon test but not by the MTP, 17 have significant coefficients in Part I of the two-part model. This corresponds to the setting α₁ = 1, γ₁ = 0 where both the CTP and the Wilcoxon test have much higher type I errors than the MTP (See the lower panel of Fig 1). Because it is less powerful than the MTP (Fig 2), the T test identifies less OTUs than the MTP.

Table 2 shows 10 most significant OTUs from the MTP model. As in [17], for OTUs which cannot be classified at the species level, the next highest classifiable taxonomic level (denoted by ‘o’, ‘f’ and ‘g’ for order, family, and genus, respectively) is displayed. We use a number in the superscript to distinguish among different OTUs with the same classification name. The detailed results of all the 45 OTUs identified by the proposed MTP model are shown in S1 Table.

Download:

Table 2. Top 10 OTUs identified by the MTP model.

https://doi.org/10.1371/journal.pcbi.1006329.t002

Moreover, for most of the 131 OTUs, the proposed marginalized two-part model fits the observed data better than the conventional two-part model. Fig 6 shows the density curves of the observed relative abundances, the predicted relative abundances using the MTP model, and the predicted relative abundances using the CTP model for two OTUs. As we can see, the MTP model fits the observed data much better than the CTP model.

Download:

Fig 6. Density curves for two OTUs.

The blue curve shows the density of the observed data. The green curve shows the density of predictions from the MTP model while the red curve represents the density of predictions from the CTP model.

https://doi.org/10.1371/journal.pcbi.1006329.g006

Discussion

In this paper, we propose a marginalized two-part Beta regression model for semi-continuous microbiome compositional data. This model allows investigators to obtain covariate effects on the marginal mean of the outcome. It takes into account the compositional and zero-inflation nature of the microbiome relative abundance data. It also has an unconditional interpretation of the covariate effect on the marginal mean. Our proposed marginalized two-part model has satisfactory performance in both simulation studies and real data analysis.

For count outcomes exhibiting many zeros, a zero-inflated Poisson (ZIP) regression model or a zero-inflated negative binomial (ZINB) model, is often employed to examine the relation between covariates and the response. To model the overall population mean count directly, the marginalized ZIP model and the marginalized ZINB model were proposed by [18] and [19], respectively. However, in the case of bounded count data, the ZIP is questionable while the zero-inflated binomial (ZIB) model and its extension for over-dispersion: the zero-inflated beta-binomial (ZIBB) model, are available in [20–22]. It is of interest to develop a marginalized modeling approach for ZIB or ZIBB.

More recently, there has been increasing interest in analyzing correlated zero-inflated semi-continuous data. The correlation may stem from the structure of clustered data or from longitudinal data where repeated measures are correlated for the same subject. Typically, random effects are included to account for the correlations between observations [10, 15, 23–25]. However, similar limitation exists in these two-part random effects models, as they cannot account for covariate effects on the marginal mean. Recently, Smith et al. [26] proposed a marginalized two-part model for longitudinal semicontinuous data based on the log-skew normal distribution for positive values. In future studies, we will extend our marginalized two-part model to correlated semi-continuous data bounded by 0 and 1.

Finally, it is of interest to consider different microbiomes together, taking into account the constraint that the relative abundances of all OTUs sum to 1. Scealy and Welsh [27, 28] considered Kent models for such compositional data. It merits further consideration to incorporate zero values in the Kent model framework.

Supporting information

S1 Text. Likelihood derivation.

https://doi.org/10.1371/journal.pcbi.1006329.s001

(PDF)

S1 Code. SAS code.

The main SAS codes for the conventional two-part model and the proposed marginalized two-part model are shown in this section.

https://doi.org/10.1371/journal.pcbi.1006329.s002

(PDF)

S1 Fig. Type I errors for the sample size 400.

This figure shows the type I errors of the four methods for sample size 400. The results in the upper panels correspond to the setting α₁ = 0, γ₁ = 0 and the lower panels correspond to setting α₁ = 1, γ₁ = 0. In each setting, the left panel shows the results for significance level 0.01 and the right panel shows the results for significance level 0.05. The dashed horizontal line in each panel represents the significance level.

https://doi.org/10.1371/journal.pcbi.1006329.s003

(TIF)

S2 Fig. Powers for the sample size 400.

This figure shows the powers of the four methods for sample size 400. The upper panel contains the power corresponding to the setting α₁ = 0, γ₁ = 1 and the lower panel shows the power corresponding to the setting α₁ = 1, γ₁ = 1. In each setting, the left figure shows the results for significance level and the right panel shows the results for significance level 0.05.

https://doi.org/10.1371/journal.pcbi.1006329.s004

(TIF)

S3 Fig. Zero-inflation of the skin data.

The figure shows the distributions of relative abundances of 6 OTUs. From the upper panel to the lower panel and from the left to the right, the proportions of zero values for these 6 OTUs are 0.77%, 3.45%, 4.97%, 14.18%, 29.89%, and 48.28%, respectively.

https://doi.org/10.1371/journal.pcbi.1006329.s005

(TIF)

S4 Fig. The figure shows the percentages of zero abundance in the 261 mouse skin samples for all 131 core OTUs.

The lower quartile and the upper quartile of the percentages are 20.11% and 48.28%, respectively.

https://doi.org/10.1371/journal.pcbi.1006329.s006

(TIF)

S5 Fig. Skewness of the skin data.

The figure shows the histogram of the relative abundance for 6 OTUs. The first one in every panel is the histogram of the OTU in the original scale, while the second one in every panel shows the histogram after logit transformation.

https://doi.org/10.1371/journal.pcbi.1006329.s007

(TIF)

S1 Table. The detailed results of the MTP model.

The table shows the detailed results of all the 45 OTUs that are identified by the proposed MTP model.

https://doi.org/10.1371/journal.pcbi.1006329.s008

(PDF)

References

1. Gilbert JA, Meyer F, Bailey MJ. The Future of microbial metagenomics (or is ignorance bliss?). Isme Journal. 2011;5(5):777–779. pmid:21107444
- View Article
- PubMed/NCBI
- Google Scholar
2. Everard A, Cani PD. Diabetes, obesity and gut microbiota. Best Practice & Research Clinical Gastroenterology. 2013;27(1):73–83.
- View Article
- Google Scholar
3. Musso G, Gambino R, Cassader M. Obesity, diabetes, and gut microbiota: the hygiene hypothesis expanded? Diabetes Care. 2010;33(10):2277–2284. pmid:20876708
- View Article
- PubMed/NCBI
- Google Scholar
4. Lewis JD, Chen EZ, Baldassano RN, Otley AR, Griffiths AM, Lee D, et al. Inflammation, Antibiotics, and diet as environmental stressors of the gut microbiome in pediatric Crohn’s disease. Cell Host & Microbe. 2015;18(4):489–500.
- View Article
- Google Scholar
5. Srinivasan S, Hoffman NG, Morgan MT, Matsen FA, Fiedler TL, Hall RW, et al. Bacterial communities in women with bacterial vaginosis: high resolution phylogenetic analyses reveal relationships of microbiota to clinical criteria. Plos One. 2012;7(6):e37818. pmid:22719852
- View Article
- PubMed/NCBI
- Google Scholar
6. Garrett WS. Cancer and the microbiota. Science. 2015;348(6230):80–86. pmid:25838377
- View Article
- PubMed/NCBI
- Google Scholar
7. Schwabe RF, Jobin C. The microbiome and cancer. Nature Reviews Cancer. 2013;13(11):800–812. pmid:24132111
- View Article
- PubMed/NCBI
- Google Scholar
8. Tyler AD, Smith MI, Silverberg MS. Analyzing the human microbiome: a “how to” guide for physicians. American Journal of Gastroenterology. 2014;109(7):983–93. pmid:24751579
- View Article
- PubMed/NCBI
- Google Scholar
9. Manning WG. A two-part model of the demand for medical care: preliminary results from the health insurance study. Health, Economics, and Health Economics. 1981; p. 103–123.
- View Article
- Google Scholar
10. Chen EZ, Li H. A two-part mixed-effects model for analyzing longitudinal microbiome compositional data. Bioinformatics. 2016;32(17):2611–2617. pmid:27187200
- View Article
- PubMed/NCBI
- Google Scholar
11. Ospina R, Ferrari SLP. A general class of zero-or-one inflated beta regression models. Computational Statistics & Data Analysis. 2012;56(6):1609–1623.
- View Article
- Google Scholar
12. Peng X, Li G, Liu Z. Zero-Inflated Beta Regression for Differential Abundance Analysis with Metagenomics Data. Journal of Computational Biology. 2015;23(2):102–110.
- View Article
- Google Scholar
13. Smith VA, Preisser JS, Neelon B, Maciejewski ML. A marginalized two-part model for semicontinuous data. Statistics in Medicine. 2014;33(28):4891–4903. pmid:25043491
- View Article
- PubMed/NCBI
- Google Scholar
14. Tobin J. Estimation of Relationships for Limited Dependent Variables. Econometrica. 1958;26(1):24–36.
- View Article
- Google Scholar
15. Liu L, Strawderman RL, Cowen ME, Shih YC. A flexible two-part random effects model for correlated medical costs. Journal of Health Economics. 2010;29(1):110–123. pmid:20015560
- View Article
- PubMed/NCBI
- Google Scholar
16. Ban Y, An L, Jiang H. Investigating microbial co-occurrence patterns based on metagenomic compositional data. Bioinformatics. 2015;31(20):3322–3329. pmid:26079350
- View Article
- PubMed/NCBI
- Google Scholar
17. Srinivas G, Möller S, Wang J, Künzel S, Zillikens D, Baines JF, et al. Genome-wide mapping of gene-microbiota interactions in susceptibility to autoimmune skin blistering. Nature Communications. 2013;4(9):2462. pmid:24042968
- View Article
- PubMed/NCBI
- Google Scholar
18. Long DL, Preisser JS, Herring AH, Golin CE. A marginalized zero-inflated Poisson regression model with overall exposure effects. Statistics in Medicine. 2014;33(29):5151–5165. pmid:25220537
- View Article
- PubMed/NCBI
- Google Scholar
19. Preisser JS, Das K, Long DL, Divaris K. Marginalized zero-inflated negative binomial regression with application to dental caries. Statistics in Medicine. 2016;35(10):1722–1735. pmid:26568034
- View Article
- PubMed/NCBI
- Google Scholar
20. Skrondal A, Rabe-Hesketh S. Generalized Latent Variable Modeling: Multilevel, Longitudinal and Structural Equation Models. London: Chapman & Hall; 2004.
21. Albert JM, Wang W, Nelson S. Estimating overall exposure effects for zero-inflated regression models with application to dental caries. Statistical Methods in Medical Research. 2014;23(3):257–278. pmid:21908419
- View Article
- PubMed/NCBI
- Google Scholar
22. Gilthorpe MS, Frydenberg M, Cheng Y, Baelum V. Modelling count data with excessive zeros: The need for class prediction in zero-inflated models and the issue of data generation in choosing between zero-inflated and generic mixture models for dental caries data. Statistics in Medicine. 2009;28(28):3539–3553. pmid:19902494
- View Article
- PubMed/NCBI
- Google Scholar
23. Olsen MK, Schafer JL. A Two-Part Random-Effects Model for Semicontinuous Longitudinal Data. Journal of the American Statistical Association. 2001;96(454):730–745.
- View Article
- Google Scholar
24. Tooze JA, Grunwald GK, Jones RH. Analysis of repeated measures data with clumping at zero. Statistical Methods in Medical Research. 2002;11(4):341–355. pmid:12197301
- View Article
- PubMed/NCBI
- Google Scholar
25. Liu L, Strawderman RL, Johnson BA, O’Quigley JM. Analyzing repeated measures semi-continuous data, with application to an alcohol dependence study. Statistical Methods in Medical Research. 2016;25(1):133–152. pmid:22474003
- View Article
- PubMed/NCBI
- Google Scholar
26. Smith VA, Neelon B, Preisser JS, Maciejewski ML. A marginalized two-part model for longitudinal semicontinuous data. Statistical Methods in Medical Research. 2017;26(4):1949–1968. pmid:26156962
- View Article
- PubMed/NCBI
- Google Scholar
27. Scealy JL, Welsh AH. Regression for compositional data by using distributions defined on the hypersphere. Journal of the Royal Statistical Society. 2011;73(3):351–375.
- View Article
- Google Scholar
28. Scealy JL, Welsh AH. Fitting Kent models to compositional data with small concentration. Statistics & Computing. 2014;24(2):165–179.
- View Article
- Google Scholar

[ref1] 1. Gilbert JA, Meyer F, Bailey MJ. The Future of microbial metagenomics (or is ignorance bliss?). Isme Journal. 2011;5(5):777–779. pmid:21107444
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Everard A, Cani PD. Diabetes, obesity and gut microbiota. Best Practice & Research Clinical Gastroenterology. 2013;27(1):73–83.
View Article
Google Scholar

[6] View Article

[7] Google Scholar

[ref3] 3. Musso G, Gambino R, Cassader M. Obesity, diabetes, and gut microbiota: the hygiene hypothesis expanded? Diabetes Care. 2010;33(10):2277–2284. pmid:20876708
View Article
PubMed/NCBI
Google Scholar

[9] View Article

[10] PubMed/NCBI

[11] Google Scholar

[ref4] 4. Lewis JD, Chen EZ, Baldassano RN, Otley AR, Griffiths AM, Lee D, et al. Inflammation, Antibiotics, and diet as environmental stressors of the gut microbiome in pediatric Crohn’s disease. Cell Host & Microbe. 2015;18(4):489–500.
View Article
Google Scholar

[13] View Article

[14] Google Scholar

[ref5] 5. Srinivasan S, Hoffman NG, Morgan MT, Matsen FA, Fiedler TL, Hall RW, et al. Bacterial communities in women with bacterial vaginosis: high resolution phylogenetic analyses reveal relationships of microbiota to clinical criteria. Plos One. 2012;7(6):e37818. pmid:22719852
View Article
PubMed/NCBI
Google Scholar

[16] View Article

[17] PubMed/NCBI

[18] Google Scholar

[ref6] 6. Garrett WS. Cancer and the microbiota. Science. 2015;348(6230):80–86. pmid:25838377
View Article
PubMed/NCBI
Google Scholar

[20] View Article

[21] PubMed/NCBI

[22] Google Scholar

[ref7] 7. Schwabe RF, Jobin C. The microbiome and cancer. Nature Reviews Cancer. 2013;13(11):800–812. pmid:24132111
View Article
PubMed/NCBI
Google Scholar

[24] View Article

[25] PubMed/NCBI

[26] Google Scholar

[ref8] 8. Tyler AD, Smith MI, Silverberg MS. Analyzing the human microbiome: a “how to” guide for physicians. American Journal of Gastroenterology. 2014;109(7):983–93. pmid:24751579
View Article
PubMed/NCBI
Google Scholar

[28] View Article

[29] PubMed/NCBI

[30] Google Scholar

[ref9] 9. Manning WG. A two-part model of the demand for medical care: preliminary results from the health insurance study. Health, Economics, and Health Economics. 1981; p. 103–123.
View Article
Google Scholar

[32] View Article

[33] Google Scholar

[ref10] 10. Chen EZ, Li H. A two-part mixed-effects model for analyzing longitudinal microbiome compositional data. Bioinformatics. 2016;32(17):2611–2617. pmid:27187200
View Article
PubMed/NCBI
Google Scholar

[35] View Article

[36] PubMed/NCBI

[37] Google Scholar

[ref11] 11. Ospina R, Ferrari SLP. A general class of zero-or-one inflated beta regression models. Computational Statistics & Data Analysis. 2012;56(6):1609–1623.
View Article
Google Scholar

[39] View Article

[40] Google Scholar

[ref12] 12. Peng X, Li G, Liu Z. Zero-Inflated Beta Regression for Differential Abundance Analysis with Metagenomics Data. Journal of Computational Biology. 2015;23(2):102–110.
View Article
Google Scholar

[42] View Article

[43] Google Scholar

[ref13] 13. Smith VA, Preisser JS, Neelon B, Maciejewski ML. A marginalized two-part model for semicontinuous data. Statistics in Medicine. 2014;33(28):4891–4903. pmid:25043491
View Article
PubMed/NCBI
Google Scholar

[45] View Article

[46] PubMed/NCBI

[47] Google Scholar

[ref14] 14. Tobin J. Estimation of Relationships for Limited Dependent Variables. Econometrica. 1958;26(1):24–36.
View Article
Google Scholar

[49] View Article

[50] Google Scholar

[ref15] 15. Liu L, Strawderman RL, Cowen ME, Shih YC. A flexible two-part random effects model for correlated medical costs. Journal of Health Economics. 2010;29(1):110–123. pmid:20015560
View Article
PubMed/NCBI
Google Scholar

[52] View Article

[53] PubMed/NCBI

[54] Google Scholar

[ref16] 16. Ban Y, An L, Jiang H. Investigating microbial co-occurrence patterns based on metagenomic compositional data. Bioinformatics. 2015;31(20):3322–3329. pmid:26079350
View Article
PubMed/NCBI
Google Scholar

[56] View Article

[57] PubMed/NCBI

[58] Google Scholar

[ref17] 17. Srinivas G, Möller S, Wang J, Künzel S, Zillikens D, Baines JF, et al. Genome-wide mapping of gene-microbiota interactions in susceptibility to autoimmune skin blistering. Nature Communications. 2013;4(9):2462. pmid:24042968
View Article
PubMed/NCBI
Google Scholar

[60] View Article

[61] PubMed/NCBI

[62] Google Scholar

[ref18] 18. Long DL, Preisser JS, Herring AH, Golin CE. A marginalized zero-inflated Poisson regression model with overall exposure effects. Statistics in Medicine. 2014;33(29):5151–5165. pmid:25220537
View Article
PubMed/NCBI
Google Scholar

[64] View Article

[65] PubMed/NCBI

[66] Google Scholar

[ref19] 19. Preisser JS, Das K, Long DL, Divaris K. Marginalized zero-inflated negative binomial regression with application to dental caries. Statistics in Medicine. 2016;35(10):1722–1735. pmid:26568034
View Article
PubMed/NCBI
Google Scholar

[68] View Article

[69] PubMed/NCBI

[70] Google Scholar

[ref20] 20. Skrondal A, Rabe-Hesketh S. Generalized Latent Variable Modeling: Multilevel, Longitudinal and Structural Equation Models. London: Chapman & Hall; 2004.

[ref21] 21. Albert JM, Wang W, Nelson S. Estimating overall exposure effects for zero-inflated regression models with application to dental caries. Statistical Methods in Medical Research. 2014;23(3):257–278. pmid:21908419
View Article
PubMed/NCBI
Google Scholar

[73] View Article

[74] PubMed/NCBI

[75] Google Scholar

[ref22] 22. Gilthorpe MS, Frydenberg M, Cheng Y, Baelum V. Modelling count data with excessive zeros: The need for class prediction in zero-inflated models and the issue of data generation in choosing between zero-inflated and generic mixture models for dental caries data. Statistics in Medicine. 2009;28(28):3539–3553. pmid:19902494
View Article
PubMed/NCBI
Google Scholar

[77] View Article

[78] PubMed/NCBI

[79] Google Scholar

[ref23] 23. Olsen MK, Schafer JL. A Two-Part Random-Effects Model for Semicontinuous Longitudinal Data. Journal of the American Statistical Association. 2001;96(454):730–745.
View Article
Google Scholar

[81] View Article

[82] Google Scholar

[ref24] 24. Tooze JA, Grunwald GK, Jones RH. Analysis of repeated measures data with clumping at zero. Statistical Methods in Medical Research. 2002;11(4):341–355. pmid:12197301
View Article
PubMed/NCBI
Google Scholar

[84] View Article

[85] PubMed/NCBI

[86] Google Scholar

[ref25] 25. Liu L, Strawderman RL, Johnson BA, O’Quigley JM. Analyzing repeated measures semi-continuous data, with application to an alcohol dependence study. Statistical Methods in Medical Research. 2016;25(1):133–152. pmid:22474003
View Article
PubMed/NCBI
Google Scholar

[88] View Article

[89] PubMed/NCBI

[90] Google Scholar

[ref26] 26. Smith VA, Neelon B, Preisser JS, Maciejewski ML. A marginalized two-part model for longitudinal semicontinuous data. Statistical Methods in Medical Research. 2017;26(4):1949–1968. pmid:26156962
View Article
PubMed/NCBI
Google Scholar

[92] View Article

[93] PubMed/NCBI

[94] Google Scholar

[ref27] 27. Scealy JL, Welsh AH. Regression for compositional data by using distributions defined on the hypersphere. Journal of the Royal Statistical Society. 2011;73(3):351–375.
View Article
Google Scholar

[96] View Article

[97] Google Scholar

[ref28] 28. Scealy JL, Welsh AH. Fitting Kent models to compositional data with small concentration. Statistics & Computing. 2014;24(2):165–179.
View Article
Google Scholar

[99] View Article

[100] Google Scholar

Figures

Abstract

Author summary

Introduction

Models

Two-part Beta regression model

Marginalized two-part Beta regression model

Interpretation of covariate effects

For the conventional model.

For the marginalized model.

Results

Simulation studies

Real data analysis

Discussion

Supporting information

S1 Text. Likelihood derivation.

S1 Code. SAS code.

S1 Fig. Type I errors for the sample size 400.

S2 Fig. Powers for the sample size 400.

S3 Fig. Zero-inflation of the skin data.

S4 Fig. The figure shows the percentages of zero abundance in the 261 mouse skin samples for all 131 core OTUs.

S5 Fig. Skewness of the skin data.

S1 Table. The detailed results of the MTP model.

References