## Figures

## Abstract

In microbiome studies, an important goal is to detect differential abundance of microbes across clinical conditions and treatment options. However, the microbiome compositional data (quantified by relative abundance) are highly skewed, bounded in [0, 1), and often have many zeros. A two-part model is commonly used to separate zeros and positive values explicitly by two submodels: a logistic model for the probability of a specie being present in Part I, and a Beta regression model for the relative abundance conditional on the presence of the specie in Part II. However, the regression coefficients in Part II cannot provide a marginal (unconditional) interpretation of covariate effects on the microbial abundance, which is of great interest in many applications. In this paper, we propose a marginalized two-part Beta regression model which captures the zero-inflation and skewness of microbiome data and also allows investigators to examine covariate effects on the marginal (unconditional) mean. We demonstrate its practical performance using simulation studies and apply the model to a real metagenomic dataset on mouse skin microbiota. We find that under the proposed marginalized model, without loss in power, the likelihood ratio test performs better in controlling the type I error than those under conventional methods.

## Author summary

Semi-continuous compositional data are typically analyzed using two-part models which separately describe the probability of zero values and the distribution of positive values. The second part of the model provides a conditional interpretation of covariate effects on the positive response. However, it is of great interest in many applications to assess the covariate effect on the marginal mean of the response. For this purpose, we propose a marginalized two-part model by reparameterizing the marginal mean in Part II. We show that the proposed marginalized two-part model outperforms conventional methods by simulation studies in terms of controlling the Type I error and maximizing the power. We apply our method to a microbiota dataset, and find consistent results with our simulation studies.

**Citation: **Chai H, Jiang H, Lin L, Liu L (2018) A marginalized two-part Beta regression model for microbiome compositional data. PLoS Comput Biol 14(7):
e1006329.
https://doi.org/10.1371/journal.pcbi.1006329

**Editor: **Dan Knights,
University of Minnesota, UNITED STATES

**Received: **July 13, 2017; **Accepted: **June 26, 2018; **Published: ** July 23, 2018

**Copyright: ** © 2018 Chai et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

**Data Availability: **The data are publicly available at https://www.nature.com/articles/ncomms3462#supplementary-information.

**Funding: **This research was partly supported by AHRQ R01 HS 020263 and National Natural Science Foundation of China (Grant No. 11571204). HC was supported by the China Scholarship Council (201606220131) as a visiting student at Northwestern University. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

**Competing interests: ** The authors have declared that no competing interests exist.

## Introduction

In recent years, metagenomics studies have been growing rapidly due to the advances of next-generation sequencing (NGS) technologies [1]. Microbiota have been known to be associated with various diseases, e.g., obesity and diabetes [2, 3], Crohn’s disease [4], bacterial vaginosis [5], and cancer [6, 7].

The microbial abundance is usually measured in read counts. However, such quantities are not directly comparable across samples due to the uneven total sequence counts of samples. Therefore, the read counts are often normalized to relative abundances which sum to 1 for all microbes in a sample [8]. Relative abundance can be characterized by a point mass at zero and a right-skewed continuous distribution with a positive support, the so-called “semi-continuous” or “zero-inflated continuous” data. The zero values indicate that certain microbes are absent in the sample, or the rare microbes are present but missed due to undersampling, while the continuous distribution with a positive support describes the levels of relative abundance among the present microbes.

The relative abundance is often described by a two-part model [9], which separates zeros and positive values explicitly by two submodels: a logistic model for the probability of the outcome being positive in Part I and a (generalized) linear regression model for the amount of the (transformed) positive value in Part II. An important issue in such two-part models is to determine the distributional form in Part II. The nonzero relative abundance data are non-normally distributed and bounded in [0, 1). Beta distribution has been used to model this outcome. A two-part Beta regression model can be thus developed [10–12]. It includes two sets of parameters, one in the logistic regression for the presence of a microbe, and the other in the Beta regression for the relative abundance conditional on the presence of the microbe. These two sets of parameters are interpreted as effects on the presence of a microbe and on the level of relative abundance given that the microbe is present, respectively. That is, there is a conditional interpretation in Part II. However, it is often of great interest to have a straightforward interpretation of covariate effects on the overall marginal (unconditional) mean. For example, [13] proposed a marginalized two-part log-normal model by parameterizing covariates effects directly in terms of the marginal mean.

As conventional two-part Beta regression models do not provide an unconditional interpretation of covariate effects, we propose a marginalized two-part Beta regression model for microbiome abundance data which parameterizes covariate effects in terms of the marginal mean. The proposed model not only accounts for the zero-inflated nature of the microbiome data but also yields more interpretable effect estimates.

Of note, an alternative to describe zero-inflated data is the Tobit model [14] where zero values are considered as left censored observations of the underlying true negative values (of Normal or other distributions accommodating negative values). However, the Tobit model is not appropriate for the Beta distribution which does not have a support of negative values. Consequently, the Tobit model cannot be applied directly to the relative abundance data.

## Models

In the following Section, we will introduce the conventional two-part Beta regression model and the proposed marginalized two-part Beta regression model. We will also describe their properties to assess the overall impact of covariates on the marginal mean, and demonstrate that the proposed model outperforms the conventional model.

### Two-part Beta regression model

We begin with the conventional two-part model with a Beta component in Part II [10–12]. For a given operational taxonomic unit (OTU), let *Y*_{i} denote its semi-continuous relative abundance for subject *i*, where 0 ≤ *Y*_{i} < 1 and *i* = 1, 2, …, *n*. Specifically, a two-part Beta regression model has the following form:
where the density function of the Beta distribution is parameterized as
with *μ*_{i} (0 < *μ*_{i} < 1) and *ϕ* (*ϕ* > 0) being the mean and dispersion parameters of the Beta distribution, respectively, and *p*_{i} is the probability that the observation *Y*_{i} is from the Beta distribution. The two-part model describes the probability *p*_{i} in the logistic component and the conditional mean in the Beta component as functions of covariates,
(1)
(2)
where ** α** and

**are vectors of regression coefficients,**

*β*

*X*_{i}= (1,

*x*

_{i1}, …,

*x*

_{ip})

^{T}is the (

*p*+ 1) dimensional covariate vector (including an intercept) for the

*i*-th subject. We assume identical covariates for both parts of the model for simplicity of notation. One can instead allow for different sets of covariates for the two parts.

### Marginalized two-part Beta regression model

To obtain interpretable covariate effects on the marginal mean, we propose the following marginalized two-part Beta regression model. Let *v*_{i} = *E*(*Y*_{i}) be the marginal mean of *Y*_{i}. The first part of the proposed marginalized two-part model is the same as Part I in the conventional two-part model,
(3)

In Part II, the marginal (unconditional) mean *v*_{i}, instead of the conditional mean *μ*_{i}, is modeled as a function of covariates:
(4)

As we can see, the marginalized two-part model not only captures zero-inflation and skewness as the conventional two-part model, but also allows us to examine covariate effects on the overall marginal mean.

In the S1 Text, we can see that the likelihood of the conventional two-part model can be reparameterized to that of ** α**,

**and**

*γ**ϕ*in the marginalized model. However, the interpretation of covariate effects are different in the two frameworks, which will be elaborated in the next subsection.

The estimation of the marginalized two-part model can be carried out in SAS Proc NLMIXED (The main code is shown in S1 Code). To obtain starting values of the estimation, a logistic model and a Beta regression model are fitted for the binary part and the positive part, respectively. Then the estimates of these two models are used as starting values for the two-part marginalized model. The convergence of the estimation is determined by a threshold value 1 × 10^{−8} for the relative gradient, a common convergence criterion in SAS Proc NLMIXED. This criterion is satisfied in our simulations for all replicates, and in the real data analysis for all 131 OTUs.

### Interpretation of covariate effects

#### For the conventional model.

Using the conventional two-part model shown in Eqs (1) and (2), *β*_{j} is interpreted as the effect of a unit increase in the *j*th covariate on the logit of the conditional mean of *Y*_{i} given *Y*_{i} is positive. In many applications, however, the primary interest is to examine the impact of covariates on the overall marginal mean *E*(*Y*_{i}). For the conventional two-part model, we have
(5)

Along the lines of [15], we can assess the effect of the *j*-th continuous covariate *x*_{ij} on the unconditional mean as
(6)
where
with *α*_{j} and *β*_{j} being the coefficients corresponding to *x*_{ij} in the conventional two-part model and *X*_{i(−j)}, *α*_{(−j)}, and *β*_{(−j)} be the corresponding vectors with the *j*-th covariate removed.

A straightforward calculation shows that (6) can be equivalently written as (7) where

As the logit transformation is a monotonically increasing function in the interval (0, 1), the hypothesis test of the covariate effects on the marginal mean is equivalent to that on its logit transformation. In Eq (7), the logit transformation of the marginal mean abundance is independent of covariate *x*_{ij} if both *α*_{j} and *β*_{j} are zero. However, if *α*_{j} and *β*_{j} have opposite signs, even when they are not zero, the logit transformation of the marginal mean abundance may be still independent of covariate *x*_{ij}. Furthermore, the coefficients *c*_{1}(*α*_{j}, *β*_{j}) and *c*_{2}(*α*_{j}, *β*_{j}) in Eq (7) are functions of *α*_{j} and *β*_{j}. Thus, the independence between the marginal mean and covariate *x*_{ij} cannot be tested simply as the hypothesis of *α*_{j} = 0 and *β*_{j} = 0, e.g., by the likelihood ratio test. Instead, the Delta method has to be used on the hypothesis test of Eq (7), which depends on *X*_{i(−j)} in a complicated way.

When the interest is to assess the effect of a discrete variable on response, e.g., placebo vs. treatment, Eq (7) no longer applies. Without loss of generality, consider a binary covariate *x*_{ik} taking value 0 or 1. Similar to [15], the difference in the logit transformation of the marginal mean with *x*_{ik} = 1 vs. *x*_{ik} = 0 is used to evaluate the impact on the expected marginal mean response.

Under the conventional two-part model, the difference between the logit transformations with *x*_{ik} = 1 and *x*_{ik} = 0 is
(8)
where

It is worth noting that *b*_{1}(*α*_{k}), *b*_{2}(*β*_{k}), and *b*_{3}(*α*_{k}, *β*_{k}) all equal to 0 if *α*_{k} and *β*_{k} are 0. Similar to the continuous covariate, the logit transformation of the marginal mean abundance does not depend on the binary covariate *x*_{ik} if both *α*_{k} and *β*_{k} are zero. However, even though neither of the coefficients is zero, the transformed mean abundance may still be independent of the binary covariate *x*_{ik} when *α*_{k} and *β*_{k} have opposite signs. Eq (8) indicates that the independence between the response and the binary covariate *x*_{ik} cannot be ascertained by directly testing *α*_{k} = 0 and *β*_{k} = 0 by e.g., the likelihood ratio test, as shown in the simulation studies and the real data analysis.

#### For the marginalized model.

In the marginalized two-part model Eqs (3) and (4), the effect of a continuous covariate *x*_{ij} on the marginal mean *E*(*Y*_{i}) can be characterized by
(9)
where *γ*_{j} is the coefficient corresponding to *x*_{ij} in Eq (4). Thus, the effect of the covariate *x*_{ij} on the marginal mean abundance is determined by its coefficient in the marginalized model. With the marginalized two-part model, we can estimate the coefficient *γ*_{j} as well as test the effect on the marginal mean.

As for a binary covariate in the marginalized two-part model, the difference in logit transformation of the marginal mean with *x*_{ik} = 1 vs. *x*_{ik} = 0 can be expressed as
(10)

One can see that the effect of a binary covariate *x*_{ik} on the marginal mean abundance is determined by its coefficient *γ*_{k} in the marginalized two-part model. The logit transformation of the marginal mean abundance with *x*_{ik} = 1 is bigger than that with *x*_{ik} = 0 when *γ*_{k} is positive, and the reverse is true when *γ*_{k} is negative.

## Results

In this section, simulation studies and real data analysis are presented to assess the performance of the proposed marginalized and the conventional two-part models. Results show that the proposed model outperforms the conventional model, which is consistent with the theoretical results.

### Simulation studies

In this section, we conduct simulation studies to evaluate the finite-sample performance of the proposed marginalized two-part model. To test the effect of the covariate on the overall marginal mean *E*(*Y*_{i}), likelihood ratio tests (LRT) are performed and compared under the marginalized two-part (MTP) model and the conventional two-part (CTP) model. In addition, the two sample T-test and the Wilcoxon rank sum test are also compared.

We assume that, in both parts, there is only one binary covariate *x*_{1}, which is generated from the Bernoulli distribution with *p* = 0.5. However, according to the interpretation of the covariate effects in the preceding section, the proposed model can be applied to multiple covariates. The response *y*_{i} is generated below:
where is the conditional mean given that *y*_{i} is positive and *ϕ* is the dispersion parameter of the Beta distribution.

In the simulation studies, 1000 samples of sizes 200 and 400 are generated. We set the parameters as *α*_{0} = 1.5, *γ*_{0} = −2.5, and *ϕ* = 1, while *α*_{1} and *γ*_{1} may have different values according to which of the two criteria are under study: the type I error or the power.

First, we evaluate the type I error for testing the null hypothesis *H*_{0}: *the binary covariate x _{1} has no effect on the overall marginal mean of y_{i}*. In the MTP model, this is equivalent to testing as shown in Eq (10). However, testing in the CTP model is not equivalent to testing

*H*

_{0}. Specifically, according to Eq (8), even though neither of the coefficients is zero, the binary covariate

*x*

_{1}may still have no effect on the marginal mean. This means that the conventional model cannot control the type I error for testing

*H*

_{0}when both

*α*

_{1}and

*β*

_{1}are non-zero.

The results are shown in Fig 1. Type I errors are calculated under two settings: *α*_{1} = 0, *γ*_{1} = 0 and *α*_{1} = 1, *γ*_{1} = 0. For each setting, two *α*-levels are considered: 0.01 and 0.05. As we can see from Fig 1, under the first setting (*α*_{1} = 0, *γ*_{1} = 0), all the methods control the type I error reasonably well. Under the setting *α*_{1} = 1, *γ*_{1} = 0, the LRT under the MTP and the T-test control the type I error well, while the LRT under the CTP and the Wilcoxon test cannot control the type I error, especially the LRT under the CTP model. Because in this setting, testing in the CTP model is not equivalent to testing the null hypothesis *H*_{0}.

The results in the upper panels correspond to the setting *α*_{1} = 0, *γ*_{1} = 0 and the lower panels correspond to setting *α*_{1} = 1, *γ*_{1} = 0. In each setting, the left panel shows the results for significance level 0.01 and the right panel shows the results for level 0.05. The dashed horizontal line in each panel represents the correct level. The results for sample size 400 can be found in S1 Fig in Supporting information.

The powers under two different settings, *α*_{1} = 0, *γ*_{1} = 1 and *α*_{1} = 1, *γ*_{1} = 1, are shown in Fig 2. As we can see, the LRT under the CTP and the MTP are the most powerful methods with the power close to 1 in all settings. The Wilcoxon test performs a little worse than the LRT while the T-test has the lowest power.

The upper panels show the powers corresponding to the setting *α*_{1} = 0, *γ*_{1} = 1 and the lower panels show the powers corresponding to the setting *α*_{1} = 1, *γ*_{1} = 1. In each setting, the left panel shows the results for significance level 0.01 and the right panel shows the results for level 0.05. The powers for sample size 400 are shown in S2 Fig in Supporting information.

We also estimate the coefficients in the MTP model under the setting *α*_{1} = 1, *γ*_{1} = 1. The results in Table 1 demonstrate that the biases are negligible and the coverage probabilities are acceptably close to the nominal level 0.95 for all the model parameters. In addition, we observe small differences between the empirical standard errors and our estimates. The mean squared errors for sample size 400 are smaller than those for sample size 200.

According to the simulation results, the LRT under the MTP model has the best performance: it controls the type I error reasonably well and also achieves the best power. The T-test has the similar performance in the error control while it is not as powerful as the LRT under the MTP model. The LRT under the CTP model is powerful, however, it fails to control the type I error. The Wilcoxon test has poor performances in both the error control and power than the LRT under the MTP model.

To assess the robustness of the proposed method, we consider a setting where positive responses are generated from another distribution. First of all, the only covariate *x*_{i} is generated from the Uniform distribution on (0, 1), while the response *y*_{i} has the following distribution:
where
and the overall marginal mean *v*_{i} of the response is

Instead of the Beta distribution, positive responses are generated from the Binomial distribution Bin(100, *μ*_{i}) and then divided by 100 to make them bounded in (0, 1). As in the previous simulation, we set . The probability of having exactly 0 success in 100 trials is (1 − *μ*_{i})^{100}, which is negligible with the proper choice of the parameters ** α** and

**. Thus almost all the zero values in this zero-inflated Binomial data are structural zeros.**

*γ*In this simulation study, 1000 samples of sizes 200 and 400 are generated. The parameters are set as *α*_{0} = 2, *γ*_{0} = −0.5, while *α*_{1} and *γ*_{1} may have different values in order to calculate the type I errors and the powers.

The type I errors are calculated under two settings: *α*_{1} = 0, *γ*_{1} = 0 and *α*_{1} = 1, *γ*_{1} = 0. For each setting, two *α*-levels are considered: 0.01 and 0.05. As we can see from Fig 3, under both settings, the proposed marginalized model controls the type I error reasonably well. The conventional model controls the type I error under the setting *α*_{1} = 0, *γ*_{1} = 0 while it fails under the setting *α*_{1} = 1, *γ*_{1} = 0, similar to Fig 1.

The results in the upper panels correspond to the setting *α*_{1} = 0, *γ*_{1} = 0 and the lower panels correspond to the setting *α*_{1} = 1, *γ*_{1} = 0. In each setting, the left panel shows the results for significance level 0.01 and the right panel shows the results for level 0.05. The dashed horizontal line in each panel represents the correct *α*-level.

As shown in Fig 4, both the marginalized model and the conventional model have power equal to 1 under all settings.

The powers in the upper panels correspond to the setting *α*_{1} = 0, *γ*_{1} = 1 and the lower panels correspond to setting *α*_{1} = 1, *γ*_{1} = 1. In each setting, the left panel shows the results for significance level 0.01 and the right panel shows the results for level 0.05.

From the simulation studies we can conclude that the proposed marginalized two-part Beta regression model is powerful and control the type I error well. Also, it is robust against model misspecification.

### Real data analysis

In this section, the proposed marginalized two-part model and the conventional two-part model are applied to a real metagenomic dataset on mouse skin microbiota to investigate the effects of immunization on the relative abundances of 131 core OTUs [16, 17]. The data are publicly available at https://www.nature.com/articles/ncomms3462#supplementary-information. In addition to the likelihood ratio tests under CTP and MTP, the T test and the Wilcoxon rank sum test are also included for comparison. All the tests are carried out with Bonferroni’s correction.

The skin dataset contains the relative abundances of the most common 131 OTUs for 261 mouse skin samples, including 78 non-immunized and 183 immunized individuals. There is a presence of a large portion of zero abundances in the skin data, ranging from 0 to 68.97% with average 33.03% and median 33.72% (see S3 and S4 Figs). The positive values are highly right skewed and the logit transformations in the MTP model and the CTP model capture the skewness (See S5 Fig).

Fig 5 shows the results for these four methods. As we can see, the LRT under the marginalized two-part model results in significant effects of immunization on 45 (namely, 31 + 14) OTUs. The LRT under the conventional two-part model has significant results for all these 45 OTUs, and 14 (namely, 8 + 4 + 2) additional OTUs. The T test identifies 31 of these 45 OTUs and another 7 (namely, 4 + 3) OTUs. Similar to the LRT under conventional two-part model, the Wilcoxon test identifies all these 45 OTUs and 21 (namely, 9 + 8 + 4) additional OTUs. Finally, 60 OTUs are not identified by any methods.

Among all the 131 OTUs, 60 OTUs are not identified by any methods and the other 71 OTUs are identified by at least one method. For example, “31” in the intersection of all sets indicates that 31 OTUs are identified by all methods; while “4” located in the intersection of three sets, indicates that 4 OTUs are identified by three methods, namely, the T test, the CTP model, and the Wilcoxon test.

The LRT under the CTP model and the Wilcoxon test identify more OTUs than the LRT under the MTP due to their failure to control the type I error as shown in Simulation studies (Fig 1). Actually, for those 14 OTUs identified by the CTP but not by the MTP, all of them have significant coefficients in Part I of the two-part model. Out of the 21 OTUs that are identified by the Wilcoxon test but not by the MTP, 17 have significant coefficients in Part I of the two-part model. This corresponds to the setting *α*_{1} = 1, *γ*_{1} = 0 where both the CTP and the Wilcoxon test have much higher type I errors than the MTP (See the lower panel of Fig 1). Because it is less powerful than the MTP (Fig 2), the T test identifies less OTUs than the MTP.

Table 2 shows 10 most significant OTUs from the MTP model. As in [17], for OTUs which cannot be classified at the species level, the next highest classifiable taxonomic level (denoted by ‘o’, ‘f’ and ‘g’ for order, family, and genus, respectively) is displayed. We use a number in the superscript to distinguish among different OTUs with the same classification name. The detailed results of all the 45 OTUs identified by the proposed MTP model are shown in S1 Table.

Moreover, for most of the 131 OTUs, the proposed marginalized two-part model fits the observed data better than the conventional two-part model. Fig 6 shows the density curves of the observed relative abundances, the predicted relative abundances using the MTP model, and the predicted relative abundances using the CTP model for two OTUs. As we can see, the MTP model fits the observed data much better than the CTP model.

The blue curve shows the density of the observed data. The green curve shows the density of predictions from the MTP model while the red curve represents the density of predictions from the CTP model.

## Discussion

In this paper, we propose a marginalized two-part Beta regression model for semi-continuous microbiome compositional data. This model allows investigators to obtain covariate effects on the marginal mean of the outcome. It takes into account the compositional and zero-inflation nature of the microbiome relative abundance data. It also has an unconditional interpretation of the covariate effect on the marginal mean. Our proposed marginalized two-part model has satisfactory performance in both simulation studies and real data analysis.

For count outcomes exhibiting many zeros, a zero-inflated Poisson (ZIP) regression model or a zero-inflated negative binomial (ZINB) model, is often employed to examine the relation between covariates and the response. To model the overall population mean count directly, the marginalized ZIP model and the marginalized ZINB model were proposed by [18] and [19], respectively. However, in the case of bounded count data, the ZIP is questionable while the zero-inflated binomial (ZIB) model and its extension for over-dispersion: the zero-inflated beta-binomial (ZIBB) model, are available in [20–22]. It is of interest to develop a marginalized modeling approach for ZIB or ZIBB.

More recently, there has been increasing interest in analyzing correlated zero-inflated semi-continuous data. The correlation may stem from the structure of clustered data or from longitudinal data where repeated measures are correlated for the same subject. Typically, random effects are included to account for the correlations between observations [10, 15, 23–25]. However, similar limitation exists in these two-part random effects models, as they cannot account for covariate effects on the marginal mean. Recently, Smith et al. [26] proposed a marginalized two-part model for longitudinal semicontinuous data based on the log-skew normal distribution for positive values. In future studies, we will extend our marginalized two-part model to correlated semi-continuous data bounded by 0 and 1.

Finally, it is of interest to consider different microbiomes together, taking into account the constraint that the relative abundances of all OTUs sum to 1. Scealy and Welsh [27, 28] considered Kent models for such compositional data. It merits further consideration to incorporate zero values in the Kent model framework.

## Supporting information

### S1 Code. SAS code.

The main SAS codes for the conventional two-part model and the proposed marginalized two-part model are shown in this section.

https://doi.org/10.1371/journal.pcbi.1006329.s002

(PDF)

### S1 Fig. Type I errors for the sample size 400.

This figure shows the type I errors of the four methods for sample size 400. The results in the upper panels correspond to the setting *α*_{1} = 0, *γ*_{1} = 0 and the lower panels correspond to setting *α*_{1} = 1, *γ*_{1} = 0. In each setting, the left panel shows the results for significance level 0.01 and the right panel shows the results for significance level 0.05. The dashed horizontal line in each panel represents the significance level.

https://doi.org/10.1371/journal.pcbi.1006329.s003

(TIF)

### S2 Fig. Powers for the sample size 400.

This figure shows the powers of the four methods for sample size 400. The upper panel contains the power corresponding to the setting *α*_{1} = 0, *γ*_{1} = 1 and the lower panel shows the power corresponding to the setting *α*_{1} = 1, *γ*_{1} = 1. In each setting, the left figure shows the results for significance level and the right panel shows the results for significance level 0.05.

https://doi.org/10.1371/journal.pcbi.1006329.s004

(TIF)

### S3 Fig. Zero-inflation of the skin data.

The figure shows the distributions of relative abundances of 6 OTUs. From the upper panel to the lower panel and from the left to the right, the proportions of zero values for these 6 OTUs are 0.77%, 3.45%, 4.97%, 14.18%, 29.89%, and 48.28%, respectively.

https://doi.org/10.1371/journal.pcbi.1006329.s005

(TIF)

### S4 Fig. The figure shows the percentages of zero abundance in the 261 mouse skin samples for all 131 core OTUs.

The lower quartile and the upper quartile of the percentages are 20.11% and 48.28%, respectively.

https://doi.org/10.1371/journal.pcbi.1006329.s006

(TIF)

### S5 Fig. Skewness of the skin data.

The figure shows the histogram of the relative abundance for 6 OTUs. The first one in every panel is the histogram of the OTU in the original scale, while the second one in every panel shows the histogram after logit transformation.

https://doi.org/10.1371/journal.pcbi.1006329.s007

(TIF)

### S1 Table. The detailed results of the MTP model.

The table shows the detailed results of all the 45 OTUs that are identified by the proposed MTP model.

https://doi.org/10.1371/journal.pcbi.1006329.s008

(PDF)

## References

- 1. Gilbert JA, Meyer F, Bailey MJ. The Future of microbial metagenomics (or is ignorance bliss?). Isme Journal. 2011;5(5):777–779. pmid:21107444
- 2. Everard A, Cani PD. Diabetes, obesity and gut microbiota. Best Practice & Research Clinical Gastroenterology. 2013;27(1):73–83.
- 3. Musso G, Gambino R, Cassader M. Obesity, diabetes, and gut microbiota: the hygiene hypothesis expanded? Diabetes Care. 2010;33(10):2277–2284. pmid:20876708
- 4. Lewis JD, Chen EZ, Baldassano RN, Otley AR, Griffiths AM, Lee D, et al. Inflammation, Antibiotics, and diet as environmental stressors of the gut microbiome in pediatric Crohn’s disease. Cell Host & Microbe. 2015;18(4):489–500.
- 5. Srinivasan S, Hoffman NG, Morgan MT, Matsen FA, Fiedler TL, Hall RW, et al. Bacterial communities in women with bacterial vaginosis: high resolution phylogenetic analyses reveal relationships of microbiota to clinical criteria. Plos One. 2012;7(6):e37818. pmid:22719852
- 6. Garrett WS. Cancer and the microbiota. Science. 2015;348(6230):80–86. pmid:25838377
- 7. Schwabe RF, Jobin C. The microbiome and cancer. Nature Reviews Cancer. 2013;13(11):800–812. pmid:24132111
- 8. Tyler AD, Smith MI, Silverberg MS. Analyzing the human microbiome: a “how to” guide for physicians. American Journal of Gastroenterology. 2014;109(7):983–93. pmid:24751579
- 9. Manning WG. A two-part model of the demand for medical care: preliminary results from the health insurance study. Health, Economics, and Health Economics. 1981; p. 103–123.
- 10. Chen EZ, Li H. A two-part mixed-effects model for analyzing longitudinal microbiome compositional data. Bioinformatics. 2016;32(17):2611–2617. pmid:27187200
- 11. Ospina R, Ferrari SLP. A general class of zero-or-one inflated beta regression models. Computational Statistics & Data Analysis. 2012;56(6):1609–1623.
- 12. Peng X, Li G, Liu Z. Zero-Inflated Beta Regression for Differential Abundance Analysis with Metagenomics Data. Journal of Computational Biology. 2015;23(2):102–110.
- 13. Smith VA, Preisser JS, Neelon B, Maciejewski ML. A marginalized two-part model for semicontinuous data. Statistics in Medicine. 2014;33(28):4891–4903. pmid:25043491
- 14. Tobin J. Estimation of Relationships for Limited Dependent Variables. Econometrica. 1958;26(1):24–36.
- 15. Liu L, Strawderman RL, Cowen ME, Shih YC. A flexible two-part random effects model for correlated medical costs. Journal of Health Economics. 2010;29(1):110–123. pmid:20015560
- 16. Ban Y, An L, Jiang H. Investigating microbial co-occurrence patterns based on metagenomic compositional data. Bioinformatics. 2015;31(20):3322–3329. pmid:26079350
- 17. Srinivas G, Möller S, Wang J, Künzel S, Zillikens D, Baines JF, et al. Genome-wide mapping of gene-microbiota interactions in susceptibility to autoimmune skin blistering. Nature Communications. 2013;4(9):2462. pmid:24042968
- 18. Long DL, Preisser JS, Herring AH, Golin CE. A marginalized zero-inflated Poisson regression model with overall exposure effects. Statistics in Medicine. 2014;33(29):5151–5165. pmid:25220537
- 19. Preisser JS, Das K, Long DL, Divaris K. Marginalized zero-inflated negative binomial regression with application to dental caries. Statistics in Medicine. 2016;35(10):1722–1735. pmid:26568034
- 20.
Skrondal A, Rabe-Hesketh S. Generalized Latent Variable Modeling: Multilevel, Longitudinal and Structural Equation Models. London: Chapman & Hall; 2004.
- 21. Albert JM, Wang W, Nelson S. Estimating overall exposure effects for zero-inflated regression models with application to dental caries. Statistical Methods in Medical Research. 2014;23(3):257–278. pmid:21908419
- 22. Gilthorpe MS, Frydenberg M, Cheng Y, Baelum V. Modelling count data with excessive zeros: The need for class prediction in zero-inflated models and the issue of data generation in choosing between zero-inflated and generic mixture models for dental caries data. Statistics in Medicine. 2009;28(28):3539–3553. pmid:19902494
- 23. Olsen MK, Schafer JL. A Two-Part Random-Effects Model for Semicontinuous Longitudinal Data. Journal of the American Statistical Association. 2001;96(454):730–745.
- 24. Tooze JA, Grunwald GK, Jones RH. Analysis of repeated measures data with clumping at zero. Statistical Methods in Medical Research. 2002;11(4):341–355. pmid:12197301
- 25. Liu L, Strawderman RL, Johnson BA, O’Quigley JM. Analyzing repeated measures semi-continuous data, with application to an alcohol dependence study. Statistical Methods in Medical Research. 2016;25(1):133–152. pmid:22474003
- 26. Smith VA, Neelon B, Preisser JS, Maciejewski ML. A marginalized two-part model for longitudinal semicontinuous data. Statistical Methods in Medical Research. 2017;26(4):1949–1968. pmid:26156962
- 27. Scealy JL, Welsh AH. Regression for compositional data by using distributions defined on the hypersphere. Journal of the Royal Statistical Society. 2011;73(3):351–375.
- 28. Scealy JL, Welsh AH. Fitting Kent models to compositional data with small concentration. Statistics & Computing. 2014;24(2):165–179.