Inferring Dynamic Signatures of Microbes in Complex Host Ecosystems

The human gut microbiota comprise a complex and dynamic ecosystem that profoundly affects host development and physiology. Standard approaches for analyzing time-series data of the microbiota involve computation of measures of ecological community diversity at each time-point, or measures of dissimilarity between pairs of time-points. Although these approaches, which treat data as static snapshots of microbial communities, can identify shifts in overall community structure, they fail to capture the dynamic properties of individual members of the microbiota and their contributions to the underlying time-varying behavior of host ecosystems. To address the limitations of current methods, we present a computational framework that uses continuous-time dynamical models coupled with Bayesian dimensionality adaptation methods to identify time-dependent signatures of individual microbial taxa within a host as well as across multiple hosts. We apply our framework to a publicly available dataset of 16S rRNA gene sequences from stool samples collected over ten months from multiple human subjects, each of whom received repeated courses of oral antibiotics. Using new diversity measures enabled by our framework, we discover groups of both phylogenetically close and distant bacterial taxa that exhibit consensus responses to antibiotic exposure across multiple human subjects. These consensus responses reveal a timeline for equilibration of sub-communities of micro-organisms with distinct physiologies, yielding insights into the successive changes that occur in microbial populations in the human gut after antibiotic treatments. Additionally, our framework leverages microbial signatures shared among human subjects to automatically design optimal experiments to interrogate dynamic properties of the microbiota in new studies. Overall, our approach provides a powerful, general-purpose framework for understanding the dynamic behaviors of complex microbial ecosystems, which we believe will prove instrumental for future studies in this field.

Here, π are mixture weights (probabilities) such that K k=1 π k = 1, θ k are the parameters for mixture component k, and φ(·; ·) is a probability density function for the mixture components.
For the MC-TIMME model, the mixture components are the continuous-time and continuous-valued prototype signatures. Each prototype signature is characterized by a set of variables that specify the shape of the signature. The number of variables used to model the shape of each signature is adaptively learned by MC-TIMME. Additionally, latent (hidden) variables are used to assign the observed sets of sequencing counts to prototype signatures. Figure 1 depicts the MC-TIMME generative model structure in graphical model form with plate notation. The model is hierarchical, with variables at higher levels parameterizing prior distributions for variables at lower levels.
The data generation model is as follows. Assume that we have observed counts of sequencing reads for O s refOTUs in S subjects at T s time-points, with each data point denoted by y sot . Prototype signatures are associated with mixture weights π and parameters Θ. Each refOTU o in subject s is assigned to a prototype signature via a variable z so | π ∼ Multinomial(π). For a given prototype signature k, θ k is a parameter vector specifying a continuous-time and continuous-valued function, as described in Section 3. We assume that the conditional distribution for the observation y sot is the Negative Binomial Distribution (NBD), with mean parameterized by the function f (t, θ k ), and variance controlled by parameters : Here, γ so and ψ st are offset terms that specialize the underlying prototype signature to an individual signature for refOTU o in subject s (see Section 2). Thus, the data is assumed to be generated via a conditional Generalized Linear Model (GLM) using the Negative Binomial Distribution, with parameterization dependent on the prototype signature to which the data is assigned. This variable lies within three overlapping plates, which represent repetition over S subjects, O s refOTUs, and T s time-points. Arrows pointing into y sot indicate that this variable depends directly on several others. The variable z so represents a probabilistic assignment of refOTU o in subject s to a particular prototype signature. The variables γ so and ψ st are offset terms that specialize the prototype signature to a refOTU specific individual signature. The vector of variables controls the variability of observed data generated by each individual signature. The variables µ k0 , δ kX , δ kµ and λ k specify the shape of prototype signature k.
The variables c kµ and c kλ control the dimensionality of the prototype signature (e.g., the number of active variables). The infinity sign in the corner of the plate surrounding these variables indicates that there are a potentially unlimited number of prototype signatures. Note that the model also uses an additional level of hyperparameters for the prior distributions, which are not shown for clarity.
The mixture weights π are assumed to have a stick-breaking prior, π ∼ Stick(α), where α is a concentration parameter. The stick-breaking prior generates a countably infinite partition of the unit interval [0,1], with the "evenness" of the partition controlled by the concentration parameter α. The stick-breaking prior on π is defined constructively as: See [1] for further details regarding the stick-breaking prior formulation for the Dirichlet Process.

Signature offset terms
The term ψ st specifies the offset from a prototype signature, specific to subject s at time-point t. This offset serves to standardize the samples to compensate for differing numbers of sequencing reads. This offset is estimated from data using the formula: The term γ so specifies the offset from a prototype signature, specific to refOTU o in subject s. This offset serves to standardize refOTU o according to its average abundance over time, and thus highlights the shape of the individual signature, rather than its absolute magnitude, for comparing individual signatures across refOTUs. This offset is estimated from data using the formula: Here, ι = 0.25 to avoid taking the logarithm of zero.
3 Dynamical model for antibiotic pulse experiments

Specification
We assume that dynamics may differ over five intervals delimited by the antibiotic treatments in the experiments: (a) pre-antibiotic, (b) antibiotic treatment one, (c) post-antibiotic treatment one, (d) antibiotic treatment two, and (e) post-antibiotic treatment two. The main dynamics of interest, with regards to the extent of recovery or alterations in the commensal flora, occur on intervals (c) and (e). We assume that dynamics for prototype signature k on interval (c), the period after the first antibiotic treatment, are specified by the following differential equation: Solving for η kc (t), we get: The dynamical model on interval (c) thus specifies a relaxation process with initial value X kb that reaches to equilibrium level µ kc with relaxation time λ kc .
For the second post-antibiotic exposure interval, (e), we assume a similar model but with initial value X kd , equilibrium level µ ke , and relaxation time λ ke : For the pre-antibiotic exposure interval (a), and the two antibiotic exposure intervals (b) and (d), we assume constant values for the equilibrium levels: For the parameter controlling NBD variance, we assume it is equal to 1 on intervals (a), (c) and (e), and equal to 2 on the antibiotic treatment intervals.
We assume the parameters of adjacent intervals are dependent via Gaussian random walks: We specify the relaxation time parameters λ kc and λ ke as follows. To reflect the time-scale of the experiments, it's desirable to constrain these parameters within specified ranges. This is accomplished using softmax functions: {λ max − λ min } + λ min ξ kc ∼ Normal(β λc , ρ 2 λc ) Here, λ max and λ min specify maximum and minimum values for the relaxation time parameters.
We assume that λ kc and λ ke are depedent via a Gaussian random walk:

Dimensionality adaptation
To adapt the structural complexity of the dynamical model for prototype signatures, we introduce dimensionality changing variables, c kµ and c kλ . These variables are both discrete-valued, and act as "switches" to control the number of distinct equilibrium levels or relaxation time parameters utilized by each prototype signature.
The variable c kµ specifies one of four configurations for the equilibrium levels: The variable c kλ specifies one of two configurations for the relaxation time parameters:

Specification
For the parameters 1 and 2 , which control the NBD noise model variance, we assume a normal prior on transformed values of 1 and 2 , with hyperparameters m and σ : For α, the DP concentration parameter, we assume a gamma prior with hyperparameters ω α : For β 0 , the mean for pre-antibiotic equilibrium levels µ k0 , we assume a normal prior with hyperparameters m β 0 and σ β 0 : β 0 ∼ Normal(m β 0 , σ 2 β 0 ) For ρ 0 , the variance for pre-antibiotic equilibrium levels µ k0 , we assume an inverse χ 2 distribution with hyperparameters v ρ 0 and ν ρ 0 : For ρ µ and ρ X , the variances for δ kµ and δ kX variables, we assume inverse χ 2 distributions with corresponding v and ν hyperparameters: For β λc , the mean for ξ kc , which controls the relaxation time on interval (c), we assume a normal distribution with hyperparameters m β λc and σ β λc : For ρ λc , the variance for ξ kc , we assume an inverse χ 2 distribution with hyperparameters v ρ λc and ν ρ λc : For ρ λe , the variance for ξ ke , we assume an inverse χ 2 distribution with hyperparameters v ρ λe and ν ρ λe : For c kµ and c kλ , the parameters controlling dimensionality of prototype signature dynamics models, we assume a multinomial distribution with parameters π µ and π µ , and Dirichlet distribution priors with hyperparameters ω πµ and ω π λ .

Setting hyperparameters
For the concentration parameter α ∼ Gamma(ω α ), we set its hyperparameters ω α1 = ω α2 = 10 −5 for a weak prior with mean 1 and variance 10 5 . See Section 7.1 for a sensitivity analysis of these hyperparameters. We set the means and variances of prior distributions empirically from the data as follows. To set the hyperparameters, m and σ , specifying the priors for the variables controlling the NBD variances in the data noise model, we first computed maximum likelihood estimates for NBD variance parameters using the antibiotic exposure time-points for each refOTU. We then log transformed these estimates, and set m 1 and σ 1 equal to the means and variances of the log transformed estimates. The same procedure was done, using the pre-antibiotic exposure time-points, to set m 2 and σ 2 .
We set the hyperparameters specifying priors for the variables controlling the prototype signature dynamical models as follows. We first log transformed the counts data after adding a small increment of ι = 0.25 to avoid taking the log of zero.
For the antibiotic exposure interval (a), we computed the mean, µ soa and variance σ 2 soa of the transformed data for each refOTU o in each subject s. We set m β 0 = mean( µ soa ) and σ 2 β 0 = var( µ soa ). For the two post-antibiotic exposure intervals (c) and (e), we performed a non-linear least squares fit to estimate the parameters in equation 2 or 3 on the transformed data for each refOTU in each subject. The estimates for the equilibrium levels µ soc and µ soe , and relaxation time parameters λ soc and λ soe were then used to set σ ρµ , m β λc , σ β λc , and σ β λe respectively, using the means or variances of the estimated parameters.
For the antibiotic exposure intervals (b) and (d), we computed the means, X sob and X sod , for each refOTU and subject. We then set σ 2 ρ X equal to the variance of estimated increments X sob − µ soa and X sod − µ soc .
The hyperparameter ν in the inverse χ 2 distribution can be interpreted as a prior sample size. We set ν = 5, a small prior sample size, for a weak prior.
The hyperparameters ω πµ and ω π λ control the Dirichlet prior on the dimensionality changing variables c kµ and c kλ . These hyperparameters can be interpreted as prior sample sizes. We set ω πµ = (1, 1, 1, 1) and ω π λ = (1, 1), to yield symmetric Dirichlet distributions specifying a prior sample size of 1. See Section 7.1 for a sensitivity analysis of these hyperparameters.

Model inference
The inference task for MC-TIMME is to estimate the posterior distribution of all model variables given the observed data. We developed an efficient Markov Chain Monte Carlo (MCMC) algorithm for this task. Our solution uses a combination of Gibbs sampling, Metropolis-Hastings (MH)/Reversible Jump (RJ), and auxiliary variable augmentation steps.
The MCMC sampling loop is as follows: 1. Sample z so , the assignment of refOTU o in subject s to a prototype signature.
5. Sample β and ρ variables controlling means and variances of dynamical model variables.
6. Sample π µ and π µ variables specifying the distribution on dimensionality changing variables.

Step 1: sampling assignments of refOTUs to prototype signatures
For the first step in the MCMC loop, we use the Gibbs sampling method outlined in [2]. We sample the assignment z so for refOTU o in subject s, conditional on all the data and other model variables including assignments of the other refOTUs. In this step there are two cases: refOTU o in subject s is assigned to an existing prototype signature k, or it is assigned to a new prototype signature. The update equations are then: Here, n −so k is the number of refOTUs assigned to prototype signature k excluding refOTU o in subject s, z −so is the assignments of all refOTUs excluding refOTU o in subject s, and O is the total number of refOTUs across all subjects. The variable θ * represents the parameters for a new prototype signature. The data likelihood for y so is given by: The Gibbs sampling procedure thus either assigns the refOTU to an existing prototype signature, or generates a new prototype signature. Because the prior distribution for the prototype signature variables is non-conjugate, it is not possible to analytically integrate out θ * . Thus, we follow the method outlined by [2], and sample θ * from its prior distribution. Because the sample from the prior distribution is unbiased, this method will sample correctly from the underlying posterior distribution.

Step 2: sampling prototype signature shape variables
For the second step in the MCMC loop, sampling prototype signature shape variables, µ k0 , δ k , λ k , c kµ and c kλ , we use MH/RJ steps.
For a particular prototype signature k, conditional on λ k and c k , we have a Generalized Linear Model (GLM) using the NBD. We can write the parameterization of prototype signature k at time t in terms of canonical GLM parameters as: The NBD variance is given by: Here, h(·) denotes the link function, which in this case equals the natural logarithm function. The variable A kt is a vector of regression coefficients that depends on the relaxation time and dimensionality changing parameters, and G kt is the vector of independent variables for the GLM. Note that A kt and G kt will change dimension depending on the setting of c k . We leverage this formulation of the model as a conditional GLM to generate good linear approximations to use as MH proposal distributions, using the method described by [3]. These approximations are directly connected to Weighted Least Squares (WLS) methods, which are often used for maximum likelihood estimates for GLM parameters.
The procedure for creating a WLS estimate is as follows. We first create transformed data values y sot for all data points assigned to prototype signature k: We then use the transformed data values to generate a diagonal weight matrix W k : The moments for the proposal distribution used in the MH steps are then given by: The vector d represents the prior mean for the GLM, i.e., d 1 = β 0 and zero otherwise. The matrix Λ is the prior covariance for the GLM, i.e., Λ 11 = ρ 2 0 and zero otherwise. The MCMC iteration number is indexed by j, and the notation (j − 1 → j) indicates a move from state j − 1 to j in the MCMC sampler. To match dimensionality, the dimension of the target state is always used, e.g., A For each prototype signature k, the MH/RJ move from state j − 1 to j will then sample c Here, κ λ are parameters used to tune the acceptance rate to ≈ 0.23, the conventionally used acceptance rate for multi-dimensional MH (see [4] for details).
The acceptance probability q(j − 1 → j) for the move is given by: kλ | β, ρ, π µ , π λ ) is the product of prior distributions for µ k , δ k , λ k , c kµ and c kλ , as described in Section 3.
The term J (j−1→j) indicates the jumping probability, and is given by the product of the proposal distributions for ξ The function l k (y; µ k0 δ k , ) is the data likelihood restricted to data assigned to prototype signature k: l k (y; µ k0 , δ k , ) = s,o : zso=k t Neg-Bin(y so ; e η k (t)+γso+ψst , )

Step 3: sampling variables controlling NBD variance
We use MH steps for sampling 1 and 2 . In this case, because we are sampling a single variable, we use a simple proposal distribution of Normal(ξ The parameter κ i is used to tune the acceptance rate to ≈ 0.44, the conventionally used acceptance rate for single dimensional MH (see [4] for details).
The acceptance probability for the i MH steps is given by:

Step 4: sampling the concentration parameter
For sampling the concentration parameter α, we use an efficient auxiliary variable sampling method as described in [5].

Step 5: sampling means and variances of dynamics variables
Sampling β and ρ is accomplished via straightforward Gibbs sampling steps [4], as these variables have conjugate priors.

5.6
Step 6: sampling variables specifying the distribution on dimensionality changing variables Sampling π µ and π µ is accomplished via straightforward Gibbs sampling steps [4], as these variables have conjugate priors.

Experimental design algorithm
We use a greedy optimization algorithm with the Bayesian D-optimality function g(·) defined in the main manuscript, to generate experimental designs ( Figure 2). Learning of sequential or cross-subject experimental designs was performed as follows. For sequential designs, a small set of initial time-points T e was used for estimating hyper-parameters: the first and last days of each antibiotic treatment, the first and last days of the entire series, and the first day before and after each antibiotic treatment. MCMC samples were collected using the time-points T e , which were then used as input into GreedyExpDesign to predict the next set of n time-points to sample. These time-points were then concatenated with T e , and the process described above was repeated until the desired total number of timepoints was selected. For efficiency, we chose n = 6-8, depending on the number of time-points sampled in the original dataset. For cross-subject designs, data from all time-points from subject A were used as input to MC-TIMME, and MCMC samples were collected. These MCMC samples were then used as input to GreedyExpDesign, to predict time-points to sample for subject B. This procedure was repeated for all pairs of subjects A and B. For each strategy, we produced a design that used 75% of the time-points from the original experiments. To eliminate bias due to the chosen starting set T e , we required all designs to include this set of time-points. To estimate predictive accuracy, we used values for all refOTUs at the selected time-points, to fit the MC-TIMME model, and then calculated predictive error on the remaining 25% of held-out time-points. We used root mean squared error (RMSE) as the measure of predictive error, which is the square root of the sum of squared differences between actual and predicted sequencing counts, averaged over the number of refOTUs and time-points.

Sensitivity testing of model dimensionality adaptation
The MC-TIMME model adapts in dimensionality in two ways: (1) the number of equilibrium levels and relaxation time parameters used by a prototype signature, and (2) the number of prototype signatures used to model the ecosystem. The first type of adaptation is controlled by the discrete-valued variables c kµ and c kλ for each prototype signature k. These variables are multinomial distributed with parameter vectors π µ and π λ . We placed Dirichlet priors on the multinomial distributions. The second type of adaptation is governed by a Dirichlet Process with concentration parameter α. We placed a gamma prior with hyperparameters ω α on the concentration parameter α. The expected number of prototype signatures K is given by K ≈ α ln(N/α + 1) [7], where N is the total number of refOTU time-series (756 for the dataset we analyzed).
To test the sensitivity of MC-TIMME's dimensionality adaptation to hyperparameter settings, we varied the hyperparameters on c kµ , c kλ , and α. For each test, we ran the algorithm 8 times, and then computed the median and 95% credible intervals for Signature Diversity scores. In addition, we computed Normalized Mutual Information (NMI) and Rand Index (RI) scores. The NMI and RI scores are external criteria of clustering quality, which measure how well a particular cluster (i.e., assignment of refOTU time-series to prototype signatures) compare to a gold standard [6]; NMI and RI values of 1 indicate perfect correspondance between the clusterings. In this case, we used 8 runs of MC-TIMME using the default hyperparameter settings as the gold standard. The median and credible intervals for the NMI and RI scores were computed based on all pairs of scores between the default runs and a given test run. Figure 3 shows the results of our sensitivity testing.
For c kµ , we tested a non-symmetric Dirichlet prior "skewed" toward π µ1 = 0.75 with the Dirichlet prior equal to (3, 1, 1, 1), e.g., favoring no change in equilibrium levels, with 3 : 1 odds instead of the default of 1 : 1. Most of the signature diversity and NMI/RI score medians were virtually identical, and there were no significant deviations in scores as assessed by overlap of credible intervals. Only the SD1 λ score showed a non-significant trend of being slightly elevated under the "skewed" prior.
For c kλ , we tested a non-symmetric Dirichlet prior "skewed" toward π λ1 = 0.75 with the Dirichlet prior equal to (3, 1), e.g., favoring no change in the relaxation time parameter on the second antibiotic exposure interval, with 3 : 1 odds instead of the default of 1 : 1. All the signature diversity and NMI/RI score medians were virtually identical, and there were no significant deviations in scores as assessed by overlap of credible intervals.
For the concentration parameter α, we tested two extreme values for the gamma prior mean, 10 −3 and 10 3 . These settings represent 1000X decreases or increases over the default mean of 1. For both the 10 −3 and 10 3 settings, most signature diversity and NMI/RI score medians were virtually identical, and there were  1, 1, 1), c λ non-unif. = setting of Dirichlet prior on π λ to (3, 1), alpha 10 −3 and 10 3 = setting of gamma prior mean on α to 10 −3 or 10 3 . NMI = Normalized Mutual Information, RI = Rand Index, SD3 = Signature Diversity type 3 score (inter-ecosystem), SD1 µ and SD1 λ = Signature Diversity type 1 scores (intra-signature diversity) for subjects D-F, SD2 = Signature Diversity type 2 scores (intra-ecosystem diversity) for subjects D-F. no significant deviations in scores as assessed by overlap of credible intervals.
Our tests indicate that the model dimensionality (complexity) inferred by MC-TIMME showed little sensitivity to changes in hyperparameter settings. A contributing factor to this result is the hierarchical design of the MC-TIMME model. The model does not require direct setting of parameters for priors on the variables affecting dimensionality adaptation, but instead uses a second level of hyperparameters. In general, such hierarchical Bayesian models are less sensitive to prior distribution choices, as the parameters of prior distributions are not fixed, and thus can adapt based on the underlying data [4].

Data simulations
We tested the ability of MC-TIMME to recover a "gold standard" set of signatures, by simulating data with different amounts of added noise. Because a gold standard experimental dataset was not available, we used the recovered signatures as described in the main manuscript as the reference signatures; we believe this represents a more realistic scenario than a "toy" example with a small number of signatures. For each level of noise, 8 simulated datasets were created by generating counts data from the reference signatures using a negative binomial distribution (NBD) noise model. The NBD parameters controlling variance were set equal to their empirically determined values from the Dethlefsen et al. data (1.0X setting, or ≈ 60% coefficient of variation for counts). We also tested a lower level of noise (0.5X setting) and a higher level of noise (1.5X setting). We computed the percent error for Signature Diversity scores and median relaxation time parameter estimates, with 8 runs of the original dataset used as the gold standard. Normalized Mutual Information (NMI) and Rand Index (RI) scores were also computed, as described in the previous section. Figures 4 and 5 depict the results of our data simulations. The simulations using noise levels equal to those in the original dataset ("1.0X ") are indicative of the performance of MC-TIMME in the presence of a realistic amount of noise. However, noise levels in the original data are likely to be higher than in more recent 16S phylotyping data sets, as Dethlefsen et al. produced data using the now obsolete Roche 454 FLX chemistry and with a relatively small number of sequencing reads per sample. In any case, the "1.0X " simulations produced error rates for Signature Diversity scores of <≈ 10%, and for median relaxation time parameters on the first post-antibiotic exposure interval of ≈ 25 − 30%. The error rates for the median relaxation time parameters on the second post-antibiotic exposure interval were higher, at ≈ 30 − 45%. These higher error rates were not unexpected, as the second post-antibiotic exposure intervals were sampled at considerably fewer time-points than the first in the original experiments. For all simulations, NMI scores were reduced by <≈ 20%, and RI scores were reduced by < 3%. The NMI score is likely to be more representative of signature assignment consistency; the RI score has been shown to "saturate" with many clusters in data [6]. We note that the number of prototype signatures was systematically lower with increasing amounts of noise (not shown), suggesting that some fine details of signature shapes were lost as noise was added.  Each bar represents the percent error in estimating Signature Diversity scores or median relaxation time parameters, relative to the original dataset. SD3 = Signature Diversity type 3 score (inter-ecosystem), SD1 µ and SD1 λ = Signature Diversity type 1 scores (intra-signature diversity) for subjects D-F, SD2 = Signature Diversity type 2 scores (intra-ecosystem diversity) for subjects D-F, λ 1 = median relaxation time parameter for the first post-antibiotic exposure interval for subjects D-F, λ 2 = median relaxation time parameter for the second post-antibiotic exposure interval for subjects D-F.
Overall, our data simulations show that MC-TIMME can accurately estimate model parameters for complex microbial ecosystems in the presence of realistic amounts of experimental noise. Signature diversity scores exhibited the lowest error rates, likely because these scores are estimated using all refOTUs from each subject, or from all subjects in the case of the SD3 score. Relaxation time parameter estimates, which are estimated from prototype signatures comprised of potentially small groups of refOTUs, exhibited higher error rates. These results provide estimates of bounds on distinguishability of differences in model parameters in the presence of noise. However, we note that these results may vary with the complexity of the microbial ecosystems used as the reference "gold standard."

Subject effects
To test the effects on model estimation of excluding data from a particular subject, we ran MC-TIMME on all pairs of subjects. Each test consisted of 8 MCMC runs. We compared the test runs to 8 runs on the complete dataset, using all 3 subjects. Figures 6 and 7 display the results of these tests. As depicted in Figure 6, the error rate for estimating Signature Diversity scores was <≈ 10% when subject D or F was excluded. However, the error rate was higher when subject E was excluded. The error rates for the relaxation time parameters followed similar trends, but were overall higher than for the Signature Diversity scores, as noted in the previous section. Subjects D and F were least similar, in terms of sharing of prototype signatures (SD3 score of 0.70), and subjects E and F were most similar (SD3 score of 0.31). Inclusion of subject E with subject D or F lowered the error rate, as subject E shared a substantial number of prototype signatures with both subjects D and F. This sharing of prototype signatures effectively increases statistical power, because estimates of model parameters are derived from groups of refOTUs rather than single instances.
Excluding any given subject did not affect the relative ordering of Signature Diversity scores. Further, the effect of excluding any subject was about the same as having a realistic level of noise present in the data, as shown in the previous section. Thus, given the level of noise in the dataset analyzed, MC-TIMME did not appear to lose resolution when data from one subject was excluded. However, studies with more subjects will be necessary to fully explore the effects of the number of subjects on model performance. For a large number of subjects, it may prove useful to extend MC-TIMME to use an adaptive Hierarchical Dirichlet Process prior to leverage similarities or differences among groups of subjects [8]. Row labels indicate the subjects that were included (e.g., DE = subjects D and E). NMI = Normalized Mutual Information, RI = Rand Index, SD3 = Signature Diversity type 3 score (inter-ecosystem; note that this applies only to the included subjects), SD1 µ and SD1 λ = Signature Diversity type 1 scores (intra-signature diversity) for subjects D-F, SD2 = Signature Diversity type 2 scores (intraecosystem diversity) for subjects D-F.