On the Origins and Control of Community Types in the Human Microbiome

Microbiome-based stratification of healthy individuals into compositional categories, referred to as “enterotypes” or “community types”, holds promise for drastically improving personalized medicine. Despite this potential, the existence of community types and the degree of their distinctness have been highly debated. Here we adopted a dynamic systems approach and found that heterogeneity in the interspecific interactions or the presence of strongly interacting species is sufficient to explain community types, independent of the topology of the underlying ecological network. By controlling the presence or absence of these strongly interacting species we can steer the microbial ecosystem to any desired community type. This open-loop control strategy still holds even when the community types are not distinct but appear as dense regions within a continuous gradient. This finding can be used to develop viable therapeutic strategies for shifting the microbial composition to a healthy configuration.


Author Summary
We coexist with a vast number of microbes that live in and on our bodies, and play important roles in physiology and disease. Two interesting phenomena have been observed in the human microbiome. The first is the stratification of healthy individuals based on the relative abundances of their microbes, which holds promise for drastically improving personalized medicine. The second is the astounding success of fecal microbial transplantation in treating certain diseases related to disordered microbiomes. Surprisingly, both phenomena have not been analytically or quantitatively understood, despite a few early qualitative attempts. This work shows that through a dynamic systems and control theoretical approach the success of fecal microbial transplantation can be explained and that the microbiome-based stratification can be as simple as the existence of strongly interacting species.

Introduction
Rather than simple passengers in and on our bodies, commensal microorganisms have been shown to play key roles in our physiology and in the evolution of several chronic diseases [1,2]. Many scientific advances have been made through the work of large-scale, consortiumdriven metagenomic projects, such as the Metagenomics of the Human Intestinal Tract (Meta-HIT) [3] and the Human Microbiome Project (HMP) [4,5]. In particular, the HMP has analyzed the largest cohort and set of distinct, clinically relevant body habitats to date, in order to characterize the ecology of human-associated microbial communities [4]. These results thus delineate the range of structural and functional configurations normal in the microbial communities of a healthy population, enabling future characterization of the translational applications of the human microbiome.
A recent study proposed that a healthy gut microbiome falls within one of three distinct community types, which the authors coined as "enterotypes" [6]. More specifically, the authors calculated the relative abundance profiles of microbiota at the genus level and then performed standard cluster analysis, finding three distinct clusters (enterotypes). Each enterotype is a dominated by a particular genus (Bacteroides, Prevotella, or Ruminococcus) but not affected by gender, age, body mass index, or nationality of the host. These results suggest that enterotyping could be an efficient way to stratify healthy human individuals. The development of personalized microbiome-based therapies would then simplify to shifting an unhealthy microbiome to one of the distinct healthy configurations.
A meta-analysis, however, suggested that enterotypes, or in general community types, could be an artifact of the small sample size in [6] and what one should expect is a continuous gradient with dense regions rather than distinct clusters [7]. The level of discreteness or continuity of the community types remains unclear. Interestingly, samples in the dense regions of this gradient are either highly abundant or deficient in Bacteroides [7], indicating that community types could still emerge as the dense regions within a continuous gradient. Indeed, some recent work actually supports the notion of distinct community types [8][9][10][11][12].
We still lack consensus on the nature and origins of community types [13][14][15][16][17]. In principle the presence of community types could be explained by several different mechanisms. For instance, there may be true multi-stability, i.e. multiple stable states with the same set of microbial species present in the same environment [18]. Although this type of multi-stability has been well discussed in macro-ecological systems [19], its detection in host-associated microbial communities is rather difficult and has not been demonstrated experimentally [15]. Host heterogeneity is another possible mechanism, leading to host-specific microbial dynamics (parameterized by host-specific intra-and inter-species interactions). If those interactions, which serve as parameters of the host-associated microbial ecosystems, can be classified into distinct groups, then we can numerically demonstrate that distinct community types will naturally emerge (S1 Text Sec. 6.2 and 7.1). However, the presence of classifiable microbial dynamics has not been experimentally detected. Moreover, the overwhelming success of Fecal Microbiota Transplantation (FMT) in treating recurrent Clostridium Difficile Infection (rCDI) suggests that host heterogeneity is likely playing a minor role in terms of its effect on intra-and inter-species interactions [20][21][22].
Here we proposed a simple mechanism, without assuming multi-stability or host heterogeneity, to explain the origins of community types. In particular, using a dynamic systems approach, we studied compositional shift as a function of species collection and demonstrated that with heterogeneous interspecific interactions, a phenomenon often observed in macroecology [23][24][25], community types can naturally emerge. Interestingly, this result is independent of the topology of the underlying ecological network. To our knowledge, this is the first quantitative attempt to explore the analytical basis of community types. Furthermore, community types, even when they weakly exist, can be manipulated efficiently by controlling the Strongly Interacting Species (SISs) only. This provides theoretical justification for translational applications of the human microbiome. Note that in this paper we use the term species in the general context of ecology, i.e. a set of organisms adapted to a particular set of resources in the environment, rather than the lowest taxonomic rank. One could think of organizing microbes by genus or operational taxonomical units as well.

Dynamic Model
The human microbiome is a complex and dynamic ecosystem [26]. When modeling a dynamic system we should first decide how complex the model needs to be so as to capture the phenomenon of interest. A detailed model of the intestinal microbiome would include mechanistic interactions among cells, spatial structure of the human intestinal tract, as well as host-microbiome interactions [27][28][29][30]. That level of detail however is not necessary for this study, because we are primarily interested in exploring the impact that any given species has on the abundance of other species. To achieve that, a population dynamics model such as the canonical Generalized Lotka-Volterra (GLV) model is sufficient [15,31]. Indeed, GLV dynamics leveraging current metagenome data has already been used for predictive modeling of the intestinal microbiota [32][33][34]. Consider a collection of n species in a habitat with the population of species i at time t denoted as x i (t). The GLV model assumes that the species populations follow a set of ordinary differential equations where ð Á Þ ¼ d dt ð Þ. Here r i is the growth rate of species i, a ij (when i 6 ¼ j) accounts for the impact that species j has on the population change of species i, and the terms a ii x 2 i are adopted from Verhulst's logistic growth model [35]. By collecting the individual populations x i (t) into a state vector x(t) = [x 1 (t), Á Á Á, x n (t)] T , Eq (1) can be represented in the compact form where r = [r 1 , Á Á Á, r n ] T is a column vector of the growth rates, A = (a ij ) is the interspecific interaction matrix, and diag generates a diagonal matrix from a vector. Hereafter we drop the explicit time dependence of x.
Next we discuss the notion of fixed point, or equivalently steady state, in the GLV dynamics. This notion is important in the context of the human microbiome, as the measurements taken of the relative abundance of intestinal microbiota in the aforementioned studies typically represent steady behavior [4,6]. In other words, the intestinal microbiota is a relatively resilient ecosystem [36,37], and until the next large perturbation (e.g. antibiotic administration or dramatic change in diet) is introduced, the system remains stable for months and possibly even years [38][39][40]. The fixed points of system Eq (2) are those solutions x that satisfy _ x ¼ 0. The solution x = 0 (i.e. all species have zero abundance) is a trivial steady state. The set of non-trivial steady states contains those solutions x Ã such that r + Ax Ã = 0. When the matrix A is invertible, it follows that the non-trivial steady state x Ã = −A −1 r is unique [41].
Our study ultimately investigated the impact that different collections of microbial species have on their steady state abundances. In Fig 1 we presented a detailed analysis showing that if we introduce a new species into the ecosystem in Eq (2), the shift of the steady state is proportional to the interaction strengths between the newly introduced species and the previously On the Origins and Control of Community Types in the Human Microbiome existing ones. Similarly, if two communities with the same dynamics differ by only one species, then it is the interaction strength of that species with regard to the rest of the community that dictates how far apart the steady states of the two communities will be. This analytical result indicates that heterogeneity of interspecific interactions could lead to the clustering of steady states, and hence the emergence of community types.
To systematically investigate how changes in species collection affect the steady state shift in the GLV dynamics, we assumed that two microbial species will interact in the same fashion regardless of the host. Otherwise, if the interactions are host specific and the dynamics are classifiable, we can show that distinct community types will emerge almost trivially (S1 Text Sec. 6.2 and 7.1).

Metacommunity and Local Communities
Consider a universal species pool, also referred to as a metacommunity [42], indexed by a set of integers S = {1, . . ., n}, an n × n matrix A representing all possible pairwise interactions between species, and a vector r of size n containing the growth rates for all the n species. The global parameters for the metacommunity are completely defined by the triple (S, A, r). We consider q Local Communities (LCs), defined by sets S [ν] that are subsets of S, denoting the species present in LC ν with ν = 1, . . ., q. This modeling procedure is inspired by the fact that alternative community assembly scenarios could give rise to the compositional variations observed in the human microbiome [42]. These LCs represent microbial communities in the same body site across different subjects. For simplicity, we assume that each LC contains only p species (p n), randomly selected from the metacommunity.
The GLV dynamics for each LC is given by where the LC specific interaction matrix and growth vector are defined as A ½n ¼ A S ½n ;S ½n and r ½n ¼ r S ½n , respectively. That is, A [ν] is obtained from A by only taking the rows and columns of A that are contained in the set S [ν] . A similar procedure is performed in order to obtain r [ν] . Finally for each x [ν] there is a corresponding x ½n 2 R n that has the abundances for species S [ν] of LC ν in the context of the metacommunity species pool S. To reveal the origins of community types in the human microbiome, we decomposed the universal interaction matrix as which contains four components. (i) N 2 R nÂn is the nominal interspecific interaction matrix where each element is sampled from a normal distribution with mean 0 and variance σ 2 , i.e. ½N ij $ N ð0; s 2 Þ. (ii) H 2 R nÂn is a diagonal matrix that captures the overall interaction strength heterogeneity of different species. When studying the impact of interaction strength heterogeneity the diagonal elements of H will be drawn from a power-law distribution with exponent −α, i.e. ½H ii $ PðaÞ, which are subsequently normalized so that the mean of the diagonal elements is equal to 1. This is to ensure that the average interaction strength is bounded. For studies that do not involve interaction strength heterogeneity H is simply the identity matrix. (iii) G 2 R nÂn is the adjacency matrix of the underlying ecological network: [G] ij = 1 if species i is affected by the presence of species j and 0 otherwise. For details on the construction of G for different network topologies see S1 Text Sec. 3.2.2. Note that the Hadamard product (•) between H and G represents element-wise matrix multiplication. (iv) The last component s is simply a scaling factor between 0 and 1. Finally, we set [A] ii = −1. The presence of the scaling factor s and setting the diagonal elements of A to −1 are to ensure an asymptotic stability condition for the GLV dynamics (S1 Text Sec. 4.2, 4.3.3, and 4.5). The elements in the global growth rate vector r are taken from the uniform distribution, ½r i $ Uð0; 1Þ. Details concerning the distribution N , P and U can be found in S1 Text Sec. 3.1.1.

Origins of Community Types
We first studied the role of interspecific interaction strength heterogeneity on the emergence of community types. In order to achieve this, we chose the complete graph topology, i.e. each species interacts with all other species. This eliminates any structural heterogeneity. The nominal interaction strengths were taken from a normal distribution N ð0; 1Þ, the scaling component was set to s = 0.7, and the interaction strength heterogeneity was varied from low heterogeneity (α = 7) to a high level of heterogeneity (α = 1.01). Fig 2 displays the distributions of the diagonal elements of the interaction heterogeneity matrix H at various heterogeneity levels. For each level of heterogeneity we constructed 500 LCs, each with 80 species randomly drawn from a metacommunity of 100 species. Fig 2b illustrates the global interaction matrix A as a weighted network. With low heterogeneity all the link weights are of the same order of magnitude. As the heterogeneity increases fewer nodes contain highly weighted links, until there is only one node with highly weighted links when α = 1.01. These nodes with highly weighted links correspond to SISs. Fig 2c presents the results of Principle Coordinates Analysis (PCoA) of the steady states associated with the 500 different LCs as a function of α. For low interaction heterogeneity (α = 7) the classical clustering measure, Silhouette Index, is less than 0.1, suggesting a lack of clustering in the data. As the heterogeneity increases the steady states can be seen to separate in the first two principle coordinate axes. At one point (α = 2.0) three clusters is the optimal number of clusters. Then as α continues to decrease the optimal number of clusters becomes two. The fact that there are three clusters when α = 2.0 is not special, as a different number of optimal clusters can be observed with different model parameters or different clustering measures (see S1 Text Sec. 7.2) [7]. While the precise number of clusters is not important here, what is important is the fact that the degree of interaction strength heterogeneity controls the degree to which the clusters appear to be distinct. For low levels of interaction strength heterogeneity the clusters appear to be more like dense regions within a continuous gradient. As the heterogeneity increases, the clusters become more distinct. Indeed, having two clusters for α = 1.01 is to be expected, because one of the clusters is associated with all the LCs that contain the single SIS, and the other LCs that do not contain the single SIS constitute the other cluster.
The overall trend observed in Fig 2c is unaffected if the complete graph is replaced by an Erdős-Rényi (ER) random graph, or if the total number of LCs is increased (S1 and S2 Figs). The result is also generally unaffected by the specifics of the nominal distribution (S1 Text Sec. 7.2.1), the mean degree of the ER graph (S1 Text Sec. 7.2.2), or the number of species in the LCs (S1 Text Sec. 7.2.3). Of course, each LC can be invaded by other species that are currently absent. If this migration occurs relatively fast, then all LCs will converge to roughly the same species collection and the clustering will disappear. Hence in our modeling approach we have to assume that the migration occurs at a relatively slow time scale, and the time interval between species invasions is too long to disrupt the clustering. We also note that if heterogeneous interactions are placed at random in the network the clustering of steady states does not arise (S3 Fig). Our results are also robust (in the control theoretical sense) to stochasticity and the migration of existing species [43]. Robustness to migration is illustrated in S4 and S5 Figs, and robustness to stochastic disturbances is illustrated in S6-S8 Figs (see S1 Text Sec. 4.4 for analytical robustness results).
We can explain the above results as follows: for low interaction strength heterogeneity all of the matrices A [ν] are very similar. In other words, despite containing different sets of species, all the LCs have very similar dynamics. Thus, clustering of steady states is not to be expected. As the heterogeneity of interaction strength increases, however, some of the LCs will have species that are associated with the highly weighted columns in A, i.e. the SISs.   contain species 23 or 81. Most of the LCs in the yellow cluster contain SISs 23 and 51. Hence, each community type is well characterized by a unique combination of SISs. Note that none of the SISs are dominating species. These findings, along with the analysis in Fig 1, suggest that heterogeneity in interaction strengths or the presence of SISs leads to the clustering of steady states, i.e. the emergence of community types.
We then studied the impact of structural heterogeneity on community types. Four different scenarios are illustrated in Fig 4: (a) a complete graph topology as in Fig 2; (b) an ER random graph as in S1 Fig; (c) a power-law out-degree network; (d) a power-law out-degree network with no interaction strength heterogeneity. Fig 4a, 4b and 4c support the main result shown in Fig 2, i.e. increasing interaction strength heterogeneity leads to the emergence of distinct community types. Fig 4d displays rather unexpected results as it suggests that structural heterogeneity alone does not lead to distinct community types. It is only with the inclusion of interaction strength heterogeneity that structurally heterogeneous microbial ecosystems can display strong clustering in their steady states as shown in Fig 4c. This result is rather surprising, because structural heterogeneity is observed in many real-world complex networks [44][45][46] and has been shown to affect many dynamical processes over complex networks [47][48][49].
Note that in the preparation of Fig 4 the steady state abundances were normalized to get relative abundances of the species and the Jensen-Shannon distance metric was used for clustering analysis [50]. The trends discussed above also hold when, instead of the Silhouette Index, the Variance Ratio Criterion is used as the clustering measure, or the Euclidean distance is used for clustering, or when absolute abundances are analyzed along with the Euclidean distance being used (S9, S10 and S11 Figs). S11 Fig correlates to the analytical results in Fig 1, where absolute abundances and the Euclidean distance are implicitly used.

Control of Community Types
With the knowledge that each community type can be associated with a specific collection of SISs, we tested the hypothesis that a local community could be steered to a desired community type by controlling the combination of SISs only. Our results for three different scenarios are shown in Fig 5a for α = 1.6. The local community that was controlled in each scenario is shown in magenta and is denoted LC Ã , which initially belongs to the blue cluster. For Scenario 1, LC Ã had the SISs 23 and 81 removed, with species 60 and 51 simultaneously introduced with random initial abundances drawn from Uð0; 1Þ. Recall that species 60 and 51 are the SISs present in the orange cluster. This swap of SISs shifts LC Ã to a slightly different state (green dot) within the blue cluster. The GLV dynamics were then simulated and the trajectory goes from the blue cluster to the orange cluster. This result was independent of the initial condition of species 60 and 51 (Fig 5b). This open-loop control of the community type by manipulating a set of SISs also works at lower levels of heterogeneity (Fig 5c and 5d). Here we use the term open-loop to contrast closed-loop control where inputs are designed with feedback so as to continuously correct the system of interest. These findings imply that the SISs, despite their low abundances, can be used to effectively control a microbial community to a desired community type.
In Scenario 2 we tested if the same result could be obtained by removing the six most abundant species from LC Ã and introducing the six most abundant species from the orange cluster at exactly the same abundance level as an arbitrary local community in the orange cluster. The state after this dominating species swap (red dot) starts close to the orange cluster, because the six most abundant species from a local community in that cluster were copied. The trajectory does not ultimately converge near the orange cluster, but goes toward the blue cluster instead. The trajectory, however, does not ultimately converge in the blue cluster because it does not contain any of the most abundant species present in the blue cluster.
In scenario 3 we explored how the open-loop control methodology just presented could also be used to conceptually justify the success of FMT in treating patients with rCDI [20][21][22]. This scenario begins by removing 20 species from LC Ã (the top two SISs and 18 of the most abundant spaces) so as to emulate the effect of broad-spectrum antibiotics, resulting in an altered community (blue dot). Then the GLV dynamics were simulated and the local community converged to a new steady state (black dot), representing the CDI state. To emulate an oral capsule FMT 1% of the species abundances from an arbitrary LC in the orange cluster, i.e. the donor, was added to the CDI state, resulting in a slightly altered community (gray dot). The GLV dynamics were then simulated until the final steady state was reached (white dot). As expected the post-FMT steady state is in the orange cluster, the same cluster that is associated with the donor's LC. Note that if during the FMT the SISs in the donor's LC were not transplanted then the patient's post-FMT steady state does not converge in the orange cluster (S12 Fig). The above results indicate that the presence of SISs simplifies the open-loop control design. However, the existence of community types is not a prerequisite for deploying this control methodology. The possibility for open-loop control of the human microbiome will likely be body site specific. Our work focused on the gut specifically because of the fact that this microbial community is very likely dominated by microbe-microbe and/or host-microbe interactions, rather than external disturbances. It is yet to be determined what factors drive the dynamics in other body sites.  Fig 2, but with Euclidean distance used so that a projection matrix could be found to show the trajectories in the 2D principle coordinate plane (S1 Text Sec. 5.6). We aim to steer a local community (denoted as LC*, shown in magenta) in the blue cluster to the orange cluster. Three different scenarios are presented, per the three numbers above the arrows. Scenario 1: SISs swap. The SISs (23 and 81) of LC* were replaced by the SISs present in the orange cluster (60 and 51). The initial abundances of species 60 and 51 were drawn from Uð0; 1Þ, resulting in and altered community (green dot), and the GLV dynamics were simulated until the steady state was reached (white dot), which is located in the orange cluster as desired. Scenario 2: Dominating Species (DS) swap. The six most abundant species in LC* were removed and replaced by the six most abundant species from a local community in the orange cluster, with the initial condition after the switch of species shown as the red dot, and the dynamics were simulated until steady state was reached (white dot). Scenario 3: Fecal Microbiota Transplantation (FMT). The two SISs and 18 of the most abundant species (for a total of 20) were removed from LC* with the initial condition shown in blue (post-antibiotic state). Then the GLV dynamics were simulated (gray line) and the system converged to the black dot (CDI state). Then 1% of the steady abundances from an arbitrary LC in the orange cluster were added to the CDI state (gray dot, emulating oral capsule FMT) and the dynamics were then simulated until steady

Discussion
In this work we studied compositional shift as a function of species collection using a dynamic systems approach, aiming to offer a possible mechanism for the origins of community types. We found that the presence of interaction strength heterogeneity or SISs is sufficient to explain the emergence of community types in the human microbiome, independent of the topology of the underlying ecological network. The presence of heterogeneity in the interspecific interaction strengths in natural communities has been well studied in macroecology [23][24][25]51]. Extensive studies are still required to explore this interesting direction in the human microbiome. While preliminary analysis is promising, all existing temporal metagenomic datasets are simply not sufficiently rich to infer the interspecific interaction strengths among all of the microbes present in and on our bodies [15] even at the genus level, let alone the species level. Recent studies have tried to overcome this issue by only investigating the interactions between the most abundant species [34]. Our results, however, suggest that SISs need not be the most abundant ones and can still play an important role in shaping the steady states of microbial ecosystems. Ignoring the lack of sufficient richness, system identification analysis with regularization and cross-validation [32,52] of the largest temporal metagenomic dataset to date [39] does not disprove the existence of SISs. To the contrary, it supports this assertion (see S13 Fig). Permutation of the time series however also results in the identification of interaction strength heterogeneity (see S14 and S15 Figs). Hence, the presence of SISs needs to be systematically studied with novel system identification methods and perhaps further validated with co-culture experiments [15]. For example, we could first use metabolic network models to predict levels of competition and complementarity among species [53], which could then be used as prior information to further improve system identification [54].
Note that our notion of SIS is fundamentally different from that of keystone species, which are typically understood as species that have a disproportionately deleterious effect (relative to their abundance) on the community upon their removal [55]. One can apply a brute-force leave-one-out strategy to evaluate the "degree of keystoneness" of any species in a given community [56]. Even without any interaction strength heterogeneity, a given community may still have a few keystone species. The SISs defined in this work are those species that have very strong impacts (either positive or negative) on the species that they directly interact with. The presence of SISs requires the presence of interaction strength heterogeneity. We emphasize that an SIS is not necessarily a keystone species. In fact, without any special structure embedded in the interaction matrix (and hence the ecological network), there is no reason why the removal of any SIS would cause a mass extinction. It does have a profound impact on the steady-state shift, which is exactly what we expected from our analytical results presented in Fig 1. Our findings also have important implications as we move forward with developing microbiome-based therapies, whether it be through drastic diet changes, FMT, drugs, or even engineered microbes [57][58][59][60][61][62][63]. Indeed, our results suggest that a few strongly interacting microbes can determine the steady state landscape of the whole microbial community. Therefore, it may be possible to control the microbiome efficiently by controlling the collection of SISs present in a patient's gut. Finer control may be possible through the engineering of microbes. This will involve a detailed mechanistic understanding of the metabolic pathways associated with the microbes of interest. As discussed in Fig 1, given a new  b, c, d, s could be chosen such that the new steady state is feasible and stable (S1 Text Sec. 4.3.1). Then, with the knowledge of the appropriate parameters b, c, d, s it would be possible to introduce a known microbe with those characteristics or engineer one to have the desired properties. We emphasize that the stability and control of the microbial ecosystem must be studied at the macroscopic scale using a systems and control theoretic approach. This is similar to what is carried out in aerospace applications. The design of wings and control surfaces for an aircraft incorporate sophisticated fluid dynamic models. The control algorithms for planes however are often derived from simple linearized reduced order dynamic models where linear control techniques can be easily deployed [64]. Taken together, our results indicate that the origins and control of community types in the human microbiome can be explored analytically if we combine the tools of dynamic systems and control theory, opening new avenues to translational applications of the human microbiome.

Materials and Methods
The methods section begins with a toy example to illustrate the construction of the universal interaction matrix A = NH • Gs in Eq (4), where steps : ðiÞ N ¼ Given that H is diagonal, it scales the columns of N. If one thinks of A as the adjacency matrix of a digraph, then H scales all of the edges leaving a node. Thus one can consider H as controlling the interaction strength heterogeneity of A. Given the Hadamard product between H and G, the off-diagonal elements of G that are zero will result in the corresponding off-diagonal elements of A being zero as well.
In the first study (Fig 2), to explore the impact of interaction heterogeneity on steady state shift, we varied the exponent −α of the power-law distribution of [H] ii to generate five different universal interaction matrices A of dimension 100 × 100. For each universal interaction matrix A, the nominal component N consists of independent and identically distributed elements sampled from a normal distribution N ð0; 1Þ. The topology for this study was a complete graph and thus all the elements in G are equal to 1. The heterogeneity element H is constructed in two steps. First, five different vectors hðaÞ 2 R 100 are constructed where each element is sampled from a power-law distribution PðaÞ for α 2 {7, 3, 1.6, 1.2, 1.01}. Then, each of the hðaÞ is normalized to have a mean of 1, h ¼ h=meanð hÞ: Finally the heterogeneity matrix is defined as H ¼ diagðhÞ. For this study s = 0.07, ensuring uniform asymptotic stability for the case of low heterogeneity (see S1 Text Theorem 17). The final step in the construction of A is to set the diagonal elements to −1.
For each α the following simulation steps were taken. There are a total of 100 species, S = {1, 2, . . ., 100}, in the metacommunity, and each of the 500 local communities contains 80 species, randomly chosen from S. The MATLAB command used to perform this step is randperm. The initial condition for each of the 500 local communities, x [ν] (0), were sampled from Uð0; 1Þ. The dynamics were then simulated for 100 seconds using the MATLAB command ode45. If any of the 500 simulations crashed due to instability or if the norm of the terminal discrete time derivative was greater than 0.01 then that local community was excluded from the rest of the study. Those simulations that finished without crashing and with small terminal discrete time derivative were deemed steady. Less than 1% of simulations were deemed unstable in the preparation of Fig 2. It is worth noting that by constructing the dynamics as described above the abundance profiles for our synthetic data do not contain the heavy-tailed abundance profile that is observed in the HMP gut data [4].
The networks presented in the second row of Fig 2 were constructed by considering A as the weighted adjacency matrix of the network. Note that arrows showing directionality and self loops were suppressed. The links were color coded in proportion to the absolute value of the entries in A.
For the last row of Fig 2 a clustering analysis was performed. For each α the steady state abundances of the 500 local communities were normalized so that we have 500 synthetic microbial samples. Then k-medoids clustering was performed for k 2 {1, 2, . . ., 10} using the Jensen-Shannon distance metric (S1 Text Sec. 5.1). Silhouette analysis was performed to determine the optimal number of clusters and the clustering results were illustrated in the 2-dimensional principle coordinates plot. For S1 Fig the same steps as for the preparation of Fig 2 were performed, but with G representing the adjacency matrix of an Erdős-Rényi digraph with mean degree of 20 (mean in-degree of 10 and mean out-degree of 10) and s ¼ 1= . Details on the construction of an Erdős-Rényi digraph can be found in S1 Text Section 3.  Fig 4d shows results for networks with a power-law out-degree distribution with mean out-degree of 10 and there is no interaction strength heterogeneity, i.e. H is the identity matrix. For this study the Silhouette Index was constructed from normalized steady state data using the Jensen-Shannon distance. S9 Fig is the same as Fig 4, but instead of the Silhouette Index, the variance ratio criterion is used with the Jensen-Shannon distance, from normalized steady state abundance (S1 Text Sec. 5.4). In S10 Fig the Silhouette Index is determined from the Euclidean distance with normalized steady state abundance. Finally, in S11 Fig the Silhouette Index is determined by the Euclidean norm with the absolute steady state abundance. Fig 5 contains a PCoA analysis of the results from Fig 2, but with the Euclidean distance being used instead of the Jensen-Shannon distance, making PCoA equivalent to principle component analysis. This enables us to project the open-loop control trajectories into the principle coordinates (S1 Text Sec 5.6). This procedure was also used in the preparation of S12 Fig.  S13-S15 Figs contain system identification analyses for temporal gut microbiome data of two subjects [39]. The data is publicly available from the metagenomics analysis server MG-RAST:4457768.3-4459735.3 and can also be accessed (as we did) from Qiita (http://qiita. ucsd.edu) under study ID 550. The processed data was downloaded as biom file "67_otu_table. biom" (2014-11-17 13:18:50.591389). The Operational Taxonomic Units (OTUs) were then grouped from the genus level and up, depending on the availability of known classifications for OTUs, and converted to a txt file using MacQIIME version 1.9.0-20140227 with the command summarize_taxa.py with the options -L 6 -a true. Data was collected over 445 days with 336 fecal samples from Subject A and 131 fecal samples from Subject B. Details on the system identification algorithm are now given. The dynamics in Eq (2) can be approximated in discrete time as [32] e i ðkÞ þ log for i = 1, 2, . . ., n where k = 1, 2, . . ., N − 1 is the sample index, N is the total number of samples, t k is the time stamp of sample k, and e is an error term that arises because of the assumption that x(t) is constant over each interval t 2 [t k , t k+1 ). Eq (5) can be rewritten in terms of a regressor vector ðkÞ ¼ ½1; x 1 ðt k Þ; x 2 ðt k Þ; . . . ; x n ðt k Þ T ; the parameter vector θ i = [r i , a i1 , a i2 , . . ., a in ] and the log difference y i (k) = log(x i (t k+1 )) − log (x i (t k )) as e i ðkÞ þ y i ðkÞ ¼ y i ðkÞ: The identification problem can then be defined as finding the parameter matrix estimatê where F = [ϕ(1), ϕ(2), . . ., ϕ(N − 1)] is the regressor matrix, kÁk F denotes the Frobenius norm, λ ! 0 is the Tikhonov regularization term [65]. The minimal solution to the above problem can be given directly as where I is the identity matrix. Next we discuss how missing data, zero reads, and λ were chosen. The difference equation in Eq (5) only uses sample data over two consecutive time samples. Therefore, in the construction of Y and F we only include samples that for which there is data from the next day as well. Also, given that logarithms are used, when a sample has zero reads for a given taxa, a read value of one is inserted. Then relative abundances are computed before the logarithm is taken. Finally we discuss how the regularization parameter is chosen. For S13 and S14 Figs the following cross-validation is performed. For Subjects A and B two-thirds of data was used for training and one-third for testing. More precisely, for each λ two-thirds of the data from Subject A and two-thirds of the data from Subject B were used to identify their corresponding dynamical constants. Then the combined error from the two test sets was used to find the optimal λ. The regularization value used in S15 Fig  Note that it is rather counter-intuitive that for α = 1.01 the Silhouette Index suggests that there are two clusters, while PCoA suggests three clusters. We emphasize that as a typical ordination method, the PCoA just produces a spatial representation of the entities in the dataset, rather than the actual determination of cluster membership. Note that as compared to  Fig  5a, but during the FMT, the SISs (60 and 51) of the donor's local community in the orange cluster were not transplanted to the CDI state (black dot). This FMT resulted in a slightly altered community (gray dot) and the system eventually evolved to a steady state (white dot) thats is not in the orange cluster. Hence the FMT failed. (EPS) S13 Fig. System identification, Tikhonov regularization λ = 0.0423. System identification was performed on the stool samples from the longitudinal data in [39] for two subjects as described in S1 Text where λ was determined by cross-validation. (a) Visualization of microbial taxa in terms of relative abundances versus day sample was taken. (b) Heat map of the interaction matrix for top 100 SISs. (c) Histogram of Standard Deviation (SD) of the columns of the interaction matrix. (d) List of top ten SISs in descending interaction strength (defined by the SD of each column in the interaction matrix) with relative abundances over all samples shown as a box plot. The banded structure shown in the heat map supports the assertion that SISs do exist in the gut microbiome. However this banded structure is also seen when the dates of the sample collections are permuted, see S14 and S15 Figs. (EPS) S14 Fig. System identification, day swap, Tikhonov regularization λ = 0.0057. System identification was performed on the stool samples from the longitudinal data in [39], but with the collection dates permuted, λ was determined by cross-validation on the permuted data. (a) Visualization of microbial taxa in terms of relative abundances versus day sample was taken (not permuted samples). (b) Heat map of the interaction matrix for top 100 SISs. (c) Histogram of Standard Deviation (SD) of the columns of the interaction matrix. (d) List of top ten SISs in descending interaction strength (defined by the SD of each column in the interaction matrix) with relative abundances over all samples shown as a box plot. Even though the sample days have been permuted the banded structure still persists. (EPS) S15 Fig. System identification, day swap, Tikhonov regularization λ = 0.0423. System identification was performed on the stool samples from the longitudinal data in [39], but with the collection dates permuted, λ was selected to be the same as in S13 Fig. (a) Visualization of microbial taxa in terms of relative abundances versus day sample was taken (not permuted samples). (b) Heat map of the interaction matrix for top 100 SISs. (c) Histogram of Standard Deviation (SD) of the columns of the interaction matrix. (d) List of top ten SISs in descending interaction strength (defined by the SD of each column in the interaction matrix) with relative abundances over all samples shown as a box plot. For the permuted data when λ is larger than the optimal value from the cross-validation the identification method biases towards making the most abundant species also the SISs. (EPS) S1 Text. Detailed treatment of necessary mathematical components. S1 Text contains discussions on: random variables, random matrices, dynamic stability, clustering, ordination, modeling, and more simulation results. (PDF)