Variable habitat conditions drive species covariation in the human microbiota

Two species with similar resource requirements respond in a characteristic way to variations in their habitat—their abundances rise and fall in concert. We use this idea to learn how bacterial populations in the microbiota respond to habitat conditions that vary from person-to-person across the human population. Our mathematical framework shows that habitat fluctuations are sufficient for explaining intra-bodysite correlations in relative species abundances from the Human Microbiome Project. We explicitly show that the relative abundances of closely related species are positively correlated and can be predicted from taxonomic relationships. We identify a small set of functional pathways related to metabolism and maintenance of the cell wall that form the basis of a common resource sharing niche space of the human microbiota.

Metagenomic studies of species abundances provide data in the form of counts. That is, the number of sampled sequences assigned to species i = 1, . . . , N is given by the integer count n i . We assume the counts are proportional to the true species abundances, but the constant of proportionality (which may vary between experiments) is unknown. As a result, we follow the standard procedure by analyzing the relative species abundances [1]: Relative species abundances are compositional, which means that x i ≥ 0 and i x i = 1. The normalization removes a degree of freedom (i.e. we do not know the total population size), and imposes significant restrictions on the types of modeling and analyses that can be performed using relative abundances. Therefore, this section will review the principles of compositional data analysis that will guide our analysis of the microbiota [2].

Working with compositions
Aitchison introduced a framework for the statistical analysis of compositions in 1982 [3]. A compositional vector x of length N lies in the simplex S N = {x ∈ R N + | i x i = 1}. For example, x i could be the relative abundance of a species in a community, as described above. According to Aitchison, the fundamental principle of compositional data analysis is [4]: Any meaningful function of a composition can be expressed in terms of ratios of the components of the composition. Perhaps equally important is that any function of a composition not expressible in terms of ratios of the components is meaningless.
That is, because the total population size is unknown, all theory and methods of analyses may only use ratios of the relative abundances, which satisfy x i /x j = n i /n j and, therefore, are insensitive to variations in the total population size. Analyzing compositional data in a manner that does not respect this principle may lead to incorrect conclusions, such as attributing spurious correlations between species to improper causes [1, 3,5,6].

Transforms for compositional data
In practice, analyzing compositions amounts to computing a transformation of the data of the form y = G log x where G is a matrix that maps the vector of 1's to the vector of 0's (i.e. G1 = 0) [2,5]. The mathematics of compositions ensures that performing analyses on the transformed data, instead of directly on the relative abundances, respects the fundamental principle of compositional data analysis. One common choice for G is the centering matrix C = I − N −1 11 T , where I is the N × N identity matrix. This implements the 'centered log-ratio' (CLR) transform. In this work, however, we have chosen to use the 'additive log-ratio' (ALR) transform, which is The additive log-ratio transform maps a compositional vector of length N to a vector of real numbers of length (N − 1) with elements y i = log(x i+1 /x 1 ) for i = 1, . . . , N − 1. The decrease in the dimension of the vector reflects the loss of a degree of freedom due to ignorance of the total population size. Figure S1 presents a comparison of the distances and principal components computed from the ALR transformed relative abundances to the true Aitchison distance. For the comparison, we computed the Isometric Log-Ratio (ILR) transformation of the relative abundances using [7]: A) The Aitchison distance is a metric for relative abundance data based on log-ratios. The ALR transform is not isometric (meaning, it does not preserve distances exactly) but the distances computed with the ALR transformed relative abundances are highly correlated the Aitchison distance (R = 0.87). b) The first two principal coordinates computed using the Aitchison distance. c) The first two principal coordinates computed using the ALR transformed relative abundances (reproduced in Figure S7E).

THEORETICAL MODEL OF SPECIES COMPOSITION
A maximum diversity hypothesis Relative species abundances provide no information about the total population size in a community. Unfortunately, the total population size plays an important role in most theoretical models of population dynamics (socalled 'density dependent' effects). Therefore, our goal in this section is to develop a model of relative species abundances using a theory that does not include any explicit density dependent effects. In this section, we present the theory as a generative model in which known environmental conditions and species properties are used to deduce the relative species abundances. In practice, we will use the model in the opposite direction; that is, observed relative species abundances across many different environments will be used to infer properties of the species. We will describe the 'inverse model' in the next section.
Rather than starting from an existing model based on absolute abundances, we postulate a fundamental prin-ciple acting directly on the relative abundances of the species in a community. The intuition for our model comes from a classic idea in ecology (e.g. MacArthur [8]) that the equilibrium species composition of a community ensures that every niche is being fully utilized. This idea can be treated explicitly in some consumer-resource models but we will consider a more general model in which the niches are abstract properties of the community [8][9][10].
Our model for relative species abundances is based on a postulate we refer to as the 'maximum diversity hypothesis': Maximum Diversity Hypothesis: The equilibrium relative species abundances in a community maximize the diversity of the community while ensuring that all effective resources are fully utilized.
To formalize the model, we suppose that a particular community has M effective resources that define the dimensions of the niche space. Note that we use the term 'effective resources' in an abstract way that captures all of the abiotic and biotic factors that affect the species composition within a community. Each effective resource µ has an availability V µ , which varies between different environments. It is the variation in the availabilities of the effective resources that drives variation in species composition between communities. In a community composed only of species i, an amount V iµ of resource µ will be utilized. In other words, V iµ describes the ability of species i to utilize resource µ. Note that V iµ can be positive, in which case species i depletes resource µ, or it could be negative, in which case species i adds more of resource µ to the environment. For example, a bacterium may secrete a metabolite that is utilized by other species. Finally, we quantify the diversity of a community using the Shannon entropy H[x] = − i x i log x i [11,12]. Following the maximum diversity hypothesis, the equilibrium relative species abundances can be obtained by maximizing H[x] subject to subject to constraints V µ = i V iµ x i and i x i = 1.
We can solve for the equilibrium relative abundances by maximizing the Lagrangian: where γ and the λ µ 's are Lagrange multipliers. The solution is given by where the Lagrange multipliers are chosen so that There is one degree of freedom for each of the N species, but a degree of freedom must be lost due to the sum constraint. As a result, the dimension of the niche space is, at most, M ≤ N − 1. From now on, we will assume that M = N − 1 for simplicity; later, we will discuss how to remove irrelevant dimensions during data analysis. The Lagrange multipliers λ µ encode the availabilities of the various resources. The values of the Lagrange multipliers have an interpretation that we can borrow from economics. A Lagrange multiplier is called a 'shadow price' and is given by λ µ = ∂H ∂Vµ evaluated at the optimum [13]. That is, the shadow price describes the number of units that diversity changes if the availability of resource µ increases by one unit. The shadow price can be positive or negative. If the shadow price of resource µ is positive then an increase in the amount of resource µ will lead to an increase in the diversity of the community. By contrast, if the shadow price of resource µ is negative then an increase in the amount of resource µ will lead to a decrease in the diversity of the community.

Population dynamics that maximize diversity
Alternatively, we can view Equation (S4) as the steadystate solution of an appropriately chosen set of dynamical equations. We stress that there are many types of dynamical models that lead to the same equilibrium, and our goal in this section is only to discuss the simplest dynamics that fulfill the maximum diversity hypothesis.
First, we suppose that the fitness of a species can be represented as a weighted sum of its traits (i.e. resource utilizations) via F * i = µ λ µ V iµ . Then, we can write a simple equation for the dynamics of y = G log x as: where G is the matrix of the additive log-ratio transform. This equation reaches equilibrium at y * = GF * = G log x * and, thus, leads to the same equilibrium abundances as the maximum diversity hypothesis. This equation can also be written as log x * = F * + constant. Left multiplication by the non-invertible matrix G sends any constant term to zero, thereby removing any dependence on the total population size. The equations for the population dynamics can be written in multiple forms that highlight different aspects of the model. For example, Equation (S6) is equivalent to: so that the community follows the gradient of the Kullback-Leibler divergence from its equilibrium configuration [14]. Similarly, the time derivatives of the relative abundances (rather than the additive log-ratio transformed abundances) can be obtained using the chain rule, and are given by a replicator equation of the form [15]: whereF = C log x is the vector of centered log-ratio transformed relative abundances.

MODELING VARIABILITY IN THE HUMAN MICROBIOTA
The previous section presented a theoretical model for determining the relative abundances of the species in a community based on a maximum diversity hypothesis. The hypothesis postulates that the equilibrium relative species abundances in a community maximize the diversity of the community while ensuring that all effective resources are fully utilized. This section will explore the consequences of this model for variability in the composition of the human microbiota across different body sites, and across the population.
The maximum diversity hypothesis implies that the additive log-ratio transform of the relative species abundances in a community is given by: The (N −1)×(N −1) matrixṼ := GV describes intrinsic characteristics of the species, and does not vary from one environment to another. By contrast, the vector λ encodes the information about the availabilities of the effective resources in a given environment. Thus, λ varies from person-to-person, as well as across bodysites within the same individual. Variation from person-to-person of the composition of the microbiota in a single bodysite (such as the gut) is generally much smaller than the differences between bodysites within a single individual. Therefore, we propose a simple statistical model in which λ is a random variable drawn from a distribution that depends on the bodysite. We hypothesize that, conditioned on the bodysite s, λ is normally distributed via λ|s ∼ N (λ s , Σ s ) where Σ s is a diagonal covariance matrix. For example, if we could determine the λ's for the gut microbiomes of a large sample of people, we would find that λ µ has a variance Σ µµ|s = σ 2 µ|s and that any two resources, say λ µ and λ ν , are uncorrelated. Moreover, the additive log-ratio (ALR) transformed relative species abundances taken from that bodysite (e.g. the gut) would be normally distributed according to: Thus, the covariance matrix (Ψ s ) computed from the ALR transformed relative abundances is given by Ψ s = VΣ sṼ T for each body site s.
The total covariance matrix of the ALR transformed abundances (i.e. the covariance matrix computed without using the bodysite labels) corresponds to: (S12) withλ µ = s p s λ µ|s , and where p s is the fraction of samples corresponding to bodysite s. In other words, Λ describes differences in the availabilities of the effective resources between different bodysites, whileΣ describes the variability in the availabilities of the resources within the bodysites. We hypothesize thatΣ is approximately diagonal, whereas Λ is likely not diagonal. As a result, it is possible to learn about the species characteristics by finding a matrixṼ such thatṼ −1 Ψ s (Ṽ T ) −1 is approximately diagonal for each bodysite s. This trick does not tell us about Λ, however, which has too many degrees of freedom to be determined from the data. Therefore, we expect that information about species characteristics (i.e. aboutṼ) can be obtained by analyzing intra-bodysite variability in species composition, but not inter-bodysite variability. Principal components analysis (PCA) is commonly applied to metagenomic data as a tool for uncovering an underlying structure, such as clustering of particular environments (e.g., [16]). PCA decomposes the covariance matrix of ALR transformed relative abundances via Ψ =Ṽ P CA Σ P CAṼ T P CA , where Σ P CA is diagonal and V P CAṼ T P CA = I. From Equation 10, we see that PCA is only relevant as a generative model if Λ is diagonal and V is orthogonal. Therefore, applying PCA to the covariance matrix computed without regard to the bodysites generally misses the structure that can be found by focusing only on intra-bodysite variability. In the next section, we describe a type of 'generalized PCA' called Common Components Analysis (CoCA) that uses the relationship Ψ s =ṼΣ sṼ T to infer the species characteristics from metagenomic data with labeled bodysites.

COMMON COMPONENTS ANALYSIS
The model that we have presented makes it possible to use metagenomic data collected from a large sample of different individuals and bodysites to infer the matrix of resource utilizations (V) and the Lagrange multipliers (i.e., shadow prices) (λ) that reflect the availabilities of the resources. The main assumptions are: (1) variation in species composition between samples (i.e., in different individuals and/or bodysites) is driven by variation in the availabilities of the effective resources, (2) the habitat variation can be captured by treating the shadow prices as random variables with a distribution that depends on the bodysite, and (3) the resource utilization matrix V is sparse so that any particular species is unlikely to be able to utilize all of the different resources.

Inference Algorithm
Common components analysis (CoCA) is an approach to simultaneous non-orthogonal approximate diagonalization [17][18][19]. We have assumed that the distribution of ALR transformed species abundances is conditioned on the bodysite as y|s =Ṽλ s with λ s ∼ N (λ s , Σ s ). Moreover, we assume that the bodysite labels are known. As in PCA, we make the assumption that Σ s is diagonal, but we do not assume thatṼ is orthogonal. This formulation of CoCA aims to find a single, common set of factors that explains the variation within each bodysite s = 1, . . . , S. The factors are such that the productṼ −1 Ψ s (Ṽ −1 ) T is approximately diagonal for each s.
Within a given bodysite, we have y|s ∼ N (Ṽλ s |ηI + VΣ sṼ T ). Here, η is a small noise term that accounts for experimental errors. In the following, we assume that the experimental errors are small relative to the intrabodysite variation in the relative abundances so that they can be neglected, but we have included the term here for completeness. Maximizing the log-likelihood is equivalent to minimizing the KL-divergence between the assumed distribution and the empirical distribution. The distribution of y given s is multivariate normal, so the KL-divergence is given by: Here,ȳ s and Ψ s are the empirical mean and covariance matrix of y = G log x in bodysite s, respectively. We can immediately see thatλ s =Ṽ −1ȳ s . Plugging this in, rearranging, and neglecting constant terms gives, which is the appropriate negative log-likelihood for a single bodysite. The fraction of samples coming from bodysite s is p s , and the bodysite labels are known. Therefore, the total negative log-likelihood is a weighted sum of each of the individual negative log-likelihoods. The matricesṼ and {Σ s } S s=1 can be inferred by minimizing this conditional negative log-likelihood: We can simplify this by redefining the objective function in terms of W T =Ṽ −1 , giving: The derivatives of the objective function can be calcu-lated as: The objective function can be minimized by alternating gradient descent updates W(t + 1) = norm(W(t) − w (∆W(t) + ρ∆W(t − 1))) and Σ −1 95 is a momentum term and norm(·) normalizes the columns of the matrix [20]. Normalizing the columns of W is an arbitrary choice that resolves an inherent ambiguity in which it is not possible to determine the relative magnitudes ofṼ and Σ s . Alternating the updates of W and Σ −1 s makes it easier to tune the step sizes w and s using the 'bold driver' method [21] where we only accept gradient steps that decrease the objective function, and we increase the step size (e.g. w ← 1.1 * w ) if a step in in the W direction is accepted and decrease the step size (e.g. w ← 0.5 * w ) if a step in the W direction is rejected, and similarly for s . The gradient descent is continued until convergence.

Sparse Recovery
The CoCA algorithm described above yields an estimate forṼ = (W T ) −1 = GV. We would like to be able to determine V but the matrix G is not invertible. However, if we assume that V is sparse then it is possible to determine V fromṼ. In the context of the model, this assumption means that any individual species is unlikely to be able to utilize every possible resource. To recover V fromṼ, we solve the problem: where ||V|| 1 = iµ |V iµ |. The solutions to this problem are all of the form V iµ = z µ +Ṽ (i−1),µ (1 − δ i1 ) for i = 1, . . . , N , where z µ can, in principle, take on any real value. Because we want the solution with a minimum L 1 norm, it is sufficient to test z µ = 0 and (the only sparse solutions) and to choose the one with minimum norm. This is a tractable search over N (N −1) possibilities in the worst case and can be done easily for reasonable system sizes.

Number of Components
We have assumed that the matrixṼ has dimension (N − 1) × (N − 1) and is full-rank or, equivalently, that the number of effective resources is M = N − 1. This assumption could be relaxed by including the term accounting for experimental errors (i.e., setting η > 0). Even without the full error model, however, one may find that many of the components have low weight. That is, there may be many µ's with small contributions to the intra-bodysite variances (e.g., small s p s Σ µµ|s ). This is similar to the usual implementation of PCA where many eigenvectors of the covariance matrix may be associated with small eigenvalues. The lowly weighted components are often neglected to obtain a low-rank approximation of the data.
In the case of CoCA, however, choosing which components to keep depends on the objective. For example, to describe the variation within bodysite s we would want to keep the components with the largest values of Σ µµ|s because we don't care about the other bodysites. By contrast, components can be chosen to obtain a low dimensional representation that clusters the bodysites by choosing components with large interbodysite differences and small intra-bodysite variances. One way to choose these components is to sort them by s p s (λ µ|s −λ µ ) 2 / s p s Σ µµ|s and keep only the components with the largest scores. This is the technique used to choose the components in Figure 2e of the main text.

Implementation
Code for Common Components Analysis was written in Python (version 3.43) and uses NumPy. The source code is available at: https://sites.google.com/site/charleskennethfisher/home /programs-and-data/

DATA USED IN THIS STUDY
We analyzed data on relative species abundances collected as part of the Human Microbiome Project (HMP) [22][23][24][25]. Data on species composition from the HMP were downloaded from the MG-RAST server (project 385) [26,27]. These metagenomic data correspond to 1606 human microbiota samples separated into four main body sites: gut, oral, skin, and vaginal. All unclassified and non-bacterial species were removed from the data. Each species (i) was assigned a rank (r l i ) in each bodysite (l) based on decreasing relative abundance; e.g. the most abundant species in sample l was assigned r l i = 1. Then, the median rank for each species was computed over all 1606 samples. The 100 species with the smallest median ranks (i.e., the 100 most abundant species) were selected for further study in order to make computational studies more feasible. Selecting species in this manner (i.e., by typical rank) ensures that every species is present in each bodysite. After performing the species selection procedure, two of the oral microbiota samples (MG-RAST ID 4472526.3 and 4472527.3) did not have enough species counts and were discarded. Thus, we analyzed data on the relative abundances of 100 species in 1604 samples taken from 4 different bodysites. Species compositions were computed using a pseudocount of 0.5 via: to ensure that x i > 0, which is necessary for taking the logarithm as part of the ALR transform. The raw species counts used in this study are available in a text file along with the code at the url provided in the Implementation section.

SUMMARY: INFERENCE AND MODEL VALIDATION
Inference: 3. Solve the sparse recovery problem GV =Ṽ for V.

4.
Verify that the CoCA model validation steps fail with randomized covariance matrices.

Strategy for randomization
Validating the model underlying CoCA requires a test that the covariances matrices (Ψ s ) are, indeed, approximately simultaneously diagonalizable. This requires two separate tests. First, we need to see if it is even possible to find an (N − 1) × (N − 1) matrixṼ and diagonal matrices Σ s such that Ψ s =ṼΣ sṼ T for each body site s ∈ {gut, oral, skin, vaginal}. Second, we need show that it is usually not possible to simultaneously diagonalize 4 randomly generated covariance matrices. This will establish that simultaneous diagonalizability is a special property of the observed covariance matrices.
In order to test the second point, we ran the CoCA algorithm multiple times using randomly generated covariance matrices. These randomized covariances matrices were constructed as follows. Let Σ (s) P CA denote the matrix with the eigenvalues of Ψ s along the diagonal. The randomized covariance matrix is given by Ψ s = QΣ (s) P CA Q T where Q is a random orthogonal matrix [28]. This randomization procedure ensures that all of the random covariance matrices for a given bodysite have the same eigenvalues, but different eigenvectors. This is the simplest null model for testing simultaneous diagonalizability.
Due to computational expense of CoCA (at least, given the current implementation), it was not possible to perform the thousands of iterations that would be necessary to compute accurate p-values. Therefore, we ran the CoCA algorithm on 20 random realizations of the covariance matrices and report the mean and standard deviation, or histograms, for each of the quantities that we computed from the observed data, as noted in the figure legends.

Results
A plot of the objective function during gradient descent is shown in Figure S2. The objective function computed from the observed covariance matrices converges to a value many standard deviations below the objective functions obtained with randomized covariance matrices. The best fit parameters from CoCA reproduce the observed covariances with reasonable accuracy, as illustrated by the Pearson correlations of R 2 = 0.99 for the gut samples, R 2 = 1.00 for the oral samples, R 2 = 0.78 for the skin samples, and R 2 = 0.86 for the vaginal samples (see top row of Figure 3, Main Text). Note that the body sites with larger sample sizes (N gut = 391, N oral = 833, N skin = 245, N vaginal = 135) have better fits. Moreover, the correlations between the observed and predicted covariances are many standard deviations larger than those obtained from the randomizations (see middle and bottom rows of Figure 3, Main Text). These results demonstrate that the observed covariances are, indeed, approximately simultaneously diagonalizable.
OnceṼ had been inferred, the shadow prices can be estimated as λ =Ṽ −1 y. If the CoCA model is correct, then the shadow prices from a particular body site s should be uncorrelated and should have covariance matrix Σ s . The top row of Figure S4 shows a comparison between Var[λ µ |s] and Σ µµ|s . The Pearson correlations between the observed and predicted values are equal R 2 = 1.00 for each of the body sites. However, the middle and bottom rows of Figure S3 show that this relationship breaks down for the randomized covariance matrices because they are not simultaneously diagonalizable.
As a final check to ensure that the CoCA model is appropriate for the HMP data, we computed histograms of the correlations between shadow prices (i.e., Corr[λ µ , λ ν |s]) within each body site. Ideally, all of the correlations would be zero but, in practice, there are weak correlations between some of the niches, as shown in the top line of Figure S4. Histograms of the shadow price correlations obtained from the randomization experiments are shown in the bottom line of Figure S4 for comparison. Clearly, the correlations from the observed shadow prices are very weak compared to those obtained via randomization. FIG. S4: Intra-body site correlations between common components. Histograms of the correlations between λ µ and λ ν , conditioned on body site, computed from observed covariance matrices (top row) and randomized covariance matrices (bottom row). These plots show that the niche availabilities obtained from the observed data are approximately uncorrelated, whereas those inferred from randomized covariance matrices are not.
The final claim that we test via randomization is that there is a significant correlation between the distance between species computed from their taxonomic classifications and the distance between species computed from their common components. The taxonomic distance between two species i and j was calculated as D tax ij = 5−O ij where O ij is the number of taxonomic levels shared by species i and j. For example, two species that belong to the same genus have taxonomic distance D tax ij = 0, whereas two species that belong to different phyla have taxonomic distance D tax ij = 5. The taxonomic classifications used in this study are included as a text file in the Supplementary Information. Each species can be represented as a vector in the CoCA derived niche space This vector describes the ability of a species to utilize each of the effective resources, weighted by the their average variabilities. The ecological distance between two species i and j was computed from their common components using the 'correlation distance,' D CoCA where θ ij is the angle between vectors v i and v j . The correlation between the taxonomic distance and the ecological distance computed for the observed covariance matrixes is R = 0.67. For comparison, a histogram of the correlations obtained from the randomization experiments is shown in Figure S6. The observed correlation is many standard deviations from the values obtained though randomization.
FIG. S5: Correlation between taxonomic and ecological distances. Histogram of the correlation between the taxonomic distance and ecological distances computed from randomized covariance matrices. The correlation obtained with the observed data is shown as a dotted red line. The true correlation lies far outside the distribution obtained from randomization.

COMPARISON WITH PRINCIPAL COMPONENTS ANALYSIS
Principal Components Analysis (PCA) is a technique that is often applied to find low-dimensional representations of microbiota data. Common Components Analysis (CoCA) shares some similarities with PCA so it is useful to compare the two techniques. The generative model for PCA is quite simple: the observed data y arise from a Gaussian distributed latent variable λ via y =Ṽλ. Here, the elements of, say λ µ and λ ν , are uncorrelated andṼ is orthogonal. Thus, the main difference between the generative model for PCA and the generative model for CoCA is that λ is assumed to be Gaussian distributed in PCA, whereas λ is drawn from a mixture of Gaussians in CoCA ( Figure S5). As a result, PCA finds a set of directions that explain the total variation over all of the samples, whereas CoCA finds a set of directions that explain the intra-bodysite variances. Figure S7 presents a comparison of PCA and CoCA on the HMP data. The assumptions of the generative model for PCA are clearly violated by the HMP data; the bodysites form coherent clusters and, therefore, cannot be driven by Gaussian distributed resources. Nevertheless, PCA can still be used as a technique for dimensionality reduction and data visualization. Both PCA and CoCA are able to explain compositional variation in the microbiota with a few components ( Figure S7a,d). However, the two largest principal components fail to separate all four bodysites ( Figure S7e). PCA fails to separate the bodysites because it finds directions that explain total variability, which is a sum of inter-bodysite differences and intra-bodysite variation. Thus, directions with the largest principal values may have both large inter-bodysite differences and large intra-bodysite variation. Separating the bodysites, however, requires one to identify directions with large inter-bodysite differences and small intra-bodysite variation. Thus, CoCA is able to identify directions that cluster the bodysites more effectively.
FIG. S6: Schematic comparison of CoCA and PCA. Each species corresponds to a point in a high dimensional niche space. Fluctuations in the availabilities of the niches from person-to-person cause fluctuations in the relative abundances of the species. If the distribution of niche availabilities does not depend on the bodysite (e.g. gut, skin, etc) then the log-ratio transformed abundances are Gaussian distributed, and the structure of niche space can be inferred using Principal Components Analysis (PCA) by finding the set of axes with the largest variation. If the distribution of the niche availabilities does depend on the bodysite, however, then the log-ratio transformed abundances are drawn from mixture of Gaussians and maximum likelihood fitting of the model identifies a common set of axes, or common components, that approximately diagonalize the covariance matrices in each of the bodysites. The KEGG orthologies (Brite hierarchy 00001) for every available strain of each of the 100 species were downloaded from the KEGG database (http://www.genome.jp/kegg/kegg2.html, [29]). For each enzyme α (EC number) in the KEGG orthology and for each species i, we computed the fraction f iα of strains that possessed that enzyme. For example, EC:1.1.1.1 (Alcohol dehydrogenase) was present in every strain of Streptococcus pneumoniae giving it a value of f iα = 1. In this way, we constructed a matrix describing the presence/absence of each of the enzymes in all 100 species.
Enzymes in the KEGG database are also organized into functional pathways such as Lysine biosynthesis and Glycolysis/Gluconeogenesis. For each pathway, we computed a distance between species as: We performed a simple regression analysis to assess which of the KEGG pathways were associated with the ecological distance computed with CoCA by analyzing the linear model: where η ij is the residual noise. It is important to note that we make a simplifying assumption that the noise (η ij ) terms are identically and independently distributed normal random variables.

Selecting Relevant Pathways with Bayesian Linear Regression
We assessed the relevance of each pathway using a framework based on Bayesian statistics. The goal is to compute a 'posterior' probability that the coefficient associated with a pathway is not equal to zero (i.e., CoCA is the matrix of squared distances computed from CoCA). However, because it is computationally challenging to compute exact posterior probabilities for variable selection we a use recently described approach called the Bayesian Ising Approximation (BIA) [30,31]. For the sake of completeness, we provide a brief description of the BIA below -more detailed results are described by Fisher and Mehta [30].
It will be helpful to introduce the half vectorization operator vech(mat) that takes the elements below the diagonal from each column in the matrix mat and stacks them into a vector. Using this notation, we define y = vech(D 2 CoCA ) and the design matrix X as the matrix with columns vech(D 2 path ). Moreover, we standardize y and each column of X to have zero mean and unit variance, which eliminates the need for the constant term in Equation S23.
Bayesian methods combine the information from the data, described by the likelihood function, with a priori knowledge, described by a prior distribution, to construct a posterior distribution that describes one's knowledge about the parameters after observing the data. In the case of linear regression, the likelihood function is a Gaussian: In this work, we will use standard conjugate prior distributions for β and σ 2 given by P (β, σ 2 |s) = P (σ 2 )P (β|σ 2 , s) where: Here, we have introduced a vector (s) of indicator variables so that β j = 0 if s j = −1 and β j = 0 if s j = +1 for the j th pathway. We also have to specify a prior for the indicator variables, which we will set to a flat prior P (s) ∝ 1 for simplicity. In principle, a 0 , b 0 and the penalty parameter on the regression coefficients, λ, are free parameters that must be specified ahead of time to reflect our prior knowledge. We will discuss these parameters in the next section.
We have set up the problem so that identifying which pathways are relevant is equivalent to identifying those features for which s j = +1. Therefore, we need to compute the posterior distribution for s, which can be determined from Bayes' theorem using P λ (s|y) ∝ dβdσ 2 P (y|β, σ 2 )P λ (β, σ 2 |s)P (s). While the integral can be computed exactly, computing the marginal distributions (i.e. P λ (s j |y)) is not computationally feasible. Therefore, we use an approach called the Bayesian Ising Approximation (BIA). The BIA approximates the posterior distribution of the indicator variables using an Ising model described by: where the external fields (h i ) and couplings (J ij ) are defined as: Here, r(z 1 , z 2 ) is the Pearson correlation coefficient between variables z 1 and z 2 . In writing this expression we have assumed that the hyperparameters a 0 and b 0 are small. The BIA approximation expansion converges as long as: where r = p −1 (p − 1) −1 i =j r 2 (X i , X j ) is the root mean square correlation between features.
To perform feature selection, we are interested in computing marginal probabilities P λ (s j = 1|y) (1 + m j (λ))/2, where we have defined the magnetizations m j (λ) = s j . While there are many techniques for calculating the magnetizations of an Ising model, we focus on the mean field approximation which leads to a selfconsistent equation: This mean field approximation provides a computationally efficient tool that approximates Bayesian feature selection for linear regression, requiring only the calculation of the Pearson correlations and solution of Equation S28.
As with other approaches to penalized regression, our expressions depend on a free parameter (λ) that determines the strength of the prior distribution. As it is usually difficult, in practice, to choose a specific value of λ ahead of time it is often helpful to compute the feature selection path; i.e. to compute m j (λ) over a wide range of λ's. Indeed, computing the variable selection path is a common practice when applying other feature selection techniques such as LASSO regression [32]. To obtain the mean field variable selection path as a function of = 1/λ, we notice that lim →0 m j ( ) = 0 and so define the recursive formula: i.e., KEGG pathway) is relevant for computing the ecological distance between species as a function of the variance of the prior distribution (i.e., the inverse of the regularization parameter). The pathways with a posterior probability greater than 0.95 when the inverse regularization parameter is one (i.e, λ * /λ = 1) are shown in red.
The feature selection path computed for the KEGG pathways using the BIA is shown in Figure S8. This is a plot of P λ (β j = 0|X) as a function of λ * /λ. We focus on the point where λ = λ * , which corresponds to the weakest prior distribution for which the BIA is applicable. We defined pathways as significantly associated if they had a posterior probability greater than 0.95. For comparison, posterior probabilities computed using Monte Carlo simulations of the exact posterior at λ = λ * are shown Figure S9. The 17 significant pathways are: Aminoacyl-tRNA biosynthesis, Carbon fixation in photosynthetic organisms, Carbon metabolism, Citrate cycle (TCA cycle), Cysteine and methionine metabolism, Glutathione metabolism, Glycolysis/Gluconeogenesis, Homologous recombination, Lipoic acid metabolism, Lipopolysaccharide biosynthesis, Lysine biosynthesis, One carbon pool by folate, Porphyrin and chlorophyll metabolism, Pyrimidine metabolism, Pyruvate metabolism, Thiamine metabolism, and Vitamin B6 metabolism. A linear regression using just these relevant pathways has a correlation coefficient of R 2 = 0.47.
The BIA is, by definition, an approximation to the posterior probabilities. Although previous results suggest that the approximation is quite good [30], we performed Monte Carlo simulations of the exact posterior distribution with λ = λ * to validate that our conclusions were not highly sensitive to the approximation. A comparison of the results from BIA and Monte Carlo simulations is shown in Figure S9. Four additional pathways (i.e., Carbon fixation pathways in prokaryotes, Pentose phosphate pathway, Purine metabolism, RNA degradation) achieve the 0.95 significance threshold according to the Monte Carlo simulations, but all 17 pathways identified as relevant by the BIA were also identified as relevant by Monte Carlo.
FIG. S9: Comparison of BIA to Monte Carlo simulations. Posterior probabilities estimated using the BIA compared to those computed with Monte Carlo simulations for λ = λ * . Four pathways (shown) reach a posterior probability of 0.95 for Monte Carlo, but not for BIA. All pathways that reached the 0.95 threshold for relevance with the BIA also reached the relevance threshold with Monte Carlo.