Figures
Abstract
How can we figure out how the different microbes interact within microbiomes? To combine theoretical models and experimental data, we often fit a deterministic model for the mean dynamics of a system to averaged data. However, in the averaging procedure a lot of information from the data is lost—and a deterministic model may be a poor representation of a stochastic reality. Here, we develop an inference method for microbiomes based on the idea that both the experiment and the model are stochastic. Starting from a stochastic model, we derive dynamical equations not only for the average, but also for higher statistical moments of the microbial abundances. We use these equations to infer distributions of the interaction parameters that best describe the biological experimental data—improving identifiability and precision. The inferred distributions allow us to make predictions but also to distinguish between fairly certain parameters and those for which the available experimental data does not give sufficient information. Compared to related approaches, we derive expressions that also work for the relative abundance of microbes, enabling us to use conventional metagenome data, and account for cases where not a single host, but only replicate hosts, can be tracked over time.
Citation: Zapién-Campos R, Bansept F, Traulsen A (2024) Stochastic models allow improved inference of microbiome interactions from time series data. PLoS Biol 22(11): e3002913. https://doi.org/10.1371/journal.pbio.3002913
Academic Editor: Isabel Gordo, Instituto Gulbenkian de Ciência: Instituto Gulbenkian de Ciencia, PORTUGAL
Received: February 17, 2024; Accepted: October 24, 2024; Published: November 21, 2024
Copyright: © 2024 Zapién-Campos et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The authors confirm that all data underlying the findings are fully available without restriction. The data and software, including Jupyter Notebooks, used to generate the results of this paper are available in Zenodo (https://doi.org/10.5281/zenodo.13958305). The mice microbiome data OMM12 are from https://doi.org/10.3389/fmicb.2019.02999 whose authors may be contacted at stecher@mvp.lmu.de.
Funding: We are grateful for funding from the German Science Foundation (Deutsche Forschungsgemeinschaft, DFG) within the Collaborative Research Center 1182 (Project-ID 261376515), project A4.1 (A.T.). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Numerous studies have shown how important the microbiome is for their hosts, ranging from development to health [1,2]. The promise of manipulating the microbiome relies on having understood the ecological and evolutionary processes operating on it [3]. Although metagenomics studies have widely characterized microbiome samples [4], their connection to mathematical models and eco-evolutionary theories lags behind. Part of the gap is explained by an intrinsic difficulty in analyzing microbiome data [5], in particular, the inverse problem of robustly inferring model parameters—and thus interactions between microbes—from data. Despite this difficulty, researchers have striven to enable the widespread use of parameter inference software in microbiome studies [6,7]. Pioneering work using linear regression to infer interactions of the linearized Lotka–Volterra model [8] showed that matching the microbiome composition dynamics does not imply matching the true value of interactions in simulations [5]. This apparent contradiction stems from 2 challenges. First, in some models the value of individual parameters can not be told apart; this structural identifiability problem occurs even for infinite noiseless data [9]. Remien and colleagues showed, for example, that a Lotka–Volterra model of relative abundances is only locally identifiable; thus, without absolute abundance data, interactions can not be uniquely inferred in their deterministic model [10]. Second, as Cao and colleagues [5] discuss extensively, the fact that data is incomplete and the high dimensionality of the parameter space limit inference in practical ways. In addition, measurement noise of data makes the inference problem more challenging. There are indications that stochastic models, which track more statistical information than deterministic models, can overcome these challenges to some extent [11]. Using stochastic modeling, parameters were successfully inferred in systems biology [9,12] and cancer studies [13].
Here, we combine Bayesian inference—where probability distributions are inferred for the parameter values [14]—and stochastic modeling (akin to [9,12,13]) to improve parameter inference in microbiome studies. We propose a computational workflow that goes from microscopic transition rates in a mathematical model—describing ecological and evolutionary events (such as birth, migration, mutation, or speciation)—to macroscopic dynamics of the statistical moments of microbiome composition [9,12,13], see Fig 1. This Bayesian inference workflow, which naturally bypasses known limitations of linear regression (i.e., point value) inference [5], is sufficiently flexible to test different mathematical models and microbiome samples while quantifying the parameter uncertainty stemming from data limitations (Fig 1C), including measurement noise. We use 2 classical ecological models—logistic growth and the Lotka–Volterra model—to illustrate its application on data sets describing absolute or relative abundances of microbes. For the relative abundance models, we show that our workflow overcomes non-identifiability of communities with a small number of types, enabling parameter inference from conventional metagenome data. The workflow outlined here bridges a gap between microbiome data and theoretical modeling by addressing fundamental and practical aspects to infer microbial interactions.
(A) Mathematical models serve as a link between parameters and data. Either to simulate biological processes or to infer parameters from data. (B) Longitudinal sampling of the same hosts or an ensemble of them are used to obtain datasets. (C) Workflow from microscopic rates of a model and experimental data to inference of parameters values by ABC. The microscopic rates describe possible eco-evolutionary events (such as birth, migration, mutation, or speciation), leading to macroscopic patterns (statistical moments of abundance). Data sets describe absolute abundances (counts) or relative abundances (frequencies) of microbes. To quantify the probability of parameter values given a data set, prior knowledge about the parameters is updated to a posterior distribution based on the agreement of the model with the data. Note that because the model describes the dynamics continuously, no correlation between time points is needed. Figure created in BioRender.com under a CC-BY-NC-ND license.
Results
Developing an inference workflow
We propose a parameter inference workflow grounded on a mechanistic description of the dynamics of absolute abundances in a microbiome (Fig 1C). For simplicity, let us define a vector n, where each element corresponds to the population of a microbial type. We can write down microscopic transition rates T describing changes in the microbiome composition of one host, n, to other compositions n′, given the set of parameters θ,
(1)
Now, instead of tracking the microbiome composition n in a single host, we can describe how the probability of a microbiome composition n in an ensemble of hosts P(n, t), changes with time,
(2)
This expression, called the master equation [15], allows us to compute statistical information about the microbiome composition beyond its mean behavior. Here, the probability influx and outflux terms indicate an increase or decrease in the probability of composition n caused by transitions from and to other microbiome compositions (n′). Therefore, the dynamics of the microbiome depend on the ecological and evolutionary processes contained in the transition rates.
Using the master equation, we derive equations for the statistical moments of the microbiome composition in an ensemble of hosts—namely, the product of the master equation by a variable of interest (gk, where k is an identifying index) summed over all possible microbiome compositions n,
(3)
This is a way to average a variable from the model. For example, computing the average abundance of microbial type i implies setting gk = ni. If we set gk = ninj, we obtain an equation for the co-moment of microbial types i and j. The resulting equations describe the expected macroscopic dynamics of the microbiome: tracking a large stochastic system without an explosive computational burden. Now, to extract sufficient statistical information from the model we can derive several of these equations, even as many as the number of free parameters. For example, in a Lotka–Volterra model with S microbial types, there are S growth rates and S2 intra- and inter-specific interactions, amounting to S + S2 parameters. We could derive S + S2 equations to match the number of parameters, including, S equations for the first moments 〈nk〉, S for the second moments , and S(S—1) for the co-moments 〈nknl〉 and covariances 〈nk,nl〉 (see the Methods). Note that each equation can depend on the vector of other moments, i.e., 〈gk〉 = f(〈g〉,θ,t). While in some models moments will depend on moments of equal or lower order (“closed equations”), in others they will depend on even higher order moments, leading to an infinite system of interdependent equations. Because closure is required to solve any system of equations, we illustrate how to approximate higher-order moments in the Methods. Note that in spite of the large system of equations to solve, our approach exploits the fact that, except from “closed equations,” many equations are linear thus quickly solved by conventional ODE solvers. Here, we presented only the generic derivation of the workflow; a step-by-step derivation from microscopic rates up to second-order moments for a logistic growth and the Lotka–Volterra models can be found in the Methods. Such models include conventional ecological events, such as growth, death, immigration, and direct and indirect interactions.
We now have the elements to infer the parameters θ from microbiome data. The focus now switches to the fitting method, with 2 possibilities: likelihood-based methods such as Markov Chain Monte Carlo (MCMC) [16] or likelihood-free methods such as Approximate Bayesian Computation (ABC) [17] which use the dynamical equations instead. Here, we opt for ABC as true likelihoods of stochastic models can rarely be derived [14]; however, MCMC assuming a pseudo-likelihood (e.g., a Gaussian likelihood) can be a promising alternative to optimize computational efficiency. The idea of ABC is to identify feasible parameters values by comparing the data to dynamical model predictions [14]. Specifically, for any given set of parameters values θ, a distance metric between the numerical solution of the equations for the moments, 〈gk〉, and the equivalent moments from data, , is estimated, e.g.,
(4)
for the Euclidean distance (the effect of rescaling some moments is shown in S1 Fig), where the sum over i refers to the data points, and the sum over k refers to the different moments. If this distance is smaller than a threshold ε, the set θ is considered to be a valid parameter estimate. By testing sets of parameters sampled according to an expectation—the prior distribution—and recording those below the threshold ε, a posterior distribution of the parameters reflecting the uncertainty of the inference can be obtained (Fig 1C). With a smaller threshold ε, this posterior can become the new prior and the process can be iterated to narrow down the parameter distributions. This method is called Approximate Bayesian Computation—Sequential Monte Carlo (ABC-SMC). We show how to choose prior distributions of the parameters in Tables 3–5.
Properties of microbiome data
Given a microbiome data set of abundances with replicates, all statistical moments can be estimated from it. Concretely, this is done by averaging the variable of interest gk, over all replicates in each specific time point (Fig 1C). For example, for gk = ni the replicates of ni are summed over and divided by the number of replicates, while for gk = ninj, the products of ni and nj for each replicate are computed, then summed over and divided by the number of replicates.
Microbiome data is nowadays typically produced by metagenome sequencing. Conventionally, for technical reasons, metagenomics only quantifies the relative abundance of each microbial type in a sample (Fig 1C) [18]. More recently, some studies have measured absolute numbers of culturable [19] and non-culturable microbes in samples [20]. We call these counts absolute abundances.
Our former equations only track moments of absolute abundance, 〈gk〉. As Gloor and colleagues [18] show, inferring parameters from relative abundance (xk) data using these would lead to spurious correlations (Fig 2A and 2B). To find equivalent expressions for the statistical moments of relative abundance, we define nΣ≡∑j nj, the total microbiome population, and the dynamical equation for its first moment, 〈nΣ〉, to be used as a scaling factor. A transformation to moments of relative abundances, 〈γk〉, is given by
(5)
Because relative abundance data sets lack information about the scaling factor, its initial condition, , must be inferred as a free parameter, one parameter more than for absolute abundance data. This scaling factor can be the quantity of interest sometimes [21]. Note that because the relative abundances add up to one, ∑k xk = 1, the number of independent equations for the microbial types decreases by 1, but the number of parameters per type remains. A detailed derivation of transformations to relative abundance for a logistic growth and the Lotka–Volterra models is shown in the Methods.
(A, B) Time series comparison between simulations (dots, derived from only 4 replicates) and equations for the statistical moments (lines) of absolute (nk) and relative abundance (xk) sharing true parameters (found in Tables 1 and 2). Two models with 3 microbial types (S = 3) were tested, (A) logistic growth with immigration and death 3S + 1 = 9 parameters and (B) Lotka–Volterra S + S2 = 12 parameters. Inferred parameter posteriors from the relative abundance are compared to true parameters (dashed lines) and priors (black distributions). All microbial types shared the same priors (Tables 3 and 4). (C) The inferred interactions for the Lotka–Volterra model resembled the true interactions, qualitatively (arrowheads) and quantitatively (arrow thickness), with various certainties (grayscale, defined by the ratio of SD of posterior to prior). (D) For both data sets, the most probable model was identified correctly. The settings for the inference are listed in Table 6 (a.u. = time units are determined by the rates, see Tables 1 and 2). Networks on the left of A and B were created in BioRender.com under a CC-BY-NC-ND license. The data underlying this figure can be found in https://doi.org/10.5281/zenodo.13958305.
While some studies track the microbiome of the same host over time, in many microbiome studies, replicate hosts are sampled at different time points and pulled together to produce a single time series (Fig 1B). This is the case when hosts are sacrificed while sampling as in experimental studies of Drosophila melanogaster, Caenorhabditis elegans [22], and Hydra vulgaris [19,23]. In contrast to deterministic models, the workflow shown here can deal with hosts pulled together as it accounts for stochastic demography. Concretely—akin to the concept of biological replicates—if the parameter values and initial conditions are the same in each host sampled, we can account for their emerging demographic differences, i.e., expected differences in microbiome composition resulting from a stochastic reality.
Finally, our workflow does not make assumptions about the experimental technology to obtain microbial abundance data. However, it is important to be aware of potential biases introduced while obtaining and preprocessing raw data [24].
The advantages of our workflow for inference
Deriving dynamical equations for the moments is more cumbersome than writing down deterministic equations. Nevertheless, the additional effort pays back on inference in at least 2 ways:
- Firstly, the dynamics of the moments use more information contained in the data, increasing the chance of estimating the true parameter values (Fig 3).
- Secondly, the larger number of equations and their structural differences can improve the structural identifiability of the parameters, guaranteeing that for infinite noiseless data their unique value can be known.
We inferred all parameter values (Table 2) from simulated absolute abundance data as in Fig 2. While our workflow used the same setup of Fig 2, the linear regression method was based on [8] without time-dependent perturbations or regularization. Our Bayesian workflow successfully “locates” the true parameter values, along their uncertainty, even if the linear regression method does not. The initial parameters guess for linear regression was close to the true value (1.5 for growth rates, and −10−4 and 0 for intra- and inter-specific interactions). For our workflow, we used the same parameter priors of Fig 2, summarized in Table 4. The data underlying this figure can be found in https://doi.org/10.5281/zenodo.13958305.
Using identifiability software, Browning and colleagues [9] showed that parameters can turn identifiable when dynamical moments are considered. Such gain depends on the combined effect of the number of equations, sampled time points, and latent (non-measured) variables. We used GenSSI [25], a Matlab package that uses series and tableaus to test the identifiability of a model (Fig 2 and Methods). Its expansion of the dynamical model around sampled points to extract the information available of the parameters is one of the most used methods for nonlinear systems [26,27]. We found that for absolute abundance, Lotka–Volterra is globally identifiable, while logistic growth has finite possible values, thus, locally identifiable. The relative abundance models retained these identifiability categories, improving the local identifiability reported for a deterministic Lotka–Volterra model [10].
Overall, statistical moments can improve structural identifiability [9], narrowing down the success of inference to the properties—quality and amount—of the data. In the following, we illustrate this practical aspect with guarantees of improved identifiability and inference of parameters from absolute and relative abundance microbiome data.
Inference from simulated and empirical data
We tested our inference workflow in 2 ways. Firstly, we inferred parameters from simulated relative abundance data (Fig 2) to compare each inferred value to their true known value. Our approach, with three microbial types, proved successful in models with and without inter-specific interactions, namely, data from Lotka–Volterra and logistic growth simulations. In fact, in contrast to linear regression using a deterministic model [8] our approach “located” the true Lotka–Volterra interaction values every time (Fig 3). The uncertainty of the parameters reflected the limitations of the data, e.g., death rates being more uncertain as a result of data only tracking a growth phase (Fig 2A). Beyond parameter values and certainty, we were able to identify the correct data-generating model each time. Importantly, only 6 time points and only 4 replicates were included in each data set, a realistic scenario for experimental studies.
Measurement noise can increase the uncertainty of inferred values. To test our workflow, we inferred parameters from simulated Lotka–Volterra data with increasing amounts of noise (Figs 4 and S2). Although noise can be influenced by many factors [24], we focused on a case where a shared noise distribution affects each microbial type at each time point. Inferring parameters from relative abundance data led to larger uncertainty than inferring from absolute abundance data. However, in both cases, uncertainty was reduced by having more replicates and/or time points (Fig 4), with the number of time points having a stronger effect.
We inferred all parameters from simulated data as shown in Fig 2 (see Table 2). For simplicity, we show the effect of noise on a single parameter with true value I3,2 = −4.1 · 10−5 (dashed line). The effect on all parameters is shown in S2 Fig. We simplified the nuances of empirical noise [24] assuming a scenario where all microbial abundances are affected proportionally. Concretely, a uniform noise distribution was shared among all microbial types and constant through time. For low noise, data could be altered by up to ±5%, while for medium and high noise, by up to ±10% and ±20%. Noise was sampled independently for each microbial type at each time point, affecting the absolute abundances from which relative abundances were computed. (A) A larger number of replicates and/or time points help reduce the increased uncertainty caused by noise. In particular, the number of time points has a stronger effect than the number of replicates. (B) The uncertainty obtained from relative abundance data is consistently larger than from absolute abundance. Still, more replicates and/or sampling time points help to reduce the uncertainty. The data underlying this figure can be found in https://doi.org/10.5281/zenodo.13958305.
The encouraging results from simulations led us to apply our inference workflow to experimental data. We used for this a thoroughly measured time series of replicates with a small number of microbial types: the absolute abundance data of OMM12—a reduced mice microbiome [28] (Fig 5). Such data set tracks the growth of microbes in the gut from a germ-free state. We used the logistic growth model, which describes transient and equilibrium stages, to illustrate our approach. The inferred posteriors suggested the growth rates of Akkermansia muciniphila, Bacteroides caecimuris, Bifidobacterium longum, and Muribaculum intestinale to be most certain, with average doubling times ranging from hours to days. Meanwhile, except from B. caecimuris, the average death and immigration rates were less certain, ranging from ≈4 · 105 to 1.4 · 106 cells per day. Most of the certainties obtained from empirical data (Fig 5) are smaller than those from simulations (Fig 2), highlighting the limits of the model tested and inference from noisy, empirical data. However, in each case, we obtained a set of parameters—capturing interactions between microbes—with some level of certainty. Our results point to selection as the ecological driver of the OMM12 dynamics, despite a possible compatibility of this data with a neutral hypothesis once it has reached steady state [29]. In this case, neutrality would imply that the parameter posteriors overlap between all microbial types, which is not the case (Fig 5).
The parameters of a logistic growth with immigration and death model were inferred from a mouse dataset. The Oligo-Mouse-Microbiota (OMM12) data set [28] tracks a 12-species defined mice microbiome (S = 12), where the absolute abundances in the same individuals were sampled from feces 11 times over 99 days. (A) We analyzed the first 21 days where 4 replicates are available, we show here the abundance of all 12 types averaged over the 4 replicates. We use the underlying data to infer the parameters of a logistic growth model with growth, death, and immigration, with in total S + S2 = 156 moments used for the inference. (B) Of the 3S + 1 = 37 parameters inferred, we show only the posteriors of the 5 most certain ones (defined by the ratio of SD of posterior to prior as a relative comparison of the certainty gained between parameters). All microbial types shared the same uniform priors (black lines, Table 5) to have a fair measure of the parameter uncertainty reduced. (C) The parameters inferred for each species varied widely with various certainties. For the shared carrying capacity, we found an average N ≈ 1.45 · 107 bacterial cells, ±3.49 · 105 cells, and uncertainty of 0.0582. A system of 156 equations was solved (S = 12 first moments and S2 = 144 second moments and co-moments). The settings for the inference are listed in Table 6. The data underlying this figure can be found in https://doi.org/10.5281/zenodo.13958305.
Discussion
Our work is motivated by the goal of understanding how microbes interact and the need to quantify the uncertainty of parameters (interactions) inferred from microbiome data. Although point-value inference methods have been used previously [8], several issues limit their quantitative application, restricting them to recreate qualitative patterns of data [5]. A major issue is that models often have more parameters than equations [5,8,30], limiting the information to constrain the large number of interactions to infer. We propose a solution by deriving equations for the statistical moments of the microbiome composition—even as many as the number of parameters—to make better use of the information contained in the data. Supporting this idea, statistical moments have improved inference from molecular [12] and cancer data [13]. Browning and colleagues [9] found that statistical moments improve the structural identifiability of parameters in simulations, which we confirmed for logistic growth and Lotka–Volterra models of absolute and relative abundance [10]. Our approach is driven by a mechanistic spirit, where microscopic rates must be written down first, based on hypothetical mechanisms and stating assumptions. As opposed to approaches where analytic solutions—or expensive stochastic simulations—are needed, here, a numerical solution is sufficient to quantify the distance between equations and data, despite the large number of parameters, microbial types, and population size [17]. This allows our workflow to handle diverse models, where formal model comparison is possible [12,17].
The workflow is not limited by the properties of the microbiome abundance data [18]. As we have shown, analyzing data sets describing the relative abundance of microbial types—even if the total absolute abundance is dynamic [30]—is possible. Such is the nature of metagenomic sequencing data—the most common method to characterize microbiomes [5]. In addition, by tracking statistical moments of the microbiome, our approach naturally accounts for the diverse types of experimental samplings, such as those where ensembles of hosts are used to obtain a single time series. Concretely, compared to other methods, we track the demographic variation between hosts explicitly and assign the remaining variation to external “environmental” noise. Measurement noise can be incorporated using knowledge about its distribution, including: shape, dynamics, and how each microbial type is affected [9,10,13]. Missing this information, we did not consider a noise model for the OMM12 data set. However, we showed that parameter uncertainty from simulated noisy data could be reduced by increasing the number of time points first, or the number of replicates second. Still, other considerations including the interval between time points and capturing the transient dynamics could be important to overcome the effect of noise in empirical data. For example, having more replicates might be more beneficial for nearly steady-state dynamics.
Our workflow assumes that samples originate from the same environmental and initial conditions. To date, such replicates are more easily obtained in laboratory conditions [28]. However, antibiotics, microbiome transplantations, or other perturbation treatments could be explored as means to force the generation of replicates. Alternatively, models of higher taxonomic levels, where microbial compositions are more similar [24], could be written.
Although by design, our workflow deals with common longitudinal (time series)—even sparse—data, analyzing a single time point (snapshot) is in principle possible. For example, if the microbiome composition is assumed to be at a steady state, the inference method’s aim is to find parameters making the dynamical equations for the moments equal to zero. This does not mean that the moments are zero, but that their rate of change is. This differs from quasi-steady data, which is common in microbiome studies [31,32], but less informative than non-steady data. However, single time points are not expected to be as effective as multiple time points. As our results illustrate, given the various sources of uncertainty, non-steady data leads to better parameter inference, in particular, those time intervals of “high activity” where many compositional changes occur [5]. As Cao and colleagues [5] proposed, several of these intervals could be analyzed simultaneously to improve the inference.
Bayesian inference can suffer the curse-of-dimensionality in large and diverse systems [17]. By combining statistical moments readily solved numerically and data of sufficient quality, we believe our workflow can overcome this to some extent, exploring the parameter space in a feasible time. Physical and biological constraints can focus the parameter exploration further [33]. We implemented an Approximate Bayesian Computation with Sequential Monte Carlo in our workflow using tools from the Python package pyABC [34] (Table 6). Other optimizations, or combinations with methods such as Markov Chain Monte Carlo [16], could greatly improve its wider application [14]. As proof-of-principle, we applied our workflow to 2 simulated relative abundance datasets and recovered the true parameter values. We also applied it to a reduced microbiome in mice [28], where we estimated values and certainties of parameters describing logistic growth, a quantitative characterization of the microbes in situ. Our contribution builds towards the aim of developing tools to enable the widespread use of parameter inference in microbiome studies, where large progress has been made (see, e.g., [6,7]).
Although Lotka–Volterra and logistic growth are meaningful ecological models to investigate first, other alternatives can be tested as well. For example, a model of logistic growth with linear environmental noise, different from our demographic-noise-only model, suggests that environmental perturbations determine many properties of the microbiome composition [31,32]. Despite only considering time-independent pairwise interactions, our workflow can incorporate higher-order interactions [35], as well as time-dependent [36] or time-delayed interactions [37] in the transition rates. Even multilayer networks [38], i.e., assorted interactions, can be modeled as we illustrate separating positive and negative interactions in a Lotka–Volterra model. Similarly to the model comparison between Lotka–Volterra and logistic growth in our results, contrasting alternative models could point to the underlying mechanisms operating in microbiomes.
In summary, we presented a Bayesian inference workflow bridging microbiome data to theoretical modeling. We used the ability of stochastic models to track statistical quantities beyond mean behaviors [9,11–13], enabling us to exploit useful information contained in dynamical data. This workflow can be facilitated by existing automated software to derive statistical moments from dynamical models [39]. By inferring from data sets of microbial absolute and relative abundances, we showed its robustness—identifying likely interactions and certainty of parameters in simulated and empirical data. Because mechanistic rates serve as stepping stones of the workflow, similar microscopic models could replace the 2 classical ecological models that we illustrated—including experimentally informed models.
Methods
Derivation of dynamical equations for the microbiome moments
To track the statistical moments of a model, e.g., average, variances, and co-variances, we have to account for the stochasticity of events. Thus, describing the probability of microbiome compositions is needed. The change in probability of each microbiome composition is described by the master equation,
(6)
where n is the vector of absolute microbial abundances, and ei is the amount of change, a vector with one in the i-th entry and zero otherwise.
Dynamical equations for the statistical moments can be obtained from the master equation by multiplication and subsequent summation; e.g., for the first moment 〈nk〉, equivalent to the average, we have
(7)
where for clarity, we make summations more explicit. For the second moment
, we have
(8)
and for the co-moments 〈nknl〉,
(9)
For models with a finite carrying capacity, the upper sum limit is changed to a finite number.
Logistic growth with immigration and death
Let us exemplify the former steps with a logistic growth model. Similarly to Allouche and colleagues [40], let us define the microscopic transition rates for one microbial population i,
(10a)
(10b)
where N is the maximum number of microbes in a host (shared carrying capacity), fi is the maximum growth rate, and ϕi and mi are the death and immigration rates for each type i. We assume small death rates ϕi (Table 1), following the typical logistic growth concept, where only birth occurs. But, in addition, close to ∑j nj ≈ N death occurs. In such limit, the model resembles a death-birth process where the microbial abundances (n) slowly move towards an equilibrium less influenced by the initial abundances but more by the rates of birth, death, and immigration [41].
Now, we illustrate how to derive dynamical equations for the moments. Let us start with the first moment,
(11)
where the first 4 lines describe birth or death of a microbe of type k and the last 4 lines describe birth or death of a microbe of type i different from k. Note that by definition at the boundaries
and
, so their summation indices go up to ni = N– 1, or start from ni = 1, respectively.
After appropriate transformations of variables to only deal with P(n,t) and re-indexing, we obtain
(12)
Note that the last 4 terms reduce to zero, and that at the boundaries and
, which allows including nk = 0 and nk = N in the summations. Simplifying, we find
(13)
and substituting the transition rates T(n→n+ei) and T(n→n−ei) from Eqs (10) leads to
(14)
For other moments and models similar derivations can be done.
For the second moment, we find
(15)
which after substituting T(n→n+ei) and T(n→n−ei) from Eqs (10) reduces to
(16)
For the co-moments, we find
(17)
which after substituting T(n→n+ei) and T(n→n−ei) from Eqs (10) leads to
(18)
Because each equation depends on even higher moments, e.g., d〈nknl〉/dt depends on 〈nknlnj〉, it is not possible to solve this system of equations without additional assumptions. However, one can find approximate expressions, where lower moments replace higher moments. For example, and 〈nknlnj〉 are approximated as functions of the lower moments:
, and 〈nj〉. This technique, called moment closure approximation, leads to a closed system of ODEs and we use it in our approach. Various approximations stemming from numerical observations, physical considerations, or heuristics have been used with great success [42], and are available in automated software tools (e.g., MomentClosure.jl [39]). Kuehn [42] makes a thorough review of this technique. Here, we illustrate the approximation of third-order moments as the product of one co-moment and one moment, 〈nknlnj〉≈〈nknl〉〈nj〉. The key of our approximation lies on the covariance of a pair of microbes and a single microbe being close to zero, 〈nknl,nj〉≈0, with 3 possible ways to distribute k, l, and j. We used 〈nknlnj〉≈〈nknl〉〈nj〉 and
all along. The validity of our approximation can be tested by checking the covariances 〈nknl,nj〉 of the experimental data set, choosing a different approximation otherwise [42].
Lotka–Volterra
Now, for a model with intra- and inter-specific interactions, let us define the transition rates,
(19a)
(19b)
where A and B are positively defined matrices containing the interactions, satisfying Ai,j = 0 if Bi,j ˃ 0, and Bi,j = 0 if Ai,j ˃ 0. Ecologically, while interactions in A promote growth, those in B lead to death. Note that interactions (i,j and j,i) can be asymmetrical. Finally, fi is the intrinsic growth rate.
For the first moment, similarly to Eq (13), we have
(20)
which after substituting T(n→n+ei) and T(n→n−ei) from Eqs (19),
(21)
takes the form of the conventional, deterministic Lotka–Volterra equations for the abundance with growth rate fk and interaction matrix Ak,j−Bk,j.
For the second moment, similarly to Eq (15)
(22)
which after substituting T(n→n+ei) and T(n→n−ei) from Eqs (19) leads to
(23)
For the co-moments, similarly to Eq (17), we derive
(24)
which after substituting T(n→n+ei) and T(n→n−ei) from Eqs (19) reduces to
(25)
As previously, a moment closure approximation is required to solve the system of equations. We used 〈nknlnj〉≈〈nknl〉〈nj〉 and .
From absolute to relative abundance
The former equations account for the change in absolute abundance. To focus on relative abundance data, we define the relative abundance as follows:
(26)
and
(27)
to serve as a scaling factor. Thus,
(28)
Let us find the transformation to relative abundances for the first moment. Using the definition of the covariance , such that
, we have
(29)
Rearranging, the transformation is given by
(30)
For second-order moments, we use that and approximate
. Then, using the chain rule
(31)
Rearranging, the transformations are given by
(32)
and
(33)
where the differential equation for 〈nΣ〉 is given by
(34)
A close look at the dynamics of the covariances shows their contribution is negligible in large populations. To see this, let us write
(35)
after the appropriate transformations of variable to only deal with P(n,t) and re-indexing, we find
(36)
Note that if , then, if either nk ≫ 1, such that nk±1≈nk or
, the terms from the previous equation simplify, leading to
(37)
Similar arguments lead to conclude that,
(38)
and
(39)
These approximations of the covariances are sensible in microbiomes, where nΣ,nk≫1 is often the case. Moreover, in the infinite population limit, covariances must be zero.
Putting all together, the approximated change of the first moment of relative abundance in large populations is given by
(40)
while for the second moments of relative abundance
(41)
and
(42)
As Joseph and colleagues [30], we see that the second term of each equation serves as a “correction factor” due to the fact that relative abundances must add up to one at all times. Finally, to solve these equations in terms of relative abundances only, changes of variable such as , etc., are needed all along. These approximations are valid in large populations, where the covariance between relative abundance terms and nΣ are comparatively much smaller than the product of their averages.
True parameters in simulations
Tables 1 and 2 contain the parameters used to simulate data.
The growth and death rates as well as the immigration parameters were only chosen for illustration, thus, time units are arbitrary. We used a relatively small population size for simplicity. However, larger population sizes can be easily tested.
The interaction parameters as well as the growth rates were only chosen for illustration, thus, time units are arbitrary. We used relatively small initial populations for simplicity. However, larger initial populations can be easily tested.
Inference settings
Tables 3–5 contain the probability priors for the inference of simulated and experimental data. Table 6 contains the settings for the inference of all data.
A combination of uninformative (uniform) and informative (normal) priors were used for illustration. These priors span a wide range of values to test the ability of the inference workflow to find the true parameters in simulations (Table 1). indicates a uniform distribution in the range from a to b.
indicates a normal distribution with mean a and standard deviation b.
A combination of uninformative (uniform) and informative (normal) priors were used for illustration. These priors span a wide range of values to test the ability of the inference workflow to find the true parameters in simulations (Table 2). indicates a uniform distribution in the range from a to b.
indicates a normal distribution with mean a and standard deviation b.
Available evidence and back-of-the-envelope calculations (marked by *) were used to propose wide priors. indicates a uniform distribution in the range from a to b.
indicates a normal distribution with mean a and standard deviation b.
These settings were chosen to decrease the computing time, but still robustly minimize the distance between data and model. We used tools from the Python package pyABC [34], mainly ABCSMC. The maximum number of generations, mismatch threshold (ε) minimum, and minimum ε change between generations are all stopping criteria (marked by *). LSODA is a numerical solver capable of selectively adapting to the stiffness of a system of differential equations. NA: not applicable.
Supporting information
S1 Fig. Effect of 2 alternative distance metrics on the inference outcome of the Lotka–Volterra model.
We inferred all parameters from relative abundance simulated data as shown in Fig 2. However, for the distance metric between model and data, Eq (4), statistical moments were rescaled or not. While for absolute abundance, the second-order moments and co-moments are naturally larger than the first-order moments, for relative abundance data, the opposite is true. Rescaling the moments can modify their importance during the inference process. To test this, we took the square root (of the squared errors) of second-order moments and co-moments for absolute abundance, and of first-order moments for relative abundance data. The posteriors of rescaled and non-rescaled moments largely overlap, with non-rescaled moments (our approach in the other figures) leading to more certainty. The data underlying this figure can be found in https://doi.org/10.5281/zenodo.13958305.
https://doi.org/10.1371/journal.pbio.3002913.s001
(TIF)
S2 Fig. Effect of data measurement noise on the uncertainty of inferred Lotka–Volterra parameters.
We inferred all parameters from simulated data as shown in Fig 2. To show the effect of noise on all parameters, we computed the L-2 norm of relative errors of the parameters (Table 2). We simplified the nuances of empirical noise assuming a scenario where all microbial abundances are affected proportionally. Concretely, a uniform noise distribution was shared among all microbial types and constant through time. For low noise, data could be altered by up to ±5%, while for medium and high noise, by up to ±10% and ±20%. Noise was sampled independently for each microbial type at each time point, affecting their absolute abundance from which relative abundances were computed. The data underlying this figure can be found in https://doi.org/10.5281/zenodo.13958305.
https://doi.org/10.1371/journal.pbio.3002913.s002
(TIF)
Acknowledgments
We thank the Theoretical Biology Department in the MPI Plön, the Collaborative Research Centre 1182: Origin and Function of Metaorganisms and Wenying Shou for fruitful discussions.
References
- 1. Esser D, Lange J, Marinos G, Sieber M, Best L, Prasse D, et al. Functions of the microbiota for the physiology of animal metaorganisms. J Innate Immun. 2019;11(5):393–404. pmid:30566939
- 2. Gilbert JA, Blaser MJ, Caporaso JG, Jansson JK, Lynch SV, Knight R. Current understanding of the human microbiome. Nat Med. 2018;24(4):392–400. pmid:29634682
- 3. Fischbach MA. Microbiome: focus on causation and mechanism. Cell. 2018;174(4):785–790. pmid:30096310
- 4. Thompson LR, Sanders JG, McDonald D, Amir A, Ladau J, Locey KJ, et al. A communal catalogue reveals Earth’s multiscale microbial diversity. Nature. 2017;551(7681):457–463. pmid:29088705
- 5. Cao HT, Gibson TE, Bashan A, Liu YY. Inferring human microbial dynamics from temporal metagenomics data: Pitfalls and lessons. Bioessays. 2017;39(2):1600188. pmid:28000336
- 6. Bucci V, Tzen B, Li N, Simmons M, Tanoue T, Bogart E, et al. MDSINE: Microbial Dynamical Systems INference Engine for microbiome time-series analyses. Genome Biol. 2016;17:1–17.
- 7. Gibson TE, Kim Y, Acharya S, Kaplan DE, DiBenedetto N, Lavin R, et al. Microbial dynamics inference at ecosystem-scale. bioRxiv. 2021:12.
- 8. Stein RR, Bucci V, Toussaint NC, Buffie CG, Rätsch G, Pamer EG, et al. Ecological modeling from time-series inference: insight into dynamics and stability of intestinal microbiota. PLoS Comput Biol. 2013;9(12):e1003388. pmid:24348232
- 9. Browning AP, Warne DJ, Burrage K, Baker RE, Simpson MJ. Identifiability analysis for stochastic differential equation models in systems biology. J R Soc Interface. 2020;17(173):20200652. pmid:33323054
- 10. Remien CH, Eckwright MJ, Ridenhour BJ. Structural identifiability of the generalized Lotka–Volterra model for microbiome studies. R Soc Open Sci. 2021;8(7):201378. pmid:34295510
- 11. Pieschner S, Hasenauer J, Fuchs C. Identifiability analysis for models of the translation kinetics after mRNA transfection. J Math Biol. 2022;84(7):56. pmid:35577967
- 12. Fröhlich F, Thomas P, Kazeroonian A, Theis FJ, Grima R, Hasenauer J. Inference for stochastic chemical kinetics using moment equations and system size expansion. PLoS Comput Biol. 2016;12(7):e1005030. pmid:27447730
- 13. Johnson KE, Howard G, Mo W, Strasser MK, Lima EA, Huang S, et al. Cancer cell population growth kinetics at low densities deviate from the exponential growth model and suggest an Allee effect. PLoS Biol. 2019;17(8):e3000399. pmid:31381560
- 14. Toni T, Welch D, Strelkowa N, Ipsen A, Stumpf MP. Approximate Bayesian computation scheme for parameter inference and model selection in dynamical systems. J R Soc Interface. 2009;6(31):187–202. pmid:19205079
- 15.
Gardiner CW. Handbook of Stochastic Methods. 3rd ed. Berlin: Springer, NY; 2004.
- 16. Valderrama-Bahamóndez GI, Fröhlich H. MCMC techniques for parameter estimation of ODE based models in systems biology. Front Appl Math Stat. 2019;5:55.
- 17. Sunnåker M, Busetto AG, Numminen E, Corander J, Foll M, Dessimoz C. Approximate bayesian computation. PLoS Comput Biol. 2013;9(1):e1002803. pmid:23341757
- 18. Gloor GB, Macklaim JM, Pawlowsky-Glahn V, Egozcue JJ. Microbiome datasets are compositional: and this is not optional. Front Microbiol. 2017;8:2224. pmid:29187837
- 19. Wein T, Dagan T, Fraune S, Bosch TC, Reusch TB, Hülter NF. Carrying capacity and colonization dynamics of Curvibacter in the Hydra host habitat. Front Microbiol. 2018;9:443. pmid:29593687
- 20. Galazzo G, Van Best N, Benedikter BJ, Janssen K, Bervoets L, Driessen C, et al. How to count our microbes? The effect of different quantitative microbiome profiling approaches. Front Cell Infect Microbiol. 2020;10:403. pmid:32850498
- 21. Wang Y, Chen S, Hu J, Zhou D. Determining cell population size from cell fraction in cell plasticity models. arXiv. 2024.
- 22. Petersen C, Dierking K, Johnke J, Schulenburg H. Isolation and Characterization of the Natural Microbiota of the Model Nematode Caenorhabditis elegans. J Vis Exp. 2022;186:e64249. pmid:36063004
- 23. Franzenburg S, Fraune S, Altrock PM, Kuenzel S, Baines JF, Traulsen A, et al. Bacterial colonization of Hydra hatchlings follows a robust temporal pattern. ISME J. 2013;7:781–790. pmid:23344242
- 24. Rausch P, Rühlemann M, Hermes BM, Doms S, Dagan T, Dierking K, et al. Comparative analysis of amplicon and metagenomic sequencing methods reveals key features in the evolution of animal metaorganisms. Microbiome. 2019;7:1–19.
- 25. Chiş O, Banga JR, Balsa-Canto E. GenSSI: a software toolbox for structural identifiability analysis of biological models. Bioinformatics. 2011;27(18):2610–2611. pmid:21784792
- 26. Chis OT, Banga JR, Balsa-Canto E. Structural identifiability of systems biology models: a critical comparison of methods. PLoS ONE. 2011;6(11):e27755. pmid:22132135
- 27. Walter E, Lecourtier Y. Global approaches to identifiability testing for linear and nonlinear state space models. Math Comput Simul. 1982;24(6):472–482.
- 28. Eberl C, Ring D, Münch PC, Beutler M, Basic M, Slack EC, et al. Reproducible colonization of germ-free mice with the oligo-mouse-microbiota in different animal facilities. Front Microbiol. 2020;10:2999. pmid:31998276
- 29. Sieber M, Pita L, Weiland-Brauer N, Dirksen P, Wang J, Mortzfeld B, et al. Neutrality in the Metaorganism. PLoS Biol. 2019;17(6):e3000298. pmid:31216282
- 30. Joseph TA, Shenhav L, Xavier JB, Halperin E, Pe’er I. Compositional Lotka-Volterra describes microbial dynamics in the simplex. PLoS Comput Biol. 2020;16(5):e1007917. pmid:32469867
- 31. Grilli J. Macroecological laws describe variation and diversity in microbial communities. Nat Commun. 2020;11(1):1–11.
- 32. Descheemaeker L, De Buyl S. Stochastic logistic models reproduce experimental time series of microbial communities. Elife. 2020;9:e55650. pmid:32687052
- 33. Gellner G, McCann K, Hastings A. Stable diverse food webs become more common when interactions are more biologically constrained. Proc Natl Acad Sci U S A. 2023;120(31):e2212061120. pmid:37487080
- 34. Schälte Y, Klinger E, Alamoudi E, Hasenauer J. pyABC: Efficient and robust easy-to-use approximate Bayesian computation. J Open Source Softw. 2022;7(74):4304.
- 35. Grilli J, Barabás G, Michalska-Smith MJ, Allesina S. Higher-order interactions stabilize dynamics in competitive network models. Nature. 2017;548(7666):210–213. pmid:28746307
- 36. Li A, Cornelius SP, Liu YY, Wang L, Barabási AL. The fundamental advantages of temporal networks. Science. 2017;358(6366):1042–1046. pmid:29170233
- 37. Yang Y, Foster KR, Coyte KZ, Li A. Time delays modulate the stability of complex ecosystems. Nat Ecol Evol. 2023;7(10):1610–1619. pmid:37592022
- 38. Wang Y, Yang Y, Li A, Wang L. Stability of multi-layer ecosystems. J R Soc Interface. 2023;20(199):20220752.
- 39. Sukys A, Grima R. MomentClosure.jl: automated moment closure approximations in Julia. Bioinformatics. 2022;38(1):289–290.
- 40. Allouche O, Kadmon R. A general framework for neutral models of community dynamics. Ecol Lett. 2009;12(12):1287–1297. pmid:19845727
- 41. Zapién-Campos R, Sieber M, Traulsen A. The effect of microbial selection on the occurrence-abundance patterns of microbiomes. J R Soc Interface. 2022;19(187):20210717. pmid:35135298
- 42. Kuehn C. Moment closure–a brief review. Control of self-organizing nonlinear systems. 2016:253–271.
- 43. Gibson B, Wilson DJ, Feil E, Eyre-Walker A. The distribution of bacterial doubling times in the wild. Proc Biol Sci. 2018;285(1880):20180789. pmid:29899074