mbtransfer: Microbiome intervention analysis using transfer functions and mirror statistics

Time series studies of microbiome interventions provide valuable data about microbial ecosystem structure. Unfortunately, existing models of microbial community dynamics have limited temporal memory and expressivity, relying on Markov or linearity assumptions. To address this, we introduce a new class of models based on transfer functions. These models learn impulse responses, capturing the potentially delayed effects of environmental changes on the microbial community. This allows us to simulate trajectories under hypothetical interventions and select significantly perturbed taxa with False Discovery Rate guarantees. Through simulations, we show that our approach effectively reduces forecasting errors compared to strong baselines and accurately pinpoints taxa of interest. Our case studies highlight the interpretability of the resulting differential response trajectories. An R package, mbtransfer, and notebooks to replicate the simulation and case studies are provided.


Introduction
Figure 1 gives three examples of microbial community dynamics under environmental perturbations.Part (a) shows the shift in the gut microbiome of a subject from David and others (2013) during a five-day shift to an animal-based diet.Part (b) shows the postpartum change in the vaginal microbiome from one participant tracked by Costello and others (2022).Part (c) gives the dynamics of an aquaculture microbiome in a tank following shifts in environmental pH, as examined by Yajima and others (2022).The shifts in these few cases represent general phenomena -the interventions they describe have reproducible effects on the microbiome, consistently altering the abundance of specific taxa on a predictable time scale.Similar microbial community studies are widespread in microbiome research efforts, especially those with the long-term goal of engineering microbial systems that promote health in a dynamic environment.
Statistical inference of intervention effects on the microbiome must account for temporal dependenceotherwise, there is a risk of overinflating the effective sample size.We will see in Section 3 that, though the practice of two-sample testing of intervention effects is common, it leads to inflated false discovery rates.Several microbial community dynamics models have been proposed in response to this issue.Among the most widely used is the generalized Lotka-Volterra (gLV) model, which discretizes an ordinary differential equation model for competitive predator-prey dynamics, optionally including covariates to model environmental influences (Gerber, 2014;Gibbons and others, 2017).Specifically, let y () be the microbial community profile at time , and let w () be the state of the associated intervention.Then, the gLV supposes y()  = y () + w () +  (), and it is typically estimated by first log-transforming the observed taxonomic abundances log (1 + y  ) and fitting an elastic net regression of log (1 + y +1 ) − log (1 + y  ) onto w  .The main limitations of this model are (1) that it assumes linearity in the relationship of log (1 + y +1 )−log (1 + y  ) onto w  and (2) it can only refer to the immediate past of and w  .Moreover, it does not quantify the uncertainty or stability of any estimated effects.
To address this, Bucci and others (2016); Gibson and others (2021) and Silverman andothers (2018, 2022) developed explicit probabilistic models of community dynamics.Silverman and others (2022)'s models, MALLARD and fido, are based on a multinomial logistic-normal autoregressive and multinomial logisticnormal Gaussian Process model, respectively.The environmental shifts can be included as covariates to support intervention analysis.MDSINE and MDSINE2 (Bucci and others, 2016;Gibson and others, 2021) are negative binomial dynamical systems models that extend the gLV and focus on the discovery of interspecies interactions and perturbation effects.Autoregressive dynamics are clustered using a Dirichlet Process, and a Gaussian Process prior is used to regularize species abundance trajectories.These models are closely related to our work in their application of a dynamical systems model and inference of environmental intervention effects.We provide a quantitative comparison in Section 3, and Supplementary Section 7.2 summarizes existing methods.
We make two contributions to the tapestry of currently available models.First, we demonstrate that nonparametric transfer function models lead to more accurate forecasts than current models, especially when environmental shifts are large.We leverage an existing gradient boosting package (Chen and others, 2018) and achieve competitive performance, most likely due to their relatively weak modeling assumptions and our data-rich setting.Second, we provide an algorithm to support the precise inference of intervention effects on individual taxa at specific temporal lags.Decoupling community dynamical modeling from inference makes our interpretations of environmental effects more robust to model misspecification.Section 2 explains how to guarantee False Discovery Rate (FDR) control of the selected taxa using only a symmetry assumption.
The key ingredients of our approach are transfer function models (Box and Tiao, 1975), which summarize community dynamics, and mirror statistics (Dai and others, 2020), which enable precise inference.Transfer functions relate an "input" series to an "output" one.These models were originally developed to support intervention analysis in time series data, for example, the influence of a new automobile emissions regulation on local ozone levels.Section 2 adapts this framework to the high-dimensional microbiome setting.Mirror statistics are an approach to selective inference that leverages data splitting to rank differential effects while controlling the FDR, and we develop an instance of this algorithm using partial dependence profiles of the fitted boosting models.This approach is analogous to recent microbiome approaches based on knockoffs (Xie and Lederer, 2021;Zhu and others, 2021), but it does not depend on the simulation of appropriate knockoff features.
Taken together, transfer function models and mirror-based inference provide answers to the following central questions in microbiome intervention analysis: 1. Which taxa are the most affected by the intervention?Our mirror statistics identify taxa with differential trajectories across counterfactual environmental conditions.
2. When are these taxa affected?We can distinguish between transient and sustained shifts in the community by simulating alternative counterfactuals from our fitted transfer function models.

Which factors mediate the shift? Flexible transfer function models can detect interactions between interventions and environmental features.
The mbtransfer R package computes artifacts directly related to these questions.Specifically, it supports training transfer function models, testing for significant taxon-level effects, and simulation under counterfactual alternatives.The package's implementation and documentation, including the code to reproduce the data analysis of Section 4, can be found at https://go.wisc.edu/crj6k6.

Method
Section 2.1 discusses a flexible generalization of transfer function models (Box and Tiao, 1975).Here, the input series is a measure of the intervention strength.The resulting model can summarize and simulate intervention effects on microbiome communities, accounting for baseline composition and mediating host features.Section 2.2 develops mirror statistics (Dai and others, 2020) to formally infer which taxa are the most strongly influenced by the interventions and whether the effects are immediate or delayed.The overall workflow supports statistically-guaranteed discovery of intervention effects in microbial time series.

Transfer function models
Transfer function models were introduced as a linear autoregressive model applied to two concurrent time series, a series   ∈ R that serves as the intervention and a series   ∈ R that changes in response.We consider the generalization, where we have used the following notation: • z () ∈ R  : The characteristics of subject  that do not vary over time.
•  ()  : Random error in the abundance of taxon  for timepoint  in subject .In Section 2.2, this noise is assumed symmetric.
We learn each   separately using gradient boosting models (Friedman, 2001;Chen and Guestrin, 2016).For training, we extract nonoverlapping temporal segments, and the last  lags of y where we used the convention that ŷ ′ = y  ′ if  ′ ≤  is observed and ŷ ′ = f+ℎ ′ ŷ(−+ℎ ′ ):(+ℎ ′ −1) , w(−+ℎ ′ +1):(+ℎ ′ ) , z for intermediate predictions The two advantages of this formulation are: (1) it can estimate nonlinear relationships between past microbial community profiles, interventions, and host features with taxon 's current abundance, and (2) it can detect interaction effects between covariates that improve predictive power and which may have valuable scientific interpretations.Note that each taxon  is trained separately.On the one hand, this means that information is not shared between related taxa.On the other, this allows us to use existing, reliable boosting implementations, and if many taxa are of interest, the implementation is easily parallelizable.

Mirror statistics
The transfer function model in equation (1) summarizes the effects of interventions w  on taxonomic abundances y  .However, this model may not provide statistical guarantees and can lead to ambiguous results.To address this, we propose a mirror statistics implementation.First, we randomly split subjects into subsets,  (1) and  (2) .Then, for each split , we train models f() .Next, we estimate the counterfactual difference between interventions w(−+2):(+1) = 1  and w(−+2):(+1) = 0  for each taxon  using: () . ( This equation is a partial dependence profile applied to the fitted model of taxon  (Friedman, 2001;Biecek and Burzykowski, 2021).Note that this definition toggles all  interventions.For an isolated intervention, we can use (0, . . ., 0, 1, 0, . . ., 0) instead of 1  to get the analogous statistic.We also define the corresponding mirror statistic as: which measures the consistency between estimated effects across separate splits.We assume that for taxon  with no true intervention effects, PD is symmetrically distributed around 0. This assumption is plausible because in the absence of an intervention effect on taxon , any differences between ) are due to noise.Given mirror statistics   , we estimate the false discovery proportion using the same procedure of Dai and others (2020), viewing PD ()  as analogous to β() .Specifically, we compute: where the choice of  defines a selection set Ĵ1 for the current pair of splits.Given FDR control level , we choose the largest  such that FDP () ≤ .We aggregate across multiple splits to improve power, following

Simulations
We perform simulation experiments to examine the effectiveness of mbtransfer relative to competitive baselines and to identify data characteristics that lead to better or worse performance.We evaluate both forecasting ability and inferential quality.

Data generating mechanism
We simulate data from a negative binomial vector autoregressive model: Here, ∈ R  , and NB refers to a negative binomial distribution applied coordinate-wise to each taxon  using a mean-dispersion parameterization.  ∈ R × parameterizes the lag- autoregressive dynamics between pairs of taxa and   ∈ R × parameterizes the lag- effect of the  interventions.We have chosen a negative binomial generative mechanism because this distribution has previously been found to fit 16S rRNA gene sequencing data well, especially after correcting for library size differences (Calgaro and others, 2020).  represents an interaction between host characteristics and the intervention, where some taxa may be more strongly affected by an intervention when their host has particular features.
The parameters   ,   , and   are simulated as follows.The first  1 taxa have true intervention effects and the remaining  0 =  −  1 rows of   and   are set to 0. Among the nonnull taxa  ∈  1 , we draw where  encodes the signal strength.Using two intervals ensures that nonnull effects are bounded away from 0. Entries  , are drawn similarly, except entire rows  ,• are set to 0 with an additional probability   .Such rows represent taxa with real intervention effects but no interaction with host characteristics, represented by  () ∼  0,  2  .Finally, we simulate  ∈ R × as a sparsified version of a random, low-rank matrix.Specifically, we first set Ã(0) ∼   where  ∈ R × has entries drawn independently from  0,  2  .Entries of Ã(0) are randomly set to 0 with probability   , yielding Ã(1) , and the result is normalized: We simulate random, one-dimensional interventions  If  start + ℓ > , we truncate the intervention series at .A visualization of the trajectories for null and nonnull taxa is given in Figure 2.For inference, we compare the mirror algorithm to DESeq2 (Love and others, 2014) with the formula ∼ intervention +  () +  () × intervention.DESeq2 is a negative binomial-based generalized linear model originally developed for hypothesis testing in bulk RNA-seq data.Nonetheless, it is often recommended in 16S sequencing analysis and has exhibited strong performance in benchmarks against methods specifically built for 16S gene sequencing data (Callahan and others, 2016;Calgaro and others, 2020).Supplementary Section 7.1 provides further details for examining and reproducing the simulation setup.
Given this simulation mechanism, we vary the following data parameters:

Model settings and metrics
Given these data, we gather performance metrics associated with normalization, forecasting, and inference approaches.For normalization, we consider working with the original, untransformed data, the DESeq2 size-factor normalized data (Love and others, 2014), and the size-factor normalized data followed by an asinh transformation (Callahan and others, 2016;Jeganathan and Holmes, 2021).The latter two transformations account for potentially different library sizes across simulated samples and the fact that negative binomial data can be heavily skewed.For forecasting, we apply MDSINE2, fido with kernel parameters  =  = 0.5 and  =  = 1, and mbtransfer with  =  = 2 and  =  = 4.As mentioned above, MDSINE2 is an alternative to the gLV model designed to account for microbiome community dynamics (Gibson and others, 2021), and fido is a logistic-normal Gaussian Process model (Silverman and others, 2022) (see Supplementary Section 7.2 for details).All the models were provided with host and perturbation-related covariates.For MDSINE2, we forecast by integrating the learned dynamics over future timepoints, setting the initial conditions equal to the current test sample's microbiome community profile2.We consider 3 normalizations × 5 models × 27 datasets, but exclude MDSINE2 on runs with 400 taxa due to consistently long computation times.This results in 378 simulation configurations.We compare the cross-validated forecasting performance of the models across the simulation settings.We divide the 50 simulated subjects into  = 4 folds.For each iteration , models are trained with the y from all subjects except those in the holdout fold.On holdout folds, we reveal all timepoints up to the first intervention  * in the currently held-out subject.The trained models then forecast the community profiles up to a time horizon of  = 5.We provide access to intermediate interventions w +ℎ , but not community compositions y () +ℎ for ℎ >  * .For each iteration , we compute the mean absolute error across lags, holdout subjects, and taxa: () .
We also evaluate inferential quality using false discovery proportions and power.Specifically, for instantaneous effects at lag ℎ = 0, we compute the false discovery proportion and power as: where Ĵ (0) are the taxa flagged as having immediate intervention effects and  1 (0) are the rows of  0 with at least one nonnull effect: ∪  { :  0, ≠ 0}.For delayed effects, we must account both for taxa with nonzero entries of   for  > 0 and also those taxa that, though not directly influenced through   , are indirectly shifted by autoregressive links   with taxa that are affected by the intervention.To this end, we recursively define: where 1  1( ℎ−) ∈ {0, 1}  is an indicator over taxa that are nonnull at lag ℎ − . 0 (ℎ) is defined as the complement of  1 (ℎ).The mirror-selected taxa at delay ℎ are denoted Ĵ (ℎ), and they can be compared with  1 (ℎ) and  0 (ℎ) to define FDP (ℎ) and Power (ℎ).
2We use md2.integrate as discussed in this MDSINE2 documentation.

Results
Figure 4 summarizes cross-validated forecasting performance on DESeq2-asinh transformed data.We also discuss alternative transformations below.Error rates increase with the proportion of nonnull taxa 1− 0 and signal strength .This increase is likely a consequence of the high variance shifts in y  during interventions for these settings.MDSINE2's performance is consistently worse than either fido or mbtransfer's.Figure 3 sheds light on this.It shows prediction error for individual holdout subjects in one of the simulation settings; residual error in other simulation settings is qualitatively similar.We average errors across all taxa and truncate those with a magnitude greater than 50.This figure shows that minor errors in MDSINE2's initial forecast become amplified at larger horizons.This behavior is not universal, but its effects on the subset of subjects where it does appear are strong enough to explain MDSINE2's deterioration in our simulation setup.In retrospect, such behavior is unsurprising -MDSINE2 can only refer to one step in the past, and it must have either exponential growth or decay until the community reaches its carrying capacity.Though this behavior does not affect inferences for taxon-perturbation relationships, which are the main focus of (Gibson and others, 2021), it can limit the usefulness of the forecasts needed to simulate hypothetical trajectories.In contrast, mbtransfer and fido can refer to historical windows, supporting more realistic intervention analysis: The second day of a microbiome intervention does not necessarily have the same consequences as the first day.
When the intervention effect has a smaller magnitude or is limited to fewer taxa, fido and mbtransfer perform comparably.In other cases, mbtransfer is more accurate.We interpret this by noting that, despite its ability to incorporate interventions as covariates, fido's Gaussian Process assumption enforces smoothness in the predicted values.This smoothness prevents the model from capturing the sharp changes in abundance within these simulation settings.Since the simulation's true autoregressive dynamics have  =  = 3, the fitted mbtransfer models are misspecified.Interestingly, the  =  = 2 model slightly outperforms the  =  = 4 model.In this context, the reduction in variance from having a slightly less flexible model outweighs the reduction in bias from modeling larger lags.
Analogous results for alternative transformations are available in Supplementary Figures 10 and 11.When the data are not asinh transformed, the mbtransfer model performs worse than either MDSINE2 or fido.This reversal is consistent with the use of a squared-error loss in the underlying gradient boosting model, which is not adapted to count data.fido should be preferred if data must be modeled on the original scale.However, we note that transformations are often well-justified in microbiome analysis, and an increasing number of formal methods implement them (McKnight and others, 2018;Chen and others, 2018;Jiang and others, 2021).Finally, Supplementary Figure 12 gives the average computation time across folds.MDSINE2 is slower than either fido or mbtransfer.fido and mbtransfer have comparable computation times except when using DESeq2-asinh transformed data.In this setting, fido is noticeably faster, but mbtransfer provides better forecasts.
Figure 5 summarizes inferential performance.When considering longer time horizons, all methods have improved FDR control because more taxa become truly nonnull as instantaneous effects propagate across the community.However, DESeq2 never appears to control the FDR at the prespecified level of  = 0.2.Though DESeq2 assumes a negative binomial generative mechanism across taxa, it treats samples as independent.This misspecification likely contributes to the poor FDR control seen in this simulation.For larger , the mirror statistics may appear conservative, with many FDP ≪ 0.2.These settings often correspond to high power, though, and the signal may have simply become easy to detect.Mirrors control the FDR when the DESeq2-asinh transformation is applied, and the number of taxa is large.This is expected, because the DESeq2-asinh transformation improves forecasts, and the need for a large number of taxa is consistent with Proposition 3.3 of Dai and others (2020), which demonstrates FDR control only asymptotically.We attribute the strong performance of mirror statistics to the fact that its false discovery rate control is adapted to the current dataset of interest rather than a previously defined probabilistic model.

Data Analysis
We next illustrate mbtransfer using three microbiome datasets.The data are drawn from two human and one animal microbiome studies.They include an experimentally-defined intervention, one that arises in the natural progression of a prospective study, and a shift in continuous ecosystem parameters.The studies also vary in their taxonomic richness, number of subjects, and total number of timepoints.Despite the various domains and intervention types considered, each study focuses on how environmental change remodels the microbiome.David and others (2013) investigated the sensitivity of the human gut microbiome to brief diet interventions.To this end, they recruited 20 participants and randomly assigned them to either "plant" or "animal" interventions.Subjects in the two groups were required to maintain a plant-or animal-based diet during a five-day intervention window.Samples were collected for two weeks surrounding the intervention, typically at a daily frequency.Ultimately, 8 -15 samples were collected for each participant since some timepoints were never successfully sampled.We linearly interpolate timepoints onto an even, daily sampling grid, motivated by the cubic spline interpolation adopted by Ruiz-Perez and others (2019).Regularly spaced timepoints are a fundamental limitation of discrete, autoregressive models.Initially, the data contained 17310 taxa.We filter to those present in at least 40% of the samples, reducing the number of taxa to 191a drastic reduction, but one consistent with distinguishing a "core" microbiome for more focused analysis (Shade and Handelsman, 2012;Neu and others, 2021).

Diet and the gut microbiome
We fit mbtransfer with  =  = 2 and w , setting two intervention series3 and omitting any host features  () .Figure 6 shows in-and out-of-sample forecasts.Forecasting performance deteriorates out of sample, highlighting the between-participant heterogeneity in this study.This challenge is most pronounced within the lowest quantile of abundance.Nonetheless, out-of-sample forecasts are still clearly correlated with the truth.In both the in and out-of-sample contexts, forecasts on shorter time horizons are more accurate.In addition, performance seems better in the more highly abundant taxa.The gradient boosting model's least squares training objective likely deteriorates in sparse data.
We compute mirror statistics for time lags ℎ = 1, . . ., 4 to evaluate the effect of a four day shift to an animal diet.Supplementary Figure 15 shows the associated mirror statistic distributions.The increasing magnitude across lags for some taxa suggests that the diet intervention effects are not instantaneous but build up over consecutive treatment days.To support interpretation, Figure 7a shows the median difference between counterfactual trajectories for a subset of significant taxa.The taxa were chosen by applying principal component analysis to the simulated trajectory differences, projecting onto the first component, and selecting every sixth taxon according to that ordering.Some taxa (e.g., OTU000006) have more immediate but transient effects, while others (e.g., OTU000065) have more gradual but sustained changes.Further, in several taxa (e.g., OTU000118, OTU000012), a long-run increase follows an initial decrease, which is corroborated by the associated subject-level data.Within taxa, we found that the first and third quartiles of the counterfactual differences across subjects tended to agree.This suggests that the model has not learned interactions between the intervention effects and past composition -the effect of the diet intervention may be uniform across various initial community states.The main benefit of a transfer function modeling approach is the model's capacity to learn different shapes of counterfactual trajectories while still controlling a precise notion of FDR.
For comparison, the original, interpolated data for a subset of taxa is shown in Figure 7b.These views are consistent with the counterfactual trajectories, but they are obfuscated by the higher degree of sampling noise and require more space.Our results are in line with David and others (2013), but we can clearly describe ecosystem dynamics by modeling temporal dependence between the diet intervention and microbiome community profiles.

Birth and the vaginal microbiome
We next re-analyze data from Costello and others (2022), which studied how birth influences the composition of the mother's vaginal microbiome.Supplementary Figure 13 shows in and out-of-sample forecasting accuracy.Compared to Figure 6 in the previous analysis, in and out-of-sample performances are more comparable, reflecting the larger sample size of this study.The derived mirror statistics are shown in Supplementary Figure 16.Compared to the diet intervention, more of the ecosystem is shifted by the birth intervention.We generate four counterfactual trajectories for all subjects to understand how birth influences individual taxa and whether any effects are modulated by contraception use.Specifically, we compute f +ℎ y ( * −−1):( * −1) ), w( * −): * , z for w( * −): * ) ∈ {1  , 0  } representing presence or absence of the birth event and z ∈ {0, 1} denoting re-initiation of contraceptive use following birth.Figure 8a suggests the absence of an interaction with contraceptive use.This may be a consequence of the fact that 57% of subjects were missing any data on contraceptive use -though the examples discussed in Costello and others (2022) consider plausible mechanisms for how contraception can influence the postpartum microbiome, our model does not detect a generalizable enough association to learn the interaction.
Like in the diet intervention, we can distinguish between response trajectories.Members of genus Lactobacillus are clearly depleted, while other taxa appear to take advantage of the postpartum environment.For example, Porphyromonas appears briefly during the same window that the Lactobacilli disappear.Figure 8b compares these trajectories with real data.As before, we see that the learned trajectories denoise the original data, and consistent with the lack of interaction, we do not observe obvious, systematic associations between postpartum community trajectory and contraception use.

pH and the aquaculture microbiome
We next use mbtransfer in a problem with continuous intervention values.Yajima and others (2022) studied the taxonomic composition of the eel aquaculture microbiome, collecting water samples every 24 hours for 128 days from five aquaculture tanks.We can view the tank's pH and eel activity scores as continuous inputs w  to a transfer function model.Based on the five tanks' longitudinal data, Yajima and others (2022) concluded that the microbiome composition changes over time and is related to various environmental factors.Moreover, there was a substantial shift in pH in three of the tanks, and we analyze how this shift influenced community composition.
After preprocessing the 16S data, we have 345 samples and 128 taxa.We interpolate the sampling times to fill in some missing days.Then, we fit a mbtransfer with  = 4 and  = 4.We simulate counterfactual pH series that are constant for ten timepoints, with values varying between pH = 2 and = 9. Figure 9 shows scatterplots of pH vs. abundance for a subset of taxa that were found to be significant when contrasting the two extremes, pH = 2 and 9.Moreover, we can also simulate taxa trajectories for each counterfactual pH series (Figure 9b).The effects seem additive, with counterfactual abundances smoothly varying as a function of pH.The taxa with the clearest associations (e.g., X_0001, X_0010) appear to have larger variation in their simulated trajectories, which agrees with their being more sensitive to changes in pH.

Software
We created the mbtransfer R package for analyzing interventions using transfer function modeling.This package provides various functionalities, including the creation of an S4 object called ts_inter, handling intervention windows, conducting counterfactual simulations, and performing inference based on mirror  Once the model is trained, we can forecast future community composition.predict() will fill in any time points that are present in the intervention slot, which allows forecasts across different time horizons.For example, the block below fills in predicted compositions from the ninth time point on, ts_preds <-list() ts_preds <-predict(fit, subset_values(ts, 1:8)) Following model evaluation, we can identify taxa with instantaneous or delayed intervention effects.This can be achieved by simulating counterfactual alternatives and employing partial dependence mirror statistics to control the false discovery rate.We can obtain a list of intervention series by specifying the characteristics of the hypothetical interventions (e.g., step interventions starting at specific time points and with defined lengths).

Discussion
mbtransfer adapts transfer function models to the dynamic microbiome context.The approach is flexible and interpretable, enabling intervention analysis without assuming a restrictive functional form and supporting the simulation of counterfactual trajectories.We have complemented our modeling approach with a formal inferential mechanism, leveraging recent advances in selective inference.A simulation study illuminated our method's properties across data-generating settings, and our data analysis highlighted its practical application in three contrasting microbiome studies.
We anticipate several directions for further study.First, we have focused attention on developing mirror statistics for detecting temporal effects, but the same strategy could be generalized to support inference for inter-species relationships and host-microbiome interactions.Indeed, the construction of mirror statistics via partial dependence profiles depends only on having access to a simulator  that can generate hypothetical        15 for the aquaculture case study in Section 4.3.Note that we plot a taxon in the selected panel if it is significant for any lag.This explains why some taxa are selected despite having higher-lag mirror distributions that are symmetric around zero.
∈ R  : The (potentially transformed) abundances of all taxa  ∈ {1, . . ., } at time  in subject .This vector has th coordinate   ∈ R  : The strength of  different interventions at time  in subject .

Figure 1 :
Figure 1: Examples of microbial community shifts in response to environmental change.Part (a) describes the gut microbiome of a subject undergoing a diet intervention (David and others, 2013), (b) shows remodeling of a mother's vaginal microbiome following birth(Costello and others, 2022), and (c) profiles an aquaculture tank microbiome together with environmental pH(Yajima and others, 2022).Section 4 explores data from these studies in depth.

Figure 2 :
Figure 2: Time series simulated according to the mechanism described in Section 3.1.Each panel shows the trajectories for one taxon, and each row is the time series for one subject.The colors of the tiles encode changes in abundance.Red borders indicate the samples where the intervention is present.The first row of taxa (tax1, tax2, tax3) are nonnull with negative, positive, and negative effects, respectively, and the bottom row of taxa are all null.

Figure 3 :
Figure 3: A comparison of forecasting residuals across four folds (rows) in one simulation run suggests that forward integrating the MDSINE2 model can lead to exponential increases in forecasting errors.

Figure 4 :
Figure 4: Forecasting errors across simulation settings.The -axis corresponds to the average of   across folds.Within each panel, the signal strength  increases from left to right.Column panels give the proportion of taxa affected by the intervention, and rows have different numbers of taxa.Errors beyond 3× the interquartile range of those from fido and mbtransfer have been omitted from the view, excluding some outliers from MDSINE2.Runs that did not complete within 72 hours are not included -this explains the missing boxplot for MDSINE2 in the 200 taxa, 20% nonnull panel.The fido package is comparable to mbtransfer when the intervention strength is weak but deteriorates when the intervention is strong.

Figure 5 :
Figure 5: Inferential performance in the simulation experiment.Rows index different normalization methods and the total number of taxa.The target FDR in each case has been set to 0.2.Columns have varying proportions of nonnull hypotheses and compare DESeq2-based inferences with those from mirror statistics.DESeq2 does not provide FDR control for lag one effects in any simulation context.mbtransfer's mirror algorithm controls the FDR when using the DESeq2-asinh transformation data and when the number of taxa is sufficiently large.

Figure 6 :
Figure 6: Forecasting error for an mbtransfer model applied to the diet intervention dataset of Section 4.1.The -axis is faceted by quantiles of abundance and the -axis is faceted by time horizon ℎ.In-sample error refers to errors made on new timepoints for individuals who appeared in the training data, while out-of-sample predictions are made on individuals who did not appear during training.Performance is strongest in shorter time horizons and for more abundant taxa.

Figure 7 :
Figure 7: (a) Counterfactual difference in simulated trajectories for a subset5of the selected taxa in the diet data in Section 4.1.(b) Subject-level data from a subset of taxa appearing in (a).Each heatmap row is a subject, and each column is a timepoint.These data are consistent with the interpretations from the counterfactual simulation.For example, OTU000006 has more transient increases in abundance (e.g., Animal1, and Animal6) while OTU000065 has more prolonged departures (e.g., Animal3 and Animal 9).

Figure 8 :
Figure8: (a) Counterfactual differences for a subset of selected taxa from the re-analysis of(Costello and  others, 2022).Counterfactual differences are computed for each subject in the data, and bands represent the first and third quartiles of differences across subjects.Since the bands for birth control reinitiation overlap, we conclude that the model does not learn the interaction effects between the intervention and contraception use.(b) The corresponding subject-level data.Rows are grouped according to the birth control reinitiation survey response.

Figure 9 :
Figure 9: (a) Associations between pH and abundance for a subset of taxa that are selected by the mirror algorithm.Taxa have been sorted in decreasing order according to the magnitude of their associated mirror statistics.(b) Counterfactual trajectories under pH shifts.A sustained pH shift is imagined starting at day 42.Different values of pH tend to influence the magnitude, but not the shape, of the forecast trajectories

Figure 12 :
Figure12: Computation times for methods considered in the simulation experiment.fido is fast on untransformed, count data, which is the context in which it was originally designed.mbtransfer is comparable to fido on transformed data.Both packages are an order of magnitude faster than MDSINE2.

Figure 13 :
Figure 13: The analog of Figure 6 for the postpartum case study in Section 4.2.Each point is one sample.In-sample error refers to errors from future timepoints of subjects observed in the training data.Out-ofsample errors are those on previously unobserved subjects.As before, errors predictions are most accurate on nearby time lags and more abundant taxa.

Figure 14 :
Figure 14: The analog of Figure 6 for the aquaculture case study in Section 4.3.We only show in-sample errors, because our analysis only considers three unique tanks.

Figure 15 :
Figure 15: The distribution of mirror statistics   for all selected and a subset of unselected taxa.Larger statistics indicate strong, consistent lag-0 effects (specifically,   (0) for taxon ) across data splits.The selection threshold is chosen adaptively according to a false discovery proportion estimate.The unselected taxa shown are those with the largest median   , and we have shown as many as possible while limiting the total number of boxplots to 100.

Figure 16 :
Figure 16: The analog of Supplementary Figure 15 for the mirror statistics in the postpartum case study in Section 4.2.Mirror statistics appear more concentrated, likely a consequence of the larger sample size and stronger effects visible in this dataset.

Figure 17 :
Figure17: The analog of Supplementary Figure15for the aquaculture case study in Section 4.3.Note that we plot a taxon in the selected panel if it is significant for any lag.This explains why some taxa are selected despite having higher-lag mirror distributions that are symmetric around zero.