A joint modeling approach for longitudinal microbiome data improves ability to detect microbiome associations with disease

Changes in the composition of the microbiome over time are associated with myriad human illnesses. Unfortunately, the lack of analytic techniques has hindered researchers’ ability to quantify the association between longitudinal microbial composition and time-to-event outcomes. Prior methodological work developed the joint model for longitudinal and time-to-event data to incorporate time-dependent biomarker covariates into the hazard regression approach to disease outcomes. The original implementation of this joint modeling approach employed a linear mixed effects model to represent the time-dependent covariates. However, when the distribution of the time-dependent covariate is non-Gaussian, as is the case with microbial abundances, researchers require different statistical methodology. We present a joint modeling framework that uses a negative binomial mixed effects model to determine longitudinal taxon abundances. We incorporate these modeled microbial abundances into a hazard function with a parameterization that not only accounts for the proportional nature of microbiome data, but also generates biologically interpretable results. Herein we demonstrate the performance improvements of our approach over existing alternatives via simulation as well as a previously published longitudinal dataset studying the microbiome during pregnancy. The results demonstrate that our joint modeling framework for longitudinal microbiome count data provides a powerful methodology to uncover associations between changes in microbial abundances over time and the onset of disease. This method offers the potential to equip researchers with a deeper understanding of the associations between longitudinal microbial composition changes and disease outcomes. This new approach could potentially lead to new diagnostic biomarkers or inform clinical interventions to help prevent or treat disease.


Introduction
Here we provide a tutorial for how to apply our joint modeling methodology for longitudinal microbiome data [1] using the R programming language. To implement this methodology, we modified the existing rstanarm R package [2] to include functionality necessary for our model. More specifically, we updated the joint modeling implementation [3] by: • Adding functionality for an offset term in the longitudinal submodel • Parameterizing the event submodel to include relative abundances rather than the predicted counts from the longitudinal submodel • Introducing a scaling parameter to modify the scale of the relative abundances in the event submodel These changes are currently available via the development version of rstanarm and can be installed in R using the following commands.
if (!require("remotes")) { install.packages("remotes") } remotes::install_github("stan-dev/rstanarm", INSTALL_opts = "--no-multiarch") Additional instructions for installing this software are available on the rstanarm GitHub page. The documentation for the rstanarm software is available via CRAN or can be built using the above command by including the build_vignettes = TRUE argument (note that this significantly increases installation time).

Pregnancy microbiome dataset
We use a pregnancy microbiome dataset to illustrate how to perform joint modeling on longitudinal microbiome and time-to-event data. The dataset is derived from a case-control study by DiGiulio et al. examining the microbial compositions of various body sites of women during pregnancy [4]. We use a pre-formatted version of this dataset from a paper by Zhang, et al. on longitudinal modeling of microbiome data [5]. In this tutorial we provide the basic steps necessary to analyze how changes in specific taxon abundances in the vaginal microbiome during pregnancy affect the time-to-delivery.

Pre-processing
In general, the pre-processing necessary for analysis will be highly dependent on the dataset and personal preferences. The taxonomy table for this dataset contains entries with special characters that cannot be used to label rows and columns, so we remove those characters from the data.
dat.tax.tab <apply(dat.tax.tab, 2, function(x) gsub("\\[|\\]", "", x)) dat.tax.tab <apply(dat.tax.tab, 2, function(x) gsub(":|-", "_", x)) The row names of the taxonomy table and the column names of the OTU table must be identical for importing the data into phyloseq. The column names in the OTU table were prefixed with an 'X' when loaded to create valid data frame column names. To have matching and valid row and column names between these two tables, we introduce a new 'OTU_' prefix for the OTU labels.
otu.names <-paste0("OTU_", rownames(dat.tax.tab)) rownames(dat.tax.tab) <-otu.names colnames(dat.otu.tab) <-otu.names To import the data tables to a phyloseq object, we create the respective objects for the metadata (or sample data), OTU, and taxonomy tables. library(phyloseq) SAM <-sample_data(dat.samp.tab) OTU <-otu_table(dat.otu.tab, taxa_are_rows = FALSE) TAX <-tax_table(as.matrix(dat.tax.tab)) We then import the data to a phyloseq object and perform a few basic operations. The tax_glom function combines the samples at a given classification level -we will analyze data at the genus level. The subset_samples function is used to select only the vaginal samples we will use for analysis and to remove samples collected postpartum. We have also loaded the tidyverse R package, a collection of tools such as ggplot2 for plotting and dplyr for data formatting that improve the efficiency of data analysis [8].

library(tidyverse)
ps.genus <phyloseq(OTU, TAX, SAM) %>% tax_glom(taxrank = "Genus") %>% # Combine by genus subset_samples(BodySite == "Vaginal_Swab") %>% # Look only at vaginal samples subset_samples(GDColl <= GDDel) # Remove samples collected after delivery We relabel the taxa using the genus names now that the OTU abundances have been combined at the genus level. If you choose not to relabel the taxa names manually, the phyloseq software will label the combined OTUs based on the OTU with the largest abundance within the genus.

Formatting
Formatting metadata prior to exploratory analyses can be helpful, but determining the optimal formatting for your model might be an iterative process. We remove samples that were collected prior to t = 0, the beginning of gestation, and create and simplify variables necessary for the longitudinal and event submodels. Additionally, we adjust the time scale of our data to weeks rather than days or months. Data with a high resolution (e.g., gestational days) will have less drastic changes per unit of time. Conversely, a low resolution time scale (e.g., gestational months) could have time units with multiple samples which could reduce the model's ability to detect changes over time.

Preliminary analyses
Our methodology works best as a tool to verify and quantify relationships identified via preliminary analyses.
In this section we discuss basic methods for determining taxa that warrant further examination. Note that this is only a small collection of simple but useful analyses -more advanced analyses could provide additional insights.

Taxa abundances over time
In this example we look at the ten most abundant taxa sorted by mean relative abundance over time. Note that it could be informative to explore the taxa data sorted by other summary statistics such as sum or variance. We compare the trajectories for the top 10 genera over time using the time related outcome of preterm birth, which is defined as a time-to-delivery < 37 weeks.

Clustering microbiome samples
Clustering microbiome samples can reveal common microbial composition patterns and illustrate how those patterns might relate to outcomes. There are many approaches for clustering microbiome data that are outside the scope of this tutorial but should be further researched prior to analyzing a microbiome data set. For this example we perform partitioning around medoids (PAM) clustering on the Bray-Curtis dissimilarity matrix for all samples using four clusters.

show_colnames = FALSE, annotation_colors = anno.colors )
Cluster membership for the samples is shown along the top bar, and the preterm outcome is shown along the second bar with pink indicating samples from women who delivered preterm. Clusters 1 and 2 contain the majority of samples, which are dominated by high abundances of Lactobacillus, and do not have a large proportion of preterm outcomes (18% and 23%, respectively). Cluster 3 is a small group that consists of Gardnerella-dominant samples, all of which are associated with preterm outcomes. Cluster 4 contains Prevotella-dominant samples with 59% of the samples having a preterm outcome. These clustering results further support examining the three most abundant taxa and indicate that increased abundances of Gardnerella and Prevotella might be positively associated with preterm outcomes.

Verifying distribution fit
The longitudinal submodel assumes that the taxon follows a negative binomial distribution. The negative binomial distribution is a discrete probability distribution that is well-suited for non-negative and overdispersed microbiome count data. However, a negative binomial distribution will not be an optimal representation of taxa that exhibit bimodality or zero-inflation, which could impact the accuracy of joint modeling results. To assess how well a negative binomial distribution represents taxon abundances, we plot a histogram of the taxon abundances and overlay a plot of the density of counts sampled from a negative binomial distribution with the sample mean and dispersion observed in our data.
As an example of how to verify the distribution of taxa within the dataset, we will examine the two most abundant taxa, Lactobacillus and Prevotella.

Lactobacillus
We first calculate the sample parameters from the observed data for Lactobacillus. The sample dispersion is defined as θ = µ 2 σ 2 −µ , where µ is the sample mean and σ 2 is the sample variance. samp.mean <mean(taxa.data$Lactobacillus) samp.var <var(taxa.data$Lactobacillus) samp.disp <-samp.meanˆ2 / (samp.varsamp.mean) Using these parameter estimates, we draw 1000 samples from a negative binomial distribution. (Note that the dispersion parameter θ can also be referred to as the size or shape parameter.) We overlay a density plot of these samples onto a histogram of the observed values.
nb.dist <data.frame(samp = rnbinom(1000, size = samp.disp, mu = samp.mean)) ggplot(taxa.data, aes(x = Lactobacillus)) + geom_histogram(aes(y = ..density..), # Use density instead of counts color = "black", fill = "white" ) + geom_density(data = nb.dist, aes(samp), alpha = .2, fill = "red") Looking at this plot, we see more zero abundance Lactobacillus samples in our data than expected for a negative binomial distribution with the observed sample mean and sample dispersion. Because the distribution of Lactobacillus does not closely align with the expected density, the longitudinal submodel would not provide a good fit for the data.

Prevotella
We construct the same plot using Prevotella abundances.

Removing low abundance taxa
The previous analyses highlight highly abundant taxa that are typically of interest in microbiome data analysis. Some researchers might be interested in analyzing a large number of taxa or low abundance taxa with a detectable effect on the time-to-event outcome. In these instances, we recommend filtering out very low abundance taxa that would be insufficient for joint modeling analysis. A few options would be to exclude taxa that: • have a mean/median/sum below a specified threshold • are zero in a given number of samples (e.g., taxon is zero in more than 10% of samples) • are below a threshold in a given number of samples (e.g., taxon is below 1% abundance in more than 10% of samples) Filtering taxa is relatively easy using the prune_taxa() function in the phyloseq R package. However, note that calculation of the total counts (library sizes) must be performed prior to filtering taxa.

Joint modeling analysis
In this section we outline the steps to perform joint modeling on longitudinal microbiome data with time-toevent outcomes using the formatted data. To construct this model we use the rstanarm R package, which simplifies the process of analyzing Bayesian regression models in R via Stan.

Longitudinal submodel
The longitudinal submodel is a negative binomial mixed effects model and serves as a predictive model for the relative abundances included in the event component of the joint model. The fixed effects included in the model should be methodically chosen using variable selection and/or exploratory analyses. For a longitudinal model the random effects are typically represented with a subject-specific random intercept and a time covariate for the random slope. Although the event submodel will use the gestational weeks variable GWColl, the longitudinal submodel for Prevotella is more accurate when using the trimester variable TrimColl for the random slope. Finally, we include an offset term for the log library size of each sample to adjust for variation of total counts across samples. long.model <-Prevotella~GWColl + History_of_preterm_delivery + Preeclampsia + non_hispanic_white + (TrimColl | SubjectID) + # random effects offset(log(LibrarySize)) # offset term long.data <-long.data %>% select( Prevotella, GWColl, History_of_preterm_delivery, Preeclampsia, non_hispanic_white, TrimColl, SubjectID, LibrarySize ) Because the model predictions from the longitudinal submodel are incorporated into the event submodel, it is especially important to test the predictive fit of the longitudinal model. For testing purposes, the model can be built using external software, such as the BhGLM [9] or brms [10] R packages, or using the stan_glmer or stan_mvmer included in rstanarm. We use stan_mvmer here, which is the function used within the joint modeling function.

summary(mvmer.fit)
Note that it is preferable to use multiple chains but we use only one here for simplicity. Longitudinal model results can be viewed using the summary function, the output of which we exclude here for brevity.
The longitudinal model predictions do not require high accuracy but should have a relatively good fit. We plot the predicted relative abundances against the observed relative abundances and find that we have sufficient model predictions.

Event submodel
The event submodel is the foundation of the joint model into which the longitudinal model values will be introduced. In this example we model the time-to-delivery GWDel outcome. Variable selection for the event submodel should be performed using standard methods for Cox proportional hazards modeling. Specifically, users should verify that covariates included in the model do not violate the proportional hazards assumption using a function such as cox.zph in the survival R package [11].

Joint model for longitudinal microbiome data
Performing joint model analysis is relatively straightforward once the longitudinal and event submodels are defined. Aside from the scale_assoc parameter we have introduced, arguments for the stan_jm function are detailed in the rstanarm documentation. However, we will highlight important recommendations and requirements for some arguments based on our methodology.
• Scaling parameter: Our update to the joint modeling function in rstanarm introduces the scale_assoc argument, which represents the scaling parameter φ discussed in detail in our manuscript. This parameter multiplicatively scales the predicted relative abundances included in the event submodel.
Although the optimal value for this parameter might vary based on the data and specific taxon you are modeling, scale_assoc=10 is often sufficient for detecting the taxon abundance effect size and maintaining an interpretable result. • Data-dependent arguments: These arguments are specific to the dataset and user-defined variables.
-formulaLong, dataLong -formula and data frame for the longitudinal submodel -formulaEvent, dataEvent -formula and data frame for the event submodel -time_var -the time variable for the longitudinal data which must be included as a fixed effect in the longitudinal model and must be on the same time scale as the time-to-event outcome -id_var -the variable that identifies subjects which must be included in both longitudinal and event data • Model required arguments: To perform joint modeling with longitudinal microbiome data as presented in our manuscript, these argument settings are required.
family = neg_binomial_2 -defines the distribution for the longitudinal submodel assoc = muvalue -defines how longitudinal model values are parameterized in the joint model • Bayesian modeling parameters: -adapt_delta and max_treedepth -These are tuning parameters specific to the MCMC implementation used in rstanarm. The software provides warning messages when these values should be modified from their default values. -init_r -This argument sets the magnitude of the range for the initial parameter values, which will be uniformly sampled from the range [-init_r,init_r]. The default value of init_r=2 can result in initialization values that are too large for the event model parameters and cause the model building to fail. We recommend setting this argument to a smaller value, such as init_r=0.5, for this modeling approach.
We use the following command to construct the joint model as outlined above for our example pregnancy microbiome dataset.
jm.fit <-stan_jm( formulaLong = long.model, dataLong = long.data, formulaEvent = event.model, dataEvent = event.data, time_var = "GWColl", id_var = "SubjectID", family = neg_binomial_2, assoc = "muvalue", scale_assoc = 10, adapt_delta = 0.9, max_treedepth = 15, init_r = 0.5, seed = 54321 ) The print function shows a short summary of the posterior distributions for the estimated parameters. A more detailed summary of the model results can be printed using the summary function (not shown here for brevity). These results show that the posterior median for the taxon abundance effect size (listed in the results as Long1|muvalue in the event submodel) isα = 0.403. The hazard ratio for this parameter can be calculated as exp(0.403) = 1.5. Because we have scaled our relative abundances by 10, a unit increase for this model is equivalent to a 10% increase in relative abundance. Therefore, the predicted hazard ratio indicates that for a 10% increase in Prevotella abundance the hazard of delivery increases 1.5-fold.
We would like to note that in the manuscript, we use the posterior mean as the predictor for the taxon abundance effect size. Both the posterior mean and median are commonly used as parameter point estimates in Bayesian modeling. Ideally, as is the case with this example, the posterior mean and median will be approximately equal.
We plot the posterior distributions for model parameters using the bayesplot package [12].