Themetagenomics: Exploring Thematic Structure and Predicted Functionality of 16s rRNA Amplicon Data

Analysis of microbiome data involves identifying co-occurring groups of taxa associated with sample features of interest (e.g., disease state). Elucidating such relations is often difficult as microbiome data are compositional, sparse, and have high dimensionality. Also, the configuration of co-occurring taxa may represent overlapping subcommunities that contribute to sample characteristics such as host status. Preserving the configuration of co-occurring microbes rather than detecting specific indicator species is more likely to facilitate biologically meaningful interpretations. Additionally, analyses that use taxonomic relative abundances to predict the abundances of different gene functions aggregate predicted functional profiles across taxa. This precludes straightforward identification of predicted functional components associated with subsets of co-occurring taxa. We provide an approach to explore co-occurring taxa using “topics” generated via a topic model and link these topics to specific sample features (e.g., disease state). Rather than inferring predicted functional content based on overall taxonomic relative abundances, we instead focus on inference of functional content within topics, which we parse by estimating interactions between topics and pathways through a multilevel, fully Bayesian regression model. We apply our methods to three publicly available 16S amplicon sequencing datasets: an inflammatory bowel disease dataset from Gevers et al., an oral cancer dataset from Schmidt et al., and a time-series dataset from David et al. Using our topic model approach to uncover latent structure in 16S rRNA amplicon surveys, investigators can (1) capture groups of co-occurring taxa termed topics; (2) uncover within-topic functional potential; (3) link taxa co-occurrence, gene function, and environmental/host features; and (4) explore the way in which sets of co-occurring taxa behave and evolve over time. These methods have been implemented in a freely available R package: https://github.com/EESI/themetagenomics.


48
Introduction 49 High-throughput sequencing now permits for the analysis of multiple large datasets on the 50 microbiome and diseases of interest. Historically, researchers have sought to reduce the 51 dimensionality of the data and/or perform feature selection to identify species (or other taxa) of 52 interest that are correlated with sample/community-level attributes (which we will refer to as 53 "phenotypic" attributes or "phenotypes") like host health status. Unfortunately, these 54 phenotype-associated species may co-occur with the same or different proportions across 55 samples within the same phenotype. Capturing these configurations is of interest to us, as we 56 contend it is more informative than merely finding specific taxa [1,2].

57
Nevertheless, obtaining meaningful configurations or subsets of taxa is often a daunting task.

58
These high-dimensional microbiome datasets include categorical and numeric features 59 associated with each sample. These, in turn, may be linked to a set of taxonomic abundances

83
Methods that predict functional profiles from 16S rRNA survey data usually report the overall 84 function of a sample and do not provide granularity on how each subcommunity provides 85 specific functions (Fig 1). Standard methods that predict function from 16S rRNA survey data 86 include PICRUSt, Tax4fun,Piphillin,. These simulate gene abundances 87 from the OTU relative abundance profile by assigning pre-existing gene ontologies, based on 88 whole genome sequences, to the OTUs. The simulation is trivial for known microbes, but for 89 novel OTUs, gene content is interpolated through its neighbors' genes. These are determined 90 via an unsupervised phylogenetic tree reconstruction. However, after the gene abundance 91 profiles are simulated for an entire sample, a user cannot view which functional content 92 associates with which taxa, nor how subcommunities contribute to function.

101
The topics-over-OTUs distribution is hierarchically clustered to infer relationships between 102 clusters of co-occurring OTUs and topics (bottom). The result is the ability to identify key topics 8 132  Our pipeline aims to concisely summarize high-dimensional data in the form of OTU 136 abundances as low-dimensional sets of co-occurring taxa (topics) with their corresponding 137 predicted functional potential. When additional high-dimensional data is available (e.g.,

138
predicted gene function abundances), interpretability becomes increasingly difficult. Although interest (e.g., those that contain differentially-enriched functional profiles) can easily be 157 identified in our pipeline via a fully Bayesian multilevel regression model. We also apply our 158 approach to empirical time-series data where we characterized events in terms of sets of 159 correlated topics to explore how the taxonomic configurations evolved over time.

165
Here we explore the use of themetagenomics on publicly available datasets studying Crohn's 166 disease microbiota (Gevers et al. [21]), oral cancer microbiota (Schmidt et al. [22]), and the 167 variation of microbiota as a function of time (David et al. [23]). With the larger Gevers et al.

168
Crohn's dataset, we validate the ability of themetagenomics to capture microbial profile 169 "signatures" (configurations of taxa which are groups with specific ratios/relative abundances 170 of co-occurring taxa). We show that (1)

198
To assess generalizability, we used a training/testing approach. We randomly selected 80% of 199 samples as our training set; the remaining 20% were set aside for testing (Table S1). Class labels 200 were binary, with positive (CD+) and negative (CD-) clinical diagnoses acting as the positive 201 and negative classes, respectively. For classifying CD diagnosis, we hypothesized that using 202 topics as predictors would outperform using relative abundances of OTUs, since the relative 203 abundance-based predictors are sparser, whereas topic modeling performs dimensionality 204 reduction, resulting in a relatively smaller set of topics that are less sparse relative to OTUs.

205
There was little difference between the topic model with at least 25 topics and the OTU table to 206 train the classifier (S1 Fig, Table S2). During testing, however, using topics as features 207 outperformed relative abundances, particularly in the F1 score, with relative abundances 208 achieving 80.8% and at least 25 topics achieving greater than 82.1% (Table S3). Using OTU 209 relative abundances as predictive features resulted in a larger proportion of false negatives, 210 which was likely due to its reliance on few, relatively rare taxa. Topics, on the other hand, are 211 less reliant on rare taxa because dimensionality reduction generates less sparse features (S2 212 appendix).

214
To identify topics of interest that were strongly associated with phenotype, we again 215 implemented themetagenomics on the Crohn's disease dataset, using the same binary 216 indicator for CD diagnosis as above. We then performed posterior inference. The primary 13 217 output of the topic model, as with any Bayesian analysis, is a posterior distribution of quantities 218 that estimate latent variables-of-interest (e.g., the frequencies of topics, θ, in a particular sample)

219
given the observed data (e.g., OTU abundances). Posterior inference involves sampling these 220 latent variables-of-interest from the posterior distribution of the fitted topic model to calculate 221 expected means and assess uncertainty in those expectations.

222
With the posterior distribution, we identified topics-of-interest based on their "topic-sample-223 effects" -the regression coefficients that represent differences in topic frequencies between CD+ 224 and CD-samples. We performed permutation tests to ensure that detected topic-sample-effects  From the posterior topics-over-OTUs distribution (β) for the K25 model, we identified OTUs 260 highly associated with CD, that is, OTUs with high frequency in high-ranking-topics (CD+ 261 associated topics T19, T13, T25, T11; CD-associated topics T14, T2, T12, T15) in more than 99%

262
of posterior samples (arbitrary threshold). We categorized these OTUs as CD+ associated OTUs,

280
We fit BioMiCo using 25 and 50 assemblages and compared its ability to distinguish CD from 281 control using held-out testing data (same train/test splits as described previously) and then 282 compared these results to the prediction performance of the STM. Testing performance was 283 similar between the two approaches (Table S3, S6). The balanced accuracy was highest for the 284 25-topic STM model, but the STM's performance varied as a function of topic number. F1 score, 285 however, was much worse for BioMiCo due to its low precision.

286
For the 25-assemblage model, there were roughly four assemblages with high posterior 287 probability for CD samples and low posterior probability for controls. If we focused on the taxa 288 with the top-10 highest posterior probability of belonging to these assemblages, no more than 2 289 taxa were present in the top-10 highest probability taxa in the STM's CD-topics that were most 290 associated with CD, suggesting little correspondence between the composition of assemblages 291 and topics. Alternatively, when focusing on assemblages with high posterior probability for 292 control but not CD, one assemblage had 4 genera in common with the STM's topic 13:

316
Like Gevers et al., we identified an increase in membrane transport associated with CD+ 317 subjects' gut microbiome; however, using themetagenomics, we were able to pinpoint the 318 specific topics associated with the enrichment of these functional categories, T2 and T12 (Fig   319  3A). We then could link enrichment of membrane transport genes to the taxa that were also 320 enriched in this topic. For example, topics T2 and T12 were dominated by Enterobacteriaceae.

321
These Enterobacteriaceae-enriched topics were also enriched for siderophore and secretion system related genes. Like T2 and T12, T15 was highly associated with CD+; however, it was

341
The strongest topic-pathway interaction was found in T19 for genes encoding bacterial motility 342 proteins. For T19, three motility-related pathways (bacterial motility proteins, bacterial chemotaxis, flagellar assembly) had topic-pathway interactions that did not span 0 at 80% 344 uncertainty, suggesting that T19 was more enriched in cell motility genes relative to all other 345 topics. The pathways inferred from T19 are consistent with this taxonomic profile, which 346 consisted of motile bacteria belonging to Lachnospiraceae, Roseburia, and Clostridiales.

347
Enrichment of two lipopolysaccharide (LPS) synthesis categories were associated with CD+ 348 topics; however, one of these categories was specific for only T15 (Table S4).

383
Only pathways present in both the themetagenomics analysis of 16S rRNA data and

384
HUMAnN2 analysis of the metagenomics shotgun sequencing data are shown. Green=associated samples with positive cancer diagnosis, Purple=associated with healthy 386 samples. Fig 4B. Pathway category-topic interaction regression coefficients for metagenomic 387 data. Topics were generated based on KOs that belonged to high frequency taxa in the 388 themetagenomics pipeline. Fig 4C. Example topic-pathway heatmaps, similar to Fig 4A and 4B 389 from four of the 100 permuted metagenomic datasets using in the permutation test. Fig 4D. 390 Distribution of root-mean-squared-error (RMSE) scores (between the topic-pathway interaction 391 regression coefficients between themetagenomics and the metagenomic data) from the 100 392 permuted metagenomic datasets. The RMSE score (0.56) for the unpermuted metagenomic 393 dataset is delineated by the red dotted line.

395
To compare the results from themetagenomics to gene function abundances inferred from 396 metagenomic shotgun sequencing for each topic, we first identified high frequency taxa (those 397 with frequencies greater than 1% in that topic) then identified all reads belonging to these taxa 398 in the metagenomic shotgun data. To identify pathway-topic enrichment/depletion, we then 399 applied a multilevel regression model. The results indicate that the taxa belonging to a topic are 400 associated with an enrichment/depletion of genes present in the shotgun data ( Fig 4B). Notably,

401
LPS biosynthesis proteins and porphyrin metabolism pathways were depleted in multiple 402 topics in both sets of results. The relative enrichment/depletion of phosphotransferase system 403 genes was also similar.

404
We performed a permutation test to determine whether the similarities in gene 405 enrichments/depletions between themetagenomics and the metagenomic data were spurious. We randomly permuted the topic and gene pathway labels in the metagenomic data, refit the 407 multilevel regression model, and then calculated the root mean square error (RMSE) for each 408 topic-pathway interaction regression weight between the themetagenomics and permuted 409 metagenomic models. After 100 replicate simulations, the RMSE for the unpermuted 410 metagenomic model was smaller than every permuted metagenomic model (p < 0.05) (Fig 4C-411  D). Therefore, the apparent similarities in the gene enrichment/depletion profiles between 412 themetagenomics and the shotgun data were not due to random chance, indicating that 413 using predicted gene enrichment/depletion from 16S rRNA amplicon surveys resulted in 414 similar within-topic predicted functional profiles to those obtained by directly measuring 415 functional content via metagenomic shotgun sequencing.

418
The David et al. dataset contains fecal and salivary 16S rRNA gene surveys from two subjects.

419
We focused on fecal samples from subject B. We compared our results to the three profiles

422
The topic model approach identified 3 distinct gut configurations. In the topic correlation 423 network (Fig 5A), we identified a small subnetwork of three topics (marked by violet bracket)

424
and two large subnetworks that contained 24 and 14 topics each (red and green brackets, given day). There were two clear periods of rapid change in cluster frequency, specifically when transitioning from cluster 1 to 2 (days 152-154) and clusters 2 to 3 (day 161). Our intervals are 449 similar to the original study's transition points at days 144-145 and 162-163, where the shift 450 from a topic cluster 1 to topic cluster 2 corresponded with subject B's food poisoning diagnosis.

451
The transition between topic clusters 1 and 2 is abrupt and likely occurred around day 153.

452
Taxonomically, this transition is marked by a shift from Bacteroideaceae (posterior 453 frequency=0.338), Lachnospiraceara (0.276), and Rumunococcaceae (0.266) to Enterbacteriaceae 454 (0.246) and Clostridiaceae (0.195) families (Fig 5D). In particular, day 153 was distinctive for 455 topic 20. This rare topic was not correlated with any other topics and hence did not belong to 456 any topic cluster. While its taxonomic profile was quite similar to topic cluster 1, it was 457 distinctly enriched for Enterobacteriaceaea spp., which is consistent with the subject's Salmonella 458 diagnosis. Topic 20 likely marks the event of initial exposure to the pathogen.

459
The distribution of topic assignments for topic cluster 2 followed the order in which its topics 460 were positioned in the topic correlation network (the linear chain of topics) (Fig 5E) shifted toward cluster 3, which was similar to cluster 1 in terms of Bacteroidaceae (0.369),but 472 enriched in Lachnospiraceae (0.360) and depleted in Rumunoicoccaceae (0.165) (Fig 5D).

497
We present our approach at a time when easily-to-interpret analyses for complex microbiome 498 data are direly needed. Current methods often link the relative abundance of a single OTU to a 499 sample information of interest (e.g., disease state). These methods routinely identify important 500 subsets of taxa but ignore OTU co-occurrence and ratios. Network methods can overcome this 501 concern, but typically don't incorporate phenotypic information within the model;

502
consequently, they are incapable of directly linking sections of the OTU correlation network 503 with sample metadata of interest. Constrained ordination methods, such as canonical 504 correspondence analysis, do in fact couple inter-community distance with sample information, 505 but the user is limited to specific distance metrics (e.g., Chi-squared) and must follow key 506 assumptions (e.g., the distributions of taxa along environmental gradients are unimodal) [26].

507
Moreover, interpretation of biplots becomes increasingly difficult as more covariates are 508 included. While linking key taxa to functional content can be accomplished via sparse canonical 509 correlation analysis [27], this approach is susceptible to many of the interpretability problems 27 510 found in other ordination approaches, and exploring inferred relationships in the context of 511 taxonomic co-occurrence is not straightforward.

512
The ability to make meaningful inferences using current methods is further limited by the fact 513 that microbiome data is often inadequately sampled (thus justifying some type of normalization 514 procedure), compositional (due to normalization), sparse, and overdispersed. Thus, recent work 515 has explored the use of Dirichlet-Multinomial models, which are well equipped at managing 516 overdispersed count data [28][29][30]. The fact that Dirichlet-Multinomial conjugacy is exploited in 517 many topics models hints at their applicability for relative abundance data. We selected the 518 recently developed STM for our workflow because of its ability to not only utilize sample data 519 as prior information as in the Dirichlet-Multinomial regression topic model [31], but also 520 capture topic correlation structure and apply partial pooling over samples or regularization 521 across regression weights. predicted functional content, as is typical, the apparent relationship between membrane 535 transport genes and specific configurations of bacteria would be lost.

536
We have also shown that our approach drastically reduces the dimensionality of two high-

567
For our purposes, the posterior distribution of unobserved (latent) parameters given the 568 observed data is given by:

570
The generative process is formulated by first specifying the probability and, for each of the samples, is assumed to follow logistic normal distributions, where is a matrix of regression coefficients that estimate the degree of influence a Γ × ( -1) 575 covariate X p has on ; and is a covariance matrix. In addition to , the probability Σ × 576 , (OTU occurs in Topic ) = β , ∑ = 1 , = 1

577
For each topic, β k is assumed to be Dirichlet distributed. Finally, both topic assignments z m,n for 578 each OTU w m,n , along with each OTU, obey multinomial distributions,

580
For the relationships between topic model nomenclature and our terminology, see Table   581 1. The posterior distribution is estimated by a semi-collapsed variational expectation 582 maximization procedure. Convergence is reached when the relative change in the variational 583 objective (i.e., the estimated lower bound) in successive iterations falls below a predetermined 584 tolerance.