Fig 1.
Workflow of data transformation to prediction of host phenotype.
First, taxon-taxon co-occurrence (binary) data from the American Gut Project (A) are input into the GloVe embedding algorithm (B) to produce a taxon (Amplicon Sequence Variant or ASV) by property transformation matrix (C). Then, we take the dot product between a sample by ASV table of interest (D) and the transformation matrix (C) to project that table into embedding space (E). The ASVs in the columns of D and those in the rows of C must match exactly. This table is used to train a random forest model (F) along with sample associated lifestyle and dietary information (G, optional) to predict the IBD status of the host (H). As points of comparison, random forest models are also built without embedding, after transforming the same sample by ASV table (D) using PCA (I) and normalizing (J).
Fig 2.
Transforming ASV tables into GloVe embedding space before training a model produces more accurate host phenotype predictions (IBD vs. healthy control) and makes models more robust to hyperparameter choice.
Each point represents a triplet of choices for number of trees, depth of each tree, and weight on a positive prediction of IBD in a random forest model. Each model was trained on the data input type indicated by color (Normalized, non-embedded counts is purple, PCA embedded data is pink, and GloVe embedded data is blue). Models trained on GloVe embedded data produce higher ROC AUCs with less variance across hyperparameter choice.
Fig 3.
Embeddings trained on American Gut training set, model trained on American Gut training set, model tested on American Gut held out test set (A). Models trained on GloVe embedded data have higher ROC AUC but slightly lower Precision-Recall AUC on a held out test set (B).
Fig 4.
Embeddings were trained on American Gut data, and the predictive models were trained and tested on Halfvarson dataset (A). Transforming microbiome data into GloVe embedding space (100 features) prior to model training produces more accurate models than using ASVs (26,251 features) (B).
Fig 5.
Two models, one embedding-based and one ASV-based, were trained on American Gut data and tested on two independent query datasets (A). Embedding-based models outperform ASV-based models significantly when testing on Halfvarson dataset (B) and Schirmer dataset (C).
Fig 6.
Relationship between embedding space and phylogeny.
A: Hierarchical clustering of ASVs using similarity between property vectors matches phylogenetic tree topology. Histograms (light blue) show branch score distances (A1) and symmetric distances (A2) between permuted phylogenetic trees and hierarchically clustered trees based on property similarity. The dotted red line shows the distance between the true phylogenetic tree and the property-based tree. B: 2D PCoA ordination using cosine distance of 1500 randomly selected ASVs in embedding space. 10 pairs of ASVs were selected based on their high spearman correlation in the American Gut ASV table across all samples(r > 0.7). Each pair is denoted by shape. Color shows the family classification of the selected pairs, and all other ASVs are shown in grey. C: t-SNE ordination using cosine distance of 5000 randomly selected ASVs, colored by genus. 3 genera highlighted in the dashed circle exhibit similar co-occurrence patterns.
Fig 7.
Dimensions in GloVe embedding space correlate with some metabolic pathway annotations, but dimensions in PCA embedding space do not (A). Each column in each heat map represents a metabolic pathway from KEGG (e.g. ko00983). Each row is a dimension in either GloVe or PCA embedding space. Significant metabolic pathway correlations of the four properties strongly associated with IBD in both Halfvarson and AGP datasets (B). Each point represents a metabolic pathway, the x axis shows the pathway’s broad category, and the y axis shows the strength of the correlation.