Fig 1.
MicroSLAM motivation and approach.
A) Flow chart of microSLAM modeling approach (diagram created in BioRender). B – G) Two bacterial species with different population structures. First row: High population structure species Phocaeicola dorei (260 IBD cases, 218 controls); Second row: Low population structure species Blautia massiliensis (73 IBD cases, 44 controls). B&E) Heatmap of gene-by-gene correlation matrix based on gene presence/absence across IBD samples, using the top 1000 most variable genes. Red: high positive correlation, Blue: high negative correlation. C&F) Heatmap of sample-by-sample GRM (1 minus Manhattan distance of gene presence/absence profiles). Dark green: high similarity, White: low similarity. D&G) Q-Q plot for p-values from tests of association between case/control status and presence/absence of individual genes in the pangenome. Tests are based on micoSLAM and standard logistic regression that does not adjust for population structure (glm). The diagonal line shows expected p-values under the null hypothesis of no association. Pangenome profiling for the metagenomes was done using MIDAS v3 [11].
Fig 2.
MicroSLAM detects both strain and gene associations.
A) The GRM for Ruminococcus B gnavus with hosts sorted by their estimated b values and annotated by their disease status. B) PCoA from the R. gnavus gene presence/absence colored by host disease status (as in A). C) Histogram of permutation test statistics (-values) from the
test for R. gnavus. The line denotes the observed value of t. D) ROC plot for the microSLAM
test model for R. gnavus. The statistic τ quantifies population structure. E) Gene presence/absence plot for a subset of genes associated with the random effect b for R. gnavus. Samples are ordered by b and annotated by their disease status.
Fig 3.
Simulations show that microSLAM improves power and false positive rates.
A) The false positive rates of the τ test of microSLAM were estimated using simulations with varying GRMs but no trait associations. We simulated gene presence/absence and GRMs for the 1000 iterations (τ test simulation 1; Methods). A histogram of p-values for the τ tests shows that the percentage of tests with a p-value < 0.05 is 5.4%. B) Power of the τ test for simulations with a range of values for the odds ratio of the simulated y compared to presence of the trait-associated strain (τ test simulation 2; Methods), repeated for different numbers of samples (N). C) False positive rates of the β tests for glm and microSLAM were estimated using simulations with varying levels of population structure (τ) but no trait associations. We simulated gene presence/absence using the GRMs for the 71 species in the IBD compendium (β test simulation 1; Methods). The false positive rate increases with for the glm and is generally above the targeted level (0.05; horizontal line), while it decreases and is generally below 0.05 for microSLAM. D) Power for 3 simulated species with different τ values and numbers of samples (N). For a subset of genes, presence/absence is simulated based on the trait using a range of odds ratios; other genes have presence probabilities that do not depend on the trait (β test simulation 2; Methods).
Fig 4.
MicroSLAM identifies novel IBD associations.
We analyzed all 71 species in our IBD compendium for three types of associations with case/control status: relative abundance (Kraken2 + Bracken: amount of the species predicts disease), population structure (microSLAM τ: strain predicts disease), and gene family (microSLAM β: gene presence/absence predicts disease). A) Venn diagram showing the number of species with significant IBD associations of each type. For genes, we counted the species if it had at least one significant gene family; species varied in the number of hits (S2 Table). All tests are localFDR adjusted for multiple testing. B) Boxplots showing the AUC ROC from τ test models for all 71 species, stratified by bacterial class. C) UHGG species tree for all 71 species, colored by order. The τ value, p-value for τ test, number of significant genes, and number of samples for each species are plotted in the outer rings. D) Volcano plot for β tests with significant genes (localFDR < 0.2) colored by bacterial order. (S3 Table) E) Bar plot of COG categories for the 83 genes with significant β tests.
Table 1.
A seven-gene operon present in all nine F. prausnitzii clades in UHGG v2.
Fig 5.
Investigation of F. prausnitzii fructoselysine PTS system operon.
A) 52 representative genomes selected from NCBI and colored by the dRep secondary cluster (Selection of Faecalibacterium prausnitzii genomes, Methods). B) Comparison of S. Typhimurium operon to operon in F. prausnitzii D. C) Graphic of the F. prausnitzii fructoselysine PTS system operon and its products (made in BioRender). D) P-values for F. prausnitzii fructoselysine PTS system operon genes in microSLAM β tests across the four F. prausnitzii species defined by UHGG. The flanking genes are much less significant than the genes within the operon. Subunit D (most significant gene in microSLAM analysis) is located at 0, and all other indices are relative to this gene.