Comprehensive Meta-analysis of Ontology Annotated 16S rRNA Profiles Identifies Beta Diversity Clusters of Environmental Bacterial Communities

doi:10.1371/journal.pcbi.1004468

Fig 1.

Overview of Microbial Community meta-analysis.

The diagram lists all major components of our framework and their relation to each other. The results of each step are shown next to each component. Tools/scripts are shown in type writer font. a) Data acquisition and database creation: we collect data samples from four different sources and unify there representations such that they can be integrated in a single relational database. b) The given web page provides a user-friendly way to generate highly customized BIOM tables to facilitate user-specific meta-analyses. c) Adaptive rarefaction, taking input either from the database or from a BIOM table, produces all-against-all beta diversity distance matrices for the provided samples. d) Conventional hierarchical clustering. e) Posthoc enrichment test for EnvO annotations. The final output are a spreadsheet, documenting enriched clusters (precision, recall, F-measure, cluster coefficients, EnvO-terms, etc.), and interactive, annotated clustering visualizations.

More »

Expand

Fig 2.

EnvO subgraph for environmental Material.

Node size reflects number of samples assigned to the EnvO-term (logarithmic scale, see size legend, right). Node colors are shades of the overarching ecosystem color, see left legend. Multiple inheritence of EnvO-terms is reflected by several colors arranged in concentric rings.

More »

Expand

Table 1.

Ecosystem definitions based on EnvO categories.

More »

Expand

Fig 3.

Alpha diversity box plots for different ecosystems.

Based on our dataset, we observe that soil, marine and plant-associated environments in general host more diverse communities. Thanks to the applied sub-categorization, we can further break down ecosystems to inspect diversity in different soil types (shown in supplementary S1 Fig). We calculate Phylogenetic Distance Alpha diversity from samples rarefied to 1140 sequences.

More »

Expand

Fig 4.

Comprehensive clustering of 10,313 samples with at least 2000 sequences.

Clusters enriched in EnvO-terms are identified and color-coded automatically if F₁-score > 0.5. Note that in the dendrogram, the entire clade is colored by the color of the enriched EnvO-term. The human/animal associated and soil clusters are supported by many independent studies, whereas freshwater and geothermal clusters are largely driven by findings of a single study. Study color, ecosystem colors and EnvO associations are visualized in the colorbars below the dendrogram. EnvO-annotation colors are shades of the associated ecosystem color (see legend).

More »

Expand

Fig 5.

Illustration for Algorithm 1.

Given a hierarchical clustering of samples that are annotated with Ontology terms (colored boxes, ancestry relations are shown with black lines), it detects enriched ontological categories on various levels of abstraction in each possible cluster: while analyzing the indicated cluster (black box, emphasized triangle), all present categories (and their ancestral categories) are characterized by their F-measure. E03 and especially E20 (parent of E02 and E03) are relatively specific for this cluster, as evidenced by a relatively high F-measure, whereas E01, E10 and E02 are mostly present outside the cluster, reflected by a small number of True Positives. Abbreviations: TP = True Positives, FP = False Positives, TN = True Negatives, FN = False Negatives.

More »

Expand

Table 2.

Clusters from enriched in Environmental Ontology terms (as determined by).

More »

Expand

Fig 6.

PCoA plot (principal components 1 and 2) for the same samples as in Fig 4.

The scatter plot shows relatively cohesive and distinct ecosystems. While large studies often constitute the bulk of ecosystem clusters, detailed inspection shows support from further, smaller studies. Data points for certain ecosystems have been separated in the subgraphs b) to e). a) PCoA scatter plot including all samples from all environments. The first component largely separates human and and environmental samples, while the second component helps to identify clusters for soil, marine, freshwater and plant-associated samples. Misannotations of insect-associated samples (wrongly annotated as Soil) are shown in the red shape. b) The two main marine clusters, “Marine 1” and “Marine 2” (corresponding to the clusters in Fig 4 with the same name) are identifiable through the composite Ecosystem coloring: Marine sediments, shown in cyan/yellow mostly form “Marine 2” due to their dual membership in soil and marine environments; in contrast “Marine 1” samples are solely colored cyan. Hypersaline samples (red) appear widespread and non-cohesive. c) Fresh water samples, colored by Envo-ID. Several environments (freshwater biome, aquarium, freshwater lake) appear strongly related, while samples from permafrost and sinkholes are outliers. d) Plant samples split according to the two main contributing studies QiimeDB 1792 and 2019 respectively. Each cluster receives further support from small and medium sized studies. e) Soil samples. Composite environments form sub-clusters.

More »

Expand

Table 3.

Cluster coefficients for homogeneity (cluster compactness) and separation for selected ecosystems and -subsystems (including all samples).

More »

Expand

Fig 7.

Comparison of Fisher’s exact test and F-measure.

We perform a grid search result for various significance thresholds for both tests. The the blue mesh shows disagreement of the tests (in %) and the stacked bars in green and red indicate, respectively, to what extend disagreement stems from Fisher’s exact test claiming signifance but not F-measure and vice versa. Under most commonly used thresholds (−log₁₀(p) score for Fisher’s exact test being 2, 3, or 5) F-measure is a stricter test (completely green bars) as the significant cases are a strict subset of Fisher’s exact test.

More »

Expand