Fig 1.
F-informed multidimensional scaling (MDS) for microbial ecology data analysis.
(a) Schematic overview of analyzing a dataset, represented as a design matrix X of N samples with S features or taxa. In a balanced design, samples are grouped by a replicates of the same experimental condition. Note that compositionality requires that the summation of elements xij across features is fixed or independent of a feature j (e.g., 1). Exploratory analysis of X is performed through its ordination, followed by statistical inference, including hypothesis testing (e.g., PERMANOVA) or differential abundance analysis. (b) Computational process diagram of reducing the dimensionality of X using label set y for visualization, ensuring compliance with inference results based on the F-statistic.
Table 1.
PERMANOVA test results for group differences in semi-synthetic datasets using the original structure (px) and the principal coordinates analysis representation (pz). The datasets were generated using SparseDOSSA [52] (section “Semisynthetic data” of Methods).
Fig 2.
Majorization algorithm optimizes the F-informed MDS objective function.
(A) Changes in each term comprising the objective function (Eq 5), including raw stress and confirmatory terms, are plotted against the training epochs for different hyperparameter values . A semisynthetic dataset of N = 200, replicate 1, was generated using SparseDOSSA [52] (see the “Semisynthetic data” section of the Methods). (B) The PERMANOVA p-value under two-dimensional representation (pz) is plotted against epoch and
until the stopping criteria are met. (C) Number of epochs until the termination is plotted against the hyperparameter, ranging between 0.2 and 1. Error bars indicate the standard deviation of triplicates.
Fig 3.
F-informed MDS is robust to the choice of hyperparameters.
(A) Shepard plots comparing pairwise distances from F-informed MDS (F-MDS) and supervised MDS (superMDS) under both zero (metric MDS) and nonzero hyperparameter settings (,
). Distances in the original high-dimensional space were computed using Bray-Curtis dissimilarity; distances in the two-dimensional embedding used Euclidean metric. (B) For each method, the Pearson correlation coefficient was calculated between original and embedded distances. (C) Normalized stress (Stress-1) was computed for each embedding. Analysis used semisynthetic datasets with N = 200 (see the “Semisynthetic data” section of the Methods). Error bars indicate standard deviation across triplicate datasets.
Fig 4.
Different quality metrics confirm consistent preservation of the semisynthetic data pattern with F-informed MDS.
Six dimension reduction methods were evaluated to test their preservation of (A) local and (B) global structure by calculating trustworthiness and continuity using two nearest-neighbor numbers, k = 14 (local), k = 150 (global). The methods were also evaluated using (C) global distortion metrics (Stress-1 and Shepard diagram correlation) and (D) statistical inference metrics, including the ratio in statistical significance (F-rank-ratio) and correlation in F-ratios (F-correlation) using a randomly permuted label set. The following hyperparameters were applied to each method: (F-MDS), number of neighbors n (UMAP, supervised (-S) and unsupervised (-U)), perplexity “perp” (t-SNE), and the number of shortest dissimilarities n (Isomap). Three semisynthetic datasets of N = 200 were generated as described in section “Semisynthetic data” of Methods. The standard deviations were calculated and displayed with error bars.
Fig 5.
Different quality metrics confirm consistent preservation of the algal microbiome pattern with F-informed MDS.
Seven dimension reduction methods were evaluated for their preservation of (A) local and (B) global structure by calculating trustworthiness and continuity using two nearest neighbor numbers k = 3 (local), k = 27 (global). The methods were also assessed based on (C) global distortion metrics (Stress-1 and Shepard plot correlation) and (D) the ratio of p-values (F-rank-ratio) and correlation in F-ratios (F-correlation) using randomly permuted label set. A dataset of N = 36 bacterial communities was analyzed as described in section “Real microbiome community” of Methods. The following hyperparameters were applied: (F-MDS), number of neighbors nU (UMAP, supervised (-S) and unsupervised (-U)), perplexity “perp” (t-SNE), and the number of shortest dissimilarities nI (Isomap).
Fig 6.
F-informed MDS can visualize group-based data discrimination.
Two-dimensional representations of (A) a simulated dataset and (B) an algal microbiome dataset, generated using metric MDS and F-informed MDS (F-MDS) with two hyperparameter settings ( 0.5, 1). Ellipses of respective colors are drawn with a confidence level of 0.68.
Fig 7.
Multi-class data is discriminated based on statistical inference.
Comparison of metric MDS and F-informed MDS () visualizations using a four-dimensional, three-class simulated dataset. Each dataset group follows normal distribution with the same covariance but different means (S9 Fig). Ellipses of respective colors are drawn with a confidence level of 0.68. Semisynthetic dataset of N = 75 was generated (see the section “Semisynthetic data” of the Methods.).