Fig 1.
Necessary properties for embedding applications.
Application rows denote biological tasks, and columns denote which properties are necessary, i.e., key geometric properties whose preservation or representation is assumed in the task.
Fig 2.
Distortion of necessary properties in embeddings.
(a) (i) Distribution of Jaccard distance of cell neighbors in PCA-preprocessed 2D embeddings and the relevant PCA space, as compared to ambient space. (ii) Distribution of Jaccard distance of cell neighbors in PCA-preprocessed 2D embeddings, as compared to the higher dimensional PCA space. (b) (i) Boxplot of correlations of cell type neighbor rankings to ambient space for the PCA-preprocessed 2D embeddings and the relevant PCA space. (ii) Boxplot of correlations of cell type neighbor rankings to the relevant higher dimensional PCA space for the PCA-preprocessed 2D embeddings. Embeddings generated n = 3 times. (c) Selection of equidistant groups with “near” or “far” distances in ambient space. UMAP embedding of the data in gray circles, with orange circles denoting all cells within the previously determined equidistant groups.
Fig 3.
Distortion of mixing patterns.
(a) Left plot shows “Log-normalized” ambient (blue) and 2D embedding (orange) distributions of mixing (fraction of cell neighbors in the same condition), where 1.0 is no mixing. Corresponding UMAP shown next to it. Right plot shows “Variance-Stabilized and Scaled” ambient (blue) and 2D embedding (orange) distributions of mixing (fraction of cell neighbors in the same condition). Corresponding UMAP shown next to it. (b) Left plot shows “MNN Integrated” ambient (blue) and 2D embedding (orange) distributions of mixing (fraction of cell neighbors in the same condition) for CEL-Seq cells. Corresponding UMAP shown next to it. Right plot shows “Scanorama Integrated” ambient (blue) and 2D embedding (orange) distributions of mixing (fraction of cell neighbors in the same condition) for CEL-Seq cells. Corresponding UMAP shown next to it.
Fig 4.
Distortion in cluster validation and relationships.
(a) Prediction of cell label for 30% of the dataset(s) based on the labels of the 50 nearest neighbors. (b) Distributions of cell type inter- and intra-type distances for the ambient or reduced space (bottom). K-S distance shown as measure of separation, where higher values denote greater separation (see Methods in S1 Text).
Fig 5.
Distortion in density-based visuals and analysis.
(a) Top row (left to right) displays UMAP embedding with n_neighbors = 5, embedding contour plot colored by condition, same contour with just in utero cells, same contour with just ex utero cells. Bottom row shows same plots for UMAP embedding with n_neighbors = 50. (b) Top row shows same plots for t-SNE embedding with perplexity of 5. Bottom row shows same plots for t-SNE embedding with perplexity of 50. Numbers denote comparisons between plots, dashed lines denote a difference, and solid lines denote the same appearance.
Fig 6.
Distortion in trajectory inference and continuous relationships.
(a) Velocyto RNA velocity embeddings for UMAPs made with 17 or 50 n_neighbors. Cell types of interest highlighted in gray. (b) Velocyto RNA velocity embeddings for t-SNEs made with perplexity of 17 or 50.
Fig 7.
Embedding properties are arbitrary.
Elephant-shaped embeddings [62,63] shown on the left, with corresponding correlations of data embeddings to ambient space shown in right-hand plots, for inter- and intra-type distance metrics. Metrics calculated over n = 5 embeddings. Colors denote cell types, delineated in Fig W in S1 Text.
Fig 8.
Relative contrast of the L1 and L2 metrics.
Violin plots display kernel density estimates of the distribution of log2(Relative Contrast) ratio values for each dataset, computed for n = 5 random subsets of 1,000 HVGs selected for each dataset from its top 2,000 HVGs. Relative contrast was calculated as described in Methods in S1 Text and [30], in the ambient (gene) space. Distributions are shown across datasets of increasing sample size (cell number). Box plots are overlaid in black, with the median denoted by the white dot. Whiskers denote 1.5× the interquartile range. HVG, highly variable gene; NSC, neural stem cell; VMH, ventromedial hypothalamus.