Fig 1.
Dimension reduction framework.
Z + ϵ represents the data constructed from an (often lower-dimensional) signal Y embedded in data space. X is the lower-dimensional embedding of Z + ϵ outputted by a DR technique. X should be constructed to preserve Y, rather than Z + ϵ.
Fig 2.
Low-dimensional simulated examples.
Visualizations of the signals used to simulate data.
Table 1.
Dimensionality details and optimal perplexity for each data set.
Fig 3.
Trustworthiness vs. perplexity (links sd = 1).
t-SNE outputs were calculated with varying perplexities. Local performance was measured via trustworthiness. The trustworthiness-maximizing perplexity was 40 when comparing against the original data, while the trustworthiness-maximizing perplexity was 80 when comparing against just the signal.
Fig 4.
Trustworthiness-maximizing representations (links sd = 1).
Trustworthiness-maximizing t-SNE outputs. Comparing against the signal resulted in a representation that better captured the two links.
Fig 5.
The experiment was repeated at various levels of noise. For each level of noise, the trustworthiness-maximizing perplexity was recorded when comparing against the original data and the signal. The optimal perplexity was consistently greater when comparing against the signal.
Fig 6.
Trustworthiness vs. perplexity (high-dimensional clusters sd = 3).
t-SNE outputs were calculated with varying perplexities. Local performance was measured via trustworthiness. The trustworthiness-maximizing perplexity was 55 when comparing against the original data, while the trustworthiness-maximizing perplexity was 60 when comparing against just the signal.
Fig 7.
Trustworthiness-maximizing representations (high-dimensional clusters sd = 3).
Trustworthiness-maximizing t-SNE outputs. Both outputs depict a similar clustering.
Fig 8.
Optimal perplexity (high-dimensional clusters).
The experiment was repeated at various levels of noise. For each level of noise, the trustworthiness-maximizing perplexity was recorded when comparing against the original data and the signal. The optimal perplexity was consistently greater when comparing against the signal.
Fig 9.
Scree plot for scRNA-seq data set.
A PCA projection was used to extract the signal. To determine the appropriate number of dimensions for the projection, a scree plot was drawn.
Fig 10.
Trustworthiness vs. perplexity for r = 5 (scRNA-seq).
t-SNE outputs were calculated with varying perplexities. Local performance was measured via trustworthiness. The trustworthiness-maximizing perplexity was 40 when comparing against the original data, while the trustworthiness-maximizing perplexity was 120 when comparing against just the signal.
Fig 11.
Trustworthiness-maximizing representations for r = 5 (scRNA-seq).
Trustworthiness-maximizing t-SNE outputs. Both outputs depict a similar clustering with slightly varying cluster positioning. The perplexity = 40 representation depicts tighter clustering, but is outperformed in metrics measuring both local and global performance, suggesting the over-clustering and cluster positioning are misleading.
Fig 12.
Trustworthiness vs. perplexity for r = 10 (scRNA-seq).
t-SNE outputs were calculated with varying perplexities. Local performance was measured via trustworthiness. The trustworthiness-maximizing perplexity was 50 when comparing against the original data, while the trustworthiness-maximizing perplexity was 60 when comparing against just the signal.
Fig 13.
Trustworthiness-maximizing representations for r = 10 (scRNA-seq).
Trustworthiness-maximizing t-SNE outputs. Both outputs depict a similar clustering with slightly varying cluster positioning.
Fig 14.
Trustworthiness vs. n_neighbors for UMAP (scRNA-seq).
UMAP outputs were calculated with varying n_neighbors values. Local performance was measured via trustworthiness. The trustworthiness-maximizing n_neighbors was 190 when comparing against the original data, while the trustworthiness-maximizing n_neighbors was 300 when comparing against just the signal.
Fig 15.
A PCA projection was used to extract the signal. To determine the appropriate number of dimensions for the projection, a scree plot was drawn.
Fig 16.
Trustworthiness vs. n_neighbors for UMAP (PBMC).
UMAP outputs were calculated with varying n_neighbors values. Local performance was measured via trustworthiness. The trustworthiness-maximizing n_neighbors was 50 when comparing against the original data, while the trustworthiness-maximizing n_neighbors was 70 when comparing against just the signal.
Fig 17.
UMAP representations for different values of n_neighbors. The cell types were assigned through study of known marker genes. The n_neighbors = 50 and n_neighbors = 70 representations did the best job separating the different cell types. The n_neighbors = 50 representation is more tightly clustered than the n_neighbors = 70 representation. The relative positioning of the NK and CD8 T cells differs between the n_neighbors = 50 and n_neighbors = 70 representations.
Fig 18.
Plot of dendritic cells (PBMC).
Dendritic cells (DC) extracted from the UMAP representations constructed with different values of n_neighbors. The n_neighbors = 15 and n_neighbors = 50 representations show two clusters, while the n_neighbors = 70 representation may be showing three clusters.
Fig 19.
PCA applied to dendritic cells (PBMC).
PCA was applied to the subset of dendritic cells. The first two principal components seem to imply the dendritic cells belong to three different clusters. The points were assigned according to a three-cluster k-means clustering upon the PCA projection.
Fig 20.
Dendritic cells colored according to PCA projection (PBMC).
Dendritic cells (DC) extracted from the UMAP representations colored according to the k-means clustering upon the PCA projection of the dendritic cells. The n_neighbors = 70 representation separated the purple points the least among the three representations.