Calibrating dimension reduction hyperparameters in the presence of noise

doi:10.1371/journal.pcbi.1012427

Fig 1.

Dimension reduction framework.

Z + ϵ represents the data constructed from an (often lower-dimensional) signal Y embedded in data space. X is the lower-dimensional embedding of Z + ϵ outputted by a DR technique. X should be constructed to preserve Y, rather than Z + ϵ.

More »

Expand

Fig 2.

Low-dimensional simulated examples.

Visualizations of the signals used to simulate data.

More »

Expand

Table 1.

Summary of results.

Dimensionality details and optimal perplexity for each data set.

More »

Expand

Fig 3.

Trustworthiness vs. perplexity (links sd = 1).

t-SNE outputs were calculated with varying perplexities. Local performance was measured via trustworthiness. The trustworthiness-maximizing perplexity was 40 when comparing against the original data, while the trustworthiness-maximizing perplexity was 80 when comparing against just the signal.

More »

Expand

Fig 4.

Trustworthiness-maximizing representations (links sd = 1).

Trustworthiness-maximizing t-SNE outputs. Comparing against the signal resulted in a representation that better captured the two links.

More »

Expand

Fig 5.

Optimal perplexity (links).

The experiment was repeated at various levels of noise. For each level of noise, the trustworthiness-maximizing perplexity was recorded when comparing against the original data and the signal. The optimal perplexity was consistently greater when comparing against the signal.

More »

Expand

Fig 6.

Trustworthiness vs. perplexity (high-dimensional clusters sd = 3).

t-SNE outputs were calculated with varying perplexities. Local performance was measured via trustworthiness. The trustworthiness-maximizing perplexity was 55 when comparing against the original data, while the trustworthiness-maximizing perplexity was 60 when comparing against just the signal.

More »

Expand

Fig 7.

Trustworthiness-maximizing representations (high-dimensional clusters sd = 3).

Trustworthiness-maximizing t-SNE outputs. Both outputs depict a similar clustering.

More »

Expand

Fig 8.

Optimal perplexity (high-dimensional clusters).

The experiment was repeated at various levels of noise. For each level of noise, the trustworthiness-maximizing perplexity was recorded when comparing against the original data and the signal. The optimal perplexity was consistently greater when comparing against the signal.

More »

Expand

Fig 9.

Scree plot for scRNA-seq data set.

A PCA projection was used to extract the signal. To determine the appropriate number of dimensions for the projection, a scree plot was drawn.

More »

Expand

Fig 10.

Trustworthiness vs. perplexity for r = 5 (scRNA-seq).

t-SNE outputs were calculated with varying perplexities. Local performance was measured via trustworthiness. The trustworthiness-maximizing perplexity was 40 when comparing against the original data, while the trustworthiness-maximizing perplexity was 120 when comparing against just the signal.

More »

Expand

Fig 11.

Trustworthiness-maximizing representations for r = 5 (scRNA-seq).

Trustworthiness-maximizing t-SNE outputs. Both outputs depict a similar clustering with slightly varying cluster positioning. The perplexity = 40 representation depicts tighter clustering, but is outperformed in metrics measuring both local and global performance, suggesting the over-clustering and cluster positioning are misleading.

More »

Expand

Fig 12.

Trustworthiness vs. perplexity for r = 10 (scRNA-seq).

t-SNE outputs were calculated with varying perplexities. Local performance was measured via trustworthiness. The trustworthiness-maximizing perplexity was 50 when comparing against the original data, while the trustworthiness-maximizing perplexity was 60 when comparing against just the signal.

More »

Expand

Fig 13.

Trustworthiness-maximizing representations for r = 10 (scRNA-seq).

Trustworthiness-maximizing t-SNE outputs. Both outputs depict a similar clustering with slightly varying cluster positioning.

More »

Expand

Fig 14.

Trustworthiness vs. n_neighbors for UMAP (scRNA-seq).

UMAP outputs were calculated with varying n_neighbors values. Local performance was measured via trustworthiness. The trustworthiness-maximizing n_neighbors was 190 when comparing against the original data, while the trustworthiness-maximizing n_neighbors was 300 when comparing against just the signal.

More »

Expand

Fig 15.

Scree plot for PBMC data set.

A PCA projection was used to extract the signal. To determine the appropriate number of dimensions for the projection, a scree plot was drawn.

More »

Expand

Fig 16.

Trustworthiness vs. n_neighbors for UMAP (PBMC).

UMAP outputs were calculated with varying n_neighbors values. Local performance was measured via trustworthiness. The trustworthiness-maximizing n_neighbors was 50 when comparing against the original data, while the trustworthiness-maximizing n_neighbors was 70 when comparing against just the signal.

More »

Expand

Fig 17.

Cell types (PBMC).

UMAP representations for different values of n_neighbors. The cell types were assigned through study of known marker genes. The n_neighbors = 50 and n_neighbors = 70 representations did the best job separating the different cell types. The n_neighbors = 50 representation is more tightly clustered than the n_neighbors = 70 representation. The relative positioning of the NK and CD8 T cells differs between the n_neighbors = 50 and n_neighbors = 70 representations.

More »

Expand

Fig 18.

Plot of dendritic cells (PBMC).

Dendritic cells (DC) extracted from the UMAP representations constructed with different values of n_neighbors. The n_neighbors = 15 and n_neighbors = 50 representations show two clusters, while the n_neighbors = 70 representation may be showing three clusters.

More »

Expand

Fig 19.

PCA applied to dendritic cells (PBMC).

PCA was applied to the subset of dendritic cells. The first two principal components seem to imply the dendritic cells belong to three different clusters. The points were assigned according to a three-cluster k-means clustering upon the PCA projection.

More »

Expand

Fig 20.

Dendritic cells colored according to PCA projection (PBMC).

Dendritic cells (DC) extracted from the UMAP representations colored according to the k-means clustering upon the PCA projection of the dendritic cells. The n_neighbors = 70 representation separated the purple points the least among the three representations.

More »

Expand