Skip to main content
Advertisement

< Back to Article

Fig 1.

The denoising process using DEWÄKSS.

The procedure is as follows: (i) The PCA of the expression matrix X is computed. (ii) A cell-to-cell similarity structure is inferred from a k-Nearest Neighbor Graph (kNN-G). (iii) An invariant and independent mapping function is constructed. (iv) The objective function, with a defined optimum for denoising, minimizes the mean squared error (MSE) between the predicted weighted average of the neighbor cells’ state and the observed cell state. DEWÄKSS tunes its objective function with two main input parameters, the number of PCs and the number k of neighbors.

More »

Fig 1 Expand

Fig 2.

Benchmarking DEWÄKSS on a predefined benchmark test [29].

A) The average Pearson correlation coefficients between cells that are known to be in the same group, calculated as in Tian et al. [29] for the RNAmix_CEL-seq2 and RNAmix_Sort-seq benchmark datasets. DEWÄKSS yields highly correlated expression profiles for “cells” in the same group, robustly across different normalization methods. B) Self-supervised hyperparameter grid search results (RNAmix_Sort-seq normalized by FTT). Neighbors are on the x-axis and PCs are colored. The optimal configuration neighbors are shown by the dotted black line and PCs are shown by the solid black line. C) Optimization behavior using optimal PCs = 4 found in (B) for 5-200 neighbors. The lowest prediction error for each diffusion trajectory (line) is marked by a circle with a green outline if it corresponds to the number of iterations in the optimal configuration i = 1 or in black when the optimal number of iterations > 1. The optimal value is marked by a diamond. The number of diffusion steps decreases as the number of neighbors increases. The number of diffusion steps is truncated to 9 steps. The prediction error decreases as the number of neighbours increases from 5-80, and then increases.

More »

Fig 2 Expand

Table 1.

Optimal hyperparameters selected by DEWÄKSS self-supervised objective function.

More »

Table 1 Expand

Fig 3.

Denoising celltype-annotated data from Hrvatin et al. [33] using metrics from Arisdakessian et al. [27].

The dataset contains 33 annotated celltypes in 48267 cells. A) Optimal denoising of the expression data with DEWÄKSS requires 100 PCs and k = 150 neighbors. B) We benchmark six different denoising pipelines: (i) in-house preprocessed (section 4.4), (pp), (ii) DeepImpute, (iii) DEWÄKSS, (iv) MAGIC, (v) DrImpute and (vi) SAVER. To be able to run (v) and (vi) we down-sample the data to 10% of the annotated cells. After preprocessing & denoising, data is clustered with the Leiden algorithm [19] using 300 PCs and 150 neighbors (resolution is set to r = 1 for DEWÄKSS, r = 2 for preprocessed (pp) and DeepImpute, r = 4 for DrImpute and SAVER, and r = 0.5 for MAGIC). Algorithm performance is measured with the Fowlkes-Mallows metric and silhouette score on two representations of the data, PCA and UMAP.

More »

Fig 3 Expand

Table 2.

Optimal configurations found by hyperparameter search on DEWÄKSS on seven real-world single-cell datasets.

More »

Table 2 Expand

Fig 4.

Mouse bone marrow (BM) data denoising.

A) The numbers of principal components needed to explain 99% and 90% of the variance in the data for different hyperparameter values for DEWÄKSS and MAGIC. DEWÄKSS is run with optimal parameters (k = 100, PCs = 50, i = iminMSE), with oversmoothed parameters (k = 100, PCs = 50, i = iminMSE), with robust parameters (k = 10, PCs = 13 selected using MCV as in S1 Section, i = iminMSE), and as X base, where normalized expression values are used instead of PCs with (k = 100, i = iminMSE). MAGIC is run with defaults (d = 15, PCs = 100, k = 15), with early stopping t1 (t = 1), and with d30 (d = 30). B) Expression of erythroid marker gene Klf1, myeloid marker Mpo, and stem cell marker Ifitm1 in DEWÄKSS optimal and DEWÄKSS oversmoothed data. The MSE increases in each iteration. C) The objective function output as a function of diffusion steps for the optimal number of PCs = 50. The minimum MSE is found for 100 neighbors and 1 diffusion step, i.e., using only the selected 100 neighbors.

More »

Fig 4 Expand

Fig 5.

Differentially expressed genes (DEGs) between bulk and single cell data.

Top panel is the delta AUROC for single-cell DEGs ordered by adjusted p-value for each separate deletion strain. Bottom panel is the delta Jaccard index between bulk DEGs and single cell DEGs at FDR = 0.01. Preprocessed is count normalized and log-transformed with no denoising method. Delta is taken between denoised and preprocessed. All computed metrics can be found in S2 Table.

More »

Fig 5 Expand

Fig 6.

Computational performance of all tested method on selected datasets.

Runtime (minutes) is plotted against the total number of values (cells * genes) in the dataset, to account for differing numbers of genes in each data set. Complete results table is available in S1 Table.

More »

Fig 6 Expand