DSAVE: Detection of misclassified cells in single-cell RNA-Seq data

doi:10.1371/journal.pone.0243360

Table 1.

List of single-cell datasets used in this study.

More »

Expand

Fig 1.

Typical use case for the DSAVE BTM variation score and DSAVE cell divergence.

Ovals represent data while rounded rectangles represent data processing. The DSAVE BTM variation score and cell divergence are both applied to cell populations defined by clustering, using the original UMI count data in combination with cell clustering assignments. DSAVE allows for an iterative approach where the user can remove/reassign cells, experiment with clustering parameters, and assess the outcome, both in terms of total cell variation within the cluster (the BTM variation score) and detected misclassified cells. When the results are satisfactory the user can finalize the curation and proceed to further data analysis. The BTM variation score calculation requires a DSAVE template, which is explained further below in the methods section.

More »

Expand

Fig 2.

Overview of the calculation of the DSAVE variation score.

Oval shapes represent data, while rounded rectangles represent calculations.

More »

Expand

Fig 3.

Schematic visualization of how the variation per gene expression range is calculated.

1000 points are logarithmically distributed between 10 and 1000 CPM. For each point, a range is determined based on the template cell population. The range (shaded region) for a point (the star in the figure) is defined to cover 500 genes such that the geometric mean of their expression lies as close to the point as possible. The bounds 10–1000 CPM were determined empirically. For genes below 10 CPM, the spread in variation was generally high in comparison to the difference in variation between the aligned and SNO cell populations. Above 1000 CPM, the low frequency of genes resulted in large expression ranges that were no longer appropriate to represent with a single CPM value. The range width of 100 genes was also determined empirically; 500 genes produces a stable variation metric while still maintaining a reasonable representation of the CV distribution.

More »

Expand

Fig 4.

Cell pool size needed for stable average gene expression.

A-C. DSAVE total cell pool variation estimation of 6 cell populations for different gene expression ranges, compared with the average variation of a bulk sample and the average variation of the mean of 4 bulk samples. D. Gene expression density in log₁₀ scale for 6 cell populations from different datasets in the range of 0.5–4000 CPM. The graph shows that the highly expressed genes are few in comparison to the lowly expressed genes, and that this distribution varies between cell populations.

More »

Expand

Fig 5.

Investigation of BTM variation.

A. Variation as a function of gene expression for 3 cell populations and their SNO counterparts. The cell populations have not been aligned. B, C. Variation as a function of gene expression for 4 aligned cell populations and their SNO counterparts. All SNO cell populations now have virtually identical variation, meaning that any difference in total variation between aligned cell populations corresponds to the difference in BTM variation. D. The difference between the variation for the observed and SNO cell population as a function of gene expression. These curves represent the BTM variation.

More »

Expand

Fig 6.

Evaluation of the DSAVE variation score.

A. Technical validation of the DSAVE variation score. All cell populations were generated in a similar fashion as the SNO cell population, except the probabilities for each gene was multiplied by a noise factor f. The noise factor was calculated as f = 2N*a, where N is a standard normal distribution and a is a positive parameter that describes the magnitude of the noise. The probabilities are then normalized to a sum of 1. The figure shows an increasing score with increasing BTM variation, and demonstrates that the score is similar when the same noise level is applied, regardless of cell type or number of reads. B. BTM variation (DSAVE Score) for different datasets. C. Comparison between cell populations with 50% B cells and 50% T cells, and their pure counterparts, for a single patient. A specialized template with 1346 cells was used here due to small cell population sizes. D. Relative importance of variation factors calculated from 5 datasets. The graph shows which factors (dataset, cell type, and tissue of origin; indicated by red, blue, and green bars, respectively) can explain differences in the DSAVE variation score between cell populations.

More »

Expand

Fig 7.

Evaluation of the DSAVE cell-wise variation metric.

A. The distribution of cell divergence for the cells in three cell populations compared to their SNO counterparts. The cells are sorted by cell divergence. B. Decrease in variation upon removing the 500 most divergent cells from each dataset. C. UMI counts vs cell divergence for T cells from the HCA CB dataset. D. Fraction of counts belonging to mitochondrial genes vs cell divergence for T cells from the BC cell population. E, F. The divergence of T cells (E) and follicular B cells (F) from the LC dataset, showing potentially misclassified cells.

More »

Expand