CDSeq: A novel complete deconvolution method for dissecting heterogeneous samples using gene expression data

doi:10.1371/journal.pcbi.1007510

Fig 1.

Schematic of the CDSeq approach.

Heterogeneous samples consist of different cell types. The bulk RNA-Seq profile represents a weighted average of the expression profiles of the constituent cell types. CDSeq takes as input the bulk RNA-Seq data for a collection of samples and performs complete deconvolution that outputs estimates of both the cell-type-specific expression profiles and the cell-type proportions for each sample. This Figure depicts a simple scenario of six biological samples comprising four cell types, each with gene expression measurements on eight genes.

More »

Expand

Fig 2.

Graphical representation of CDSeq probabilistic model.

The light blue nodes, α, β, denote the hyperparameters that are assumed to be known. The dark blue nodes, ℓ, g_ij, r_ij, denote the values of observable random variables (either measured in the study or established in previous studies) whereas the white nodes, η, Φ, θ, c_ij, are unobservable random variables that need to be inferred from data. The outer box represents samples where M is the sample size, and the inner box denotes the RNA-Seq data of a sample where N is the total number of reads from the sample (see S1 Methods for details).

More »

Expand

Table 1.

Deconvolution methods for comparison.

More »

Expand

Fig 3.

Deconvolution of synthetic mixtures.

We ran CDSeq with six cell types, α = 5, β = 0.5, and 700 MCMC runs. (A) Difference (“residual”) between estimated and true cell-type proportion plotted against true proportion for CDSeq (green) and CIBERSORT (red). Each plotted point represents the value for a single sample. (B) Radar plot of RMSE for estimates of sample-specific cell-type proportions. CDSeq (green); CIBERSORT (red). (C) Difference (“residual”) between estimated and true log2 gene expression level (log2(RPKM)) plotted against true log2 gene expression level for CDseq (green) and csSAM (red). Each plotted point represents a single gene, 22498 genes total. (D) Radar plot of RMSE for gene expression levels (RPKM). CDSeq (green); csSAM (red).

More »

Expand

Fig 4.

Performance comparisons on synthetic mixtures.

(A) RMSEs of sample-specific cell-type proportion estimations; (B) RMSEs of cell-type-specific GEPs estimations.

More »

Expand

Fig 5.

Deconvolution of mixed RNA from cultured cell lines.

We ran CDSeq with four cell types, α = 5, β = 0.5, and 700 MCMC runs. (A) Difference (“residual”) between estimated and true cell-type proportion plotted against true proportion for CDSeq (green) and CIBERSORT (red). Each plotted point represents the value for a single sample. (B) Radar plot of RMSE for estimates of sample-specific cell-type proportions. CDSeq (green); CIBERSORT (red). Total RMSE summing over cell types is 17% smaller for CDseq compared to CIBERSORT. (C) Difference (“residual”) between estimated and true log2 gene expression level (log2(RPMK)) plotted against true log2 gene expression level for CDseq (green) and csSAM (red). Each plotted point displays the expression value of a single gene, 19653 genes in total. (D) Radar plot of RMSE for gene expression levels. CDSeq (green); csSAM (red). Total RMSE of gene expression (summing over cell types) is 16% smaller for CDseq compared to csSAM.

More »

Expand

Fig 6.

Performance comparisons on experimental mixtures.

(A) RMSEs of sample-specific cell-type proportion estimations; (B) RMSEs of cell-type-specific GEPs estimations.

More »

Expand

Fig 7.

Deconvolution of mixed liver, lung and brain cell lines.

Comparisons with CIBERSORT and csSAM on mixtures of liver, brain and lung cells. (A) Residual of proportion estimation; (B) Radar plot of RMSE for proportion estimation; (C) Residual of GEPs estimation; (D) Radar plot of RMSE for GEPs estimation.

More »

Expand

Fig 8.

Comparison of CDSeq using the quasi-unsupervised strategy with CIBERSOFT on deconvolution of B cells and T cells in lymphoma samples.

We ran CDSeq with 22 cell types, α = 0.5, β = 0.5, and 700 MCMC runs. We considered an anonymous CDSeq-identified cell type to match one of the B cell (blue dots) or T cell subtypes (red dots) if the Pearson correlation of their GEPs exceeded 0.6. (A) Correlation between estimated GEPs and true GEPs; (B) CDSeq estimated proportions versus flow cytometry; (C) CIBERSORT estimation versus flow cytometry.

More »

Expand

Fig 9.

Deep deconvolution of PBMC data.

We applied CDSeq using the quasi-unsupervised learning strategy and ran CDSeq with 22 cell types, α = 50, β = 20. The black line is the linear regression line; the dashed line is the x = y line; R is the correlation coefficient; and P is the p-value for testing the null hypothesis of no correlation.

More »

Expand

Fig 10.

Estimating the number of cell types.

The maximum of the log posterior provides an estimate of the number of cell types. (A) synthetic data; (B) mixed RNA data. In each data set, the method correctly estimated the number of cell types.

More »

Expand