STREAK: A supervised cell surface receptor abundance estimation strategy for single cell RNA-sequencing data using feature selection and thresholded gene set scoring

doi:10.1371/journal.pcbi.1011413

Fig 1.

STREAK schematic with step (1) corresponding to the training co-expression analysis and step (2) corresponding to gene set scoring and subsequent clustering and thresholding to achieve cell-specific estimated receptor abundance profiles.

More »

Expand

Table 1.

Joint scRNA-seq/CITE-seq datasets used for method evaluation.

More »

Expand

Table 2.

Number of cells in individual training and target data subsets and total number of cells for each of the six datasets for the 5-fold cross-validation evaluation approach.

More »

Expand

Fig 2.

Percentage of receptors for which a given technique (STREAK, SPECK, normalized RNA transcript, Random Forest or Support Vector Machines) generates estimates with the highest average rank correlation with the associated CITE-seq data over five subsets of training data consisting of 12,000 cells from the Hao dataset, 10,000 cells from the Unterman data, 1,682 cells from the MALT data, 3,940 cells from the Mouse Spleen data, 7,422 cells from the Monocytes data and 994 cells from the MPEM dataset.

More »

Expand

Fig 3.

Percentage of receptors for which a given technique (STREAK, SPECK, normalized RNA transcript, Random Forest or Support Vector Machines) generates estimates with the highest average rank correlation with the associated CITE-seq data. Results are computed on the subset of receptors supported by the cTP-net methods.

More »

Expand

Fig 4.

Percentage of receptors for which a given unsupervised abundance estimation technique (STREAK, SPECK and the normalized RNA transcript) generates estimates with the highest average rank correlation with the associated CITE-seq data.

More »

Expand

Fig 5.

Percentage of receptors for which a given supervised abundance estimation technique (STREAK, the Random Forest and Support Vector Machines algorithms) generates estimates with the highest average rank correlation with the associated CITE-seq data.

More »

Expand

Fig 6.

Percentage of receptors for which a given supervised abundance estimation technique (STREAK, cTP-net, and the Random Forest and Support Vector Machines algorithms) generates estimates with the highest average rank correlation with the associated CITE-seq data for a subset of receptors with defined cTP-net expression.

More »

Expand

Fig 7.

Percentage of receptors with highest average Spearman rank correlations between CITE-seq data and abundance profiles estimated using STREAK, SPECK, the normalized RNA approach, cTP-net or RF using the cross-training evaluation approach (Fig 1A–1C). The horizontal axis for these plots indicates the number of target cells evaluated from the Unterman data.

More »

Expand

Fig 8.

Correlation versus correlation scatter plots for the PBMC Hao data. Each point corresponds to a receptor from a maximum sample size of 217 receptors. LOESS (locally estimated scatterplot smoothing) function is applied to smooth out conditional means. Individual correlations are computed using the Spearman rank correlation metric.

More »

Expand

Fig 9.

Correlation versus correlation (computed using the Spearman rank correlation metric) scatter plots for the Mouse Spleen data. Each point corresponds to a receptor from a maximum sample size of 107 receptors.

More »

Expand

Fig 10.

Average rank correlations between CITE-seq data and receptor abundance values estimated using STREAK and all comparative methods as evaluated using the 5-fold cross-validation approach for training data consisting of 12,000 cells from the Hao data. Asterisk text format is used to indicate receptors for which the STREAK estimate has highest correlation with corresponding CITE-seq data among all evaluated methods. Limits of the gradient color scale are determined by the minimum and maximum average correlation values for all comparative methods combined.

More »

Expand

Fig 11.

Average rank correlations between CITE-seq data and receptor abundance values estimated using STREAK and comparative methods as evaluated using the 5-fold cross-validation approach for training data consisting of 3,940 cells from the Mouse Spleen dataset.

More »

Expand

Fig 12.

Average rank correlations between CITE-seq data and abundance values estimated using STREAK and comparative methods evaluated using the cross-training approach trained on a subset of 5,000 cells from the Hao data and evaluated on a subset of 50,000 cells from the Unterman data.

More »

Expand

Fig 13.

Gene set scoring versus thresholding sensitivity analysis examining frequency of receptors with highest average rank correlations between CITE-seq data and abundance values estimated using STREAK (i.e., estimation via gene set scoring followed by thresholding) or VAM (i.e., estimation using just gene set scoring) evaluated using the 5-fold cross-validation approach with the indicated training data ranging from 1,000 to 12,000 cells for the Hao data (Fig 13A) 1,000 to 10,000 cells for the Unterman data (Fig 13B) and 1,000 to 1,682 cells for the MALT data (Fig 13C).

More »

Expand