Table 1.
Two CyTOF benchmark data sets for analysis.
Fig 1.
A framework of DGCyTOF model in the identification of canonical cell population and new cell type populations.
A) The flowchart of DGCyTOF. To single cell data, it includes labeled and unlabeled data in CyTOF database. Identification of cell types includes four processes. (1) To cells labeled, a supervised deep learning automatically identifies canonical cell populations or cell types gated by protein markers, the detailed description sees (B). (2) To new cell population, a novel graphic-clustering integrating UMAP + HDBSCAN allows a learning of feature representations and preservation of data structure in a network of cell-to-cell interaction for the assignment of clusters for identification of new cell populations. (3) These cell types from classification and clustering are adjusted between (1) and (2) layers above mentioned via a feedback-loop using an iteration calibration system to reduce false-negative errors in the system integrating cell identification. (4) In the final step, a tool permitting three-dimensional (3D) visualization is developed to display the cell clusters, projecting all cell type labels into independent 3D space for their vivid depiction and differentiation to facilitate the identification of cell types. B) A three-layer artificial neural network constructs the deep classification-learning model for the identification of canonical cell populations. C) A calibration-feedback learning system for cell type correction. After deep learning model in Fig 1A, there are lots of known cell types identified (here called existing cluster). A correlation threshold value averaging the Spearman correlation determines whether the cell belongs to these known cell population. If correlation of the filtered cell with cells from the given canonical cell population is greater than the correlation threshold in this population, we reallocate that cell to this canonical population.
Table 2.
Contingency table for calculating the receiver operating characteristic curve.
Fig 2.
Cell population identification by DGCyTOF in the analysis of CyTOF1 and CyTOF2 datasets.
Fig 2A identifies the 32 types of known cells by deep classification learning for dataset CyTOF1, and Fig 2C, the 13 types of known cells for CyTOF2. Fig 2B and 2D show the spectral clustering for the identification and visualization of unknown cell populations in the two datasets.
Fig 3.
Comparison of confusion matrices and their associated receiver operating characteristic (ROC) curves for real labels of CyTOF1 and CyTOF2 datasets as assessed utilizing the DGCyTOF model.
(A,B) confusion matrices for the (A) CyTOF1 dataset and (B) CyTOF 2 dataset; (C,D) ROC curves for the (C) CyTOF1 dataset and (D) CyTOF 2 dataset.
Table 3.
Comparison of methods for averaging performance in the identification of known cell types in training and testing data by different measurements for CyTOF1 and CyTOF2 datasets.
Table 4.
Comparison of machine-learning methods by different measurements for CyTOF Dataset 1 (13 biomarkers, 24 labeled cell types).
Table 5.
Comparison of machine-learning methods by different measurements for CyTOF Dataset 2 (32 biomarkers, 14 labeled cell types).
Fig 4.
Visualizations of cell populations in databases CyTOF1 and CyTOF2.
(A-B) Two-dimensional visualization of embedding of cells for the identification of dimension-reduction techniques in databases CyTOF1 (A) and CyTOF2 (B). Cell subtypes are labeled by DGCyTOF for databases CyTOF1 (C) and CyTOF2 (D).
Table 6.
Calibration of cell types utilizing calibration feedback for CyTOF1 and CyTOF2 data.