Figures
Abstract
It is a significant step for single cell analysis to identify cell types through clustering single-cell RNA sequencing (scRNA-seq) data. However, great challenges still remain due to the inherent high-dimensionality, noise, and sparsity of scRNA-seq data. In this study, scPEDSSC, a deep sparse subspace clustering method based on proximity enhancement, is put forward. The self-expression matrix (SEM), learned from the deep auto-encoder with two part generalized gamma (TPGG) distribution, are adopted to generate the similarity matrix along with its second power. Compared with eight state-of-the-art single-cell clustering methods on twelve real biological datasets, the proposed method scPEDSSC can achieve superior performance in most datasets, which has been verified through a number of experiments.
Author summary
The rapid advancement of single-cell RNA sequencing technologies has thrown a new light on studying complex biological phenomena. A crucial step in the single-cell transcriptome analysis is to group cells which belong to the same cell type with gene expression data, i.e., clustering a noisy, sparse and high dimensional dataset with enormously fewer cells than the number of genes. In order to address the above problems, we propose a deep sparse subspace clustering method based on proximity enhancement. The raw sequencing data are first preprocessed by four different similarities and the corresponding Laplace scores to initially reduce their dimensionality. Afterwards, the self-expression matrix (SEM), learned from the deep auto-encoder with two part generalized gamma (TPGG) distribution, are adopted to generate the similarity matrix along with its second power. The clustering results are finally obtained using spectral clustering. Experimental comparisons with eight state-of-the-art methods on multiple datasets demonstrate the effectiveness and reliability of method scPEDSSC in clustering scRNA-seq data.
Citation: Wei X, Wu J, Li G, Liu J, Wu X, He C (2025) scPEDSSC: proximity enhanced deep sparse subspace clustering method for scRNA-seq data. PLoS Comput Biol 21(4): e1012924. https://doi.org/10.1371/journal.pcbi.1012924
Editor: Jason M. Haugh, North Carolina State University, UNITED STATES OFAMERICA
Received: July 5, 2024; Accepted: March 3, 2025; Published: April 28, 2025
Copyright: © 2025 Wei et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The data and source code of scDSSC is available at https://github.com/gxsdcode/scPEDSSC.
Funding: This work was supported by the National Natural Science Foundation of China (No. 62366007 to JW), Guangxi Natural Science Foundation (No. 2022GXNSFAA035625 to JW), the National Natural Science Foundation of China (No. 62302107 to JL). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Single-cell RNA sequencing (scRNA-seq) is an emerging high-throughput sequencing technology. It can overcome inherent defects of traditional sequencing, unable to reflect the actual situation of each cell due to averaging the expression of cell groups, by detecting the gene expression status at the single-cell resolution [1,2]. ScRNA-seq technology can provide significant support and assistance to explore intercellular heterogeneity and gain insight into biological processes [3]. Cell type identification is one of the fundamental upstream tasks for conducting these studies [4], hence it is essential to differentiate varieties of cells from scRNA-seq data. Great attention has been drawn to devise new efficient and reliable clustering methods, for the traditional ones cannot deal with high noise rate and high dropouts inherent in scRNA-seq data [5–7].
It has been acknowledged that deep learning approaches can provide a unique opportunity to model the noisy and complex scRNA-seq data [8]. In recent years, many deep learning-based cluster methods have been put forward. In 2019, Tian et al. [9] proposed the scDeepCluster method, which adds Gaussian noise to each coding layer and applies deep embedding clustering to generate final cell clusters. In 2022, method scBKAP presented by Wang et al. [10] conducted bisecting K-means clustering on the dimensionality-reduced single-cell data, which were generated from an autoencoder network and a dimensionality reduction model MPDR. In 2023, Du et al. [11] proposed a self-supervised contrastive learning method scCCL for clustering scRNA-seq data, which uses momentum encoder to extract features from enhanced data, and implements contrast learning in the instance-level and cluster-level modules to obtain a higher-order embedding representation model. In the same year, He et al. [12] put forward method scMCKC, which performs denoising and dimensionality reduction with a zero-inflated negative binomial model-based autoencoder, and conducts a weighted soft K-means clustering on latent space by using the pairwise constraints with a priori information.
Since the high noise existed in scRNA-seq data makes it challenging to explore group structure in high dimensional space, subspace clustering has been adopted to capture global structural information and yield more reliable similarity [8]. In 2019, Zheng et al. [13] proposed a similarity learning-based method SinNLRR, which learns non-negative and low-rank constrained similarity matrices for the purpose of dimensionality reduction and clustering. In 2021, Liang et al. [14] devised method SSRE, which computes the linear representation between cells based on sparse subspace theory, and generates a sparse representation of the cell-to-cell similarity. Later, Wang et al. [8] indicated that the subspace-based models ignored the abundant distribution and manifold information contained in scRNA-seq data, i.e., the learnt feature representation can not thoroughly imply the deep relationships of subspaces. The scDSSC method proposed by them combines noise reduction and dimensionality reduction for scRNA-seq data, modelling scRNA-seq data with a zero-inflated negative binomial (ZINB) distribution, and constructing the similarity matrix from the learned hidden layer self-expression one. However, a recent study [15] has indicated that the normalized scRNA-seq data exhibit such two statistical features as the bimodal expression pattern and the right-skewed characteristic, which may not be modeled by the ZINB distribution. In this paper, the two part generalized gamma (TPGG) distribution is introduced for modeling the scRNA-seq data with such statistical features. The main contributions are as follows:
- Devise a deep auto-encoder by introducing the two part generalized gamma distribution to better extract the features of the gene expression matrix.
- Explore the potential relationships between cells by conducting the calculation of their second-order proximity, making the self-expression matrix contain more comprehensive information between cells.
- Propose a Proximity Enhancement based Deep Sparse Subspace Clustering method (scPEDSSC) to cluster cells with scRNA-seq Data. It constructs the similarity matrix with the enhanced hidden layer self-expression one, and then performs spectral clustering on it to acquire cell clusters.
- Extensive comparative trials were conducted on twelve real datasets, the results prove the effectiveness of the proposed method compared to the state-of-the-art approaches.
Materials and methods
Suppose that there is an m × n gene expression matrix X, where the rows denote a group of different types of cells C, the columns denote a set of genes G, and each entry xij represents the expression level of gene j in cell i (i=1, 2, …, m, j=1, 2, …, n). The cell clustering method tries to partition the m cells into a set of K clusters, i.e.,so that the same type of cells can be categorized into the same cluster.
Based on the above notations and definitions, a novel deep sparse subspace clustering method scPEDSSC is put forward. As shown in Fig 1, we begin with preprocessing the original gene expression data, i.e., droping the genes that are not expressed in all cells, and selecting a given number of genes with high Laplace scores. Then a self-expression matrix is generated from training a deep auto-encoder with preprocessed gene expression data. Next, a similarity matrix is constructed from the self-expression one enhanced with its second-order proximity. Finally, a spectral clustering is conducted to produce a group of clusters. Some critical techniques of method scPEDSSC are described as follows.
Step 1: Data preprocessing. Step 2: Learning the self-expression matrix. Step 3: Constructing and enhancing similarity matrix. Step 4: Spectral clustering.
Data preprocessing
Since low-expressed genes fail to provide valid information for clustering in most cases, they are filtered out from the given gene expression matrix X so as to reduce the dimensionality of the data [16–18]. We begin with dropping the genes that are not expressed in all of the cells. Then each row is normalized with L2 norm to eliminate the expression scale differences between cells. Next, four gene-gene similarity matrices ,
,
, and
are created with calculating such four correlation coefficients as Sparse, Pearson, Spearman, and Cosine on the normalized expression matrix [14]. For each gene, four Laplace scores are computed based on the four similarity matrices. Finally, the top T genes with higher harmonic mean of the four Laplace scores are retained. For the convenience of description, the preprocessed gene expression matrix is still denoted by X.
Learning the self-expression matrix
Due to the limitations of the sequencing technique, the scRNA-seq data are represented with high sparsity. Therefore, the theory of sparse subspace [19], an approach for uncovering the internal structure of complex data in an unsupervised manner, is applied to represent the similarity between cells. The calculation of self-expression matrix is a critical step in clustering, i.e., the expression profile of a cell is mathematically described as a linear combination of the expression profiles of the cells predicted to be the same type [8]. It is able to capture global structural information and create more reliable similarity. Nevertheless, it is a challenging task to extract robust descriptive features from the high dimensional scRNA-seq data. In this section, a deep autoencoder neural network is constructed to project them into a low dimensional space, so as to acquire the low dimensional representations with rich non-linear features. As illustrated in Step 2 of Fig 1, two three-layer fully-connected neural networks are adopted as encoder and decoder, with and
(i=1, 2, 3) neurons on the ith layer of the encoder and decoder, respectively. The hidden layer, extracted from the preprocessed expression matrix through the encoder, is adopted to calculate the self-expression matrix. The loss function can be formulated as follows:
(1)
(2)where
denotes the reconstructed data, M is the self-expression matrix, E( ⋅ ) and D( ⋅ ) represent two nonlinear mapping, i.e., the encoding and decoding process, Z is the low-dimensional embedding features. The term
imposes sparsity restriction on the self-expression matrix.
It is crucial to select an appropriate probability distribution function to model the distributional properties of scRNA-seq data. The ZINB distribution has been applied in most models [8,18,20], for its good simulation of the sparsity of single-cell data. However, it is discovered that the non-zero values in normalized scRNA-seq data usually present such two features as bimodality and right-skew [15,21], which are neglected by the ZINB distribution. Therefore, in this study, the TPGG distribution [15] that takes the two features into full consideration is employed. As shown in Fig 1, additional four fully-connected layers (denoted with four different colors) are applied in the decoder, trying to simulate the TPGG distribution, as represented in Eq (3):(3)where π (π ∈ [0,1]) is the parameter of Bernoulli distribution, fitting for the probability of observing a positive-versus-zero outcome. α, β, and γ (α > 0, β > 0, γ > 0) are the shape and scale parameters of the generalized gamma distribution, as shown in Eq (4):
(4)here Γ( ⋅ ) denotes the gamma function. As indicated in Fig 1, the autoencoder is utilized to estimate the four parameters, which are set as the decoder outputs through four fully connected layers. The rules of forward propagation is illustrated as follows:
(5)
In Eq (5), the first equation represents the process of forward propagation, where D–1 denotes the penultimate layer of the decoder network. denotes the preprocessed gene expression matrix X. σ( ⋅ ) is the activation function, and the ReLU function is used here. W denotes the weight matrix. Π, A, B, and Y represent four inferred parameter matrices outputted by the decoder. Then the negative log-likelihood of TPGG is used to construct the loss function, connecting the inputs and outputs efficiently, as follows:
(6)where S denotes set {0, ⋯ , D–2,
,
,
,
}. The regularization term attempts to prevent the effect of static noise on the optimization objective and the irrelevant components of learnable parameters. Thus, the final loss function of the presented model is formulated as below:
(7)here
,
,
are there hyperparameters. Based on the loss function L, the model is trained with learning rate lr.
Constructing similarity matrix from enhanced self-expression one
As mentioned above, although the learned self-expression matrix is able to capture the global structural information among cells, some inherent higher-order relations [22] remain unextracted. Therefore, enhancement was performed on self-expression matrix by executing its second power. Let matrix M be the learned m × m self-expression matrix, where both rows and columns represent cells, each entry M[i,j] measures the relationship from the i-th cell to the j-th one. The intuition of performing second power is that the direct relationship from cell i to cell j may be enhanced through the transitivity of relationships. The relationship is also proportional to the number intermediary cells transiting relationships and the strength of the relationships with the intermediary ones. Let denote the enhanced matrix, where
[i,j] (i, j=1, 2, …, m) is calculated as Eq (8):
(8)
Let us take Fig 2 for an example, there are some relationships, denoted with directed edges, among cells ,
,
and
. In Fig 2A, a potential direct relationship may be created between cells
and
through intermediary cells
and
. Then its strength is set to 0 . 0 + 0 . 1 × 0 . 1 + 0 . 2 × 0 . 2 = 0 . 05. Similarly, in Fig 2B, the strength of relationship between cells
and
is updated to 0 . 1 + 0 . 1 × 0 . 1 + 0 . 2 × 0 . 2 = 0 . 15.
Given the enhanced self-expression matrix , the similarity matrix
is constructed as follows:
(9)
Spectral clustering
Given the constructed similarity matrix , spectral clustering, which has the advantage of model simplicity and robustness, is adopted to cluster the cells. It begins with decomposing similarity matrix
with Singular Value Decomposition (SVD) algorithm, and normalizing the left singular vector with L2 norm and max norm. Let
denote the normalized left singular vector, the matrix
is obtained and still denoted as
for the convenience of description. Then the Laplace matrix L = D − AM is constructed to acquire its eigenvalues and eigenvectors, where AM is the adjacent matrix generated by performing the K-Nearest Neighbor (KNN) algorithm on matrix
(K = 10)[8], D is the degree matrix. Finally, K-means algorithm is employed to acquired the clustered cells, where the number of clusters is set to the actual number of labels. The detailed illustration of spectral clustering could refer to previous literature [23,24].
Results
In this section, real scRNA-seq datasets were adopted to compare the performance of method scPEDSSC with eight state-of-the-art methods: two traditional methods NMF [6] and SIMLR [7], four deep learning-based methods scCCL [11], scBKAP [10], scMCKC [12], and scDCC [25], two subspace clustering methods SSRE [14] and scDSSC [8]. The source code of the comparison methods was acquired from the literature. All of the experiments were conducted on an Intel Core i7-12700 2.10 GHz with 16GB RAM. The operating system was Windows 11, and the deep learning framework was TensorFlow 1.2.1 for method scBKAP, and PyTorch 3.8 for the other methods.
Datasets
Twelve real scRNA-seq datasets were collected from public databases or published studies. The number of cells ranges from hundreds to thousands, and the number of genes are from thousands to tens of thousands. The details of the datasets are exhibited in Table 1.
Evaluation metrics and parameters settings
As performed in previous studies [8,14,15,25], two widely used evaluation metrics, i.e., Adjusted Rand Index (ARI) [37] and Normalized Mutual Information (NMI) [38], were adopted to quantitatively evaluate the clustering performance. Both of them evaluate the performance of clustering by assessing the similarity between genuine class labels and predicted cluster ones. The larger they are, the better a clustering result is. Given a group of m cells C, let ={
,
,…,
} denote the genuine partition of C into
subsets, let
={
,
,…,
} denote the predicted partition of C into
subsets. The calculation of ARI is as Eq (10):
(10)where a represents the number of pairs of cells in C that are in the same subset in
and
. b denotes the number of pairs of cells in C that are in the same subset in
but in different subsets in
. c equals the number of pairs of cells in C that are in different subsets in
but in same subset in
. d denotes the number of pairs of cells in C that are in different subsets in
and
. NMI is calculated as in Eq (11):
(11)
(12)
(13)
(14)where MI(
,
) represents mutual information of
and
, H(
) (resp. H(
)) represents the entropy of
(resp.
). p(i)=
, p(j)=
, and p(i,j)=
.
The parameters of method scPEDSSC were set as follows: T = 2000, ,
,
,
=
= 256,
=
= 32,
=
= 10, and
= 0.001, which were ascertained through a large number of experimental tests, as shown in S1 and S2 Tables. The parameters of the other methods were set as the literature[6–8,10–12,14,25].
Cell type identification and analysis by clustering
In Table 2, the scPEDSSC method is compared with other methods based on the Normalized Mutual Information. During the experiments, the number of clusters is set to the actual number of labels, i.e., =
. The last row AVG_Rank indicates the average rank among the comparative methods. It has the same meaning in the subsequent table. A smaller AVG_Rank means better performance. As can be seen from the table, the proposed method scPEDSSC has achieved the best results on half of the datasets except for Ting (ranked 2nd), Deng (ranked 2nd), Vento (ranked 2nd), CITE_CBMC (ranked 2nd), Tasic (ranked 3rd) and HumanLiver (ranked 3rd). It has earned average rank of 1.6, indicating it performs superior to the other methods in general.
Table 3 illustrates the comparison results in terms of the Adjusted Rand Index. It can be observed that method scPEDSSC still performs the best on most (seven out of twelve) datasets, and its smallest AVG_Rank demonstrates that it has better performance in general than other comparison methods.
Visualization of cell clustering
As mentioned above, spectral clustering is applied on the constructed similarity matrix , which records the potential correlations among cells. To illustrate more intuitively the relationships, the heatmaps of similarity matrices are exhibited for six datasets with different sizes, as shown in Fig 3. Redder color indicates a stronger correlation, while bluer color indicates a weaker one. From this figure it can be seen that the cells are indeed distributed in different low-dimensional subspaces. The cells belong to the same subspace have strong relationships with each other.
In Fig 4, the clustering results of the comparison methods on the Darmanis dataset was visually compared using scatter plots. Specifically, t-distributed Stochastic Neighbor Embedding (t-SNE), a popular dimensionality reduction and visualization technique, was applied on the similarity matrix . It is clearly shown that the scPEDSSC method demonstrates superior clustering effect to other methods.
Further, the clustering results of method scPEDSSC on the twelve datasets are depicted in Fig 5. It is noticed that Figs 5A–5G display satisfying clustering visualization results, i.e., the clustering number is exactly the same as the actual number of cell-types, and there is less overlap between different clusters. For the rest five datasets with much more cell-types, poor clustering visualization results are presented, as in Figs 5H–5M. The reason may be that with the increase of cell-types, the learned hidden feature information contained in the similarity matrix is insufficient for distinguishing different cell types.
Ablation experiments
In this section, we validate the effectiveness of introducing the Laplace score based data preprocessing, the TPGG distribution, and the enhanced self-expression matrix. Let DP denote the method of replacing “Laplace score based Data preprocessing” with “a conventional preprocessing implemented using the Scanpy Python package,” TP denote the method of replacing the TPGG distribution with the ZINB one, and ESM denote the method of removing the enhanced self-expression matrix. In Fig 6, the NMI scores are compared for the four methods on datasets Song, Darmanis, Haber, and Tasic. From this figure it can be seen that, the scPEDSSC method can acquire the highest NMI score among the comparative ones on each dataset. Taking dataset Darmanis as an example, the NMI scores of methods DP, TP, ESM, and scPEDSSC are 0.6569, 0.8436, 0.8572, and 0.8614, respectively. Fig 7 demonstrates the ARI values of the four methods on the four datasets. The ARI obtained by the scPEDSSC method is still higher than those of the other three ones on the four datasets.
Conclusion and discussion
The distinguishment of various cells from scRNA-seq data has been regarded as one of the crucial upstream tasks for conducting cell-related studies. In this paper, a deep sparse subspace clustering method scPEDSSC is proposed based on proximity enhancement. It begins with screening genes in terms of Laplace scores. Then it constructs a self-expression matrix from training a deep auto-encoder with adopting the TPGG distribution. The self-expression matrix is further enhanced to produce a similarity matrix for conducting spectral clustering. Twelve real biological datasets were adopted to perform the comparisons among method scPEDSSC and eight state-of-the-art single-cell clustering ones. The experimental results indicate that the proposed method scPEDSSC has better performance than other comparison methods in general.
However, during the process of experiments, it is noticed that the performance of method scPEDSSC is affected negatively by the number of clusters and cells, i.e., the learned hidden feature information is insufficient for distinguishing different cell types when the cluster number or the cell number is large. It may be due to the fact that the probability distribution function cannot model the distributional properties of scRNA-seq data very well. More appropriate probability distribution function should be further devised, which will be studied in a future work.
Supporting information
S1 Table. The NMI and ARI under different
,
,
,
,
,
and
(
=0.2,
=1.0,
=0.5).
https://doi.org/10.1371/journal.pcbi.1012924.s001
(XLSX)
S2 Table. The NMI and ARI scores under different
,
, and
(
=
=256,
=
=32,
=
=10,
=0.001).
https://doi.org/10.1371/journal.pcbi.1012924.s002
(XLSX)
Acknowledgments
The authors are grateful to Profs.Junyi Li, Bin Yu, Xin Gao, Tian Tian, Jie Zhang, JianPing Zhao, ChunHou Zheng, YanSen Su, Xiangtao Chen, Jiawei Luo, Min Li for their kindly offering the source codes and the biological datasets.
References
- 1. Song L, Pan S, Zhang Z, Jia L, Chen WH, Zhao XM. STAB: a spatio-temporal cell atlas of the human brain. Nucleic Acids Res. 2021;49(D1):D1029–37. pmid:32976581
- 2. Tang F, Barbacioru C, Nordman E, Li B, Xu N, Bashkirov VI, et al. RNA-Seq analysis to capture the transcriptome landscape of a single cell. Nat Protoc 2010;5(3):516–35. pmid:20203668
- 3. Li J, Yu C, Ma L, Wang J, Guo G. Comparison of Scanpy-based algorithms to remove the batch effect from single-cell RNA-seq data. Cell Regen 2020;9(1):1–8. pmid:32632608
- 4. Kiselev VY, Andrews TS, Hemberg M. Challenges in unsupervised clustering of single-cell RNA-seq data. Nat Rev Genet 2019;20(5):273–82. pmid:30617341
- 5. Elowitz MB, Levine AJ, Siggia ED, Swain PS. Stochastic gene expression in a single cell. Science 2002;297(5584):1183–6. pmid:12183631
- 6. Shao C, Höfer T. Robust classification of single-cell transcriptome data by nonnegative matrix factorization. Bioinformatics 2017;33(2):235–42. pmid:27663498
- 7. Wang B, Zhu J, Pierson E, Ramazzotti D, Batzoglou S. Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning. Nat Methods 2017;14(4):414–6. pmid:28263960
- 8. Wang H, Zhao J, Zheng C, Su Y. scDSSC: deep sparse subspace clustering for scRNA-seq data. PLoS Comput Biol 2022;18(12):e1010772. pmid:36534702
- 9. Tian T, Wan J, Song Q, Wei Z. Clustering single-cell RNA-seq data with a model-based deep learning approach. Nat. Mach. Intell 2019;1(4):191–8. https://www.nature.com/articles/s42256-019-0037-0
- 10. Wang X, Gao H, Qi R, Zheng R, Gao X, Yu B. scBKAP: a clustering model for single-cell RNA-Seq data based on bisecting K-means. IEEE/ACM Trans Comput Biol Bioinform 2023;20(3):2007–15.
- 11. Du L, Han R, Liu B, Wang Y, Li J. ScCCL: Single-cell data clustering based on self-supervised contrastive learning. IEEE/ACM Trans Comput Biol Bioinform 2023;20(3):2233–41.
- 12. He Y, Chen X, Tu NH, Luo J. Deep multi-constraint soft clustering analysis for single-cell RNA-seq data via zero-inflated autoencoder embedding. IEEE/ACM Trans Comput Biol Bioinform 2023;20(3):2254–65. pmid:37022218
- 13. Zheng R, Li M, Liang Z, Wu FX, Pan Y, Wang J. SinNLRR: a robust subspace clustering method for cell type detection by non-negative and low-rank representation. Bioinformatics 2019;35(19):3642–50. pmid:30821315
- 14. Liang Z, Li M, Zheng R, Tian Y, Yan X, Chen J, et al. SSRE: cell type detection based on sparse subspace representation and similarity enhancement. Genomics Proteomics Bioinformatics 2021;19(2):282–91. pmid:33647482
- 15. Zhao S, Zhang L, Liu X. AE-TPGG: a novel autoencoder-based approach for single-cell RNA-seq data imputation and dimensionality reduction. Front Comput Sci (Berl) 2023;17(3):173902. pmid:36320820
- 16. Kiselev VY, Kirschner K, Schaub MT, Andrews T, Hemberg M. SC3—consensus clustering of single-cell RNA-Seq data. Nat Methods 2017;14(5):483–6. pmid:28346451
- 17. Huang M, Wang J, Torre E, Dueck H, Shaffer S, Bonasio R, et al. SAVER: gene expression recovery for single-cell RNA sequencing. Nat Methods 2018;15(7):539–42. pmid:29941873
- 18. Zhang Z, Cui F, Wang C, Zhao L, Zou Q. Goals and approaches for each processing step for single-cell RNA sequencing data. Brief Bioinform. 2020;(1):bbaa314. pmid:33316046
- 19. Elhamifar E, Vidal R. Sparse subspace clustering: algorithm, theory, and applications. IEEE Trans Pattern Anal Mach Intell 2012;35(11):2765–2781. pmid:24051734
- 20. Eraslan G, Simon LM, Mircea M, Mueller NS, Theis FJ. Single-cell RNA-seq denoising using a deep count autoencoder. Nat Commun 2019;10(1):390. pmid:30674886
- 21. Lun ATL, Bach K, Marioni JC. Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol 2016;17(1):75. pmid:27122128
- 22.
Tang J, Qu M, Wang M, Zhang M, Yan J, Mei Q. Line: Large-scale information network embedding. In: Proceedings of the 24th International Conference on World Wide Web; 2015. pp. 1067–77.
- 23. Bach F, Jordan M. Learning Spectral Clustering. Neural Information Processing Systems, Neural Information Processing Systems. 2003.
- 24. Ye X, Zhao J, Chen Y, Guo LJ. Bayesian adversarial spectral clustering with unknown cluster number. IEEE Trans Image Process. 2020;8506-–18. pmid:32813658
- 25. Tian T, Zhang J, Lin X, Wei Z, Hakonarson H. Model-based deep embedding for constrained clustering analysis of single cell RNA-seq data. Nat Commun 2021;12(1):1873. pmid:33767149
- 26. Ting DT, Wittner BS, Ligorio M, Jordan NV, Shah AM, Miyamoto DT, et al. Single-cell RNA sequencing identifies extracellular matrix gene expression by pancreatic circulating tumor cells. Cell Rep 2014;8(6):1905–18. pmid:25242334
- 27. Goolam M, Scialdone A, Graham SJL, Macaulay IC, Jedrusik A, Hupalowska A, et al. Heterogeneity in Oct4 and Sox2 targets biases cell fate in four-cell mouse embryos. Obstet Gynecol Surv 2016;71(7):411–12.
- 28. Deng Q, Ramskld D, Reinius B, Sandberg R. Single-cell RNA-seq reveals dynamic, random monoallelic gene expression in mammalian cells. Science 343(6167):193–6. pmid:24408435
- 29. Engel I, Seumois G, Chavez L, Samaniego-Castruita D, White B, Chawla A, et al. Innate-like functions of natural killer T cell subsets result from highly divergent gene programs. Nat Immunol 2019;17(6):728–39. pmid:27089380
- 30. Song Y, Botvinnik OB, Lovci MT, Kakaradov B, Liu P, Xu JL, et al. Single-cell alternative splicing analysis with expedition reveals splicing dynamics during neuron differentiation. Mol Cell. 2017;67(1):148–161.e5. pmid:28673540
- 31. Pollen AA, Nowakowski TJ, Shuga J, Wang X, Leyrat AA, Lui JH, et al. Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex. Nat Biotechnol 2014;32(10):1053–8. pmid:25086649
- 32. Darmanis S, Sloan SA, Zhang Y, Enge M, Caneda C, Shuer LM, et al. A survey of human brain transcriptome diversity at the single cell level. Proc Natl Acad Sci U S A 2015;112(23):7285–90. pmid:26060301
- 33. Haber AL, Biton M, Rogel N, Herbst RH, Shekhar K, Smillie C, et al. A single-cell survey of the small intestinal epithelium. Nature 2017;551(7680):333–9. pmid:29144463
- 34. Tasic B, Menon V, Nguyen TN, Kim TK, Jarsky T, Yao Z, et al. Adult mouse cortical cell taxonomy revealed by single cell transcriptomics. Nat Neurosci. 2016;19(2):335–46. pmid:26727548
- 35. Vento-Tormo R, Efremova M, Botting RA, Turco MY, Vento-Termo M, Meyer KB, et al. Single-cell reconstruction of the early maternal–fetal interface in humans. Nature 2018;563(7731):347–53. pmid:30429548
- 36. Zheng GX, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, et al. Massively parallel digital transcriptional profiling of single cells. Nat Commun 2017;8(1):14049. pmid:28091601
- 37. Meilă M. Comparing clusterings–—an information based distance. J Multivar Anal 2007;98(5):873–95.
- 38. Alexander Strehl, Joydeep Ghosh. Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 2003;3(3):583–617. http://dx.doi.org/10.1162/153244303321897735