Analyzing scRNA-seq data by CCP-assisted UMAP and tSNE

Yuta Hozumi; Guo-Wei Wei

doi:10.1371/journal.pone.0311791

Abstract

Single-cell RNA sequencing (scRNA-seq) is widely used to reveal heterogeneity in cells, which has given us insights into cell-cell communication, cell differentiation, and differential gene expression. However, analyzing scRNA-seq data is a challenge due to sparsity and the large number of genes involved. Therefore, dimensionality reduction and feature selection are important for removing spurious signals and enhancing downstream analysis. Correlated clustering and projection (CCP) was recently introduced as an effective method for preprocessing scRNA-seq data. CCP utilizes gene-gene correlations to partition the genes and, based on the partition, employs cell-cell interactions to obtain super-genes. Because CCP is a data-domain approach that does not require matrix diagonalization, it can be used in many downstream machine learning tasks. In this work, we utilize CCP as an initialization tool for uniform manifold approximation and projection (UMAP) and t-distributed stochastic neighbor embedding (tSNE). By using 21 publicly available datasets, we have found that CCP significantly improves UMAP and tSNE visualization and dramatically improve their accuracy. More specifically, CCP improves UMAP by 22% in ARI, 14% in NMI and 15% in ECM, and improves tSNE by 11% in ARI, 9% in NMI and 8% in ECM.

Citation: Hozumi Y, Wei G-W (2024) Analyzing scRNA-seq data by CCP-assisted UMAP and tSNE. PLoS ONE 19(12): e0311791. https://doi.org/10.1371/journal.pone.0311791

Editor: Andrea Tangherloni, Bocconi University: Universita Bocconi, ITALY

Received: February 20, 2024; Accepted: September 24, 2024; Published: December 13, 2024

Copyright: © 2024 Hozumi, Wei. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All data is publicly available from the Gene Expression Omnibus database with the accession numbers GSE75748, GSE82187, GSE94820, GSE67835, GSE84133, GSE109774, GSE85241, GSE74672, SCP1749 We have uploaded the data and code to reproduce our work in figshare, under https://doi.org/10.6084/m9.figshare.26501389.v7.

Funding: National Institute of health (NIH) grants R01GM126189, R01AI164266, and R35GM148196. National Science Foundation (NSF) grants DMS-2052983, DMS-1761320, and IIS-1900473 National Aeronautics and Space Administration (NASA) grant 80NSSC21M0023 Michigan State University (MSU) Foundation Bristol-Myers Squibb 65109 Pfizer. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Single-cell RNA sequencing (scRNA-seq) is a relatively new technology that profiles the transcriptome of individual cells within a tissue or organ, aiming to gain understanding of gene expression, gene regulation, cell-cell interaction, spatial transcriptomics, signal transduction pathways, and more [1]. The typical workflow of scRNA-seq involves cell isolation, RNA extraction, library preparation, sequencing, and data analysis. Through technological advances in the experimental procedures, the read quality has improved, and over 10,000 samples can now be sequenced at once. Despite the improvements, the data still contains nonuniform noise, is often times unlabeled and has high dimensions. In order to analyze such complex data, a standard data analysis pipeline involves data preprocessing, gene expression quantification, normalization and batch correction, dimensionality reduction, cell type identification, differential gene expression analysis, and pathway and functional analysis [2–7]. An effective dimensionality reduction must be employed in order to have a meaningful downstream analysis.

Two of the most popular dimensionality reductions for scRNA-seq are principal components analysis (PCA) and nonnegative matrix factorization (NMF). The first component is called the principal component, where the variance of the projected data is maximized. The subsequent ith component is orthogonal to the first i − 1 components, and maximizes the variance of the residual data projected onto the ith component [8]. PCA aims to obtain a lower dimensional representation of the data to identify important gene patterns. Many PCA derivatives have also been used for scRNA-seq data analysis [9–12]. In particular, a popular package Seurat [13] utilizes supervised PCA to find an optimal projection that incorporate local structure of the reference data for the downstream analysis. However, because PCA requires matrix diagnolization and its projected data contains negative values, it is difficult to interpret. In contrast, NMF has an additional constrain such that the low dimensional representation is nonnegative. Each components, often called a metagene in scRNA-seq analysis, is a linear combination of original genes [14]. NMF has seen a numerous extension to the original formulation, including robustness to noise and manifold regularization [15–21]. Through the nonnegative constrain, NMF is highly interpretable, and may be more suitable downstream analysis.

Deep learning and ensemble methods are another class of approaches that have become popular for single cell RNA-seq analysis. Single-cell variational inference (scVI) utilizes deep neural networks to obtain information from similar cells and genes to approximate the distribution of underlying gene expression values [22]. Single-cell cluster using marker genes (SCMcluster) utilizes known marker genes to guide feature selection and perform ensemble clustering [23]. AutoCell [24] utilizes variational autoencoding network that combines the Gaussian mixture model and graph embedding to model the high dimensional scRNA-seq data. Diffusion models [25–28], generative adversarial network (GAN) [27], language models [29–32], transformers [33–37], ensemble methods [38–40] and more [41–44] have also been used for scRNA-seq analysis. Though these methods have great performance, they rely on careful curation of data and often require large amount of data for pretraining.

Visualization of the data is also an important aspect of scRNA-seq analysis pipeline. After data preprocessing and feature extraction through dimensionality reduction, the visualization of data commonly involves the utilization of uniform manifold and projection (UMAP) or t-distributed stochastic neighbor embedding (tSNE) [45, 46]. UMAP obtains its visualization by constructing a k-dimensional weighted graph and computes the edge-wise cross entropy between the weighted graph of the low dimensional embedding and the k-dimensional weighted graph of the original space. Through the utilization of stochastic gradient descent, UMAP demonstrates a notable improvement in speed and scalability compared to other Laplacian eigenmap-based algorithms. TSNE computes data similarity by constructing a conditional probability distribution among pairs of data points. It employs the Student’s t-distribution to derive the probability distribution of the low-dimensional embedded space. Subsequently, it minimizes the Kullback-Leibler (KL) divergence between these two distributions to obtain the visualization [46, 47]. These visualization methods are widely utilized in popular analysis pipelines such as Scanpy for Python and Seurat for R. They serve as crucial tools to ensure proper feature selection and facilitate comprehensive data exploration.

We proposed correlated clustering and projection (CCP) as a general approach for dimensionality reduction [48, 49]. CCP is a data-domain method that completely bypasses matrix diagonalization. It partitions genes into clusters based on their similarities and then projects genes in the same clusters into a super-gene, which is a measure of accumulated gene-gene correlations among cells. The gene partition can be realized by either the standard k-means or the k-medoids using either covariance distance or correlation distance. Flexibility rigidity index (FRI) is used for the nonlinear projection [50]. The resulting super-genes are all non-negative and highly interpretable. Recently, CCP has been applied to the clustering and classification of scRNA-seq datasets [49], where it showed an improvement over PCA across 14 scRNA-seq data. This indicates CCP’s promising performance in handling single-cell RNA sequencing data.

The aim of this study is to explore CCP’s potential as the primary dimensionality reduction method for visualization, particularly focusing on its application in initializing UMAP and tSNE, two highly effective visualization tools in scRNA-seq analysis. Additionally, we introduce a novel method for handling low-variance (LV) genes. Instead of discarding low-variance genes like many other methods, we group them together into a single category. This grouping is achieved by projecting them into one descriptor using FRI. One of the drawbacks of dropping low-variance genes is that scRNA-seq data often has an unequal number of cell types. Moreover, there are numerous genes with low expression, and removing too many genes may result in overlooking cell outliers. Therefore, LV-gene addresses this issue by consolidating low-variance genes into one descriptor, thereby increasing its predictive power. Through experimentation on 21 publicly available datasets, we evaluated CCP-assisted UMAP and CCP-assisted tSNE. Our findings showcase that CCP enhances UMAP by 22% in Adjusted Rand Index (ARI), 14% in Normalized Mutual Information (NMI), and 15% in Element-Centric Measure (ECM). Similarly, CCP improves tSNE by 11% in ARI, 9% in NMI, and 8% in ECM.

Methods and algorithms

Consider a scRNA-seq dataset , where M is the number of cells and I is the number of genes. CCP finds an N-dimensional representation , in which 1 ≤ N << I, by using a data-domain two-step strategy: gene clustering and gene projection. Fig 1 shows the workflow of CCP, and the details of each step is outlined below. First, the genes are clustered according to their similarities. Then, for each gene cluster, the genes are projected into 1 descriptor called the super-gene. The resulting component can be regarded as the nth super-gene for the mth cell. Then, subsequence analysis, such as 2D visualization using UMAP and tSNE can be performed.

Download:

Fig 1. Workflow of CCP.

From the input gene expression matrix, the genes are partitioned into groups, according to their similarity. Then, each group of genes is projected into 1 descriptor called super-gene. UMAP and tSNE can further be applied to the super-gene to visualize the cells in 2 dimensions.

https://doi.org/10.1371/journal.pone.0311791.g001

Gene clustering

To facilitate a gene clustering, we emphasize gene vectors by setting the original data as , where represents the ith gene vector for the data. CCP partitions the gene vector into N components with 1 ≤ N << I by a clustering technique, such as k-means or k-medoids. CCP seeks an optimal disjoint partition of the data , for a given N, where is the nth partition (cluster) of the genes. To this end, the correlations among gene vectors zⁱ are analyzed according to appropriate correlation measures, such as covariance distance and correlation distance. Note that the clustering is performed on the genes, rather than the cells.

Let S = {1, …, I} be the enumeration of the genes. We can partition S into S¹, …, S^N using the gene clustering results by letting . Then, denotes the Sⁿ genes of the mth cell. Further detail can be found in Section S1.1 of the S1 File.

Gene projection.

Based on the gene partitioning, we denote as nth cluster of Sⁿ genes for the mth cell. CCP projects these genes into a super-gene by using the flexibility rigidity index (FRI). Denote as some metric between cell i and cell j for the nth cluster of Sⁿ genes. The gene-gene correlation between the two cells are defined by , where Φ is a correlation kernel, with parameters and κ > 0. One may use the Euclidean, Manhattan, and/or Wasserstein distances to measure the correlations. Additionally, the FRI correlation kernels satisfy the following conditions (1) (2) Although various radial basis functions can be used in CCP, we consider generalized exponential function in the present work (3) where is the cutoff distance and is the scale, which are defined from the data automatically. Here, κ is the power and τ is a scale parameter.

The gene-gene correlation matrix represents the cell-cell interactions for genes Sⁿ, and it captures all the interaction up to a threshold, which is determined by . Here, we take to be the 3-standard deviations of the pairwise distances. Additionally, to automatically evaluate , we consider the average minimal distance between the cluster of genes (4) Using the correlation function, CCP projects Sⁿ genes into a super-gene using FRI for ith sample, (5) where w_im are the weights. In this work, we set ω_im = 1.

CCP obtains the lower dimensional super-gene representation for ith sample (cell) by running the projection for all gene clusters .

Low variance (LV) genes.

One major challenge of scRNA-seq analysis is dealing with sparsity and low variance genes. We propose using low variance (LV) genes to collapse the low varying genes into 1 super-genes to increase their predictive power.

Let v = (v₁, …, v_I) be the variance of the genes, where v_i is the variance of gene zⁱ, and assume that the variance are sorted in descending order. Then, define the low variance set P as where 0 ≤ v_c ≤ 1 is the cutoff ratio. Then, we can obtain the cell-cell correlation using these low variance genes , where is the generalized exponential function is taken as the 3-standard deviation of the pairwise distances, and η^P is the average minimum distance

Using the correlation function, CCP projects |P| genes into a super-gene using FRI for ith sample, where w_im are the weights.

For CCP, we compute the LV-gene first, and use the correlated partition algorithm on the remaining genes.

Results

Data preprocessing

We have tested CCP-assisted UMAP and tSNE visualization on 21 publicly available data. Table 1 displays information including the Gene Expression Omnibus (GEO) accession ID [51, 52], the reference, data dimensions, and cell composition for each dataset. Additionally, data from the scziDesk paper [53] was utilized and can be accessed from their S1 File. The Qx and Qs data correspond to Smart-seq2 and 10x genomic data from Quake et al. [54]. Notably, the GSE84133 human dataset encompasses all human patient data from Baron et al. [55]. Detailed statistics for each data can be found in S1 Table in the S1 File.

Download:

Table 1. Dataset name, reference, dimensions and cell type composition.

https://doi.org/10.1371/journal.pone.0311791.t001

To normalize the data, we began by normalizing the counts by using the average median gene count of each cell. Let be the data, with M cells and N genes. Each row (cell) was divided by its row sum, followed by multiplication by the median row sum to obtain a normalized count matrix. Finally, log-transform was applied using Scanpy’s log1p method.

In our benchmarking process, we employed CCP with parameters τ = 6 and κ = 2 to reduce the dimensions to 300 super-genes. Additionally, we utilized ν = 0.8 to generate the LV-gene. Clustering was performed using the Leiden algorithm, and we evaluated the quality of clustering using ARI, NMI and ECM by comparing the obtained clusters with the cell types provided by the original authors. Visualizations were generated using Scanpy’s implementation of UMAP and tSNE. In order to reduce the computation load for datasets exceeding 2,000 samples, we utilized subsampling. For all the benchmarking, we utilized Michigan State University’s high performance computing cluster, which utilizes AMD EPYC 7H12 Processor. We utilized 16gb of memory with 4 CPU cores.

Visualization

Preprocessing of scRNA-seq data is a key step for visualization. Fig 2 shows an example of CCP-assisted tSNE visualization and the original tSNE visualization of the Baron dataset [55]. The original data has 20,125 genes, and aggressively reducing the original dimension to 2 dimensions by tSNE leads to poor visualization. In CCP-assisted tSNE, CCP was utilized to reduce the original genes into 300 super-genes, which were further reduced to 2 dimensions with tSNE for visualization. It is clear that CCP-assisted tSNE significantly improves the visualization quality in this case. We further showcase CCP-assisted visualization on the dataset described in Table 1. We provide additional comparison with PCA-assisted and NMF-assisted visualization in Section S2.2 of the S1 File.

Download:

Fig 2. TSNE visualization of GSE84133 mouse2 data.

The left and right figures show the CCP-assisted and the standard tSNE visualization.

https://doi.org/10.1371/journal.pone.0311791.g002

Fig 3 show the comparison of CCP-assisted UMAP and tSNE with standard UMAP and tSNE visualization on Quake dataset. Each row correspond to one of the 5 dataset, and the columns correspond to CCP-assisted UMAP, CCP-assisted tSNE, standard UMAP and standard tSNE visualization. The samples were colored according to the true cell type.

Download:

Fig 3. Comparison of CCP-assisted UMAP and tSNE with standard UMAP and tSNE visualization on Quake dataset.

The rows correspond to Qx Bladder, Qx Limb Muscle, Qx Diaphragm, Qs Limb Muscle and Qs Trachea. Qx indicates scRNA-seq obtained used 10x genomic platform, and Qs indicate data obtained from SmartSeq2 platform. CCP was used to reduced the dimension to 300 super-genes. UMAP and tSNE were utilized to further reduce the dimension to 2 to obtain the visualization. Samples were colored according to the cell types provided by the original authors.

https://doi.org/10.1371/journal.pone.0311791.g003

CCP improves the overall visualization of the Quake dataset. In Qx Bladder data, CCP-assisted UMAP and tSNE show 1 bladder cluster, whereas the standard UMAP and tSNE visualization show an elongated cluster of bladder cells. The urotherial cells in CCP-assisted UMAP and tSNE are divided into 3 subclusters, whereas the standard UMAP and tSNE visualization show 1 cluster. In Qs Diaphragm data, CCP-assisted UMAP and tSNE show 5 distinct clusters corresponding to each cell types. However the UMAP visualization do not differentiate the 5 cell types. The standard tSNE visualization show poor clustering, where satellite cell, mesenchymal cell and endothelial cell form a supercluster. In Qs Limb Muscle cell, all visualization show a supercluster of B cell and T cell. CCP-assisted visualization show a clear distinction between the B-T cell supercluster and macrophages, whereas the standard visualization show a supercluster of B cell, T cell, macrophages and endothelial cells. In the Qs Trachea data, the standard UMAP and tSNE visualization show a subpopulation of mesenchymal cell within the epithelial cell, whereas CCP-assisted counterparts do not.

Fig 4 show the comparison of CCP-assisted UMAP and tSNE with standard UMAP and tSNE visualization on GSE75748 cell, GSE75748 time, GSE67835 and GSE82187 dataset. The columns correspond to CCP-assisted UMAP, CCP-assisted tSNE, standard UMAP and standard tSNE visualization. The samples were colored according to the true cell type.

Download:

Fig 4. Comparison of CCP-assisted UMAP and CCP-assisted tSNE with the standard UMAP and tSNE visualization on GSE75748 cell, GSE75748 time, GSE67835 and GSE82187.

CCP was used to reduced the dimension to 300 super-genes. UMAP and tSNE were utilized to further reduce the dimension to 2 to obtain the visualization. Samples were colored according to the cell types provided by the original authors.

https://doi.org/10.1371/journal.pone.0311791.g004

In GSE75748 cell data, all the visualizations are similar. In [56], Chu obtained snapshots of lineage-specific progenitor cells that differentiated from H1 human embryonic stem (ES) cells and compared the gene profiles with undifferentiated H1 and H9 human ES cells as control. Most notably, H1 and H9 clustered together, which is consistent with our visualization. In GSE75748 time, all visualization is comparable. Chu et al [56] obtained snapshot of ES cell differentiation from pluropotency to definitive endoderm over the time period 0hr, 12hr, 24hr, 36hr, 72hr and 96hr. Chu noted the cells sequenced at 72hr and 96hr show relatively similar expression profiles, suggesting that the differentiation has completed by 72hr. We see from our visualization that the 72hr and 96hr cells form a cluster, 12hr and 24hr cells form a cluster, and 0hr cells form its own cluster, indicating that there is a clear distinction between the undifferentiated and the cells undergoing differentiation. In GSE67835, CCP-assisted visualization and its counter part have comparable result. Most notably, neurons cell from a distinct cluster in CCP-assisted visualization, whereas it does not in the standard visualization. In GSE82187 data, CCP-assisted UMAP and tSNE show a significant improvement over standard UMAP and tSNE visualization. Aside from astrocytes and OPC, all cell types form its own cluster. Standard UMAP and tSNE fail to show significant clustering of the different cell types.

Fig 5 show the comparison of CCP-assisted UMAP and tSNE with standard UMAP and tSNE visualization on Baron human dataset [55]. The rows correspond to the patients, and the columns correspond to CCP-assisted UMAP, CCP-assisted tSNE, standard UMAP and standard tSNE visualization. The samples were colored according to the true cell type.

Download:

Fig 5. Comparison of CCP-assisted UMAP and CCP-assisted tSNE with the standard UMAP and tSNE visualization on GSE84133 human dataset.

Each row corresponds to 1 of the 4 patients. CCP was used to reduced the dimension to 300 super-genes. UMAP and tSNE were utilized to further reduce the dimension to 2 to obtain the visualization. Samples were colored according to the cell types provided by the original authors.

https://doi.org/10.1371/journal.pone.0311791.g005

Overall, CCP-assisted visualizations show stronger clustering. In standard UMAP and tSNE visualizations across all patients, we noticed superclusters with unclear boundaries. Conversely, CCP-assisted visualizations display well-defined boundaries between cell types. Most notably is the clear differentiation of quiescent stellate (Q-Stellate) cells, alpha cells, and ductal cells across all patients, which is a distinction that isn’t as evident in the standard visualizations. Additionally, standard tSNE visualization of patient 3 show instability in the standard tSNE algorithm, where the visualization do not differentiate the cell types.

Fig 6 show the comparison of CCP-assisted UMAP and tSNE with standard UMAP and tSNE visualization on Baron mouse dataset [55]. The rows correspond to the patients, and the columns correspond to mouse 1 and 2, and the columns correspond to CCP-assisted UMAP, CCP-assisted tSNE, standard UMAP and standard tSNE visualization. The samples were colored according to the true cell type.

Download:

Fig 6. Comparison of CCP-assisted UMAP and CCP-assisted tSNE with the standard UMAP and tSNE visualization on GSE84133 human dataset.

The rows correspond to mouse 1 and 2. CCP was used to reduced the dimension to 300 super-genes. UMAP and tSNE were utilized to further reduce the dimension to 2 to obtain the visualization. Samples were colored according to the cell types provided by the original authors.

https://doi.org/10.1371/journal.pone.0311791.g006

CCP-assisted visualizations demonstrate significantly stronger clustering for both mouse samples. In the standard visualizations, beta cells are scattered among other cell types. Furthermore, in the data from mouse 2, alpha cells do not form a distinct cluster. Conversely, CCP-assisted visualizations distinctly cluster all cell types. Regarding mouse 1, the CCP-assisted visualization does not form a cluster for gamma cells, potentially due to the limited number of available gamma cells.

Fig 7 show the visualization of Murano, Romanov and Qs Lung data. The columns correspond to CCP-assisted UMAP, CCP-assisted tSNE, standard UMAP and standard tSNE visualization. The samples were colored according to the true cell type.

Download:

Fig 7. Comparison of CCP-assisted UMAP and CCP-assisted tSNE with the standard UMAP and tSNE visualizationn on Muraro, Romanov and Qs Lung dataset.

CCP was used to reduced the dimension to 300 super-genes. UMAP and tSNE were utilized to further reduce the dimension to 2 to obtain the visualization. Samples were colored according to the cell types provided by the original authors.

https://doi.org/10.1371/journal.pone.0311791.g007

In the Muraro dataset, CCP-assisted UMAP exhibits a clear separation of A cells, D cells, B cells, and ductal cells. In contrast, the standard UMAP visualization presents these cells as a supercluster. The standard tSNE visualization indicates the instability of tSNE algorithm, where the visualization is unclear and dominated by and A cell outlier. Regarding the Romanov dataset, all visualizations are relatively similar. CCP-assisted UMAP reveals a distinct cluster of astrocytes and ependymal cells, whereas both the standard UMAP and tSNE display a supercluster of these two cell types. Additionally, CCP-assisted UMAP and tSNE suggest two subclusters of VSM and endothelial cells, which are not discernible in the standard visualization. In the Qs Lung dataset, CCP-assisted and standard visualizations yield comparable results. While the standard tSNE separates monocytes from classical monocytes, CCP-assisted UMAP and tSNE portray a homogeneous clustering of these two cell types.

Accuracy

To assess CCP’s effectiveness as a primary dimensionality reduction tool for UMAP and tSNE, we conducted clustering using the Leiden algorithm within scanpy. We employed the adjusted Rand index (ARI) and normalized mutual information (NMI) to gauge accuracy by comparing the clustering results with the labels provided by the dataset’s authors. It’s important to note that these metrics do not measure absolute accuracy due to the absence of a gold standard dataset for scRNA-seq. Additionally, we used the Element-Centric measure (ECM) [62] to evaluate cluster stability.

Fig 8 show the average ARI, NMI and ECM of CCP-assisted UMAP, CCP-assisted tSNE, UMAP and tSNE across 18 dataset. For each dataset, we conducted 10 random seeds to perform dimensionality reduction, utilizing Leiden clustering to generate clustering labels. These labels were then compared to the annotated cell types provided by the original authors.

Download:

Fig 8. The average ARI, NMI, ECM of 18 datasets.

10 random initialization was used to compute CCP, CCP-assisted UMAP, CCP-assisted tSNE, standard UMAP and standard tSNE for each dataset. Leiden clustering was used to obtain the clustering results.

https://doi.org/10.1371/journal.pone.0311791.g008

CCP-assisted UMAP demonstrates a 22% improvement in ARI, 14% in NMI, and 15% in ECM over standard UMAP. Similarly, CCP-assisted tSNE improves standard tSNE by 11% ARI, 9% NMI and 8% in ECM. Additionally, CCP-assisted UMAP and tSNE have a higher ECM score, indicating that their clustering is more stable. Notably, both CCP-assisted UMAP and tSNE yield higher ECM scores, indicating more stable clustering. Interestingly, standard tSNE outperforms UMAP. However, UMAP’s performance heavily relies on accurately finding nearest neighbors, which can be challenging with noisy, sparse, and high-dimensional data. CCP effectively reduces dimensions, enabling UMAP to find neighbors more effectively and resulting in improved visualization.

For a detailed comparison between CCP-assisted, PCA-assisted, and NMF-assisted visualizations, please refer to Section S2.3 in the S1 File. Additionally, we provide the ARI, NMI and ECM for each dataset in S2–S4 Tables of the S1 File.

Discussion

Large data

While CCP proves to be an efficient dimensionality reduction technique for datasets with a large number of features, such as in the case of scRNA-seq data, it may encounter limitations due to the necessity of computing cell-cell correlations for each super-gene. To address this challenge, for larger datasets, we propose a subsampling approach.

Let be the training data used to develop a CCP model, and be a new dataset or additional data. Using the training data, gene partitions Sⁿ, cutoff distance and the connectivity are determined. Then, we embed new data to the trained model, utilizing the following modification to Eq 5 (6) to obtain appropriate super-genes.

We verified the subsampling approach on GSE84133 human and Qx Spleen data. We combined all four patient’s sequencing data into one superset for this analysis. We randomly subsampled 500, 1000, 1500, 2000, 2500, 3000 samples as a training data, and performed the subsampling under 10 random seeds. We projected the testing data using Eq 6, followed by Leiden clustering. ARI and NMI were computed, and the average scores are reported in Fig 9. Notably, both the GSE84133 human and Qx Spleen datasets exhibited consistent and stable results under varying number of subsampling.

Download:

Fig 9. UMAP and tSNE visualization of GSE84133 human and Qx Spleen data under different number of subsampling.

300 super-genes were generated from CCP, and Leiden clustering was used to obtain the clustering results. (a) ARI and NMI under different subsampling values. Left figure shows the ARI and NMI for GSE84133 Human, where the 4 patient data were combined. Right figure shows the ARI and NMI of Qx Speen data. (b) CCP-assisted UMAP and tSNE of GSE84133 Human data under different number of subsampling. (c) CCP-assisted UMAP and tSNE of Qx Spleen data under different number of subsampling.

https://doi.org/10.1371/journal.pone.0311791.g009

Additionally, we also show that CCP-assisted UMAP and tSNE for both data when subsampling was utilized. Notably, all visualizations were comparable, underscoring the stability of CCP-assisted visualizations even under subsampling. For the computation time, subsampling scheme using 1000 samples took 152 seconds and 160 seconds on GSE84133 human and Qx Spleen data, respectively. Additional comparison can be found in the S1 File.

Low variance genes

We have utilized LV-genes to enhance the predictive power of super-genes with a LV gene cluster. By using a high cutoff ratio, we can reduce the number of genes used in the feature partition, potentially resulting in a lower number of super-genes. To assess the impact of the cutoff ratio on the number of super-genes used for UMAP and tSNE visualizations, we conducted tests using GSE82187 and GSE75748 cell data. The discussion for GSE75748 cell data can be found in Section S3.2 of the S1 File.

Fig 10 show the effect of varying the number of super-genes and the cutoff ratio on the predictive power and visualization of GSE82187 data. We utilized 10 random seeds to generate CCP super-genes using different numbers of super-genes and cutoff ratios. Subsequently, Leiden clustering was applied to obtain cluster labels, and the ARI was computed utilizing the cell labels provided by the original authors. Notably, across all cutoff ratios, the ARI increases with an augmented number of super-genes, plateauing at a comparable level around 300 super-genes. This indicates the robustness of LV-gene.

Download:

Fig 10. Analysis of varying the cutoff ratio ν_c on clustering and visualization of GSE82187 data.

(a) ARI of leiden clustering when the number of super-genes and cutoff ratio is changed. (b) The number of genes in the LV-gene when ν_c is changed. (c) Top and bottom row shows the CCP-assisted UMAP and tSNE visualization, and the columns corresponds to ν_c = 0.6, 0.7, 0.8, 0.9. 300 super-genes were used to initialize UMAP and tSNE, and the samples were colored according to the true cell type.

https://doi.org/10.1371/journal.pone.0311791.g010

Fig 10(c) show the visualization of CCP-assisted UMAP and tSNE at various cutoff ratio. For the visualization, 300 super-genes were utilized, and UMAP and tSNE was applied to the super-genes to reduce the dimension to 2. Samples were then colored according to the cell types provided by the original authors. Note that all the visualization are comparable, indicating the robustness of LV-gene under different cutoff ratio.

Conclusion

CCP is a nonlinear data-domain dimensionality reduction technique that leverages gene-gene correlations to partition genes, and utilizes cell-cell correlation to generate super-genes. Unlike methods that involve matrix diagonalization, CCP can be directly applied as a primary dimensionality reduction tool to complement traditional visualization techniques like UMAP and tSNE. In our experiments with 18 datasets, CCP-assisted UMAP and CCP-assisted tSNE visualizations consistently outperformed the original UMAP and tSNE. On average, CCP-assised UMAP improves the standard UMAP visualization by 22% in ARI, 14% in NMI and 15% in ECM, and CCP-assisted tSNE improves standard tSNE by 11% ARI, 9% NMI and 8% in ECM. Although the improvement for tSNE visualization is less than the improvement in UMAP, tSNE is sensitive to potential outliers and noise, where the visualization can become uninterpretable. CCP-assisted tSNE consistently show clear visualization in the 21 dataset we have tested. Additionally, CCP-assisted visualization improves PCA-assisted and NMF-assisted visualization in the 21 dataset we have tested. However, CCP comes with some disadvantageous. For data with no clear gene-gene correlation, CCP will most likely not perform well. Additionally, although utilizing gene clustering removes the complication with computing distance in high dimensions, when the number of samples becomes large, the cell-cell correlation computation becomes time consuming. We show that subsampling via a training set is an effective approach to enable CCP for dealing with large data. One possible extension for gene clustering is to incorporate prior information, such as using known genes or utilizing known gene regulatory pathways, to guide in the clustering. Additionally, CCP can also be employed in many other single cell contexts, such as spatial transcriptomics and cell-cell communication, and for initializing deep learning methods.

Supporting information

S1 File. Supporting materials for analyzing scRNA-seq data by CCP-assisted UMAP and t-SNE.

https://doi.org/10.1371/journal.pone.0311791.s001

(PDF)

References

1. Lun AT, McCarthy DJ, Marioni JC. A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor. F1000Research. 2016;5. pmid:27909575
- View Article
- PubMed/NCBI
- Google Scholar
2. Hwang B, Lee JH, Bang D. Single-cell RNA sequencing technologies and bioinformatics pipelines. Experimental & molecular medicine. 2018;50(8):1–14. pmid:30089861
- View Article
- PubMed/NCBI
- Google Scholar
3. Andrews TS, Kiselev VY, McCarthy D, Hemberg M. Tutorial: guidelines for the computational analysis of single-cell RNA sequencing data. Nature protocols. 2021;16(1):1–9. pmid:33288955
- View Article
- PubMed/NCBI
- Google Scholar
4. Luecken MD, Theis FJ. Current best practices in single-cell RNA-seq analysis: a tutorial. Molecular systems biology. 2019;15(6):e8746. pmid:31217225
- View Article
- PubMed/NCBI
- Google Scholar
5. Chen G, Ning B, Shi T. Single-cell RNA-seq technologies and related computational data analysis. Frontiers in genetics. 2019;10:317. pmid:31024627
- View Article
- PubMed/NCBI
- Google Scholar
6. Petegrosso R, Li Z, Kuang R. Machine learning and statistical methods for clustering single-cell RNA-sequencing data. Briefings in bioinformatics. 2020;21(4):1209–1223. pmid:31243426
- View Article
- PubMed/NCBI
- Google Scholar
7. Li WV, Li JJ. A statistical simulator scDesign for rational scRNA-seq experimental design. Bioinformatics. 2019;35(14):i41–i50. pmid:31510652
- View Article
- PubMed/NCBI
- Google Scholar
8. Dunteman GH. Principal components analysis. vol. 69. Sage; 1989.
9. Lounici K. Sparse Principal Component Analysis with Missing Observations. In: High Dimensional Probability VI: The Banff Volume. Springer; 2013. p. 327–356.
10. Zou H, Hastie T, Tibshirani R. Sparse Principal Component Analysis. Journal of computational and graphical statistics. 2006;15(2):265–286.
- View Article
- Google Scholar
11. Townes FW, Hicks SC, Aryee MJ, Irizarry RA. Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model. Genome biology. 2019;20:1–16. pmid:31870412
- View Article
- PubMed/NCBI
- Google Scholar
12. Cottrell S, Wang R, Wei GW. PLPCA: persistent laplacian-enhanced PCA for microarray data analysis. Journal of chemical information and modeling. 2023;64(7):2405–2420. pmid:37738663
- View Article
- PubMed/NCBI
- Google Scholar
13. Hao Y, Hao S, Andersen-Nissen E, Mauck WM, Zheng S, Butler A, et al. Integrated analysis of multimodal single-cell data. Cell. 2021;184(13):3573–3587. pmid:34062119
- View Article
- PubMed/NCBI
- Google Scholar
14. Wang YX, Zhang YJ. Nonnegative matrix factorization: A comprehensive review. IEEE Transactions on knowledge and data engineering. 2013;25(6):1336–1353.
- View Article
- Google Scholar
15. Shu Z, Long Q, Zhang L, Yu Z, Wu XJ. Robust Graph Regularized NMF with Dissimilarity and Similarity Constraints for ScRNA-seq Data Clustering. Journal of Chemical Information and Modeling. 2022;62(23):6271–6286. pmid:36459053
- View Article
- PubMed/NCBI
- Google Scholar
16. Wu P, An M, Zou HR, Zhong CY, Wang W, Wu CP. A robust semi-supervised NMF model for single cell RNA-seq data. PeerJ. 2020;8:e10091. pmid:33088619
- View Article
- PubMed/NCBI
- Google Scholar
17. Lan W, Chen J, Chen Q, Liu J, Wang J, Chen YPP. Detecting cell type from single cell RNA sequencing based on deep bi-stochastic graph regularized matrix factorization. bioRxiv. 2022; p. 2022–05.
- View Article
- Google Scholar
18. Xiao Q, Luo J, Liang C, Cai J, Ding P. A graph regularized non-negative matrix factorization method for identifying microRNA-disease associations. Bioinformatics. 2018;34(2):239–248. pmid:28968779
- View Article
- PubMed/NCBI
- Google Scholar
19. Yu N, Gao YL, Liu JX, Wang J, Shang J. Robust hypergraph regularized non-negative matrix factorization for sample clustering and feature selection in multi-view gene expression data. Human genomics. 2019;13:1–10. pmid:31639067
- View Article
- PubMed/NCBI
- Google Scholar
20. Liu JX, Wang D, Gao YL, Zheng CH, Shang JL, Liu F, et al. A joint-L2,1-norm-constraint-based semi-supervised feature extraction for RNA-Seq data analysis. Neurocomputing. 2017;228:263–269.
- View Article
- Google Scholar
21. Hozumi Y, Wei GW. Analyzing single cell RNA sequencing with topological nonnegative matrix factorization. Journal of Computational and Applied Mathematics. 2024;445:115842. pmid:38464901
- View Article
- PubMed/NCBI
- Google Scholar
22. Lopez R, Regier J, Cole MB, Jordan MI, Yosef N. Deep generative modeling for single-cell transcriptomics. Nature methods. 2018;15(12):1053–1058. pmid:30504886
- View Article
- PubMed/NCBI
- Google Scholar
23. Wu H, Zhou H, Zhou B, Wang M. SCMcluster: a high-precision cell clustering algorithm integrating marker gene set with single-cell RNA sequencing data. Briefings in Functional Genomics. 2023;22(4):329–340. pmid:36848584
- View Article
- PubMed/NCBI
- Google Scholar
24. Xu J, Xu J, Meng Y, Lu C, Cai L, Zeng X, et al. Graph embedding and Gaussian mixture variational autoencoder network for end-to-end analysis of single-cell RNA sequencing data. Cell Reports Methods. 2023;3(1).
- View Article
- Google Scholar
25. Sadria M, Layton A. The Power of Two: integrating deep diffusion models and variational autoencoders for single-cell transcriptomics analysis. BioRxiv. 2023; p. 2023–04.
- View Article
- Google Scholar
26. Palma A, Theis FJ, Lotfollahi M. Predicting cell morphological responses to perturbations using generative modeling. bioRxiv. 2023; p. 2023–07.
- View Article
- Google Scholar
27. Giansanti V, Giannese F, Botrugno OA, Gandolfi G, Balestrieri C, Antoniotti M, et al. Scalable integration of multiomic single-cell data using generative adversarial networks. Bioinformatics. 2024;40(5):btae300. pmid:38696763
- View Article
- PubMed/NCBI
- Google Scholar
28. Kirkegaard JB. Spontaneous breaking of symmetry in overlapping cell instance segmentation using diffusion models. bioRxiv. 2023; p. 2023–07.
- View Article
- Google Scholar
29. Shen H, Liu J, Hu J, Shen X, Zhang C, Wu D, et al. Generative pretraining from large-scale transcriptomes for single-cell deciphering. Iscience. 2023;26(5). pmid:37187700
- View Article
- PubMed/NCBI
- Google Scholar
30. Connell W, Khan U, Keiser MJ. A single-cell gene expression language model. arXiv preprint arXiv:221014330. 2022.
31. Yang F, Wang W, Wang F, Fang Y, Tang D, Huang J, et al. scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nature Machine Intelligence. 2022;4(10):852–866.
- View Article
- Google Scholar
32. Yu X, Ren J, Long H, Zeng R, Zhang G, Bilal A, et al. iDNA-OpenPrompt: OpenPrompt learning model for identifying DNA methylation. Frontiers in Genetics. 2024;15:1377285. pmid:38689652
- View Article
- PubMed/NCBI
- Google Scholar
33. Chen J, Xu H, Tao W, Chen Z, Zhao Y, Han JDJ. Transformer for one stop interpretable cell type annotation. Nature Communications. 2023;14(1):223. pmid:36641532
- View Article
- PubMed/NCBI
- Google Scholar
34. Xu J, Zhang A, Liu F, Chen L, Zhang X. CIForm as a transformer-based model for cell-type annotation of large-scale single-cell RNA-seq data. Briefings in Bioinformatics. 2023;24(4):bbad195. pmid:37200157
- View Article
- PubMed/NCBI
- Google Scholar
35. Jiao L, Wang G, Dai H, Li X, Wang S, Song T. scTransSort: Transformers for intelligent annotation of cell types by gene embeddings. Biomolecules. 2023;13(4):611. pmid:37189359
- View Article
- PubMed/NCBI
- Google Scholar
36. Hu H, Feng Z, Lin H, Zhao J, Zhang Y, Xu F, et al. Modeling and analyzing single-cell multimodal data with deep parametric inference. Briefings in Bioinformatics. 2023;24(1):bbad005. pmid:36642414
- View Article
- PubMed/NCBI
- Google Scholar
37. Meng R, Yin S, Sun J, Hu H, Zhao Q. scAAGA: Single cell data analysis framework using asymmetric autoencoder with gene attention. Computers in biology and medicine. 2023;165:107414. pmid:37660567
- View Article
- PubMed/NCBI
- Google Scholar
38. Wang B, Zhu J, Pierson E, Ramazzotti D, Batzoglou S. Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning. Nature methods. 2017;14(4):414–416. pmid:28263960
- View Article
- PubMed/NCBI
- Google Scholar
39. Kiselev VY, Kirschner K, Schaub MT, Andrews T, Yiu A, Chandra T, et al. SC3: consensus clustering of single-cell RNA-seq data. Nature methods. 2017;14(5):483–486. pmid:28346451
- View Article
- PubMed/NCBI
- Google Scholar
40. Zhang P, Zhang H, Wu H. iPro-WAEL: a comprehensive and robust framework for identifying promoters in multiple species. Nucleic Acids Research. 2022;50(18):10278–10289. pmid:36161334
- View Article
- PubMed/NCBI
- Google Scholar
41. Zhang P, Wu Y, Zhou H, Zhou B, Zhang H, Wu H. CLNN-loop: a deep learning model to predict CTCF-mediated chromatin loops in the different cell lines and CTCF-binding sites (CBS) pair types. Bioinformatics. 2022;38(19):4497–4504. pmid:35997565
- View Article
- PubMed/NCBI
- Google Scholar
42. Zhang P, Wu H. Ichrom-deep: an attention-based deep learning model for identifying chromatin interactions. IEEE Journal of Biomedical and Health Informatics. 2023;. pmid:37402191
- View Article
- PubMed/NCBI
- Google Scholar
43. Hu H, Feng Z, Lin H, Cheng J, Lyu J, Zhang Y, et al. Gene function and cell surface protein association analysis based on single-cell multiomics data. Computers in Biology and Medicine. 2023;157:106733. pmid:36924730
- View Article
- PubMed/NCBI
- Google Scholar
44. Feng X, Xiu YH, Long HX, Wang ZT, Bilal A, Yang LM. Advancing single-cell RNA-seq data analysis through the fusion of multi-layer perceptron and graph neural network. Briefings in Bioinformatics. 2024;25(1):bbad481.
- View Article
- Google Scholar
45. McInnes L, Healy J, Melville J. UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:180203426. 2018.
46. Kobak D, Berens P. The art of using t-SNE for single-cell transcriptomics. Nature communications. 2019;10(1):5416. pmid:31780648
- View Article
- PubMed/NCBI
- Google Scholar
47. Van der Maaten L, Hinton G. Visualizing data using t-SNE. Journal of machine learning research;9(11):2579–2605.
- View Article
- Google Scholar
48. Hozumi Y, Wang R, Wei GW. CCP: Correlated clustering and projection for dimensionality reduction. arXiv preprint arXiv:220604189. 2022.
49. Hozumi Y, Tanemura KA, Wei GW. Preprocessing of single cell RNA sequencing data using correlated clustering and projection. Journal of chemical information and modeling. 2023;64(7):2829–2838. pmid:37402705
- View Article
- PubMed/NCBI
- Google Scholar
50. Xia K, Opron K, Wei GW. Multiscale multiphysics and multidomain models—Flexibility and rigidity. The Journal of chemical physics. 2013;139(19). pmid:24320318
- View Article
- PubMed/NCBI
- Google Scholar
51. Edgar R, Domrachev M, Lash AE. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic acids research. 2002;30(1):207–210. pmid:11752295
- View Article
- PubMed/NCBI
- Google Scholar
52. Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, et al. NCBI GEO: archive for functional genomics data sets—update. Nucleic acids research. 2012;41(D1):D991–D995. pmid:23193258
- View Article
- PubMed/NCBI
- Google Scholar
53. Chen L, Wang W, Zhai Y, Deng M. Deep soft K-means clustering with self-training for single-cell RNA sequence data. NAR genomics and bioinformatics. 2020;2(2):lqaa039. pmid:33575592
- View Article
- PubMed/NCBI
- Google Scholar
54. Schaum N, Karkanias J, Neff NF, May AP, Quake SR, Wyss-Coray T, et al. Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris: The Tabula Muris Consortium. Nature. 2018;562(7727):367–372.
- View Article
- Google Scholar
55. Baron M, Veres A, Wolock SL, Faust AL, Gaujoux R, Vetere A, et al. A single-cell transcriptomic map of the human and mouse pancreas reveals inter-and intra-cell population structure. Cell systems. 2016;3(4):346–360. pmid:27667365
- View Article
- PubMed/NCBI
- Google Scholar
56. Chu LF, Leng N, Zhang J, Hou Z, Mamott D, Vereide DT, et al. Single-cell RNA-seq reveals novel regulators of human embryonic stem cell differentiation to definitive endoderm. Genome biology. 2016;17:1–20. pmid:27534536
- View Article
- PubMed/NCBI
- Google Scholar
57. Gokce O, Stanley GM, Treutlein B, Neff NF, Camp JG, Malenka RC, et al. Cellular taxonomy of the mouse striatum as revealed by single-cell RNA-seq. Cell reports. 2016;16(4):1126–1137. pmid:27425622
- View Article
- PubMed/NCBI
- Google Scholar
58. Darmanis S, Sloan SA, Zhang Y, Enge M, Caneda C, Shuer LM, et al. A survey of human brain transcriptome diversity at the single cell level. Proceedings of the National Academy of Sciences. 2015;112(23):7285–7290. pmid:26060301
- View Article
- PubMed/NCBI
- Google Scholar
59. Muraro MJ, Dharmadhikari G, Grün D, Groen N, Dielen T, Jansen E, et al. A single-cell transcriptome atlas of the human pancreas. Cell systems. 2016;3(4):385–394. pmid:27693023
- View Article
- PubMed/NCBI
- Google Scholar
60. Romanov RA, Zeisel A, Bakker J, Girach F, Hellysaz A, Tomer R, et al. Molecular interrogation of hypothalamic organization reveals distinct dopamine neuronal subtypes. Nature neuroscience. 2017;20(2):176–188. pmid:27991900
- View Article
- PubMed/NCBI
- Google Scholar
61. Gideon HP, Hughes TK, Tzouanas CN, Wadsworth MH, Tu AA, Gierahn TM, et al. Multimodal profiling of lung granulomas in macaques reveals cellular correlates of tuberculosis control. Immunity. 2022;55(5):827–846. pmid:35483355
- View Article
- PubMed/NCBI
- Google Scholar
62. Gates AJ, Wood IB, Hetrick WP, Ahn YY. Element-centric clustering comparison unifies overlaps and hierarchy. Scientific reports. 2019;9(1):8574. pmid:31189888
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Lun AT, McCarthy DJ, Marioni JC. A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor. F1000Research. 2016;5. pmid:27909575
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Hwang B, Lee JH, Bang D. Single-cell RNA sequencing technologies and bioinformatics pipelines. Experimental & molecular medicine. 2018;50(8):1–14. pmid:30089861
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Andrews TS, Kiselev VY, McCarthy D, Hemberg M. Tutorial: guidelines for the computational analysis of single-cell RNA sequencing data. Nature protocols. 2021;16(1):1–9. pmid:33288955
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Luecken MD, Theis FJ. Current best practices in single-cell RNA-seq analysis: a tutorial. Molecular systems biology. 2019;15(6):e8746. pmid:31217225
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref5] 5. Chen G, Ning B, Shi T. Single-cell RNA-seq technologies and related computational data analysis. Frontiers in genetics. 2019;10:317. pmid:31024627
View Article
PubMed/NCBI
Google Scholar

[18] View Article

[19] PubMed/NCBI

[20] Google Scholar

[ref6] 6. Petegrosso R, Li Z, Kuang R. Machine learning and statistical methods for clustering single-cell RNA-sequencing data. Briefings in bioinformatics. 2020;21(4):1209–1223. pmid:31243426
View Article
PubMed/NCBI
Google Scholar

[22] View Article

[23] PubMed/NCBI

[24] Google Scholar

[ref7] 7. Li WV, Li JJ. A statistical simulator scDesign for rational scRNA-seq experimental design. Bioinformatics. 2019;35(14):i41–i50. pmid:31510652
View Article
PubMed/NCBI
Google Scholar

[26] View Article

[27] PubMed/NCBI

[28] Google Scholar

[ref8] 8. Dunteman GH. Principal components analysis. vol. 69. Sage; 1989.

[ref9] 9. Lounici K. Sparse Principal Component Analysis with Missing Observations. In: High Dimensional Probability VI: The Banff Volume. Springer; 2013. p. 327–356.

[ref10] 10. Zou H, Hastie T, Tibshirani R. Sparse Principal Component Analysis. Journal of computational and graphical statistics. 2006;15(2):265–286.
View Article
Google Scholar

[32] View Article

[33] Google Scholar

[ref11] 11. Townes FW, Hicks SC, Aryee MJ, Irizarry RA. Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model. Genome biology. 2019;20:1–16. pmid:31870412
View Article
PubMed/NCBI
Google Scholar

[35] View Article

[36] PubMed/NCBI

[37] Google Scholar

[ref12] 12. Cottrell S, Wang R, Wei GW. PLPCA: persistent laplacian-enhanced PCA for microarray data analysis. Journal of chemical information and modeling. 2023;64(7):2405–2420. pmid:37738663
View Article
PubMed/NCBI
Google Scholar

[39] View Article

[40] PubMed/NCBI

[41] Google Scholar

[ref13] 13. Hao Y, Hao S, Andersen-Nissen E, Mauck WM, Zheng S, Butler A, et al. Integrated analysis of multimodal single-cell data. Cell. 2021;184(13):3573–3587. pmid:34062119
View Article
PubMed/NCBI
Google Scholar

[43] View Article

[44] PubMed/NCBI

[45] Google Scholar

[ref14] 14. Wang YX, Zhang YJ. Nonnegative matrix factorization: A comprehensive review. IEEE Transactions on knowledge and data engineering. 2013;25(6):1336–1353.
View Article
Google Scholar

[47] View Article

[48] Google Scholar

[ref15] 15. Shu Z, Long Q, Zhang L, Yu Z, Wu XJ. Robust Graph Regularized NMF with Dissimilarity and Similarity Constraints for ScRNA-seq Data Clustering. Journal of Chemical Information and Modeling. 2022;62(23):6271–6286. pmid:36459053
View Article
PubMed/NCBI
Google Scholar

[50] View Article

[51] PubMed/NCBI

[52] Google Scholar

[ref16] 16. Wu P, An M, Zou HR, Zhong CY, Wang W, Wu CP. A robust semi-supervised NMF model for single cell RNA-seq data. PeerJ. 2020;8:e10091. pmid:33088619
View Article
PubMed/NCBI
Google Scholar

[54] View Article

[55] PubMed/NCBI

[56] Google Scholar

[ref17] 17. Lan W, Chen J, Chen Q, Liu J, Wang J, Chen YPP. Detecting cell type from single cell RNA sequencing based on deep bi-stochastic graph regularized matrix factorization. bioRxiv. 2022; p. 2022–05.
View Article
Google Scholar

[58] View Article

[59] Google Scholar

[ref18] 18. Xiao Q, Luo J, Liang C, Cai J, Ding P. A graph regularized non-negative matrix factorization method for identifying microRNA-disease associations. Bioinformatics. 2018;34(2):239–248. pmid:28968779
View Article
PubMed/NCBI
Google Scholar

[61] View Article

[62] PubMed/NCBI

[63] Google Scholar

[ref19] 19. Yu N, Gao YL, Liu JX, Wang J, Shang J. Robust hypergraph regularized non-negative matrix factorization for sample clustering and feature selection in multi-view gene expression data. Human genomics. 2019;13:1–10. pmid:31639067
View Article
PubMed/NCBI
Google Scholar

[65] View Article

[66] PubMed/NCBI

[67] Google Scholar

[ref20] 20. Liu JX, Wang D, Gao YL, Zheng CH, Shang JL, Liu F, et al. A joint-L2,1-norm-constraint-based semi-supervised feature extraction for RNA-Seq data analysis. Neurocomputing. 2017;228:263–269.
View Article
Google Scholar

[69] View Article

[70] Google Scholar

[ref21] 21. Hozumi Y, Wei GW. Analyzing single cell RNA sequencing with topological nonnegative matrix factorization. Journal of Computational and Applied Mathematics. 2024;445:115842. pmid:38464901
View Article
PubMed/NCBI
Google Scholar

[72] View Article

[73] PubMed/NCBI

[74] Google Scholar

[ref22] 22. Lopez R, Regier J, Cole MB, Jordan MI, Yosef N. Deep generative modeling for single-cell transcriptomics. Nature methods. 2018;15(12):1053–1058. pmid:30504886
View Article
PubMed/NCBI
Google Scholar

[76] View Article

[77] PubMed/NCBI

[78] Google Scholar

[ref23] 23. Wu H, Zhou H, Zhou B, Wang M. SCMcluster: a high-precision cell clustering algorithm integrating marker gene set with single-cell RNA sequencing data. Briefings in Functional Genomics. 2023;22(4):329–340. pmid:36848584
View Article
PubMed/NCBI
Google Scholar

[80] View Article

[81] PubMed/NCBI

[82] Google Scholar

[ref24] 24. Xu J, Xu J, Meng Y, Lu C, Cai L, Zeng X, et al. Graph embedding and Gaussian mixture variational autoencoder network for end-to-end analysis of single-cell RNA sequencing data. Cell Reports Methods. 2023;3(1).
View Article
Google Scholar

[84] View Article

[85] Google Scholar

[ref25] 25. Sadria M, Layton A. The Power of Two: integrating deep diffusion models and variational autoencoders for single-cell transcriptomics analysis. BioRxiv. 2023; p. 2023–04.
View Article
Google Scholar

[87] View Article

[88] Google Scholar

[ref26] 26. Palma A, Theis FJ, Lotfollahi M. Predicting cell morphological responses to perturbations using generative modeling. bioRxiv. 2023; p. 2023–07.
View Article
Google Scholar

[90] View Article

[91] Google Scholar

[ref27] 27. Giansanti V, Giannese F, Botrugno OA, Gandolfi G, Balestrieri C, Antoniotti M, et al. Scalable integration of multiomic single-cell data using generative adversarial networks. Bioinformatics. 2024;40(5):btae300. pmid:38696763
View Article
PubMed/NCBI
Google Scholar

[93] View Article

[94] PubMed/NCBI

[95] Google Scholar

[ref28] 28. Kirkegaard JB. Spontaneous breaking of symmetry in overlapping cell instance segmentation using diffusion models. bioRxiv. 2023; p. 2023–07.
View Article
Google Scholar

[97] View Article

[98] Google Scholar

[ref29] 29. Shen H, Liu J, Hu J, Shen X, Zhang C, Wu D, et al. Generative pretraining from large-scale transcriptomes for single-cell deciphering. Iscience. 2023;26(5). pmid:37187700
View Article
PubMed/NCBI
Google Scholar

[100] View Article

[101] PubMed/NCBI

[102] Google Scholar

[ref30] 30. Connell W, Khan U, Keiser MJ. A single-cell gene expression language model. arXiv preprint arXiv:221014330. 2022.

[ref31] 31. Yang F, Wang W, Wang F, Fang Y, Tang D, Huang J, et al. scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nature Machine Intelligence. 2022;4(10):852–866.
View Article
Google Scholar

[105] View Article

[106] Google Scholar

[ref32] 32. Yu X, Ren J, Long H, Zeng R, Zhang G, Bilal A, et al. iDNA-OpenPrompt: OpenPrompt learning model for identifying DNA methylation. Frontiers in Genetics. 2024;15:1377285. pmid:38689652
View Article
PubMed/NCBI
Google Scholar

[108] View Article

[109] PubMed/NCBI

[110] Google Scholar

[ref33] 33. Chen J, Xu H, Tao W, Chen Z, Zhao Y, Han JDJ. Transformer for one stop interpretable cell type annotation. Nature Communications. 2023;14(1):223. pmid:36641532
View Article
PubMed/NCBI
Google Scholar

[112] View Article

[113] PubMed/NCBI

[114] Google Scholar

[ref34] 34. Xu J, Zhang A, Liu F, Chen L, Zhang X. CIForm as a transformer-based model for cell-type annotation of large-scale single-cell RNA-seq data. Briefings in Bioinformatics. 2023;24(4):bbad195. pmid:37200157
View Article
PubMed/NCBI
Google Scholar

[116] View Article

[117] PubMed/NCBI

[118] Google Scholar

[ref35] 35. Jiao L, Wang G, Dai H, Li X, Wang S, Song T. scTransSort: Transformers for intelligent annotation of cell types by gene embeddings. Biomolecules. 2023;13(4):611. pmid:37189359
View Article
PubMed/NCBI
Google Scholar

[120] View Article

[121] PubMed/NCBI

[122] Google Scholar

[ref36] 36. Hu H, Feng Z, Lin H, Zhao J, Zhang Y, Xu F, et al. Modeling and analyzing single-cell multimodal data with deep parametric inference. Briefings in Bioinformatics. 2023;24(1):bbad005. pmid:36642414
View Article
PubMed/NCBI
Google Scholar

[124] View Article

[125] PubMed/NCBI

[126] Google Scholar

[ref37] 37. Meng R, Yin S, Sun J, Hu H, Zhao Q. scAAGA: Single cell data analysis framework using asymmetric autoencoder with gene attention. Computers in biology and medicine. 2023;165:107414. pmid:37660567
View Article
PubMed/NCBI
Google Scholar

[128] View Article

[129] PubMed/NCBI

[130] Google Scholar

[ref38] 38. Wang B, Zhu J, Pierson E, Ramazzotti D, Batzoglou S. Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning. Nature methods. 2017;14(4):414–416. pmid:28263960
View Article
PubMed/NCBI
Google Scholar

[132] View Article

[133] PubMed/NCBI

[134] Google Scholar

[ref39] 39. Kiselev VY, Kirschner K, Schaub MT, Andrews T, Yiu A, Chandra T, et al. SC3: consensus clustering of single-cell RNA-seq data. Nature methods. 2017;14(5):483–486. pmid:28346451
View Article
PubMed/NCBI
Google Scholar

[136] View Article

[137] PubMed/NCBI

[138] Google Scholar

[ref40] 40. Zhang P, Zhang H, Wu H. iPro-WAEL: a comprehensive and robust framework for identifying promoters in multiple species. Nucleic Acids Research. 2022;50(18):10278–10289. pmid:36161334
View Article
PubMed/NCBI
Google Scholar

[140] View Article

[141] PubMed/NCBI

[142] Google Scholar

[ref41] 41. Zhang P, Wu Y, Zhou H, Zhou B, Zhang H, Wu H. CLNN-loop: a deep learning model to predict CTCF-mediated chromatin loops in the different cell lines and CTCF-binding sites (CBS) pair types. Bioinformatics. 2022;38(19):4497–4504. pmid:35997565
View Article
PubMed/NCBI
Google Scholar

[144] View Article

[145] PubMed/NCBI

[146] Google Scholar

[ref42] 42. Zhang P, Wu H. Ichrom-deep: an attention-based deep learning model for identifying chromatin interactions. IEEE Journal of Biomedical and Health Informatics. 2023;. pmid:37402191
View Article
PubMed/NCBI
Google Scholar

[148] View Article

[149] PubMed/NCBI

[150] Google Scholar

[ref43] 43. Hu H, Feng Z, Lin H, Cheng J, Lyu J, Zhang Y, et al. Gene function and cell surface protein association analysis based on single-cell multiomics data. Computers in Biology and Medicine. 2023;157:106733. pmid:36924730
View Article
PubMed/NCBI
Google Scholar

[152] View Article

[153] PubMed/NCBI

[154] Google Scholar

[ref44] 44. Feng X, Xiu YH, Long HX, Wang ZT, Bilal A, Yang LM. Advancing single-cell RNA-seq data analysis through the fusion of multi-layer perceptron and graph neural network. Briefings in Bioinformatics. 2024;25(1):bbad481.
View Article
Google Scholar

[156] View Article

[157] Google Scholar

[ref45] 45. McInnes L, Healy J, Melville J. UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:180203426. 2018.

[ref46] 46. Kobak D, Berens P. The art of using t-SNE for single-cell transcriptomics. Nature communications. 2019;10(1):5416. pmid:31780648
View Article
PubMed/NCBI
Google Scholar

[160] View Article

[161] PubMed/NCBI

[162] Google Scholar

[ref47] 47. Van der Maaten L, Hinton G. Visualizing data using t-SNE. Journal of machine learning research;9(11):2579–2605.
View Article
Google Scholar

[164] View Article

[165] Google Scholar

[ref48] 48. Hozumi Y, Wang R, Wei GW. CCP: Correlated clustering and projection for dimensionality reduction. arXiv preprint arXiv:220604189. 2022.

[ref49] 49. Hozumi Y, Tanemura KA, Wei GW. Preprocessing of single cell RNA sequencing data using correlated clustering and projection. Journal of chemical information and modeling. 2023;64(7):2829–2838. pmid:37402705
View Article
PubMed/NCBI
Google Scholar

[168] View Article

[169] PubMed/NCBI

[170] Google Scholar

[ref50] 50. Xia K, Opron K, Wei GW. Multiscale multiphysics and multidomain models—Flexibility and rigidity. The Journal of chemical physics. 2013;139(19). pmid:24320318
View Article
PubMed/NCBI
Google Scholar

[172] View Article

[173] PubMed/NCBI

[174] Google Scholar

[ref51] 51. Edgar R, Domrachev M, Lash AE. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic acids research. 2002;30(1):207–210. pmid:11752295
View Article
PubMed/NCBI
Google Scholar

[176] View Article

[177] PubMed/NCBI

[178] Google Scholar

[ref52] 52. Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, et al. NCBI GEO: archive for functional genomics data sets—update. Nucleic acids research. 2012;41(D1):D991–D995. pmid:23193258
View Article
PubMed/NCBI
Google Scholar

[180] View Article

[181] PubMed/NCBI

[182] Google Scholar

[ref53] 53. Chen L, Wang W, Zhai Y, Deng M. Deep soft K-means clustering with self-training for single-cell RNA sequence data. NAR genomics and bioinformatics. 2020;2(2):lqaa039. pmid:33575592
View Article
PubMed/NCBI
Google Scholar

[184] View Article

[185] PubMed/NCBI

[186] Google Scholar

[ref54] 54. Schaum N, Karkanias J, Neff NF, May AP, Quake SR, Wyss-Coray T, et al. Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris: The Tabula Muris Consortium. Nature. 2018;562(7727):367–372.
View Article
Google Scholar

[188] View Article

[189] Google Scholar

[ref55] 55. Baron M, Veres A, Wolock SL, Faust AL, Gaujoux R, Vetere A, et al. A single-cell transcriptomic map of the human and mouse pancreas reveals inter-and intra-cell population structure. Cell systems. 2016;3(4):346–360. pmid:27667365
View Article
PubMed/NCBI
Google Scholar

[191] View Article

[192] PubMed/NCBI

[193] Google Scholar

[ref56] 56. Chu LF, Leng N, Zhang J, Hou Z, Mamott D, Vereide DT, et al. Single-cell RNA-seq reveals novel regulators of human embryonic stem cell differentiation to definitive endoderm. Genome biology. 2016;17:1–20. pmid:27534536
View Article
PubMed/NCBI
Google Scholar

[195] View Article

[196] PubMed/NCBI

[197] Google Scholar

[ref57] 57. Gokce O, Stanley GM, Treutlein B, Neff NF, Camp JG, Malenka RC, et al. Cellular taxonomy of the mouse striatum as revealed by single-cell RNA-seq. Cell reports. 2016;16(4):1126–1137. pmid:27425622
View Article
PubMed/NCBI
Google Scholar

[199] View Article

[200] PubMed/NCBI

[201] Google Scholar

[ref58] 58. Darmanis S, Sloan SA, Zhang Y, Enge M, Caneda C, Shuer LM, et al. A survey of human brain transcriptome diversity at the single cell level. Proceedings of the National Academy of Sciences. 2015;112(23):7285–7290. pmid:26060301
View Article
PubMed/NCBI
Google Scholar

[203] View Article

[204] PubMed/NCBI

[205] Google Scholar

[ref59] 59. Muraro MJ, Dharmadhikari G, Grün D, Groen N, Dielen T, Jansen E, et al. A single-cell transcriptome atlas of the human pancreas. Cell systems. 2016;3(4):385–394. pmid:27693023
View Article
PubMed/NCBI
Google Scholar

[207] View Article

[208] PubMed/NCBI

[209] Google Scholar

[ref60] 60. Romanov RA, Zeisel A, Bakker J, Girach F, Hellysaz A, Tomer R, et al. Molecular interrogation of hypothalamic organization reveals distinct dopamine neuronal subtypes. Nature neuroscience. 2017;20(2):176–188. pmid:27991900
View Article
PubMed/NCBI
Google Scholar

[211] View Article

[212] PubMed/NCBI

[213] Google Scholar

[ref61] 61. Gideon HP, Hughes TK, Tzouanas CN, Wadsworth MH, Tu AA, Gierahn TM, et al. Multimodal profiling of lung granulomas in macaques reveals cellular correlates of tuberculosis control. Immunity. 2022;55(5):827–846. pmid:35483355
View Article
PubMed/NCBI
Google Scholar

[215] View Article

[216] PubMed/NCBI

[217] Google Scholar

[ref62] 62. Gates AJ, Wood IB, Hetrick WP, Ahn YY. Element-centric clustering comparison unifies overlaps and hierarchy. Scientific reports. 2019;9(1):8574. pmid:31189888
View Article
PubMed/NCBI
Google Scholar

[219] View Article

[220] PubMed/NCBI

[221] Google Scholar

Figures

Abstract

Introduction

Methods and algorithms

Gene clustering

Gene projection.

Low variance (LV) genes.

Results

Data preprocessing

Visualization

Accuracy

Discussion

Large data

Low variance genes

Conclusion

Supporting information

S1 File. Supporting materials for analyzing scRNA-seq data by CCP-assisted UMAP and t-SNE.

References