Figures
Abstract
Single-cell RNA sequencing (scRNA-seq) is widely used to reveal heterogeneity in cells, which has given us insights into cell-cell communication, cell differentiation, and differential gene expression. However, analyzing scRNA-seq data is a challenge due to sparsity and the large number of genes involved. Therefore, dimensionality reduction and feature selection are important for removing spurious signals and enhancing downstream analysis. Correlated clustering and projection (CCP) was recently introduced as an effective method for preprocessing scRNA-seq data. CCP utilizes gene-gene correlations to partition the genes and, based on the partition, employs cell-cell interactions to obtain super-genes. Because CCP is a data-domain approach that does not require matrix diagonalization, it can be used in many downstream machine learning tasks. In this work, we utilize CCP as an initialization tool for uniform manifold approximation and projection (UMAP) and t-distributed stochastic neighbor embedding (tSNE). By using 21 publicly available datasets, we have found that CCP significantly improves UMAP and tSNE visualization and dramatically improve their accuracy. More specifically, CCP improves UMAP by 22% in ARI, 14% in NMI and 15% in ECM, and improves tSNE by 11% in ARI, 9% in NMI and 8% in ECM.
Citation: Hozumi Y, Wei G-W (2024) Analyzing scRNA-seq data by CCP-assisted UMAP and tSNE. PLoS ONE 19(12): e0311791. https://doi.org/10.1371/journal.pone.0311791
Editor: Andrea Tangherloni, Bocconi University: Universita Bocconi, ITALY
Received: February 20, 2024; Accepted: September 24, 2024; Published: December 13, 2024
Copyright: © 2024 Hozumi, Wei. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All data is publicly available from the Gene Expression Omnibus database with the accession numbers GSE75748, GSE82187, GSE94820, GSE67835, GSE84133, GSE109774, GSE85241, GSE74672, SCP1749 We have uploaded the data and code to reproduce our work in figshare, under https://doi.org/10.6084/m9.figshare.26501389.v7.
Funding: National Institute of health (NIH) grants R01GM126189, R01AI164266, and R35GM148196. National Science Foundation (NSF) grants DMS-2052983, DMS-1761320, and IIS-1900473 National Aeronautics and Space Administration (NASA) grant 80NSSC21M0023 Michigan State University (MSU) Foundation Bristol-Myers Squibb 65109 Pfizer. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Single-cell RNA sequencing (scRNA-seq) is a relatively new technology that profiles the transcriptome of individual cells within a tissue or organ, aiming to gain understanding of gene expression, gene regulation, cell-cell interaction, spatial transcriptomics, signal transduction pathways, and more [1]. The typical workflow of scRNA-seq involves cell isolation, RNA extraction, library preparation, sequencing, and data analysis. Through technological advances in the experimental procedures, the read quality has improved, and over 10,000 samples can now be sequenced at once. Despite the improvements, the data still contains nonuniform noise, is often times unlabeled and has high dimensions. In order to analyze such complex data, a standard data analysis pipeline involves data preprocessing, gene expression quantification, normalization and batch correction, dimensionality reduction, cell type identification, differential gene expression analysis, and pathway and functional analysis [2–7]. An effective dimensionality reduction must be employed in order to have a meaningful downstream analysis.
Two of the most popular dimensionality reductions for scRNA-seq are principal components analysis (PCA) and nonnegative matrix factorization (NMF). The first component is called the principal component, where the variance of the projected data is maximized. The subsequent ith component is orthogonal to the first i − 1 components, and maximizes the variance of the residual data projected onto the ith component [8]. PCA aims to obtain a lower dimensional representation of the data to identify important gene patterns. Many PCA derivatives have also been used for scRNA-seq data analysis [9–12]. In particular, a popular package Seurat [13] utilizes supervised PCA to find an optimal projection that incorporate local structure of the reference data for the downstream analysis. However, because PCA requires matrix diagnolization and its projected data contains negative values, it is difficult to interpret. In contrast, NMF has an additional constrain such that the low dimensional representation is nonnegative. Each components, often called a metagene in scRNA-seq analysis, is a linear combination of original genes [14]. NMF has seen a numerous extension to the original formulation, including robustness to noise and manifold regularization [15–21]. Through the nonnegative constrain, NMF is highly interpretable, and may be more suitable downstream analysis.
Deep learning and ensemble methods are another class of approaches that have become popular for single cell RNA-seq analysis. Single-cell variational inference (scVI) utilizes deep neural networks to obtain information from similar cells and genes to approximate the distribution of underlying gene expression values [22]. Single-cell cluster using marker genes (SCMcluster) utilizes known marker genes to guide feature selection and perform ensemble clustering [23]. AutoCell [24] utilizes variational autoencoding network that combines the Gaussian mixture model and graph embedding to model the high dimensional scRNA-seq data. Diffusion models [25–28], generative adversarial network (GAN) [27], language models [29–32], transformers [33–37], ensemble methods [38–40] and more [41–44] have also been used for scRNA-seq analysis. Though these methods have great performance, they rely on careful curation of data and often require large amount of data for pretraining.
Visualization of the data is also an important aspect of scRNA-seq analysis pipeline. After data preprocessing and feature extraction through dimensionality reduction, the visualization of data commonly involves the utilization of uniform manifold and projection (UMAP) or t-distributed stochastic neighbor embedding (tSNE) [45, 46]. UMAP obtains its visualization by constructing a k-dimensional weighted graph and computes the edge-wise cross entropy between the weighted graph of the low dimensional embedding and the k-dimensional weighted graph of the original space. Through the utilization of stochastic gradient descent, UMAP demonstrates a notable improvement in speed and scalability compared to other Laplacian eigenmap-based algorithms. TSNE computes data similarity by constructing a conditional probability distribution among pairs of data points. It employs the Student’s t-distribution to derive the probability distribution of the low-dimensional embedded space. Subsequently, it minimizes the Kullback-Leibler (KL) divergence between these two distributions to obtain the visualization [46, 47]. These visualization methods are widely utilized in popular analysis pipelines such as Scanpy for Python and Seurat for R. They serve as crucial tools to ensure proper feature selection and facilitate comprehensive data exploration.
We proposed correlated clustering and projection (CCP) as a general approach for dimensionality reduction [48, 49]. CCP is a data-domain method that completely bypasses matrix diagonalization. It partitions genes into clusters based on their similarities and then projects genes in the same clusters into a super-gene, which is a measure of accumulated gene-gene correlations among cells. The gene partition can be realized by either the standard k-means or the k-medoids using either covariance distance or correlation distance. Flexibility rigidity index (FRI) is used for the nonlinear projection [50]. The resulting super-genes are all non-negative and highly interpretable. Recently, CCP has been applied to the clustering and classification of scRNA-seq datasets [49], where it showed an improvement over PCA across 14 scRNA-seq data. This indicates CCP’s promising performance in handling single-cell RNA sequencing data.
The aim of this study is to explore CCP’s potential as the primary dimensionality reduction method for visualization, particularly focusing on its application in initializing UMAP and tSNE, two highly effective visualization tools in scRNA-seq analysis. Additionally, we introduce a novel method for handling low-variance (LV) genes. Instead of discarding low-variance genes like many other methods, we group them together into a single category. This grouping is achieved by projecting them into one descriptor using FRI. One of the drawbacks of dropping low-variance genes is that scRNA-seq data often has an unequal number of cell types. Moreover, there are numerous genes with low expression, and removing too many genes may result in overlooking cell outliers. Therefore, LV-gene addresses this issue by consolidating low-variance genes into one descriptor, thereby increasing its predictive power. Through experimentation on 21 publicly available datasets, we evaluated CCP-assisted UMAP and CCP-assisted tSNE. Our findings showcase that CCP enhances UMAP by 22% in Adjusted Rand Index (ARI), 14% in Normalized Mutual Information (NMI), and 15% in Element-Centric Measure (ECM). Similarly, CCP improves tSNE by 11% in ARI, 9% in NMI, and 8% in ECM.
Methods and algorithms
Consider a scRNA-seq dataset , where M is the number of cells and I is the number of genes. CCP finds an N-dimensional representation
, in which 1 ≤ N << I, by using a data-domain two-step strategy: gene clustering and gene projection. Fig 1 shows the workflow of CCP, and the details of each step is outlined below. First, the genes are clustered according to their similarities. Then, for each gene cluster, the genes are projected into 1 descriptor called the super-gene. The resulting component
can be regarded as the nth super-gene for the mth cell. Then, subsequence analysis, such as 2D visualization using UMAP and tSNE can be performed.
From the input gene expression matrix, the genes are partitioned into groups, according to their similarity. Then, each group of genes is projected into 1 descriptor called super-gene. UMAP and tSNE can further be applied to the super-gene to visualize the cells in 2 dimensions.
Gene clustering
To facilitate a gene clustering, we emphasize gene vectors by setting the original data as , where
represents the ith gene vector for the data. CCP partitions the gene vector into N components with 1 ≤ N << I by a clustering technique, such as k-means or k-medoids. CCP seeks an optimal disjoint partition of the data
, for a given N, where
is the nth partition (cluster) of the genes. To this end, the correlations among gene vectors zi are analyzed according to appropriate correlation measures, such as covariance distance and correlation distance. Note that the clustering is performed on the genes, rather than the cells.
Let S = {1, …, I} be the enumeration of the genes. We can partition S into S1, …, SN using the gene clustering results by letting . Then,
denotes the Sn genes of the mth cell. Further detail can be found in Section S1.1 of the S1 File.
Gene projection.
Based on the gene partitioning, we denote as nth cluster of Sn genes for the mth cell. CCP projects these genes into a super-gene
by using the flexibility rigidity index (FRI). Denote
as some metric between cell i and cell j for the nth cluster of Sn genes. The gene-gene correlation between the two cells are defined by
, where Φ is a correlation kernel, with parameters
and κ > 0. One may use the Euclidean, Manhattan, and/or Wasserstein distances to measure the correlations. Additionally, the FRI correlation kernels satisfy the following conditions
(1)
(2)
Although various radial basis functions can be used in CCP, we consider generalized exponential function in the present work
(3)
where
is the cutoff distance and
is the scale, which are defined from the data automatically. Here, κ is the power and τ is a scale parameter.
The gene-gene correlation matrix represents the cell-cell interactions for genes Sn, and it captures all the interaction up to a threshold, which is determined by
. Here, we take
to be the 3-standard deviations of the pairwise distances. Additionally, to automatically evaluate
, we consider the average minimal distance between the cluster of genes
(4)
Using the correlation function, CCP projects Sn genes into a super-gene using FRI for ith sample,
(5)
where wim are the weights. In this work, we set ωim = 1.
CCP obtains the lower dimensional super-gene representation for ith sample (cell) by running the projection for all gene clusters
.
Low variance (LV) genes.
One major challenge of scRNA-seq analysis is dealing with sparsity and low variance genes. We propose using low variance (LV) genes to collapse the low varying genes into 1 super-genes to increase their predictive power.
Let v = (v1, …, vI) be the variance of the genes, where vi is the variance of gene zi, and assume that the variance are sorted in descending order. Then, define the low variance set P as
where 0 ≤ vc ≤ 1 is the cutoff ratio. Then, we can obtain the cell-cell correlation using these low variance genes
,
where
is the generalized exponential function
is taken as the 3-standard deviation of the pairwise distances, and ηP is the average minimum distance
Using the correlation function, CCP projects |P| genes into a super-gene using FRI for ith sample,
where wim are the weights.
For CCP, we compute the LV-gene first, and use the correlated partition algorithm on the remaining genes.
Results
Data preprocessing
We have tested CCP-assisted UMAP and tSNE visualization on 21 publicly available data. Table 1 displays information including the Gene Expression Omnibus (GEO) accession ID [51, 52], the reference, data dimensions, and cell composition for each dataset. Additionally, data from the scziDesk paper [53] was utilized and can be accessed from their S1 File. The Qx and Qs data correspond to Smart-seq2 and 10x genomic data from Quake et al. [54]. Notably, the GSE84133 human dataset encompasses all human patient data from Baron et al. [55]. Detailed statistics for each data can be found in S1 Table in the S1 File.
To normalize the data, we began by normalizing the counts by using the average median gene count of each cell. Let be the data, with M cells and N genes. Each row (cell) was divided by its row sum, followed by multiplication by the median row sum to obtain a normalized count matrix. Finally, log-transform was applied using Scanpy’s log1p method.
In our benchmarking process, we employed CCP with parameters τ = 6 and κ = 2 to reduce the dimensions to 300 super-genes. Additionally, we utilized ν = 0.8 to generate the LV-gene. Clustering was performed using the Leiden algorithm, and we evaluated the quality of clustering using ARI, NMI and ECM by comparing the obtained clusters with the cell types provided by the original authors. Visualizations were generated using Scanpy’s implementation of UMAP and tSNE. In order to reduce the computation load for datasets exceeding 2,000 samples, we utilized subsampling. For all the benchmarking, we utilized Michigan State University’s high performance computing cluster, which utilizes AMD EPYC 7H12 Processor. We utilized 16gb of memory with 4 CPU cores.
Visualization
Preprocessing of scRNA-seq data is a key step for visualization. Fig 2 shows an example of CCP-assisted tSNE visualization and the original tSNE visualization of the Baron dataset [55]. The original data has 20,125 genes, and aggressively reducing the original dimension to 2 dimensions by tSNE leads to poor visualization. In CCP-assisted tSNE, CCP was utilized to reduce the original genes into 300 super-genes, which were further reduced to 2 dimensions with tSNE for visualization. It is clear that CCP-assisted tSNE significantly improves the visualization quality in this case. We further showcase CCP-assisted visualization on the dataset described in Table 1. We provide additional comparison with PCA-assisted and NMF-assisted visualization in Section S2.2 of the S1 File.
The left and right figures show the CCP-assisted and the standard tSNE visualization.
Fig 3 show the comparison of CCP-assisted UMAP and tSNE with standard UMAP and tSNE visualization on Quake dataset. Each row correspond to one of the 5 dataset, and the columns correspond to CCP-assisted UMAP, CCP-assisted tSNE, standard UMAP and standard tSNE visualization. The samples were colored according to the true cell type.
The rows correspond to Qx Bladder, Qx Limb Muscle, Qx Diaphragm, Qs Limb Muscle and Qs Trachea. Qx indicates scRNA-seq obtained used 10x genomic platform, and Qs indicate data obtained from SmartSeq2 platform. CCP was used to reduced the dimension to 300 super-genes. UMAP and tSNE were utilized to further reduce the dimension to 2 to obtain the visualization. Samples were colored according to the cell types provided by the original authors.
CCP improves the overall visualization of the Quake dataset. In Qx Bladder data, CCP-assisted UMAP and tSNE show 1 bladder cluster, whereas the standard UMAP and tSNE visualization show an elongated cluster of bladder cells. The urotherial cells in CCP-assisted UMAP and tSNE are divided into 3 subclusters, whereas the standard UMAP and tSNE visualization show 1 cluster. In Qs Diaphragm data, CCP-assisted UMAP and tSNE show 5 distinct clusters corresponding to each cell types. However the UMAP visualization do not differentiate the 5 cell types. The standard tSNE visualization show poor clustering, where satellite cell, mesenchymal cell and endothelial cell form a supercluster. In Qs Limb Muscle cell, all visualization show a supercluster of B cell and T cell. CCP-assisted visualization show a clear distinction between the B-T cell supercluster and macrophages, whereas the standard visualization show a supercluster of B cell, T cell, macrophages and endothelial cells. In the Qs Trachea data, the standard UMAP and tSNE visualization show a subpopulation of mesenchymal cell within the epithelial cell, whereas CCP-assisted counterparts do not.
Fig 4 show the comparison of CCP-assisted UMAP and tSNE with standard UMAP and tSNE visualization on GSE75748 cell, GSE75748 time, GSE67835 and GSE82187 dataset. The columns correspond to CCP-assisted UMAP, CCP-assisted tSNE, standard UMAP and standard tSNE visualization. The samples were colored according to the true cell type.
CCP was used to reduced the dimension to 300 super-genes. UMAP and tSNE were utilized to further reduce the dimension to 2 to obtain the visualization. Samples were colored according to the cell types provided by the original authors.
In GSE75748 cell data, all the visualizations are similar. In [56], Chu obtained snapshots of lineage-specific progenitor cells that differentiated from H1 human embryonic stem (ES) cells and compared the gene profiles with undifferentiated H1 and H9 human ES cells as control. Most notably, H1 and H9 clustered together, which is consistent with our visualization. In GSE75748 time, all visualization is comparable. Chu et al [56] obtained snapshot of ES cell differentiation from pluropotency to definitive endoderm over the time period 0hr, 12hr, 24hr, 36hr, 72hr and 96hr. Chu noted the cells sequenced at 72hr and 96hr show relatively similar expression profiles, suggesting that the differentiation has completed by 72hr. We see from our visualization that the 72hr and 96hr cells form a cluster, 12hr and 24hr cells form a cluster, and 0hr cells form its own cluster, indicating that there is a clear distinction between the undifferentiated and the cells undergoing differentiation. In GSE67835, CCP-assisted visualization and its counter part have comparable result. Most notably, neurons cell from a distinct cluster in CCP-assisted visualization, whereas it does not in the standard visualization. In GSE82187 data, CCP-assisted UMAP and tSNE show a significant improvement over standard UMAP and tSNE visualization. Aside from astrocytes and OPC, all cell types form its own cluster. Standard UMAP and tSNE fail to show significant clustering of the different cell types.
Fig 5 show the comparison of CCP-assisted UMAP and tSNE with standard UMAP and tSNE visualization on Baron human dataset [55]. The rows correspond to the patients, and the columns correspond to CCP-assisted UMAP, CCP-assisted tSNE, standard UMAP and standard tSNE visualization. The samples were colored according to the true cell type.
Each row corresponds to 1 of the 4 patients. CCP was used to reduced the dimension to 300 super-genes. UMAP and tSNE were utilized to further reduce the dimension to 2 to obtain the visualization. Samples were colored according to the cell types provided by the original authors.
Overall, CCP-assisted visualizations show stronger clustering. In standard UMAP and tSNE visualizations across all patients, we noticed superclusters with unclear boundaries. Conversely, CCP-assisted visualizations display well-defined boundaries between cell types. Most notably is the clear differentiation of quiescent stellate (Q-Stellate) cells, alpha cells, and ductal cells across all patients, which is a distinction that isn’t as evident in the standard visualizations. Additionally, standard tSNE visualization of patient 3 show instability in the standard tSNE algorithm, where the visualization do not differentiate the cell types.
Fig 6 show the comparison of CCP-assisted UMAP and tSNE with standard UMAP and tSNE visualization on Baron mouse dataset [55]. The rows correspond to the patients, and the columns correspond to mouse 1 and 2, and the columns correspond to CCP-assisted UMAP, CCP-assisted tSNE, standard UMAP and standard tSNE visualization. The samples were colored according to the true cell type.
The rows correspond to mouse 1 and 2. CCP was used to reduced the dimension to 300 super-genes. UMAP and tSNE were utilized to further reduce the dimension to 2 to obtain the visualization. Samples were colored according to the cell types provided by the original authors.
CCP-assisted visualizations demonstrate significantly stronger clustering for both mouse samples. In the standard visualizations, beta cells are scattered among other cell types. Furthermore, in the data from mouse 2, alpha cells do not form a distinct cluster. Conversely, CCP-assisted visualizations distinctly cluster all cell types. Regarding mouse 1, the CCP-assisted visualization does not form a cluster for gamma cells, potentially due to the limited number of available gamma cells.
Fig 7 show the visualization of Murano, Romanov and Qs Lung data. The columns correspond to CCP-assisted UMAP, CCP-assisted tSNE, standard UMAP and standard tSNE visualization. The samples were colored according to the true cell type.
CCP was used to reduced the dimension to 300 super-genes. UMAP and tSNE were utilized to further reduce the dimension to 2 to obtain the visualization. Samples were colored according to the cell types provided by the original authors.
In the Muraro dataset, CCP-assisted UMAP exhibits a clear separation of A cells, D cells, B cells, and ductal cells. In contrast, the standard UMAP visualization presents these cells as a supercluster. The standard tSNE visualization indicates the instability of tSNE algorithm, where the visualization is unclear and dominated by and A cell outlier. Regarding the Romanov dataset, all visualizations are relatively similar. CCP-assisted UMAP reveals a distinct cluster of astrocytes and ependymal cells, whereas both the standard UMAP and tSNE display a supercluster of these two cell types. Additionally, CCP-assisted UMAP and tSNE suggest two subclusters of VSM and endothelial cells, which are not discernible in the standard visualization. In the Qs Lung dataset, CCP-assisted and standard visualizations yield comparable results. While the standard tSNE separates monocytes from classical monocytes, CCP-assisted UMAP and tSNE portray a homogeneous clustering of these two cell types.
Accuracy
To assess CCP’s effectiveness as a primary dimensionality reduction tool for UMAP and tSNE, we conducted clustering using the Leiden algorithm within scanpy. We employed the adjusted Rand index (ARI) and normalized mutual information (NMI) to gauge accuracy by comparing the clustering results with the labels provided by the dataset’s authors. It’s important to note that these metrics do not measure absolute accuracy due to the absence of a gold standard dataset for scRNA-seq. Additionally, we used the Element-Centric measure (ECM) [62] to evaluate cluster stability.
Fig 8 show the average ARI, NMI and ECM of CCP-assisted UMAP, CCP-assisted tSNE, UMAP and tSNE across 18 dataset. For each dataset, we conducted 10 random seeds to perform dimensionality reduction, utilizing Leiden clustering to generate clustering labels. These labels were then compared to the annotated cell types provided by the original authors.
10 random initialization was used to compute CCP, CCP-assisted UMAP, CCP-assisted tSNE, standard UMAP and standard tSNE for each dataset. Leiden clustering was used to obtain the clustering results.
CCP-assisted UMAP demonstrates a 22% improvement in ARI, 14% in NMI, and 15% in ECM over standard UMAP. Similarly, CCP-assisted tSNE improves standard tSNE by 11% ARI, 9% NMI and 8% in ECM. Additionally, CCP-assisted UMAP and tSNE have a higher ECM score, indicating that their clustering is more stable. Notably, both CCP-assisted UMAP and tSNE yield higher ECM scores, indicating more stable clustering. Interestingly, standard tSNE outperforms UMAP. However, UMAP’s performance heavily relies on accurately finding nearest neighbors, which can be challenging with noisy, sparse, and high-dimensional data. CCP effectively reduces dimensions, enabling UMAP to find neighbors more effectively and resulting in improved visualization.
For a detailed comparison between CCP-assisted, PCA-assisted, and NMF-assisted visualizations, please refer to Section S2.3 in the S1 File. Additionally, we provide the ARI, NMI and ECM for each dataset in S2–S4 Tables of the S1 File.
Discussion
Large data
While CCP proves to be an efficient dimensionality reduction technique for datasets with a large number of features, such as in the case of scRNA-seq data, it may encounter limitations due to the necessity of computing cell-cell correlations for each super-gene. To address this challenge, for larger datasets, we propose a subsampling approach.
Let be the training data used to develop a CCP model, and
be a new dataset or additional data. Using the training data, gene partitions Sn, cutoff distance
and the connectivity
are determined. Then, we embed new data to the trained model, utilizing the following modification to Eq 5
(6)
to obtain appropriate super-genes.
We verified the subsampling approach on GSE84133 human and Qx Spleen data. We combined all four patient’s sequencing data into one superset for this analysis. We randomly subsampled 500, 1000, 1500, 2000, 2500, 3000 samples as a training data, and performed the subsampling under 10 random seeds. We projected the testing data using Eq 6, followed by Leiden clustering. ARI and NMI were computed, and the average scores are reported in Fig 9. Notably, both the GSE84133 human and Qx Spleen datasets exhibited consistent and stable results under varying number of subsampling.
300 super-genes were generated from CCP, and Leiden clustering was used to obtain the clustering results. (a) ARI and NMI under different subsampling values. Left figure shows the ARI and NMI for GSE84133 Human, where the 4 patient data were combined. Right figure shows the ARI and NMI of Qx Speen data. (b) CCP-assisted UMAP and tSNE of GSE84133 Human data under different number of subsampling. (c) CCP-assisted UMAP and tSNE of Qx Spleen data under different number of subsampling.
Additionally, we also show that CCP-assisted UMAP and tSNE for both data when subsampling was utilized. Notably, all visualizations were comparable, underscoring the stability of CCP-assisted visualizations even under subsampling. For the computation time, subsampling scheme using 1000 samples took 152 seconds and 160 seconds on GSE84133 human and Qx Spleen data, respectively. Additional comparison can be found in the S1 File.
Low variance genes
We have utilized LV-genes to enhance the predictive power of super-genes with a LV gene cluster. By using a high cutoff ratio, we can reduce the number of genes used in the feature partition, potentially resulting in a lower number of super-genes. To assess the impact of the cutoff ratio on the number of super-genes used for UMAP and tSNE visualizations, we conducted tests using GSE82187 and GSE75748 cell data. The discussion for GSE75748 cell data can be found in Section S3.2 of the S1 File.
Fig 10 show the effect of varying the number of super-genes and the cutoff ratio on the predictive power and visualization of GSE82187 data. We utilized 10 random seeds to generate CCP super-genes using different numbers of super-genes and cutoff ratios. Subsequently, Leiden clustering was applied to obtain cluster labels, and the ARI was computed utilizing the cell labels provided by the original authors. Notably, across all cutoff ratios, the ARI increases with an augmented number of super-genes, plateauing at a comparable level around 300 super-genes. This indicates the robustness of LV-gene.
(a) ARI of leiden clustering when the number of super-genes and cutoff ratio is changed. (b) The number of genes in the LV-gene when νc is changed. (c) Top and bottom row shows the CCP-assisted UMAP and tSNE visualization, and the columns corresponds to νc = 0.6, 0.7, 0.8, 0.9. 300 super-genes were used to initialize UMAP and tSNE, and the samples were colored according to the true cell type.
Fig 10(c) show the visualization of CCP-assisted UMAP and tSNE at various cutoff ratio. For the visualization, 300 super-genes were utilized, and UMAP and tSNE was applied to the super-genes to reduce the dimension to 2. Samples were then colored according to the cell types provided by the original authors. Note that all the visualization are comparable, indicating the robustness of LV-gene under different cutoff ratio.
Conclusion
CCP is a nonlinear data-domain dimensionality reduction technique that leverages gene-gene correlations to partition genes, and utilizes cell-cell correlation to generate super-genes. Unlike methods that involve matrix diagonalization, CCP can be directly applied as a primary dimensionality reduction tool to complement traditional visualization techniques like UMAP and tSNE. In our experiments with 18 datasets, CCP-assisted UMAP and CCP-assisted tSNE visualizations consistently outperformed the original UMAP and tSNE. On average, CCP-assised UMAP improves the standard UMAP visualization by 22% in ARI, 14% in NMI and 15% in ECM, and CCP-assisted tSNE improves standard tSNE by 11% ARI, 9% NMI and 8% in ECM. Although the improvement for tSNE visualization is less than the improvement in UMAP, tSNE is sensitive to potential outliers and noise, where the visualization can become uninterpretable. CCP-assisted tSNE consistently show clear visualization in the 21 dataset we have tested. Additionally, CCP-assisted visualization improves PCA-assisted and NMF-assisted visualization in the 21 dataset we have tested. However, CCP comes with some disadvantageous. For data with no clear gene-gene correlation, CCP will most likely not perform well. Additionally, although utilizing gene clustering removes the complication with computing distance in high dimensions, when the number of samples becomes large, the cell-cell correlation computation becomes time consuming. We show that subsampling via a training set is an effective approach to enable CCP for dealing with large data. One possible extension for gene clustering is to incorporate prior information, such as using known genes or utilizing known gene regulatory pathways, to guide in the clustering. Additionally, CCP can also be employed in many other single cell contexts, such as spatial transcriptomics and cell-cell communication, and for initializing deep learning methods.
Supporting information
S1 File. Supporting materials for analyzing scRNA-seq data by CCP-assisted UMAP and t-SNE.
https://doi.org/10.1371/journal.pone.0311791.s001
(PDF)
References
- 1. Lun AT, McCarthy DJ, Marioni JC. A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor. F1000Research. 2016;5. pmid:27909575
- 2. Hwang B, Lee JH, Bang D. Single-cell RNA sequencing technologies and bioinformatics pipelines. Experimental & molecular medicine. 2018;50(8):1–14. pmid:30089861
- 3. Andrews TS, Kiselev VY, McCarthy D, Hemberg M. Tutorial: guidelines for the computational analysis of single-cell RNA sequencing data. Nature protocols. 2021;16(1):1–9. pmid:33288955
- 4. Luecken MD, Theis FJ. Current best practices in single-cell RNA-seq analysis: a tutorial. Molecular systems biology. 2019;15(6):e8746. pmid:31217225
- 5. Chen G, Ning B, Shi T. Single-cell RNA-seq technologies and related computational data analysis. Frontiers in genetics. 2019;10:317. pmid:31024627
- 6. Petegrosso R, Li Z, Kuang R. Machine learning and statistical methods for clustering single-cell RNA-sequencing data. Briefings in bioinformatics. 2020;21(4):1209–1223. pmid:31243426
- 7. Li WV, Li JJ. A statistical simulator scDesign for rational scRNA-seq experimental design. Bioinformatics. 2019;35(14):i41–i50. pmid:31510652
- 8.
Dunteman GH. Principal components analysis. vol. 69. Sage; 1989.
- 9.
Lounici K. Sparse Principal Component Analysis with Missing Observations. In: High Dimensional Probability VI: The Banff Volume. Springer; 2013. p. 327–356.
- 10. Zou H, Hastie T, Tibshirani R. Sparse Principal Component Analysis. Journal of computational and graphical statistics. 2006;15(2):265–286.
- 11. Townes FW, Hicks SC, Aryee MJ, Irizarry RA. Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model. Genome biology. 2019;20:1–16. pmid:31870412
- 12. Cottrell S, Wang R, Wei GW. PLPCA: persistent laplacian-enhanced PCA for microarray data analysis. Journal of chemical information and modeling. 2023;64(7):2405–2420. pmid:37738663
- 13. Hao Y, Hao S, Andersen-Nissen E, Mauck WM, Zheng S, Butler A, et al. Integrated analysis of multimodal single-cell data. Cell. 2021;184(13):3573–3587. pmid:34062119
- 14. Wang YX, Zhang YJ. Nonnegative matrix factorization: A comprehensive review. IEEE Transactions on knowledge and data engineering. 2013;25(6):1336–1353.
- 15. Shu Z, Long Q, Zhang L, Yu Z, Wu XJ. Robust Graph Regularized NMF with Dissimilarity and Similarity Constraints for ScRNA-seq Data Clustering. Journal of Chemical Information and Modeling. 2022;62(23):6271–6286. pmid:36459053
- 16. Wu P, An M, Zou HR, Zhong CY, Wang W, Wu CP. A robust semi-supervised NMF model for single cell RNA-seq data. PeerJ. 2020;8:e10091. pmid:33088619
- 17. Lan W, Chen J, Chen Q, Liu J, Wang J, Chen YPP. Detecting cell type from single cell RNA sequencing based on deep bi-stochastic graph regularized matrix factorization. bioRxiv. 2022; p. 2022–05.
- 18. Xiao Q, Luo J, Liang C, Cai J, Ding P. A graph regularized non-negative matrix factorization method for identifying microRNA-disease associations. Bioinformatics. 2018;34(2):239–248. pmid:28968779
- 19. Yu N, Gao YL, Liu JX, Wang J, Shang J. Robust hypergraph regularized non-negative matrix factorization for sample clustering and feature selection in multi-view gene expression data. Human genomics. 2019;13:1–10. pmid:31639067
- 20. Liu JX, Wang D, Gao YL, Zheng CH, Shang JL, Liu F, et al. A joint-L2,1-norm-constraint-based semi-supervised feature extraction for RNA-Seq data analysis. Neurocomputing. 2017;228:263–269.
- 21. Hozumi Y, Wei GW. Analyzing single cell RNA sequencing with topological nonnegative matrix factorization. Journal of Computational and Applied Mathematics. 2024;445:115842. pmid:38464901
- 22. Lopez R, Regier J, Cole MB, Jordan MI, Yosef N. Deep generative modeling for single-cell transcriptomics. Nature methods. 2018;15(12):1053–1058. pmid:30504886
- 23. Wu H, Zhou H, Zhou B, Wang M. SCMcluster: a high-precision cell clustering algorithm integrating marker gene set with single-cell RNA sequencing data. Briefings in Functional Genomics. 2023;22(4):329–340. pmid:36848584
- 24. Xu J, Xu J, Meng Y, Lu C, Cai L, Zeng X, et al. Graph embedding and Gaussian mixture variational autoencoder network for end-to-end analysis of single-cell RNA sequencing data. Cell Reports Methods. 2023;3(1).
- 25. Sadria M, Layton A. The Power of Two: integrating deep diffusion models and variational autoencoders for single-cell transcriptomics analysis. BioRxiv. 2023; p. 2023–04.
- 26. Palma A, Theis FJ, Lotfollahi M. Predicting cell morphological responses to perturbations using generative modeling. bioRxiv. 2023; p. 2023–07.
- 27. Giansanti V, Giannese F, Botrugno OA, Gandolfi G, Balestrieri C, Antoniotti M, et al. Scalable integration of multiomic single-cell data using generative adversarial networks. Bioinformatics. 2024;40(5):btae300. pmid:38696763
- 28. Kirkegaard JB. Spontaneous breaking of symmetry in overlapping cell instance segmentation using diffusion models. bioRxiv. 2023; p. 2023–07.
- 29. Shen H, Liu J, Hu J, Shen X, Zhang C, Wu D, et al. Generative pretraining from large-scale transcriptomes for single-cell deciphering. Iscience. 2023;26(5). pmid:37187700
- 30.
Connell W, Khan U, Keiser MJ. A single-cell gene expression language model. arXiv preprint arXiv:221014330. 2022.
- 31. Yang F, Wang W, Wang F, Fang Y, Tang D, Huang J, et al. scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nature Machine Intelligence. 2022;4(10):852–866.
- 32. Yu X, Ren J, Long H, Zeng R, Zhang G, Bilal A, et al. iDNA-OpenPrompt: OpenPrompt learning model for identifying DNA methylation. Frontiers in Genetics. 2024;15:1377285. pmid:38689652
- 33. Chen J, Xu H, Tao W, Chen Z, Zhao Y, Han JDJ. Transformer for one stop interpretable cell type annotation. Nature Communications. 2023;14(1):223. pmid:36641532
- 34. Xu J, Zhang A, Liu F, Chen L, Zhang X. CIForm as a transformer-based model for cell-type annotation of large-scale single-cell RNA-seq data. Briefings in Bioinformatics. 2023;24(4):bbad195. pmid:37200157
- 35. Jiao L, Wang G, Dai H, Li X, Wang S, Song T. scTransSort: Transformers for intelligent annotation of cell types by gene embeddings. Biomolecules. 2023;13(4):611. pmid:37189359
- 36. Hu H, Feng Z, Lin H, Zhao J, Zhang Y, Xu F, et al. Modeling and analyzing single-cell multimodal data with deep parametric inference. Briefings in Bioinformatics. 2023;24(1):bbad005. pmid:36642414
- 37. Meng R, Yin S, Sun J, Hu H, Zhao Q. scAAGA: Single cell data analysis framework using asymmetric autoencoder with gene attention. Computers in biology and medicine. 2023;165:107414. pmid:37660567
- 38. Wang B, Zhu J, Pierson E, Ramazzotti D, Batzoglou S. Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning. Nature methods. 2017;14(4):414–416. pmid:28263960
- 39. Kiselev VY, Kirschner K, Schaub MT, Andrews T, Yiu A, Chandra T, et al. SC3: consensus clustering of single-cell RNA-seq data. Nature methods. 2017;14(5):483–486. pmid:28346451
- 40. Zhang P, Zhang H, Wu H. iPro-WAEL: a comprehensive and robust framework for identifying promoters in multiple species. Nucleic Acids Research. 2022;50(18):10278–10289. pmid:36161334
- 41. Zhang P, Wu Y, Zhou H, Zhou B, Zhang H, Wu H. CLNN-loop: a deep learning model to predict CTCF-mediated chromatin loops in the different cell lines and CTCF-binding sites (CBS) pair types. Bioinformatics. 2022;38(19):4497–4504. pmid:35997565
- 42. Zhang P, Wu H. Ichrom-deep: an attention-based deep learning model for identifying chromatin interactions. IEEE Journal of Biomedical and Health Informatics. 2023;. pmid:37402191
- 43. Hu H, Feng Z, Lin H, Cheng J, Lyu J, Zhang Y, et al. Gene function and cell surface protein association analysis based on single-cell multiomics data. Computers in Biology and Medicine. 2023;157:106733. pmid:36924730
- 44. Feng X, Xiu YH, Long HX, Wang ZT, Bilal A, Yang LM. Advancing single-cell RNA-seq data analysis through the fusion of multi-layer perceptron and graph neural network. Briefings in Bioinformatics. 2024;25(1):bbad481.
- 45.
McInnes L, Healy J, Melville J. UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:180203426. 2018.
- 46. Kobak D, Berens P. The art of using t-SNE for single-cell transcriptomics. Nature communications. 2019;10(1):5416. pmid:31780648
- 47. Van der Maaten L, Hinton G. Visualizing data using t-SNE. Journal of machine learning research;9(11):2579–2605.
- 48.
Hozumi Y, Wang R, Wei GW. CCP: Correlated clustering and projection for dimensionality reduction. arXiv preprint arXiv:220604189. 2022.
- 49. Hozumi Y, Tanemura KA, Wei GW. Preprocessing of single cell RNA sequencing data using correlated clustering and projection. Journal of chemical information and modeling. 2023;64(7):2829–2838. pmid:37402705
- 50. Xia K, Opron K, Wei GW. Multiscale multiphysics and multidomain models—Flexibility and rigidity. The Journal of chemical physics. 2013;139(19). pmid:24320318
- 51. Edgar R, Domrachev M, Lash AE. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic acids research. 2002;30(1):207–210. pmid:11752295
- 52. Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, et al. NCBI GEO: archive for functional genomics data sets—update. Nucleic acids research. 2012;41(D1):D991–D995. pmid:23193258
- 53. Chen L, Wang W, Zhai Y, Deng M. Deep soft K-means clustering with self-training for single-cell RNA sequence data. NAR genomics and bioinformatics. 2020;2(2):lqaa039. pmid:33575592
- 54. Schaum N, Karkanias J, Neff NF, May AP, Quake SR, Wyss-Coray T, et al. Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris: The Tabula Muris Consortium. Nature. 2018;562(7727):367–372.
- 55. Baron M, Veres A, Wolock SL, Faust AL, Gaujoux R, Vetere A, et al. A single-cell transcriptomic map of the human and mouse pancreas reveals inter-and intra-cell population structure. Cell systems. 2016;3(4):346–360. pmid:27667365
- 56. Chu LF, Leng N, Zhang J, Hou Z, Mamott D, Vereide DT, et al. Single-cell RNA-seq reveals novel regulators of human embryonic stem cell differentiation to definitive endoderm. Genome biology. 2016;17:1–20. pmid:27534536
- 57. Gokce O, Stanley GM, Treutlein B, Neff NF, Camp JG, Malenka RC, et al. Cellular taxonomy of the mouse striatum as revealed by single-cell RNA-seq. Cell reports. 2016;16(4):1126–1137. pmid:27425622
- 58. Darmanis S, Sloan SA, Zhang Y, Enge M, Caneda C, Shuer LM, et al. A survey of human brain transcriptome diversity at the single cell level. Proceedings of the National Academy of Sciences. 2015;112(23):7285–7290. pmid:26060301
- 59. Muraro MJ, Dharmadhikari G, Grün D, Groen N, Dielen T, Jansen E, et al. A single-cell transcriptome atlas of the human pancreas. Cell systems. 2016;3(4):385–394. pmid:27693023
- 60. Romanov RA, Zeisel A, Bakker J, Girach F, Hellysaz A, Tomer R, et al. Molecular interrogation of hypothalamic organization reveals distinct dopamine neuronal subtypes. Nature neuroscience. 2017;20(2):176–188. pmid:27991900
- 61. Gideon HP, Hughes TK, Tzouanas CN, Wadsworth MH, Tu AA, Gierahn TM, et al. Multimodal profiling of lung granulomas in macaques reveals cellular correlates of tuberculosis control. Immunity. 2022;55(5):827–846. pmid:35483355
- 62. Gates AJ, Wood IB, Hetrick WP, Ahn YY. Element-centric clustering comparison unifies overlaps and hierarchy. Scientific reports. 2019;9(1):8574. pmid:31189888