Attention-based deep clustering method for scRNA-seq cell type identification

Single-cell sequencing (scRNA-seq) technology provides higher resolution of cellular differences than bulk RNA sequencing and reveals the heterogeneity in biological research. The analysis of scRNA-seq datasets is premised on the subpopulation assignment. When an appropriate reference is not available, such as specific marker genes and single-cell reference atlas, unsupervised clustering approaches become the predominant option. However, the inherent sparsity and high-dimensionality of scRNA-seq datasets pose specific analytical challenges to traditional clustering methods. Therefore, a various deep learning-based methods have been proposed to address these challenges. As each method improves partially, a comprehensive method needs to be proposed. In this article, we propose a novel scRNA-seq data clustering method named AttentionAE-sc (Attention fusion AutoEncoder for single-cell). Two different scRNA-seq clustering strategies are combined through an attention mechanism, that include zero-inflated negative binomial (ZINB)-based methods dealing with the impact of dropout events and graph autoencoder (GAE)-based methods relying on information from neighbors to guide the dimension reduction. Based on an iterative fusion between denoising and topological embeddings, AttentionAE-sc can easily acquire clustering-friendly cell representations that similar cells are closer in the hidden embedding. Compared with several state-of-art baseline methods, AttentionAE-sc demonstrated excellent clustering performance on 16 real scRNA-seq datasets without the need to specify the number of groups. Additionally, AttentionAE-sc learned improved cell representations and exhibited enhanced stability and robustness. Furthermore, AttentionAE-sc achieved remarkable identification in a breast cancer single-cell atlas dataset and provided valuable insights into the heterogeneity among different cell subtypes.


Introduction
Single Cell RNA sequencing (scRNA-seq) techniques facilitate exploring the heterogeneity and diversity of cells at a cellular level, providing a more detailed mechanism for a variety of biological researches [1,2].Accurate identification of cell subpopulation is essential for many downstream analyses and several auxiliary tools have been constructed [3,4].The basic cell subpopulations apportion capability of these tools is based on unsupervised clustering, which is a feasible solution for scRNA-seq datasets lack of cell atlas [5].However, major computational challenges in scRNA-seq analysis include the high sparsity, high dimensionality and batch effects [6,7].As a result, more and more clustering methods have been developed for cell group assignments and jointly dealing with problems mentioned above [8,9,10,11], but an optimal approach has not yet been proposed.
In the last few years, various clustering methods involved in other fields have been applied to scRNA-seq datasets.For example, RaceID [12] applied K-means to the clustering analysis of scRNA-seq datasets to groups identification and introduces the outlier detection to improve the ability of K-means for race cells recognition.However, K-means inevitably converges to the local minimization because the process of searching cluster center is sensitive to initialization.SC3 [13] overcame the problems by integrating multiple results after running K-means repeatedly.Another type of method supported by various scRNA-seq data analysis tools [4] is based on community detection, such as Louvain or Leiden algorithm [14,15], which iteratively aggregates the nodes in the network composed of cells into multiple communities and gradually gets the categories of all cells.The major advantage of these algorithms based on community discovery over other methods is that they don't need the information on the number of groups, which makes them more suitable for the scRNA-seq dataset with scarce prior knowledge.
To efficiently integrate single-cell clustering methods with dimension reduction, data denoising and batch effects removal processes, various deep learning-based tools have been proposed.DCA [16] raised a denoising autoencoder (DAE) to denoise and impute the original scRNA-seq data by minimizing a zero-inflated negative binomial (ZINB) distribution loss.Additionally, scDeepCluster [9] combined deep embedding clustering (DEC) [17] to concurrently learn feature representation and optimize clustering performance.DEC is a self-optimizing clustering method and can provide the soft labels of cells.Moreover, DESC [18] was composed of stacked autoencoders and DEC.They discussed the biological interpretable soft assignment performed by DEC and further analyzed the effectiveness of soft clustering on batch effects removal.ScVI (Single-cell Variational Inference) [19] approximated the ZINB distribution of expression values and a comprehensive tool called scvi-tools [20] was proposed to integrate multiple methods for various tasks in single-cell dataset analysis.However, the interaction among cells wasn't directly employed in the feature extraction of each cell by these methods.To explicitly incorporate cell-to-cell information, graph neural network (GNN) has been employed in scRNA-seq clustering analysis [10,21] to extract a topological embedding of cells, which captured relationship among cells and can significantly enhance the learning of clustering-friendly representations [11].
Different from the previous methods, GNN-based methods need to manually construct the topologic information between cells in the absence of prior knowledge.Two types of graphs are mainly included: cell-gene graph and cell-cell graph.scDeepSort [22] proposed a strategy to construct the cell-gene graph using gene expression values as edges, and a weighted GNNbased model was constructed for cell annotation.graph-sc [21] also used a similar component graph strategy and graph autoencoder (GAE) [23] to perform the clustering analysis of scRNA-seq data.Besides, the cell-cell graph was used in scGNN [10] and represented the cells as graph nodes only.The connection among cells was obtained by a multi-modal autoencoders in an iterative way.Same as graph-sc, scGNN extracted the output features of graph encoder, representation of cells in the low dimensional hidden space, and obtained the clustering results relying on the separate K-means or community detection algorithms.scGAC [11] also constructed the cell-cell graph and adopted a self-optimizing method to simultaneously learn representation and optimize clustering.The topology embeddings of cells were extracted by Graph Attention Network-based GAE, and the optimized clustering results were obtained by DEC.Additionally, SCEA utilized an Encoder based on Multilayer Perceptron (MLP) for data dimension reduction prior to employing a similar topology representation learning strategy and achieved superior performance [24].
Motivated by the researches above, we propose an attention mechanism-based scRNA-seq clustering method named AttentionAE-sc (Attention fusion AutoEncoder for single-cell).As shown in Fig 1, DAE and GAE are implemented simultaneously to learn the cellular embeddings.Direct use of topology embedding in the clustering phase is easily affected by noisy edges.Therefore, an information fusion block based on multi-head attention mechanism is suggested to combine the topological information and the denoising information to reconstruct the relationships among cells (S1 Fig) .Through simultaneous learning different embeddings from scRNA-seq datasets, the information fusion blocks are iteratively optimized to learn the clustering-friendly representations so that similar cells are closer in the hidden embedding for better clustering analysis.To obtain a soft subpopulation partition, a self-optimizing clustering process is conducted through DEC.By initializing the cluster centers with Leiden algorithm, we can obtain the allocation labels of cells without specifying the number of groups.AttentionAE-sc achieved an excellent performance on 16 scRNA-seq datasets.Firstly, AttentionAE-sc completely outperformed other methods based on community detection algorithm in terms of clustering performance.To our surprise, it even achieved superior results when compared to several K-means-based methods that require specifying the number of groups in advance.Secondly, AttentionAE-sc was capable of learning a more clusteringfriendly representation compared to other DEC-based methods (or methods without DEC), showcasing its robustness and stability when subjected to random perturbations such as down-sampling or manual dropout.Finally, we applied AttentionAE-sc to conduct clustering analysis on a large breast cancer single-cell atlas dataset, acquiring promising results in terms of the degree of similarity between the clustering outcomes and real cell types, as well as the biological significance captured by the predicted labels.

Data preprocessing
The preprocessing process was implemented by scanpy package [4].Given a raw count matrix R N×M (N cells and M genes), cells or genes without any counts are filtered (n cells leftover).After normalization for each cell, data is smoothed by simple scaling.Only the most variable genes (top 2500) were extracted as gene expression matrix X n×2500 , which was used for cell-cell graph construction.And its corresponding original count matrix R 0 n�2500 was used for DAE training.Finally, X n×m was transformed into z-score data X z-score so that the mean value of each selected gene was zero and the variance was the unit value.In addition, the adjacency matrix A n×n , another input of AttentionAE-sc, was calculated by measuring the distance d ij among cells (the distance between cell i and cell j) with the Gaussian kernel, which is a decreasing function with the distance d ij : Where is used to adjust smoothness and is set as the distance with the k-th neighbor (like K-Nearest Neighbor) in the distance matrix d ij .k was set as the default value (default 15) in scanpy package.A n×n was used in the message passing process of GAE.

Graph autoencoder
Topological information between cells can be obtained by GAE [23].In a classical GAE, the encoder is composed of the GNN layer to produce node embeddings, and the decoder is trained to reconstruct the adjacency matrix by producing a soft reconstruction matrix.In this work, Graph convolutional network (GCN) [25] was used for the unit of encoder, the update of cell embedding H l in the l th layer follows: Where Ã ¼ DÀ 1 2 A DÀ 1 2 is the symmetrically normalized adjacency matrix and D is the degree matrix of A. Tanh is a hyperbolic tangent activation function e x À e À x e x þe À x � � .W l-1 is the learnable parameter of each encoder layer.
A dot product reconstruction (3) was used to constitute the decoder.The reconstructed adjacency matrix � A is obtained from the outputs of encoder H l .Finally, a mean square error (MSE) between reconstruction graph and original graph was designed for loss function (4).

Denoising autoencoder
ZINB or NB distribution is one of the approximate distributions of scRNA-seq data [16].ZINB-based autoencoder has been applied to extract denoising embedding and overcome the discrete property and over-dispersion of scRNA-seq dataset: Where R 0 n�m is defined in the process of data preprocessing to represent the raw counts corresponding to the input matrix.Mean (μ), dispersion (θ) is the parameter of negative binomial distribution and is the probability that a zero in the raw count is a dropout event.
The encoder accepts X z-score as input and its output layer produces reduced-dimension latent embedding Z.The decoder subnetwork, on the contrary, takes Z as input and produces reconstructions of the original input X z-score .The encoder and decoder are both Multilayer Perceptron (MLP)-based neural networks.Then, three different fully connected layers are used respectively for fitting mean (μ), dispersion (θ) and the dropout probability (π): Where e represents the random Gaussian noise (0 mean and 0.01 variance).The three different fully connected layers are nonlinear activated by diverse ways.The dropout probability (π) should be in a reasonable scale (0-1), so a sigmoid is used for standardization.The mean (μ) and dispersion (θ) are non-negative values, hence exponential function is chosen as activation function.In order to prevent overfitting, the mean (μ) is normalized to the same extent with the raw count size for each cell (i.e., Diag [] operation).All the W represent the learnable parameters.Finally, the loss function of DAE is the negative log of the ZINB likelihood:

Information fusion block
Attention mechanism has shown its effectiveness in natural language processing and computer version [26,27].In the attention layer, a global attention score will be calculated to guide the feature extraction based on correlation among samples.Therefore, we construct an information fusion block to merge the denoising embedding and topological embedding through the multi-head attention mechanism: Attention score : Where the query Q l M is calculated by the topological embedding H l from GAE, and the key K l M , value V l M are calculated by the denoising embedding E l from DAE.The attention score is calculated by the dot-product attention mechanism.M represents the number of heads.All the W and W M represent the learnable parameters.In AttentionAE-sc, two information fusion blocks are adopted so two outputs R l are calculated successively by the l th embedding of GAE and DAE.The output R 1 is used as the input of next encoder layer (fully connected layers followed by the ReLU activation function), and the output R 2 is used as the feature representation (embedding Z) of clustering stage.

Deep embedding clustering
In the clustering stage, DEC [17] is applied to carried out a self-optimizing soft clustering.The soft clustering distribution Q is defined as the probability of cell belonging to each cluster center measured by student's t distribution: Where k j (j = 1, 2, . .., k) is the cluster centers, obtained by Leiden algorithm [15] with pretrained embedding Z as input.The optimized high-confidence distribution P is constructed as following: The optimization process of clustering is defined as the KL divergence between the soft clustering distribution Q and the target distribution P: In summary, the total loss function is: Where r 1 and r 2 are the factors to balance multi-object optimization (default 0.1).The clustering process is only carried out in the fine-tuning stage.AttentionAE-sc was implemented in Python 3.8.Dimensions of all hidden layers were set as 256, 64, 16, 64, 256.Two multi-head attention blocks were used, and the number of heads was 8.In order to reduce the time cost of model training, a dimension reduction linear layer was shared by DAE and GAE.In the process of model training and fine-tuning, Adam was used as the optimizer.The learning rate was 0.001 and a gradient clipping with L2 norm maximum of 3 was adopted.In the fine-tuning stage, training will be stopped prematurely if the number of altered labels is fewer than 1/1000.

Evaluation metric and baseline methods
In this section, we tested AttentionAE-sc on 16 scRNA-seq datasets [28,29,30,31,32,33] from different organs or platforms (containing 870-9552 individual cells or 4-11 cell groups).Detailed information was presented in Table 1.We used the labels provided by these datasets as the ground truth and adjusted Rand index (ARI) [34] and normalized mutual information (NMI) [35] were used as metrics for evaluation of model performance.The similarity between the predicted labels and the ground truth was compared from different points of view.The higher the value is, the more accuracy the clustering results are.When multiple methods employ the same clustering strategy, such as community detection, their performance in clustering distribution can be evaluated by calculating the internal metrics like the silhouette score [36] and Davies-Bouldin score [37].For internal metric, the better the clustering performance is, the greater the distance between predicted clusters and the closer the distance within each cluster; this results in a larger silhouette score and a smaller Davies-Bouldin score.
We selected several state-of-art scRNA-seq clustering methods as baseline methods, including DESC [18], Leiden (scanpy) [4,15], SC3 [13], scGNN [10], scGAC [11], graph-sc [21], scDeepCluster [9], scvi-tools [20], and SCEA [24].Among these methods, scGAC, scDeepCluster and SCEA need to specify the number of clusters when using K-means algorithm to initialize the cluster centers.Both K-means and Leiden are supported by graph-sc, but we adopted the results obtained by Leiden algorithm considering that methods without setting the number of cell groups were more suitable for unsupervised scRNA-seq clustering analysis.In the case of SC3, we employed the Tracy-Widom theory on random matrices to estimate the optimal number of clusters before executing all steps of SC3 analysis.See S1 Table for additional details of baseline methods.

AttentionAE-sc achieved excellent clustering results
The experimental results were shown in Figs 2 and S2.On 12 real scRNA-seq datasets, the performance of AttentionAE-sc was at the forefront.On other real scRNA-seq datasets, the performance of AttentionAE-sc came out in front and completely better than the community detection-based methods (scGNN, graph-sc, DESC, Leiden, scvi-tools).Like these methods, AttentionAE-sc adopted the same strategy to avoid manually setting the number of predicted  clusters and achieved excellent results, about 60%-90% higher ARI score and 15%-36% higher NMI score.An intriguing observation reveals that the methods relying on a known cell group number yielded better external metric scores when there are fewer actual cell groups, particularly in datasets with fewer than 6 cell groups.In contrast, other methods that tend to partition the datasets into more predicted cell groups.Nonetheless, the overall performance of Attentio-nAE-sc remained superior to that of the K-means-based method.Among these K-meansbased methods, SCEA exhibited better performance.In comparison to scGAC, which also utilized a GAE and a self-optimizing DEC, its performance was less satisfactory across most datasets.To scDeepCluster, a classical ZINB-based model, AttentionAE-sc obtained better results because of leveraging the interaction among during the process of learning cell representations.By effectively combining the fundamental frameworks of these two approaches, Atten-tionAE-sc aquired outstanding results.However, SC3 did not perform as well as deep learning-based methods, often overestimating the cluster count when using random matrices and producing null values with larger datasets (10x spleen).In general, AttentionAE-sc achieved a competitive performance without specifying the number of clusters.
To analyze the quality of cell representation for clustering, we visualized the latent embeddings of AttentionAE-sc and used the internal metrics (S2B and S2C Fig) as the measurement of the clustering performance.Compared with other 4 state-of-art community detection-based methods (Fig 3A ), AttentionAE-sc achieved the best performance.In the baseline methods, graph-sc and scGNN, like AttentionAE-sc, also employed GAE for feature extraction, but they acquired a suboptimal performance due to splitting representation learning and clustering into separate tasks.By contrast, DESC and AttentionAE-sc obtained the higher silhouette score because the soft clustering strategies were adopted to simultaneously optimize feature extraction and clustering allocation, but AttentionAE-sc further considered the similarity among cells and directly applied topological embedding to construct the cell representation through a fusion feature attention layer.In addition, the Leiden algorithm (based on scanpy) obtained a lower performance with raw gene expression matrices as input.We can intuitively perceive the advantages of embedding learned by AttentionAE-sc through UMAP visualization (Fig 3B).AttentionAE-sc obtained the larger distinction between each group and the degree of internal tightness.S3 Fig showed the UMAP visualization of other scRNA-seq datasets, which was more intuitive to reflect the clustering and representation performance of those 5 methods including AttentionAE-sc.

AttentionAE-sc constructed better relationships among cells
AttentionAE-sc constructed relationships among cells through the attention-based information fusion blocks.In each iteration of the clustering stage, cells can receive the information from other cells according to the attention mechanism, so that the cluster centers were adaptively chosen and redundant cluster centers were discarded (Figs 4A and S4), since similar cells were closer in the hidden space and were easier to be assigned to the same cluster.For verification, we took the outputs of information fusion blocks as dimension reduction features to calculate the relative distance between cells and estimated the influence of information fusion block based on the assumption that the output of attention layer can retain the relationships among cells extracted by AttentionAE-sc.on learning the attention scores of specific subsets of cells, indicating that multi-head attention mechanism can effectively capture the distinctions between cell groups at the transcriptome scale.In summary, AttentionAE-sc learned a more clustering-friendly cell representation and deserved a great clustering result with the iterative clustering process.

AttentionAE-sc exhibited great stability and robustness
We tested the robustness and stability of AttentionAE-sc by input manual dropout and downsampling.For the dropout experiment, 10% -50% gene expression values from the real scRNA-seq datasets were randomly manipulated as zero values and the impaired gene expression matrix were adopted to compare the clustering performance of AttentionAE-sc with DAE.For the down-sampling experiment, 20% -80% cells from the real scRNA-seq datasets In addition, selecting a part of sampled cells at random achieved comparable performance to that of using the entire datasets (S6B Fig).

Model ablation experiment
We conducted a hyperparameter search to establish a baseline model prior to conducting the ablation experiment.To construct the cell-cell graph, cell connectivity was calculated using a Gaussian kernel following the diffusion maps [38] with adaptive width [39].The computational process was implemented using the scanpy package [4].While we explored other methods integrated into the package, such as the UMAP [40], it was consistently evident that the 'gauss' method outperformed others (Fig 5A).Other hyperparameters that significantly impact the performance of AttentionAE-sc include the number of attention heads in the information fusion blocks, the resolution parameters in the Leiden algorithm, and the number of highly variable genes (Figs 5B-5D and S7).The optimal choice for the number of attention heads is 8. Having too few attention heads results in lead inadequate learning of cell interactions, whereas an excessive number leads to redundancy in model parameters.The resolution of Leiden primarily governs the number of clusters generated by the algorithm.Smaller resolution values yield a greater number of distinct cell types.Our results indicate that optimal performance falls within the range of 0.1 to 1. Therefore, we selected a resolution of 1 for our experiments.Highly variable genes, characterized by their significant variance within the datasets, are typically considered candidate genes for differential expression analysis.Consequently, before conducting single-cell data clustering, it is advisable to preselect a subset of highly variable genes as features.Our findings suggest that utilizing 2,500 highly variable genes yields the best results, as too few or redundant features can negatively impact the clustering outcomes.
To determine the importance of each component in AttentionAE-sc, we compared the performance of AttentionAE-sc without attention-based information fusion block (wo attn), without ZINB loss function (wo zinb), without residual connection (wo res) and without GNN (wo gnn).As shown in Figs 5E and S8, the addition of information fusion block can significantly improve the clustering performance.However, when the attention layers were substituted with a multi-layer perceptron, performance deteriorated.Besides, it's worth noted that AttentionAE-sc can achieve comparable results with ZINB-based DAE in several datasets when a L2 norm-based DAE (wo zinb) was used, so AttentionAE-sc can also perform well without using the approximation assumption of ZINB distribution.Nevertheless, the hypothesis of the ZINB distribution can enable the model to adapt to a broader range of datasets and enhance its generalization capability.Furthermore, constructing residual connections in front of the attention layer typically expedites model convergence and mitigates the risk of overfitting.Poor model performance was observed when denoising embedding wasn't connected to the representation of cell (wo res).Finally, we conducted a test without the use of GNN, where the keys and values of the attention layer were derived from the embedding obtained from the DAE.The results demonstrated that the absence of GAE-based learning of cellular interactions leaded to a decline in model performance.As a result, we made the inference that excellent performance of AttentionAE-sc was attributed to the information fusion block which can learn the relation among cells iteratively to obtain a clustering-friendly representation.

Experiment in BRCA dataset
To further evaluate the performance of AttentionAE-sc, we conducted an analysis on a breast cancer (BRCA) single-cell atlas [41].This dataset consists of 35,276 single cells derived from 32 cell lines.The results of our analysis were presented in Figs 6 and S9.After receiving the predicted cluster labels and the cell representation according to AttentionAE-sc, we further obtained two-dimensional visual results based on UMAP (Fig 6A).The predicted clusters closely match the ground truth (Fig 6B ) and the vast majority of cells from the same cell line were assigned to the same class.Additionally, through the differential expression analysis between the predicted clusters, we identified a set of differentially expressed genes (DEGs) that can serve as marker genes.These marker genes effectively distinguish cells to the corresponding cell line types (Fig 6C ).However, some extra clustering assignment results have emerged.For instance, cells from different cell lines, KPL1 and MCF7, have been predicted to the same groups (cluster 18).One of the identified DEG, BCAS3, has been found to be a biomarker for BRCA [42,43].Interestingly, BCAS3 exhibits similar specific expression in both of these two cell lines (Fig 6D).In addition, it has been observed in several studies that KPL1 is a sub-clonal cell line derived from MCF7 [44,45].Subsequently, we proceeded to perform sub-clustering within cluster 18, and observed a striking separation of cells from these two cell lines (Fig 6E).To investigate the variation in AttentionAE-sc' s ability to distinguish cells from different scale datasets, we analyzed the differences of input features (S9A Fig) .Indeed, a major difference in the clustering stage and sub-clustering stage can be attributed to the selection of genes, which are influenced by the proportion of cells of interest in the dataset.When the same feature screening method was used, we may obtain different initial input features in different stages, leading to varying results.We argue that the limited number of cells from the cell line HDQP1 in the BRCA dataset is the reason for their scattered prediction in the clustering process.Additionally, the only non-cancer cell line MCF12A was divided into two groups (cluster 6 and cluster 31).One possibility is that they are different cell subtypes.Therefore, we further analyzed the expression heterogeneity between these two groups.Based on the Wilconxon test, we calculated the DEGs for each group (p-value <0.05, listed in S2 Table ) and conducted KEGG enrichment analysis (S9B Fig) [46].The preliminary findings indicate that there is a significant difference in the cell states between the two subtypes.In cluster 31, pathways associated with inflammation and apoptosis are up-regulated, while pathways related to cell proliferation are down-regulated.In conclusion, AttentionAE-sc proves to be effective in distinguishing cells within multi-scale datasets and provides valuable insights into the heterogeneity among different cell subtypes.

Discussion
In this work, we propose AttentionAE-sc, a novel deep learning-based scRNA-seq clustering method, relying on an attention-based information fusion block, which combines the denoising autoencoder with the graph autoencoder to learn a clustering-friendly cell representation.It can not only deal with sparsity and dropout events of scRNA-seq datasets by ZINB-based autoencoder, but also directly considers the relationship among cells to guide the dimension reduction and get a clustering-friendly representation.Therefore, AttentionAE-sc can easily extract denoising features from the sparse scRNA-seq data, and the ideal clustering partition is obtained without specifying the number of groups.
AttentionAE-sc acquired an excellent performance on 16 scRNA-seq datasets, with an average of 60% ARI score and 18% NMI score higher than 9 baseline methods.Compared with these state-of-art methods, AttentionAE-sc achieved an exceptional performance, including a high internal metric silhouette score that reflects the quality of cell representations.Furthermore, we explored the relationship between the outstanding clustering results and the clustering-friendly cell representation obtained during the optimizing clustering stage.We also analyzed the stability and robustness of AttentionAE-sc and examined the contribution of each module component.Moreover, experiments conducted on BRCA datasets demonstrated the excellent performance of AttentionAE-sc in clustering analysis and its capability to unveil the biological significance within scRNA-seq datasets.
However, AttentionAE-sc as well as some other methods [10,11] aims to construct a more elaborate model by taking cell interactions into account but is at the cost of computation time (S3 Table ).In addition, in each iteration of the optimized clustering process, AttentionAE-sc takes all samples as input which limits its usage in a more large-scale dataset.Moreover, existing annotated single-cell datasets [47,48] possess the potential to construct pre-training models and reduce the training difficulties but weren't fully utilized.Concurrently, the incorporation of prior knowledge, which aids in establishing cellular relationships [49], holds promise as valuable information for constructing deep learning models [50].In the future, it's still a challenge to propose a canonical single-cell clustering pipeline.We will explore more efficient clustering strategies which can make full use of increasing annotated or unannotated single-cell datasets.

Fig 1 .
Fig 1. Model architecture.In the train stage, a graph autoencoder (GAE) and a denoising autoencoder (DAE) are constructed simultaneously.The multi-head attention blocks are used to combine the denoising embedding and the topological embedding.In the clustering stage, the prediction labels of each cell are obtained by a self-optimizing clustering.https://doi.org/10.1371/journal.pcbi.1011641.g001

Fig 2 .
Fig 2. ARI score of AttentionAE-sc (our methods) and all baseline methods in 16 scRNA-seq datasets.Arithmetic mean was taken as results of each dataset after running each method five times under different random seeds.Methods that need to specify the number of clusters were marked with an asterisk (*).https://doi.org/10.1371/journal.pcbi.1011641.g002 Comparing the relative distance visualization of AttentionAE-sc (Fig 4C) with raw expression matrix (Fig 4D) and features extracted without information fusion block (Fig 4E), it's clear to observe that the cells distribution in the same groups were closer and were more distinctive between the different groups based on the embeddings learned by AttentionAE-sc.To gain further insights into the interpretability of the attention blocks, we visualized the weight of attention neural network directly, which represent the attention scores between each pair of cells (S5 Fig).It becomes evident that similar global attention information was consistently transmitted among cells of the same cell types.Consequently, this reaffirms the consistent results observed in Fig 4C.From another perspective, each attention head seemed to

Fig 3 .
Fig 3. Evaluations of cell embedding.a. Clustering performance of AttentionAE-sc and four methods based on community detection algorithm using Silhouette scores as metrics.Each box contains the results of 16 datasets (run 5 times by different random seeds).In the box diagram, the dotted line represents the median score, and the upper or lower solid lines represent the maximum or minimum score.b.Comparison of UMAP visualization from different methods on the dataset Romanov.AttentionAE-sc more clustering-friendly embedding for single-cell clustering.https://doi.org/10.1371/journal.pcbi.1011641.g003

Fig 4 .
Fig 4. AttentionAE-sc construction better relationships among cells (Muraro).a.The trend of the number of cell groups predicted and ARI scores changing with training epochs during the clustering stage.The better clustering performance was obtained, when the cluster centers were adaptively chosen with each iteration and redundant cluster centers were discarded.Other datasets were shown in S4 Fig. b.The visualization of cell embeddings.c. the heatmap of relative cosine distance between cells calculated by the embeddings of multi-head attention layer.d. the heatmap of relative cosine distance between cells calculated from the input expression matrix.On the right subfigure, we sorted the dataset follow per cell groups to get an intuitive visualization of among cells distance (c, d, e).e. the heatmap of relative cosine distance between cells calculated by ordinary denoising autoencoder (DAE) and the corresponding visualization.https://doi.org/10.1371/journal.pcbi.1011641.g004

Fig 5 .
Fig 5. Hyperparameters search and the ablation experiment.a.Comparison of different methods for cell connectivity calculations.When compared to the UMAP method, the 'gauss' method demonstrated superior performance.b.Comparison of different settings for the number of attention heads.c.Comparison of different settings for the resolution parameter in the Leiden algorithm.d.Comparison of different settings for the number of highly variable genes.e. Model ablation experiment of AttentionAE-sc in 8 scRNA-seq datasets.Four conditions were tested, including the absence of the information fusion block (wo attn), the absence of the ZINB loss function (wo zinb), the absence of residual connections (wo res), and the absence of GAE (wo gnn).https://doi.org/10.1371/journal.pcbi.1011641.g005

Fig 6 .
Fig 6.Experiments in BRCA dataset.a. UMAP visualization of the clustering results on the BRCA datasets.b.Comparison of cell distribution between the predicted clusters and the ground truth cell lines.In the heat map, each row represents a class of cells from the same cell line, and each column represents a class of cells from the same predicted cluster.There was generally a one-to-one correspondence between cell lines and predicted clusters for ideal clustering results.c.Expression dot plot of identified DEGs in different cell lines.Based on the predicted cell labels, DEGs of all clusters were calculated by the Wilcoxon test (cluster 15 only contains 30 cells, so it was excluded).Top 1 DEGs of predicted clusters were selected as the potential marker genes.d.The pot plot was drawn to visualize the expression in each cell line and the most of cell lines can be distinguished obtained DEGs.BCAS3 was specifically expressed in cluster 18, i.e.MCF7 and KPL1.e.The UMAP visualization of the sub-clustering results on the cluster 18 by AttentionAE-sc, which was consist of these two cell lines.https://doi.org/10.1371/journal.pcbi.1011641.g006