Fig 1.
Overview of scTrans. (a) Sparse attention aggregates cellular representation: scTrans leverages a sparse attention mechanism to efficiently encode non-zero genes into cellular representations. We assign each gene an embedding based on its gene symbol and only use the embeddings corresponding to genes with non-zero expression values, aggregating these embeddings through attention weights, enabling focused learning on informative genes. (b) Contrastive pre-training strategy: During pre-training, scTrans generates augmented cells through random masking, creating positive pairs with the original cells and negative pairs with other cells in the batch. Features for contrastive learning are extracted via an encoder-projection architecture. This process pulls similar positive pairs closer and pushes negative pairs apart in the latent space, facilitating unsupervised pre-training. (c) Fine-tuning for cell type classification: In the fine-tuning phase, a classification layer is appended after the latent representation layer, enabling supervised learning for cell type classification using labeled data. Model parameters are optimized accordingly. This optimization is achieved through supervised learning using labeled data. (d) Applications in downstream tasks: Trained scTrans can be deployed for cell type annotation on novel datasets, as well as for downstream tasks such as gene expression analysis, clustering or cell trajectory inference.
Fig 2.
Comparison annotation performance across multiple datasets and scales. (a) The average accuracy and f1-macro of each tissue in MCA datasets. (b) The accuracy and f1-macro violin plot, including 31 tissues of MCA datasets, with each point representing the average annotation result of a tissue. (c) The accuracy and f1-macro violin plot of PBMC160k and scBloodNL datasets, with 10 percent of stratified sampling label cell, five times repeated experiments at different randomized seed. (d) The runtime performance of scTrans and comparative methods at MCA, PBMC160k and scBloodNL datasets. The figure on the left shows the running time of the 31 tissues in MCA, and the number of cells increases with the x-coordinate. The figure on the right shows the running time of four methods at PBMC160k and scBloodNL datasets. (e) The average accuracy and f1-macro under 1r = 1, 5δ, and 10n = 4 labeled cells in 31 tissues of MCA.
Fig 3.
Cross batch annotation on PBMC45k, mouse brain and mouse pancreas datasets. (a) UMAP visualization of PCA embedding of PBMC45k dataset, including its cell types and technologies, showing batch effects within the dataset. (b) Venn diagrams of the mouse pancreas and mouse brain datasets, illustrating the overlaps between different datasets, indicating differences in cellular composition. The left side represents the overlap in the number of identical cells between different datasets, while the right side shows the overlap in the number of identical cell types. (c) Average accuracy heatmap of annotation task result in PBMC45K, including the results of different methods using different technical datasets on single reference annotation and multi reference annotation tasks. (d) Comparison of accuracy of annotation results for different methods in single reference and multi reference annotation task on mouse pancreas and mouse brain datasets. Error bars were based on mean and 95% confidence. (e) The sankey diagram showing annotation results of scTrans in multi reference annotation task, on the left is MCA Pancreas and on the right is TMS Pancreas.
Fig 4.
Critical gene analysis on Baron datasets. (a) Gene expression heatmaps of top 10 critical genes on Baron datasets. Each row represents a cell, and the colored bars on the left correspond to different cell types. Each column represents one critical gene, these critical genes are top 10 critical genes of each predict cell type results. (b) Dot plot of gene expression value for top 10 critical genes in endothelial on Baron datasets. (c) UMAP visualization of Baron datasets based on latent representation generated by scTrans, including cell type and gene expression of four critical genes. (d) Top 10 KEGG analysis results among top 100 critical genes of endothelial in Baron datasets.
Fig 5.
Latent representation quality analysis. (a) The ARI, NMI and ASW evaluation metrics were calculated for the clustering results of latent representation generated by scTrans, scSemiGAN and scDeepSort, with each datasets running 5 times. The y-axis represents NMI, the x-axis represents ARI, different shapes represent different datasets, and shape size represents ASW. (b) The six graphs showed the UMAP visualization of latent representation on MCA Pancreas dataset generated by scTrans, scSemiGAN, and scDeepSort, including K-Means clustering results and true cell type. (c) Comparison of the clustering results ARI of all methods in mouse brain and mouse pancreas datasets. Error bars were based on mean and 95% confidence. (d) UMAP visualization results showed T cell development dataset, including gene expression variations at different developmental stages and batch information of three donors in the datasets. And pseudo time inference results were shown based on latent representations generated by scTrans, trVAE, and scVI methods.
Fig 6.
Batch correction and cell subtypes identification in PBMC45k. (a) Umap visualization results of latent representation generated by scTrans, scVI and trVAE, including donors, cell types and sequencing technology. (b) The expression of marker genes in six B cell subtypes. (c) The density of B cell subtypes and the expression of marker genes in latent representation generated by scTrans.