Fig 1.
Overview of the PaaSc approach.
(A) Through the PaaSc approach, multiple correspondence analysis (MCA) is employed to perform dimensionality reduction of the gene expression matrix, projecting both cells and genes into a shared low-dimensional orthogonal coordinate space. The resulting biplot representation captures spatial relationships between cells, genes, and their associations in the reduced-dimensional space. (B) The relative frequencies of genes in the pathway of interest and background gene sets were constructed to assess their contributions to the identified dimensions. (C) Ordinary linear regression was applied to identify significant dimensions associated with the pathway of interest. Dimensions with significant associations (P < 0.05) were retained, and their significance levels were quantified using t-statistics, computed as the ratio of regression coefficients to their corresponding standard errors. (D) Raw pathway activity scores were computed as a weighted sum of the embedding matrix, incorporating both t-statistics and the proportion of variation explained as weights. These scores were then z-score normalized for downstream analysis and visualization. (E) Normalized pathway activity scores were used for cell type assignment, cluster association testing, and spatial analysis.
Fig 2.
Comprehensive performance evaluation of PaaSc and other gene set scoring methods using REAP-seq data and benchmark datasets.
(A) UMAP visualization of nine distinct cell populations identified in human PBMCs profiled by REAP-seq, including CD4 + T cells, CD8 + T cells, natural killer (NK) cells, plasmacytoid dendritic cells (pDCs), dendritic cells (DCs), CD14 + monocytes, CD16 + monocytes, and megakaryocytes (Mk). (B) Performance comparison of different gene set scoring methods using established cell type-specific markers. Box plots show the distribution of AUC scores across all cell types for each method. The centerline represents the median value, the box limits indicate the first and third quartiles, and the whiskers extend to the minimum and maximum values. (C) Assessment of the robustness of the gene set scoring tools against random noise. The line plot shows the mean AUC scores of different methods when varying proportions (10–80%) of random genes were introduced into cell type-specific marker sets. (D) Cell type annotation performance of different methods evaluated using marker genes from 20 predefined cell types. Box plots show the distributions of recall (upper), precision (middle), and F1 scores (lower) across all cell types. Unassigned cells were excluded from the calculation. (E) Cross-dataset validation using five independent benchmark datasets (Liver, Pancreas, Spleen, Bmcite, and Hcortex). The heatmap shows the mean AUC scores for each method across different datasets.
Fig 3.
Evaluation of the performance of PaaSc on 136 annotated scRNA-seq datasets.
(A) A bar plot illustrating the number of datasets containing each of the five cell types analyzed: B cells, CD8 + T cells, macrophages, Tregs, and NK cells. (B) Box plots comparing the performance of PaaSc and 7 competing tools in distinguishing B cells from other cell populations, as measured by AUC scores based on B cell-specific markers. (C) A bar plot showing the number of datasets in which each tool achieved the highest AUC score for B cell identification. (D) Four heatmaps comparing the performance of PaaSc and competing tools in scoring pathway activity for CD8 + T cells, macrophages, Tregs, and NK cells. The datasets were grouped according to which tool achieved the highest accuracy for each cell type. (E) Stacked bar plots summarizing the number of datasets in which each tool achieved the best performance for each cell type.
Fig 4.
Application of PaaSc in analyzing biological aging processes.
(A) Enrichment of the senescence signature in the bulk RNA-seq datasets. A volcano plot showing the enrichment of senescence signatures across 50 bulk RNA-seq datasets. The x-axis represents the normalized enrichment score (NES), and the y-axis represents the permutation p-value in -log10 transformation. (B) Assessment of senescence in IMR90 cells treated with 4-hydroxytamoxifen. Cells were projected into two-dimensional space using Monocle2, with labels indicating either time points or calculated PaaSc scores. (C) Assessment of senescence in WI-38 cells across sequential population doublings. Senescence scores were calculated using PaaSc, CelliD, and GSDensity. Statistical significance was assessed by a one-sided Wilcoxon rank sum test, with significance levels denoted as follows: *P < 0.05, **P < 0.01, ***P < 0.001, ns: not significant. (D) Discrimination between senescent and non-senescent single cells using PaaSc, CelliD, and GSDensity. Receiver operating characteristic (ROC) curves demonstrate the ability of the three tools to discriminate between senescent and non-senescent single cells in three datasets (GSE102090, GSE119807, and GSE115301). (E) Identification of pathways associated with cell senescence. An association analysis between the activities of the senescence signature and hallmark pathways from MSigDB revealed pathways positively correlated with senescence across five cell types (CD8 + T cells, CD4 + T cells, B cells, NK cells, and monocytes/macrophages).
Fig 5.
Identification of GWAS trait-associated cell types using PaaSc.
(A) Enrichment of lymphocyte count-associated genes across 19 sorted cell types. A box plot showing the enrichment scores of lymphocyte count-associated genes calculated by PaaSc. Cell types are categorized into blood, brain, or other categories. (B) Distribution of PaaSc scores for lymphocyte count-associated genes. A histogram illustrating the bimodal distribution of PaaSc scores, enabling the classification of cells into positive and negative states on the basis of the enrichment of lymphocyte count-associated genes. (C) Significance of enrichment of lymphocyte count-associated genes across 19 sorted cell types. Positive cells were defined using the cutoff established in (B). A one-sided Fisher’s exact test was performed, and the log10-transformed odds ratio and the negative log10-transformed false discovery rate (–log10 FDR) are shown. (D) Comparison of PaaSc, CelliD, and GSDensity in identifying GWAS trait-associated cell types. Positive cells were defined as the top 5% of cells with the highest scores for each method. Statistical significance was assessed using a one-sided Fisher’s exact test, with significant results marked by an asterisk (*).
Fig 6.
Scoring pathway activities in the presence of batch effects.
The analysis used the GSE96583 dataset, in which peripheral blood mononuclear cells (PBMCs) from systemic lupus erythematosus (SLE) patients under two biological conditions were profiled: control (ctrl) and interferon (IFN) stimulation (stim). (A, B) Dimensional reduction plots showing cell clustering grouped by cell type (A) and biological condition (B). (C) Evaluation of three pathway scoring methods (PaaSc, GSDensity, and CelliD) using the B cell activation pathway from Gene Ontology. Pathway scores were calculated for each method and normalized by relative rank to values between 0 and 1 (where 1 represents the highest score). The analysis focused on scores from B cells, and receiver operating characteristic (ROC) scores were calculated to assess how effectively each method distinguished between biological conditions on the basis of gene set scores. (D) Interferon pathway activity was calculated using the same approach as in (C), with scores compared between control and stimulated conditions across all cell types.
Fig 7.
Evaluation of the performance of PaaSc in evaluating pathway activity using scATAC-seq data.
(A) UMAP visualization of human PBMC single cells from a publicly available 10x Genomics multiome dataset, in which DNA accessibility and gene expression were simultaneously measured. Cell type annotations were transferred from an existing PBMC reference dataset using Seurat package tools [43], utilizing only gene expression information (see Methods). (B) Comparison of cell type discrimination performance between PaaSc, CelliD, and GSDensity using cell type-specific gene sets. Box plots show ROC AUC values representing each method’s ability to distinguish target cell types from all other cell types. (C) Venn diagram illustrating the overlap of correctly predicted cells among the three methods. For each method, individual cells were assigned to the cell type with the highest score using the corresponding cell type-specific gene set. (D) Differential pathway activity analysis between CD8 + TEM cells and CD8 + naïve cells.
Fig 8.
Identification of spatially relevant signatures using PaaSc in human and mouse brain datasets.
(A) Spatial plots of mouse brain sagittal sections showing PaaSc-calculated activity scores for region-specific gene markers. The color intensity represents the pathway activity score, with higher scores indicating stronger activation. Spatial mutual information (SMI) values are indicated for each gene set. (B) Spatial plot of human brain tissue showing anatomically annotated regions, including cortical layers 1–6 and white matter (WM). (C) UMAP embedding of human brain spatial transcriptomics data demonstrating that anatomically proximal cells cluster together in low-dimensional space. (D) Box plot comparing the performance of PaaSc, GSDensity, and CelliD in distinguishing brain regions using domain-specific markers.
Fig 9.
Performance evaluation of PaaSc on spatial transcriptomics data from human lung cancer.
(A) Spatial plot showing the tissue architecture of human lung cancer samples analyzed using CosMx SMI technology. (B) UMAP embedding showing the distribution of annotated cell types, including tumor cells, epithelial cells, endothelial cells, and 12 immune cell populations. (C) Box plot comparing the performance of PaaSc, CelliD, and GSDensity in distinguishing individual cell types using cell type-specific markers. Each point represents the AUC value for a specific cell type, with median values indicated. (D) Heatmap showing the correlation between pathway activity scores (rows) and cell type gene set scores (columns). The color intensity indicates the strength of positive (red) or negative (blue) correlations.