Fig 1.
Overview of batch effect and graph-based domain adaptation in scATAC-seq cell type annotation.
Cells from different tissues are sequenced using different platforms, resulting in source and target datasets with potential batch effects. The observed embedding space (top right) often separates cells by batch rather than by biological identity, due to sequencing-induced biases. The ideal scenario should organize cells according to true cell types, regardless of platform or batch. To achieve this, we construct source and target graphs based on intra-dataset similarity, and apply a graph domain adaptation strategy to align the two domains. This framework enables batch-aware alignment and improves cross-platform cell type annotation accuracy.
Fig 2.
Overview of the proposed scLLMDA framework for cell type annotation from scATAC-seq data.
(A) Feature extraction from genomic sequences: Chromatin-accessible regions are tokenized and encoded using a pretrained DNA language model (DNA-LLM), generating contextualized embeddings that are used to predict accessibility profiles and derive cell embeddings. (B) Cell type annotation via graph domain adaptation: Source and target cell embeddings are used to construct corresponding graphs. A domain adaptation module with attention mechanisms and shared parameters enables effective knowledge transfer and cell type classification across datasets.
Table 1.
Performance comparison (Accuracy and F1-score) of intra-platform cell type annotation.
Table 2.
Cell type annotation between snATAC-seq and sciATAC-seq platforms.
Table 3.
Cell type annotation between MouseBrain(10x) and sciATAC-seq platforms.
Fig 3.
UMAP visualization comparison of cell type annotation results under the MosA1 → WholeBrainA transfer task.
Predicted labels (top row) and ground-truth labels (bottom row) are shown for each method, including annATAC, AtacAnnoR, Cellcano, MINGLE, SANGO, scJoint, scNym, and scLLMDA.
Table 4.
Computational efficiency benchmarks (time and memory) on two cross-dataset annotation tasks. For CPU-only methods, we report “CPU” in the memory column.
Fig 4.
Effect of the balancing hyperparameter on cross-platform annotation performance of scLLMDA.
(a) Accuracy; (b) F1-score.
Fig 5.
Ablation study comparing DNABERT+GDA and DNABERT+KNN across 10 cross-dataset annotation tasks.
The boxplots display distributions of (a) Accuracy and (b) Macro-F1 score.