Cell type annotation for scATAC-seq via DNA large language model and graph domain adaptation

doi:10.1371/journal.pcbi.1014226

Fig 1.

Overview of batch effect and graph-based domain adaptation in scATAC-seq cell type annotation.

Cells from different tissues are sequenced using different platforms, resulting in source and target datasets with potential batch effects. The observed embedding space (top right) often separates cells by batch rather than by biological identity, due to sequencing-induced biases. The ideal scenario should organize cells according to true cell types, regardless of platform or batch. To achieve this, we construct source and target graphs based on intra-dataset similarity, and apply a graph domain adaptation strategy to align the two domains. This framework enables batch-aware alignment and improves cross-platform cell type annotation accuracy.

More »

Expand

Fig 2.

Overview of the proposed scLLMDA framework for cell type annotation from scATAC-seq data.

(A) Feature extraction from genomic sequences: Chromatin-accessible regions are tokenized and encoded using a pretrained DNA language model (DNA-LLM), generating contextualized embeddings that are used to predict accessibility profiles and derive cell embeddings. (B) Cell type annotation via graph domain adaptation: Source and target cell embeddings are used to construct corresponding graphs. A domain adaptation module with attention mechanisms and shared parameters enables effective knowledge transfer and cell type classification across datasets.

More »

Expand

Table 1.

Performance comparison (Accuracy and F1-score) of intra-platform cell type annotation.

More »

Expand

Table 2.

Cell type annotation between snATAC-seq and sciATAC-seq platforms.

More »

Expand

Table 3.

Cell type annotation between MouseBrain(10x) and sciATAC-seq platforms.

More »

Expand

Fig 3.

UMAP visualization comparison of cell type annotation results under the MosA1 → WholeBrainA transfer task.

Predicted labels (top row) and ground-truth labels (bottom row) are shown for each method, including annATAC, AtacAnnoR, Cellcano, MINGLE, SANGO, scJoint, scNym, and scLLMDA.

More »

Expand

Table 4.

Computational efficiency benchmarks (time and memory) on two cross-dataset annotation tasks. For CPU-only methods, we report “CPU” in the memory column.

More »

Expand

Fig 4.

Effect of the balancing hyperparameter on cross-platform annotation performance of scLLMDA.

(a) Accuracy; (b) F1-score.

More »

Expand

Fig 5.

Ablation study comparing DNABERT+GDA and DNABERT+KNN across 10 cross-dataset annotation tasks.

The boxplots display distributions of (a) Accuracy and (b) Macro-F1 score.

More »

Expand