Skip to main content
Advertisement
  • Loading metrics

Branch-specific gene discovery in cell differentiation using multi-omics graph attention

  • Yihao Yin,

    Roles Data curation, Investigation, Methodology, Writing – original draft

    Affiliations School of Mathematics and Statistics, Wuhan Textile University, Wuhan, Hubei, China, Research Center for Applied Mathematics and Interdisciplinary Studies, Wuhan Textile University, Wuhan, Hubei, China

  • Linzhi Zhuang,

    Roles Investigation, Methodology, Software

    Affiliations School of Mathematics and Statistics, Wuhan Textile University, Wuhan, Hubei, China, Research Center for Applied Mathematics and Interdisciplinary Studies, Wuhan Textile University, Wuhan, Hubei, China

  • Yulei Wang,

    Roles Formal analysis, Investigation, Software

    Affiliations School of Mathematics and Statistics, Wuhan Textile University, Wuhan, Hubei, China, Research Center for Applied Mathematics and Interdisciplinary Studies, Wuhan Textile University, Wuhan, Hubei, China

  • Yazhou Shi,

    Roles Methodology, Visualization, Writing – review & editing

    Affiliations School of Mathematics and Statistics, Wuhan Textile University, Wuhan, Hubei, China, Research Center for Applied Mathematics and Interdisciplinary Studies, Wuhan Textile University, Wuhan, Hubei, China

  • Bengong Zhang

    Roles Conceptualization, Funding acquisition, Methodology, Project administration, Supervision, Writing – original draft, Writing – review & editing

    bgzhang@wtu.edu.cn

    Affiliations School of Mathematics and Statistics, Wuhan Textile University, Wuhan, Hubei, China, Research Center for Applied Mathematics and Interdisciplinary Studies, Wuhan Textile University, Wuhan, Hubei, China

Abstract

Understanding gene regulation during cell differentiation requires effective integration of multi-omics single-cell data. In this study, we propose BranchKGN, a heterogeneous graph transformer-based framework for identifying branch-specific key genes along cell differentiation trajectories. By integrating scRNA-seq and scATAC-seq data into a unified gene representation, we infer differentiation trajectories using Slingshot and construct a heterogeneous graph capturing gene–cell relationships. Through attention-based graph learning, BranchKGN assigns gene importance scores within each cell, enabling the identification of genes consistently informative across branch point cells and their descendant lineages. These genes are then used to reconstruct gene regulatory networks and differentiation trajectories. Validation on three independent datasets demonstrates that the identified gene sets not only capture key regulators of cell fate bifurcation but also support accurate reconstruction of differentiation trajectories. Our results highlight the effectiveness of BranchKGN in dissecting gene regulation dynamics during cellular transitions and provide a valuable tool for multi-omics single-cell analysis.

Author summary

Cell fate is an important biology process during biological development, tissue regeneration, and disease progression. However, the modes of cell differentiation and the patterns of gene expression changes during this process are still not so clear. To investigate these problems, we introduce BranchKGN, a framework based on graph attention mechanisms designed to identify branch-specific key genes along cell differentiation trajectories. By integrating scRNA-seq and scATAC-seq data and inferring differentiation trajectories using Slingshot, BranchKGN employs multi-head attention learning to score the importance of genes within each cell. And then branch-specific key genes at bifurcation points are identified. These genes are used to reconstruct gene regulatory networks and differentiation trajectories. We test BranchKGN on three independent datasets, can identify specific key gene sets at cell fate bifurcations. These key genes can accurately reconstruct the cell differentiation trajectories in turn. Our results highlight that BranchKGN is a powerful tool to dissect gene regulatory dynamics during cell transitions and analyze the multi-omics single-cell data.

Introduction

Understanding the transitions between cellular states and fates is fundamental for deciphering the mechanisms of biological development, tissue regeneration, and disease progression [13]. During these processes, cells undergo dynamic changes in gene expression and chromatin accessibility, ultimately committing to distinct lineages and acquiring specialized functions[46]. Recent advances in single-cell technologies, particularly single-cell RNA sequencing (scRNA-seq) and single-cell assay for transposase-accessible chromatin using sequencing (scATAC-seq), provide high-resolution, multi-modal snapshots of individual cells [7,8]. These technologies offer unique opportunities to capture cell-to-cell heterogeneity and unravel the regulatory programs that drive fate specification.

Despite these advances, a key challenge remains: how to identify branch-specific regulatory genes that drive fate bifurcations in complex differentiation systems [9]. While traditional differential expression analyses can highlight global markers, they often fail to capture context-dependent regulators that selectively influence specific lineage choices at critical branch points [10,11].

Trajectory inference algorithms, such as Monocle [12,13], TIGON [14], and PAGA [15] have been widely adopted to reconstruct cellular developmental paths by projecting single-cell data onto low-dimensional manifolds and ordering cells along pseudotime or tree-like structures [16,17]. These methods effectively capture global lineage relationships and branch points but typically fall short in pinpointing the molecular regulators driving fate bifurcations. This is largely due to their focus on modeling cell-level dynamics, rather than explicitly quantifying gene-level contributions to fate decisions [18]. For instance, MuTrans [19] applies manifold learning to identify transitional states with maximum transcriptomic uncertainty, and SpliceJAC [20] estimates Jacobian matrices derived from RNA velocity to detect bifurcating states. While informative, such approaches primarily emphasize cell-level transitions, providing limited insight into gene-level drivers of differentiation.

Efforts have also been made to characterize gene regulatory mechanisms along developmental trajectories [21]. Notably, Topographer reconstructs a quantitative Waddington landscape from single-cell transcriptomic data, integrating macro-scale lineage topology, meso-scale gene regulatory networks, and micro-scale transcriptional burst kinetics [22]. It estimates transition probabilities and reveals increased transcriptional noise and gene–gene correlation heterogeneity near bifurcation points, offering valuable insights into the stochastic nature of fate decisions. However, Topographer relies solely on gene expression data and does not incorporate chromatin accessibility or multi-modal regulatory features, limiting its capacity to pinpoint causal, branch-specific regulators.

In parallel, deep learning has emerged as a powerful tool for modeling high-dimensional single-cell multi-omics data. Methods such as DeepMAPS [23], MarsGT [24] and GRNFormer [25] employ graph neural networks or transformer-based architectures to integrate gene expression and chromatin accessibility, jointly learning embeddings for genes and cells. These approaches enable the inference of gene regulatory networks (GRNs) [26] and can reveal cell-type-specific gene–gene interactions. However, most focus on global or pairwise regulatory patterns, and are not specifically designed to uncover the branch-specific regulators that determine bifurcation outcomes [27]. Moreover, identifying causal regulators in rare or transitional cell states remains particularly challenging due to data sparsity, noise, and the complexity of underlying regulatory programs.

To address these limitations, we introduce BranchKGN, a deep learning framework tailored to identify branch-specific key regulators by integrating scRNA-seq and scATAC-seq data. BranchKGN models gene regulatory landscapes as attributed graphs, where nodes represent genes with multi-omic features and edges encode potential regulatory associations. By leveraging a multi-head self-attention mechanism [28], the model computes gene-level scores that reflect their importance in distinguishing cell fates across divergent branches. BranchKGN then aligns cell clusters along inferred pseudotime trajectories and quantifies the dynamic rewiring of GRNs during differentiation. This framework enables the discovery of bifurcation-associated genes, offering mechanistic insights into lineage commitment and cell fate decisions. We demonstrate the effectiveness of BranchKGN on multiple real datasets, showing its ability to reveal interpretable, biologically relevant regulators that govern cell-type-specific transitions.

Materials and methods

BranchKGN framework

As shown in Fig 1, BranchKGN is a computational framework designed to identify branch-specific regulatory genes by integrating single-cell transcriptomic (scRNA-seq) and chromatin accessibility (scATAC-seq) data. Seurat is first applied to normalize scRNA-seq data, while MAESTRO computes Regulatory Potential (RP) scores from scATAC-seq [29,30]. The two modalities are integrated into a Gene Integration Matrix (GIM) using Seurat [31], capturing both gene expression and regulatory potential. The GIM is then reduced via PCA [32,33] and used for trajectory inference with Slingshot based on Gaussian Mixture Models (GMM) [34], which identifies branching points and partitions differentiation into pre-branching, branching, and post-branching phases.

thumbnail
Fig 1. Overview of the BranchKGN framework.

A: The framework consists of three main components: multi-omic data integration, trajectory inference, and gene-to-cell importance learning via attention-based graph modeling. B: The selected key genes are used to reconstruct differentiation trajectories and infer dynamic gene regulatory networks across pseudotime.

https://doi.org/10.1371/journal.pcbi.1013664.g001

In parallel, the GIM is transformed into a heterogeneous gene–cell graph, where nodes represent genes and cells, and edges encode expression, accessibility, or regulatory interactions [35]. A Heterogeneous Graph Transformer (HGT) is employed to learn embeddings and compute Gene Attention Scores (GAS) for each gene–cell pair, quantifying gene contributions to cell fate [36]. Branch cells identified from the trajectory are analyzed by ranking genes according to GAS, followed by filtering with a score threshold and expression probability (50% of branch cells). Genes passing these criteria are designated as branch-specific regulatory genes, which are further used to reconstruct differentiation trajectories and infer branch-specific gene regulatory networks (GRNs), see Fig 1B for details.

Integration of scRNA-seq and scATAC-seq data

To capture transcriptional and chromatin accessibility signals at single-cell resolution, we integrated scRNA-seq and scATAC-seq data using the Seurat framework. Gene expression matrices from scRNA-seq were normalized with SCTransform [37], while scATAC-seq profiles were processed with TF-IDF [38] and converted into gene activity scores based on promoter and gene-body accessibility, reflecting regulatory potential [30] (detailed in S1 Text). The two modalities were aligned through canonical correlation analysis (CCA) to obtain a shared low-dimensional representation. This process yielded a harmonized Gene Integration Matrix that jointly encodes expression and accessibility features for matched cells, serving as the foundation for trajectory inference and graph-based modeling.

Trajectory inference and branching phase definition

Based on the GIM, we first applied PCA to reduce the gene-level dimensionality, followed by cell clustering to identify major cell populations. Differentiation trajectories were reconstructed using Slingshot, which fits smooth lineages through clusters with Gaussian Mixture Models (GMMs) and detects bifurcation points [39,40]. Since this study focuses on bifurcating trajectories, only clusters located along bifurcations were considered for subsequent analysis. To assign clusters into pre-branching, branching, and post-branching phases, we applied the following criteria. For each cluster, we examined (i) the proportion of cells assigned to different lineages, (ii) the median pseudotime values across lineages, and (iii) the differences between lineage-specific pseudotimes. A cluster was classified as pre-branching if a substantial proportion of its cells were associated with multiple lineages () and if the median pseudotimes across lineages were small and highly similar (maximum difference ). Conversely, a cluster was classified as post-branching if the majority of its cells belonged to a single lineage () and if its median pseudotime was relatively large. Remaining clusters, which typically displayed intermediate pseudotime differences or heterogeneous lineage assignments without a dominant lineage, were classified as branching. In this study, we set the parameters ρ=2 and =0.3. Given the varying number of cells and genes across different datasets, the resulting time spans also differ. Therefore, we recommend adjusting these parameters according to the specific characteristics of each dataset in practical computations.

Construction of the gene–cell bipartite graph

From the integrated GIM, we constructed a heterogeneous bipartite graph with two node types: genes and cells. An undirected edge was added between a gene node and a cell node if the gene was expressed in that cell, with edge weights initially binarized (0 or 1) to indicate the presence or absence of an expression relationship, thereby encoding cell-specific transcriptional activity. Node representations were initialized using two independent linear neural networks that projected raw features of cells and genes into 256-dimensional embeddings. These embeddings were then used as inputs to the HGT for modeling gene–cell interactions.

Heterogeneous graph transformer for gene–cell embedding

To capture the complex relationships between genes and cells, the gene–cell bipartite graph was processed by a multi-layer HGT in an unsupervised manner (see detail description in S1 Text) [23,41]. Each layer of the HGT consists of five components : (i) multi-head self-attention, where query, key, and value vectors are generated for each target–source pair according to node and relation types; (ii) message passing, propagating information between connected nodes; (iii) mutual attention, which quantifies the importance of gene–cell interactions; (iv) target aggregation, where weighted messages are integrated for each node; and (v) update and normalization, which refine the embeddings, as detailed in S1 Fig.

Specifically, each cell node j aggregates information from its connected gene nodes via a multi-head attention mechanism. For each target cell j – source gene i pair, linear transformations generate query, key, and value vectors:

(1)

where hj and hi are the embeddings from the previous layer, and WQ, WK, are learnable weight matrices specific to the node types and relation types [42,43]. The attention weight of gene node i towards cell node j for head h is computed as:

(2)

where dh is the dimension of the key and query vectors. Attention weights of all connected gene nodes are normalized to assess their relative importance to the target cell. The updated embedding of cell j is obtained by aggregating the weighted value vectors and integrating with its previous embedding.

Through multiple stacked layers, the model outputs cell embeddings, gene embeddings, and attention weights between gene–cell pairs. The embeddings of genes and cells are subsequently combined by inner product to reconstruct the original gene–cell matrix. Trained in an end-to-end, unsupervised manner by minimizing Kullback–Leibler (KL) divergence between the reconstructed gene–cell matrix and the original Gene Integration Matrix X:

(3)

Minimizing KL encourages the model to learn embeddings that accurately capture the original gene–cell relationships while preserving the underlying transcriptional and regulatory structure. The HGT model was implemented with a hidden layer dimension of 128, number of attention heads H = 16, number of graph neural network layers of 2 and trained for E = 100 epochs with the learning rate is initially set to 0.01 and then automatically adjusted during training and the optimizer is Adam. Additional hyperparameters and implementation details are provided in Table A in S1 Text.

After training, the attention weights from all heads and layers are combined and normalized to generate a GAS matrix; This matrix provides an interpretable measure of each gene’s contribution to specific cell populations and is subsequently used for identifying key genes specific to each branch. Seeing S1 Text for details.

(4)

Identification of branch-specific regulatory genes

To identify branch-specific regulatory genes, we focused on cells corresponding to each pseudotemporal phase, with particular emphasis on branching-point cells. For each cell, genes were ranked according to their GAS derived from the trained HGT model. Genes with GAS above a predefined threshold [4446] were selected as candidate regulators.

Across all cells within a given phase, we retained as branch-specific key genes those expressed in a majority of cells (50%). While this threshold may exclude rare transitional regulators with low expression, it offers a practical balance between reducing noise and capturing reproducible signals. These high-confidence regulators were subsequently used to reconstruct reduced differentiation trajectories that preserve the original branching topology and to infer branch-specific GRNs, thereby capturing dynamic transcriptional programs and lineage-restricted regulatory shifts during fate bifurcation.

Trajectory reconstruction and gene regulatory network construction

To validate the identified branch-specific key genes, cellular differentiation trajectories were reconstructed using only these genes, following the same procedure as the full-gene analysis. The resulting trajectories exhibited high topological consistency with the original ones, confirming that the selected genes capture essential branching dynamics.

To investigate dynamic gene regulation during differentiation, Log2 Fold Change (Log2FC) was calculated for each gene between experimental and control groups, defined as , where x and y denote the average expression levels of the gene in the experimental and control clusters, respectively. This metric quantifies differential expression and highlights transcriptional changes across pseudotime. Based on these expression dynamics and pseudotime information, branch-specific gene regulatory networks were inferred for each cluster using GENIE3 [47] or STREAM [48], providing insight into lineage-restricted regulatory programs during cell fate commitment.

Dataset description

We used four single-cell multi-omics datasets to evaluate BranchKGN, show in Table 1. First, oRG-111 dataset profiles outer radial glia (oRG) cells in the developing human cerebral cortex using single-cell RNA sequencing [49]. oRG cells are a specialized neural stem cell population in the outer subventricular zone, exhibiting transcriptional programs distinct from ventricular radial glia (vRG) cells and playing essential roles in cortical expansion and gyrification. The dataset contains 393 cells and 25137 genes. From these, 111 oRG-associated genes, including TNC, PTPRZ1, FAM107A, HOPX, and LIFR, were curated as gold-standard labels. This dataset was used to benchmark BranchKGN against existing methods for key gene identification.

PBMC datasets (pbmc_granulocyte_sorted_3k, pbmc_unsorted_3k, pbmc_unsorted_10k) [50] were generated by 10x Genomics from peripheral blood mononuclear cells (PBMCs) of healthy donors, using scRNA-seq and paired scATAC-seq (https://support.10xgenomics.com/single-cell-multiome-atac-gex/datasets). PBMCs originate from hematopoietic stem cells and differentiate into two major lineages: myeloid and lymphoid [5153]. They mainly consist of lymphocytes (T cells, B cells, NK cells), monocytes, and dendritic cells, making them a widely used system for studying hematopoietic differentiation. The pbmc_granulocyte_sorted_3k dataset (∼3,000 cells) was sorted to exclude granulocytes, enabling focused analysis of non-granulocyte immune populations. In contrast, the pbmc_unsorted_3k and pbmc_unsorted_10k datasets (∼3,000 and ∼10,000 cells, respectively) retained all PBMC subtypes, thus offering broader cellular diversity and increased data complexity. Although lacking predefined gold-standard labels, the PBMC datasets enable evaluation of BranchKGN by assessing whether branch-specific key genes can reconstruct differentiation trajectories and gene regulatory networks that remain consistent with full-gene analyses and biologically plausible.

Results

Benchmarking on the oRG-111 labeled dataset

BranchKGN was primarily developed to identify branch-specific regulators driving lineage bifurcations. However, due to the lack of benchmark datasets with explicit branch-point labels, direct evaluation of this ability is currently not feasible. As a proxy, we evaluated the general capability of BranchKGN in detecting functionally relevant genes using the oRG-111 dataset, which provides experimentally validated oRG-specific gene labels. This setting allowed us to benchmark BranchKGN against existing key gene identification approaches, even though the dataset does not directly capture branch-point regulation.

We benchmarked BranchKGN against four representative approaches: three statistical methods summarized by Li et al. [54], DISP, MVP, and VST [8,55], as well as the widely used toolbox scGEAToolbox [56]. For each method, we selected genes under four ranking thresholds (Top 500, 1000, 2000, and 3000). In BranchKGN, the selection probability was fixed at 0.5, and only the ranking of genes was varied; the actual number of selected genes was substantially smaller than the given thresholds.

The comparative results (Fig 2) demonstrate that BranchKGN consistently identifies oRG-specific genes with higher accuracy than the baseline methods, particularly at moderate ranking thresholds (Top 1000–2000). Notably, BranchKGN effectively prioritizes known oRG marker genes such as TNC, HOPX, and FAM107A, highlighting its strength in capturing biologically meaningful branch-specific signatures. However, we also observed that when the number of candidate genes was restricted to fewer than 1000, the accuracy of key gene identification decreased (<10%), indicating a limitation in sensitivity under stringent selection cutoffs. Although this dataset is not a strict benchmark for branch-point testing, these results demonstrate that BranchKGN exhibits a clear advantage in identifying functionally relevant key genes, laying the foundation for its application in branch-point scenarios.

thumbnail
Fig 2. Accuracy of BranchKGN versus existing methods on the oRG-111 dataset.

Bar plots show the percentage of correctly identified oRG-specific genes at four ranking thresholds (Top 500, 1,000, 2,000, 5,000) for DISP, MVP, VST, scGEAToolbox, and BranchKGN. A pie chart inset indicates the proportion of known oRG-associated genes in the dataset (111 genes, 0.44%), serving as the gold-standard reference.

https://doi.org/10.1371/journal.pcbi.1013664.g002

Identify branch-specific genes and differentiation trajectory inference

In order to find the branch-specific genes and infer the differentiation trajectory by BranchKGN, the “pbmc_unsorted_3k dataset” is used as a case study to illustrate branch-specific gene expression patterns (the results of other two datasets can be found in S4 Fig, S5 Fig and S6 Fig). Detailed methodological descriptions can be found in S1 Text.

By using MAESTRO, we integrated the gene expression matrix and the chromatin-derived gene activity scores into a shared low-dimensional space. PCA was then applied for dimensionality reduction, followed by GMM clustering, which identified six cell populations consistent with known biological annotations (Fig 3). Based on these clusters, Slingshot inferred cellular trajectories and revealed a clear bifurcation structure, allowing segmentation of the differentiation process into three phases: pre-branching (cluster 6), mid-branching (cluster 5), and post-branching (clusters 1/3 vs. clusters 2/4) (Fig 3A and 3B). To validate this classification, we further applied TFvelo [57], an RNA velocity-based trajectory inference method. The results (S3 Fig) confirmed that the overall divergence trajectories and key branching points were consistent with our initial reconstruction.

thumbnail
Fig 3. Pseudotemporal trajectory analysis and gene expression profiling for the pbmc_unsorted_3k dataset.

A: Slingshot-inferred trajectory overlaid on cluster assignments, with six original clusters consolidated into four representative groups. B: Pseudotemporal ordering of all genes. C: Expression profile of the gene ARHGAP26 across clusters along the differentiation trajectory. D: Reconstructed trajectory using only the top 312 key genes.

https://doi.org/10.1371/journal.pcbi.1013664.g003

To illustrate branch-specific gene expression dynamics, we examined ARHGAP26 across the identified clusters. As shown in Fig 3C, ARHGAP26 is highly expressed in clusters 2 and 4, while remaining low in clusters 1, 3, 5, and 6. Its temporal regulation along the pseudotime trajectory is evident: ARHGAP26 is upregulated during the mid-branching phase (cluster 5) and maintained in one post-branch lineage (cluster 6), but downregulated in the alternative branch. This highlights its branch-specific expression during the bifurcation process. To further demonstrate the generality of our method, S2 Fig presents expression patterns of four additional top-ranked genes, randomly selected from the highest-scoring candidates. These genes consistently show branch-specific patterns across different lineages, reinforcing the robustness and biological relevance of the key gene identification by BranchKGN.

Fig 3D shows the reconstructed trajectory using the top 312 key genes from the pbmc_unsorted_3k dataset. To assess reconstruction accuracy, we compared the pseudotime of all cells with that obtained using only branch-cluster reconstruction and calculated the Pearson correlation coefficient, which was 0.715. This significant positive correlation indicates that the reconstructed trajectory largely preserves the ordering of cells along the differentiation pathway.

Verification of branch-specific genes and construction of gene regulatory networks

Using HGT method in Sect 3.1, we identified 312 genes with relatively high expression levels in mid-differentiation cells from a total of 12,837 genes. To evaluate whether these genes represent key regulators, we analyzed their expression dynamics across four distinct cell clusters and calculated average expression scores for each cluster.

As shown in Fig 3A, the trajectory can be divided into four sequential stages: pre-branching (cluster 1), mid-differentiation (cluster 2), and two post-branching lineages (clusters 3 and 4). To capture biologically meaningful transcriptional transitions, differential expression analysis was conducted only between adjacent stages along the trajectory, using cluster 2 as the reference cluster for comparisons with the other three clusters to assess statistically significant differences in gene expression, as shown in Fig 4A. Specifically, we compared (i) cluster 1 vs. cluster 2, (ii) cluster 2 vs. cluster 3, and (iii) cluster 2 vs. cluster 4.

thumbnail
Fig 4. The expression variation of transfer genes.

A: Log2FC variation of gene sets at different branches. B: Heatmap of the expression of selected genes in differentiated clusters.

https://doi.org/10.1371/journal.pcbi.1013664.g004

For each pairwise comparison, average expression levels of the 312 candidate genes were computed, and two-sample t-tests were performed to assess statistical differences. P-values were adjusted using the Benjamini–Hochberg procedure to control the false discovery rate. As summarized in Table 2, a modest but non-significant increase was observed from cluster 1 to cluster 2 (Log2FC = 0.50, FDR = 0.114), whereas cluster 2 showed significant downregulation relative to clusters 3 and 4 (Log2FC = –0.99 and –1.32; FDR = 0.010 and 0.023, respectively). Non-adjacent comparisons (e.g., cluster 1 vs. clusters 3 or 4) were excluded, as they do not reflect direct sequential differentiation and may be biologically misleading.

thumbnail
Table 2. P-Value and FDR-adjusted p-value are used to judge the significance of Log2FC.

https://doi.org/10.1371/journal.pcbi.1013664.t002

The pronounced reduction in gene expression in cluster 2 suggests transcriptional repression during mid-differentiation, with gradual reactivation in the later stages. Fig 4B presents a heatmap illustrating the expression profiles of the selected genes across the four clusters.

To construct the gene regulatory networks, eight genes are randomly selected from the scoring table: ANAPC16, CHD3, FZD3, LYST, MAML2, SIPA1L1, UBE2D2, and USP3. First, the average expression of these eight genes in different branches is drawn in Fig 5, and it could be seen that the expression of LYST, MAML2, SIPA1L1, UBE2D2, USP3 was significantly different. The gene regulatory network is then constructed for different branches, as shown in Fig 6.

thumbnail
Fig 5. Regulation map of partial gene expression in different branches.

This graph shows the changes in the average expression level and percentage of multiple genes in different cell branches (C1, C2, C3, C4). The dot color in the figure represents the average expression level of the gene, the darker the color, the higher the expression level, while the dot size indicates the percentage of the gene expressed in the cell, that is, how many cells express the gene.

https://doi.org/10.1371/journal.pcbi.1013664.g005

thumbnail
Fig 6. GRNS at different branch points.

A: Pre-branching network centered on MAML2, emphasizing its regulatory interactions with various genes. B: Branching point network focused on USP1. C: Post-branching phase 1 network with MAML2. D: Post-branching phase 2 network featuring MAML2, USP1, and SIPA1L1. In each panel, nodes represent genes, where key genes are highlighted at a larger size, and the thickness of edges reflects the strength of interactions. These networks reveal the dynamic rewiring of regulatory interactions at different stages of cell differentiation.

https://doi.org/10.1371/journal.pcbi.1013664.g006

Using Genie3 to filter the “importance” value in each branch, it can be seen that MAML2 is expressed in all four branches and dominates before differentiation in Fig 6. During differentiation, LYST, SIPA1L1 and USP3 began to express. In addition, MAML2 and SIPA1L1, LYST and USP3 were expressed in post-branch 1 and post-branch 2, respectively, it shows differences in gene expression. As can be seen from the Fig 5, the heterogeneity of gene expression in different cell branches, for example, the expression levels and percentages of genes LYST, MAML2, SIPAIL1 and USP3 in C2 branch are significantly higher than in other branches.This may indicate the functional importance or regulatory differences of these genes in specific cell branches.

Biological function analysis of branch-specific genes

In this section, the biological function analysis will be made for the branch-specific genes. It resorts to enrichment analysis which is a technique widely used in the field of bioinformatics to identify key genes and related pathways from large amounts of genetic data. It consists of several systematic steps as following:

  1. Preliminary collection and collation of gene data in the selected library to ensure the completeness and accuracy of the data.
  2. ClusterProfiler R package was used to realize KEGG [58] and GO [59] pathway analysis.
  3. Evaluate the number of genes and their corresponding p-values for each biological process.

To investigate the biological function of the branch-specific genes, we leveraged DisGeNET [60], a comprehensive database integrating gene–disease associations. Enriched biological processes were ranked based on their p-values, and the top 20 terms were visualized using bar plots (Fig 7). Statistically significant terms (p < 0.05) were highlighted with color-coded bars to facilitate interpretation. These results offer key insights and potential entry points for further investigation into the molecular mechanisms underlying the identified gene sets.

thumbnail
Fig 7. Enrichment analysis by KEGG,GO and DisGeNET.

A: KEGG enrichment analysis showing significant pathways. B: KEGG pathway enriches the sankey plot. C: List of diseases associated with the analyzed gene set, ranked by their prevalence or significance.

https://doi.org/10.1371/journal.pcbi.1013664.g007

In addition to the pbmc_unsorted_3k dataset, we further applied BranchKGN to two additional PBMC datasets (pbmc_granulocyte_sorted_3k and pbmc_unsorted_10k). The results, shown in S4 FigS6 Fig, reveal consistent findings: trajectories reconstructed by BranchKGN preserve the branching topology, while the identified branch-specific regulators align with biologically relevant pathways. Notably, even for the larger 10k dataset, BranchKGN robustly recovered coherent differentiation trajectories and stage-specific GRNs, underscoring the scalability and reliability of our method.

Discussion and conclusion

To deeply characterize gene expression dynamics during cell differentiation, we propose a graph attention network-based analysis framework, BranchKGN, designed to identify branch-specific key genes. By integrating scRNA-seq and scATAC-seq data across three real datasets, BranchKGN demonstrates robust performance in capturing cell fate transitions. Using HGT model, Slingshot algorithm and GMM clustering, we delineate the differentiation trajectory into four key stages. This staging not only spans the full temporal course of differentiation but also accurately captures critical transitions near branch points. Gene expression patterns specific to each stage are clearly resolved, highlighting the method’s ability to reveal regulatory dynamics at single-cell resolution. BranchKGN thus provides a valuable tool for dissecting cell state transitions and regulatory mechanisms during differentiation.

However, there is also potential limitation in the current study. BranchKGN currently incorporates only scRNA-seq and scATAC-seq data, without fully addressing inter-tissue heterogeneity, which may limit its generalizability across diverse biological contexts. Additionally, the trajectory inference approach is relatively simple, modeling only bifurcating structures and lacking resolution for more complex differentiation paths. Future work should focus on integrating spatial transcriptomics and broader multi-omics data, as well as enhancing trajectory modeling, to better capture the intricate regulatory networks underlying cell differentiation.

Supporting information

S1 Text. Details of scRNA-seq and scATAC-seq data integration, the principles and implementation of the Heterogeneous Graph Transformer, and the parameter settings (Table A).

https://doi.org/10.1371/journal.pcbi.1013664.s001

(DOCX)

S1 Fig. Schematic overview of the Heterogeneous Graph Transformer (HGT).

The framework takes as input a gene–cell bipartite graph with initial embeddings. Its core principle is to enable information exchange between nodes through graph edges, while employing an attention mechanism to dynamically weight the contributions of different neighbors. The model outputs low-dimensional embeddings for both cells and genes, together with attention scores that quantify the importance of each gene to each cell.

https://doi.org/10.1371/journal.pcbi.1013664.s002

(TIFF)

S2 Fig. Expression patterns of four representative top-ranked genes (LYST, DOCK8, MAML2 and UBE2D2) identified by BranchKGN.

LYST and DOCK8 exhibit strong branch-specific expression in Branch 1, with little or no expression in Branch 2. MAML2 remains expressed both before and after branching and functions as a transcriptional co-activator in the Notch signaling pathway. UBE2D2 is expressed across both branches and encodes the ubiquitin-conjugating enzyme E2D2, a key regulator of protein degradation, cell cycle progression, and signal transduction. These results exemplify the branch-specific and biologically relevant expression patterns consistently captured by our method.

https://doi.org/10.1371/journal.pcbi.1013664.s003

(TIFF)

S3 Fig. The velocity flow graph obtained using Tfvelo.

Cells in cluster 1 exhibit a coherent outward velocity toward cluster 3. Subsequently, cells in cluster 3 diverge into two distinct trajectories: one flowing toward clusters 5 and 6, and the other toward clusters 2 and 4. This dynamic pattern confirms the bifurcating differentiation process identified in our trajectory analysis.

https://doi.org/10.1371/journal.pcbi.1013664.s004

(TIFF)

S4 Fig. Results of the Pbmc_granulocyte_sorted_3k datasets.

(A) Slingshot trajectory analysis with GMM clusters, illustrating the pseudotime progression of cells through different developmental stages. (B) Distribution of cell pseudotime in PCA space, color-coded by pseudotime (red = early stages, yellow = later stages). (C) Trajectory mapping using selected gene expression data, showing the transition of cells across principal components. (D) Pseudotime representation in PCA space, with a color gradient from light (early) to dark (late). (E) Bar plot of GO and KEGG enrichment analysis, where bar length represents gene count and color intensity reflects statistical significance (adjusted p-value). (F) GO analysis of differentially expressed genes, highlighting enriched biological processes, molecular functions, and cellular components.

https://doi.org/10.1371/journal.pcbi.1013664.s005

(TIFF)

S5 Fig. The result of analysis for Pbmc_unsorted_10k datasets.

(A) Slingshot trajectory analysis with GMM clusters, depicting lineage bifurcation. (B) Distribution of cell pseudotime in PCA space. (C) Trajectory mapping using selected gene expression data, showing smooth transitions across principal components. (D) Pseudotime representation in PCA space, with a color gradient from light (early) to dark (late). (E, F) KEGG and GO enrichment analysis of branch-specific genes, presented as bar plots where bar length indicates gene count and color intensity reflects adjusted p-value significance.

https://doi.org/10.1371/journal.pcbi.1013664.s006

(TIFF)

S6 Fig. Gene regulatory networks of the Pbmc_granulocyte_sorted_3k and Pbmc_unsorted_10k datasets.

(A) GRN visualization for the Pbmc_granulocyte_sorted_3k dataset. Core genes are highlighted in brown. A1: Pre-branching network dominated by ZFP69, ZNF624, ZBTB16, ZNF80A, and PRDM1. A2: Network at the branching point, with ZBTB16, TBX21, IKZF3, and PRDM1 as key regulators. A3: Branch 1 network primarily regulated by ZBTB16, with sub-networks involving ZBF707, ZNF30, and CASZ2. A4: Branch 2 network featuring a dual-layer structure driven by ZBTB16, PRDM1, IKZF3, NFATC2, and TBX2. (B) GRN visualization for Pbmc_unsorted_10k dataset (core genes in green). with key genes including BCL11B, FZD3, ITK, CASZ1, NFE2, RBM44, RSPH3, ZFP69, and ZNF707.

https://doi.org/10.1371/journal.pcbi.1013664.s007

(TIFF)

Acknowledgments

The authors would like to give their grateful thanks to Professor Zhou Tianshou and Dr. Cao Wenjie from Sun Yat-sen University for their valuable suggestions.

References

  1. 1. McKinley KL, Castillo-Azofeifa D, Klein OD. Tools and concepts for interrogating and defining cellular identity. Cell Stem Cell. 2020;26(5):632–56. pmid:32386555
  2. 2. Mincarelli L, Lister A, Lipscombe J, Macaulay IC. Defining cell identity with single-cell omics. Proteomics. 2018;18(18):e1700312. pmid:29644800
  3. 3. Wagner A, Regev A, Yosef N. Revealing the vectors of cellular identity with single-cell genomics. Nat Biotechnol. 2016;34(11):1145–60. pmid:27824854
  4. 4. Badia-I-Mompel P, Wessels L, Müller-Dott S, Trimbour R, Ramirez Flores RO, Argelaguet R, et al. Gene regulatory network inference in the era of single-cell multi-omics. Nat Rev Genet. 2023;24(11):739–54. pmid:37365273
  5. 5. Hecker M, Lambeck S, Toepfer S, van Someren E, Guthke R. Gene regulatory network inference: data integration in dynamic models-a review. Biosystems. 2009;96(1):86–103. pmid:19150482
  6. 6. Brackston RD, Lakatos E, Stumpf MPH. Transition state characteristics during cell differentiation. PLoS Comput Biol. 2018;14(9):e1006405. pmid:30235202
  7. 7. Cusanovich DA, Hill AJ, Aghamirzaie D, Daza RM, Pliner HA, Berletch JB, et al. A single-cell atlas of in vivo mammalian chromatin accessibility. Cell. 2018;174(5):1309-1324.e18.
  8. 8. Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E, Mauck WM 3rd, et al. Comprehensive integration of single-cell data. Cell. 2019;177(7):1888-1902.e21. pmid:31178118
  9. 9. Sun R, Cao W, Li S, Jiang J, Shi Y, Zhang B. scGRN-entropy: inferring cell differentiation trajectories using single-cell data and gene regulation network-based transfer entropy. PLoS Comput Biol. 2024;20(11):e1012638. pmid:39585902
  10. 10. Hoppe PS, Schwarzfischer M, Loeffler D, Kokkaliaris KD, Hilsenbeck O, Moritz N, et al. Early myeloid lineage choice is not initiated by random PU.1 to GATA1 protein ratios. Nature. 2016;535(7611):299–302. pmid:27411635
  11. 11. Olsson A, Venkatasubramanian M, Chaudhri VK, Aronow BJ, Salomonis N, Singh H, et al. Single-cell analysis of mixed-lineage states leading to a binary cell fate choice. Nature. 2016;537(7622):698–702. pmid:27580035
  12. 12. Qiu X, Mao Q, Tang Y, Wang L, Chawla R, Pliner HA, et al. Reversed graph embedding resolves complex single-cell trajectories. Nat Methods. 2017;14(10):979–82. pmid:28825705
  13. 13. Cao J, Spielmann M, Qiu X, Huang X, Ibrahim DM, Hill AJ, et al. The single-cell transcriptional landscape of mammalian organogenesis. Nature. 2019;566(7745):496–502. pmid:30787437
  14. 14. Sha Y, Qiu Y, Zhou P, Nie Q. Reconstructing growth and dynamic trajectories from single-cell transcriptomics data. Nat Mach Intell. 2024;6(1):25–39. pmid:38274364
  15. 15. Wolf FA, Hamey FK, Plass M, Solana J, Dahlin JS, Göttgens B, et al. PAGA: graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells. Genome Biol. 2019;20(1):59. pmid:30890159
  16. 16. McInnes L, Healy J, Melville J. Umap: uniform manifold approximation and projection for dimension reduction. arXiv preprint 2018. https://doi.org/10.48550/arXiv.1802.03426
  17. 17. Aibar S, González-Blas CB, Moerman T, Huynh-Thu VA, Imrichova H, Hulselmans G, et al. SCENIC: single-cell regulatory network inference and clustering. Nat Methods. 2017;14(11):1083–6. pmid:28991892
  18. 18. Han C, Cao W, Li C, Guo Y, Wang Y, Shi YZ. MLBKFD: probabilistic model methods to infer pseudo trajectories from single-cell data. Journal of Computational Biophysics and Chemistry. 2024;23(06):723–39.
  19. 19. Zhou P, Wang S, Li T, Nie Q. Dissecting transition cells from single-cell transcriptome data through multiscale stochastic dynamics. Nat Commun. 2021;12(1):5609. pmid:34556644
  20. 20. Bocci F, Zhou P, Nie Q. spliceJAC: transition genes and state-specific gene regulation from single-cell transcriptome data. Mol Syst Biol. 2022;18(11):e11176. pmid:36321549
  21. 21. Marku M, Pancaldi V. From time-series transcriptomics to gene regulatory networks: a review on inference methods. PLoS Comput Biol. 2023;19(8):e1011254. pmid:37561790
  22. 22. Zhang J, Nie Q, Zhou T. Revealing dynamic mechanisms of cell fate decisions from single-cell transcriptomic data. Frontiers in Genetics. 2019;10.
  23. 23. Ma A, Wang X, Li J, Wang C, Xiao T, Liu Y, et al. Single-cell biological network inference using a heterogeneous graph transformer. Nat Commun. 2023;14(1):964. pmid:36810839
  24. 24. Wang X, Duan M, Li J, Ma A, Xin G, Xu D, et al. MarsGT: Multi-omics analysis for rare population inference using single-cell graph transformer. Nat Commun. 2024;15(1):338. pmid:38184630
  25. 25. Hegde A, Cheng J. GRNFomer: accurate gene regulatory network inference using graph transformer. bioRxiv. 2025.
  26. 26. Xu J, Lu C, Jin S, Meng Y, Fu X, Zeng X, et al. Deep learning-based cell-specific gene regulatory networks inferred from single-cell multiome data. Nucleic Acids Res. 2025;53(5):gkaf138. pmid:40037709
  27. 27. Hegde A, Nguyen T, Cheng J. Machine learning methods for gene regulatory network inference. arXiv preprint 2025. https://arxiv.org/abs/250412610
  28. 28. Xu F, Zhang Y, Li Y. Graph neural network algorithm based on graph convolution and attention mechanism. In: 2025 IEEE 5th International Conference on Power, Electronics and Computer Applications (ICPECA). 2025. p. 492–7.
  29. 29. Butler A, Hoffman P, Smibert P, Papalexi E, Satija R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat Biotechnol. 2018;36(5):411–20. pmid:29608179
  30. 30. Wang C, Sun D, Huang X, Wan C, Li Z, Han Y, et al. Integrative analyses of single-cell transcriptome and regulome using MAESTRO. Genome Biol. 2020;21(1):198. pmid:32767996
  31. 31. Hao Y, Hao S, Andersen-Nissen E, Mauck WM 3rd, Zheng S, Butler A, et al. Integrated analysis of multimodal single-cell data. Cell. 2021;184(13):3573-3587.e29. pmid:34062119
  32. 32. Lever J, Krzywinski M, Altman N. Principal component analysis. Nat Methods. 2017;14(7):641–2.
  33. 33. Xiang R, Wang W, Yang L, Wang S, Xu C, Chen X. A comparison for dimensionality reduction methods of single-cell RNA-seq data. Frontiers in Genetics. 2021;12.
  34. 34. Reynolds D. Gaussian Mixture Models. Springer US; 2009. p. 659–63.
  35. 35. Hu Z, Dong Y, Wang K, Sun Y. Heterogeneous graph transformer. In: Proceedings of the web conference 2020. 2020. p. 2704–10.
  36. 36. Wang X, Bo D, Shi C, Fan S, Ye Y, Yu PS. A survey on heterogeneous graph embedding: methods, techniques, applications and sources. IEEE Trans Big Data. 2023;9(2):415–36.
  37. 37. Hafemeister C, Satija R. Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome Biol. 2019;20(1):296. pmid:31870423
  38. 38. Stuart T, Srivastava A, Madad S, Lareau CA, Satija R. Single-cell chromatin state analysis with signac. Nat Methods. 2021;18(11):1333–41. pmid:34725479
  39. 39. Street K, Risso D, Fletcher RB, Das D, Ngai J, Yosef N, et al. Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics. BMC Genomics. 2018;19(1):477. pmid:29914354
  40. 40. Boukouvalas A, Hensman J, Rattray M. BGP: identifying gene-specific branching dynamics from single-cell data with a branching Gaussian process. Genome Biol. 2018;19(1):65. pmid:29843817
  41. 41. Xuan P, Li P, Cui H, Wang M, Nakaguchi T, Zhang T. Learning multi-types of neighbor node attributes and semantics by heterogeneous graph transformer and multi-view attention for drug-related side-effect prediction. Molecules. 2023;28(18):6544. pmid:37764319
  42. 42. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. Advances in neural information processing systems. 2017;30.
  43. 43. Su G, Wang H, Zhang Y, Zhang W, Lin X. Simple and deep graph attention networks. Knowledge-Based Systems. 2024;293:111649.
  44. 44. Wolf FA, Angerer P, Theis FJ. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 2018;19(1):15. pmid:29409532
  45. 45. Munari E, Zamboni G, Lunardi G, Marchionni L, Marconi M, Sommaggio M, et al. PD-L1 expression heterogeneity in non-small cell lung cancer: defining criteria for harmonization between biopsy specimens and whole sections. J Thorac Oncol. 2018;13(8):1113–20. pmid:29704674
  46. 46. Al Taweraqi N, King RD. Improved prediction of gene expression through integrating cell signalling models with machine learning. BMC Bioinformatics. 2022;23(1):323. pmid:35933367
  47. 47. Huynh-Thu VA, Irrthum A, Wehenkel L, Geurts P. Inferring regulatory networks from expression data using tree-based methods. PLoS One. 2010;5(9):e12776. pmid:20927193
  48. 48. Li Y, Ma A, Wang Y, Guo Q, Wang C, Fu H, et al. Enhancer-driven gene regulatory networks inference from single-cell RNA-seq and ATAC-seq data. Brief Bioinform. 2024;25(5):bbae369. pmid:39082647
  49. 49. Pollen AA, Nowakowski TJ, Chen J, Retallack H, Sandoval-Espinosa C, Nicholas CR, et al. Molecular identity of human outer radial glia during cortical development. Cell. 2015;163(1):55–67. pmid:26406371
  50. 50. Hao Y, Stuart T, Kowalski MH, Choudhary S, Hoffman P, Hartman A, et al. Dictionary learning for integrative, multimodal and scalable single-cell analysis. Nat Biotechnol. 2024;42(2):293–304. pmid:37231261
  51. 51. Zhang M, Huang B. The multi-differentiation potential of peripheral blood mononuclear cells. Stem Cell Res Ther. 2012;3(6):48. pmid:23200034
  52. 52. Zheng GXY, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, et al. Massively parallel digital transcriptional profiling of single cells. Nat Commun. 2017;8:14049. pmid:28091601
  53. 53. Luo J, Lu J, Zeng J, Ma Y, Gong Q, Wang Z, et al. Single-cell RNA analysis of chemokine expression in heterogeneous CD14+ monocytes with lipopolysaccharide-induced bone resorption. Exp Cell Res. 2022;420(2):113343. pmid:36088998
  54. 54. Sheng J, Li WV. Selecting gene features for unsupervised analysis of single-cell gene expression data. Brief Bioinform. 2021;22(6):bbab295. pmid:34351383
  55. 55. Satija R, Farrell JA, Gennert D, Schier AF, Regev A. Spatial reconstruction of single-cell gene expression data. Nat Biotechnol. 2015;33(5):495–502. pmid:25867923
  56. 56. Cai JJ. scGEAToolbox: a Matlab toolbox for single-cell RNA sequencing data analysis. Bioinformatics. 2019:btz830. pmid:31697351
  57. 57. Li J, Pan X, Yuan Y, Shen H-B. TFvelo: gene regulation inspired RNA velocity estimation. Nat Commun. 2024;15(1):1387. pmid:38360714
  58. 58. Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28(1):27–30. pmid:10592173
  59. 59. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene ontology: tool for the unification of biology. The gene ontology consortium. Nat Genet. 2000;25(1):25–9. pmid:10802651
  60. 60. Piñero J, Bravo À, Queralt-Rosinach N, Gutiérrez-Sacristán A, Deu-Pons J, Centeno E, et al. DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants. Nucleic Acids Res. 2017;45(D1):D833–9. pmid:27924018