Figures
Abstract
Spatial transcriptomics has revolutionized the analysis of gene expression while preserving tissue spatial information, which provides novel insights into the cellular composition and function of complex biological tissues. However, current technologies are constrained by limited resolution and data sparsity, compromising the accuracy of downstream analyses. To address these challenges, we developed SpaVGN, a deep learning framework integrating convolutional neural networks, vision transformer, and graph neural networks for high-fidelity gene expression imputation and spatial domain identification. By combining local feature extraction, global attention mechanisms, and spatial graph-based modeling, SpaVGN effectively reconstructs missing transcriptomic data while preserving spatial tissue architecture. Evaluated on melanoma and sagittal posterior mouse brain datasets, SpaVGN outperformed existing methods in gene expression prediction, achieving Pearson correlation coefficients of 0.609 (melanoma) and 0.682 (mouse brain). It clearly delineated tumor regions and lymphoid niches in melanoma tissue, achieving fine-grained resolution of hippocampal subfields, including Cornu Ammonis and Dentate Gyrus, with a Silhouette Score of 0.43 and a Davies-Bouldin Index of 0.86. Validation through UMAP dimensionality reduction and PAGA network analysis demonstrated that SpaVGN significantly mitigates the negative impact of data sparsity in spatial transcriptomics, improving data completeness and spatial continuity. This study presents an innovative solution that enhances the resolution of spatial transcriptomics data, offering cross-tissue applicability and providing a valuable tool for research in biological development, disease, and tumor heterogeneity.
Citation: Wang H, Zhang Y, Zhang Y, Zhao X, Bai Z, Ma X, et al. (2025) SpaVGN: A hybrid deep learning framework for high-resolution spatial transcriptomics data reconstruction and spatial domain identification. PLoS One 20(8): e0329122. https://doi.org/10.1371/journal.pone.0329122
Editor: Xiaohui Zhang,, Bayer Crop Science United States: Bayer CropScience LP, UNITED STATES OF AMERICA
Received: April 22, 2025; Accepted: July 11, 2025; Published: August 14, 2025
Copyright: © 2025 Wang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The minimal dataset underlying the findings of this study has been uploaded to Figshare and is available at the following DOI:https://doi.org/10.6084/m9.figshare.29374538.
Funding: This research was supported by the Innovative Ability Training Program for Graduate Students in Hebei Province (CXZZSS2025079). There was no additional external funding received for this study.
Competing interests: The authors have declared that no competing interests exist.
1. Introduction
Spatial transcriptomic technologies enable the analysis of gene expression while preserving the spatial architecture of tissues—an essential feature for decoding the organization and function of complex cellular environments [1–3]. These technologies fall broadly into two main categories based on data acquisition: imaging-based technologies and sequencing-based technologies. The former includes methods such as MERFISH [4], seqFISH [5], and osmFISH [6], while sequencing-based methods include 10x Visium [7], Slide-seqV2 [8], Stereo-seq [9], and DBiT-seq [10]. Imaging-based methods detect transcripts through in situ sequencing or hybridization-based capture probes, offering subcellular resolution but limited throughput and transcriptome coverage [11]. In contrast, sequencing-based approaches rely on next-generation sequencing (NGS) technologies that associate transcripts with encoded spatial coordinates prior to sequencing [12,13], enabling high-throughput and unbiased coverage for whole-transcriptome-level gene expression measurement. However, their spatial resolution is constrained by the area and sparsity of capture domains [14]. For instance, 10x Visium uses capture spots spaced 100 μm apart [15], while traditional ST exhibits 200 μm spacing, leaving an estimated 54–80% of spatial gene expression unmeasured [16]. This substantial data sparsity reduces transcript-level spatial resolution and significantly impairs downstream analytical accuracy. Therefore, accurately inferring transcriptomic features in unmeasured spatial regions is essential for overcoming current technical limitations and fully exploiting the potential of spatial transcriptomic data.
Current approaches for predicting spatial transcriptomic gene expression can be broadly categorized into three types. Methods such as XFuse [17], Istar [18], and soScope [19] primarily employ advanced computational methods to integrate tissue information from multiple modalities, enabling the inference of super-resolution tissue structures. TCGN [20], THItoGene [21], and mclSTExp [22] focus on predicting spatial gene expression based on H&E-stained histological images. However, these two types of methods perform poorly in downstream analysis tasks, with clustering metrics such as ARI on HER2+ and CSCC datasets ranging only between 0.1–0.4. DIST [23] and stEnTrans [24] method trains a deep neural network to learn spatial dependencies in gene expression patterns across different locations, enabling interpolation and prediction for unmeasured data points. Although these methods have made notable progress in enhancing the resolution of spatial gene expression profiles, they still face challenges in accuracy and generalization for downstream tasks.
Spatial domain identification primarily includes non-spatial clustering methods and spatial clustering methods. Traditional non-spatial clustering methods, such as K-means [25], Louvain [26], and Scanpy [27], rely solely on gene expression data as input for cluster analysis, neglecting spatial location information [28,29]. To address this limitation, researchers have proposed various spatial clustering methods that integrate gene expression, spatial location, and morphological information to identify spatial domains accurately. The BayesSpace [30] models low-dimensional gene expression matrices and incorporates spatial prior to cluster truly neighboring locations. Increasingly, deep learning techniques are being adopted for spatial domain identification, as exemplified by SEDR [31], STAGATE [32], PAST [33], DeepST [34], MuCoST [35] and stHGC [36]. These approaches typically combine graph neural networks, autoencoders, and attention mechanisms to integrate spatial information and gene expression, thereby more effectively capturing spatial dependencies and complex tissue structures. Notably, the Prost [37] applies a probabilistic framework to model spatial transcriptomic data, enabling the capture of uncertainty and variability inherent in biological signals. This framework improves robustness in the presence of noise and data heterogeneity. Meanwhile, with the groundbreaking development of spatial clustering techniques, an increasing number of studies focus on integrating static spatial domain identification results with dynamic cell state transitions to construct spatiotemporally continuous biological process models [38–40]. Therefore, after identifying spatial domains in this study, we further explore the dynamic changes in cell states across different spatial domains.
To address the resolution limitations of spatial transcriptomic sequencing, we developed SpaVGN, a computational framework that integrates CNN, ViT, and GNN to predict gene expression in unsequenced tissue regions. By jointly analyzing gene expression patterns and spatial coordinates, SpaVGN enables high-fidelity transcriptomic reconstruction, simultaneously predicting unsequenced spots and integrating them with sequenced data for spatial domain identification and trajectory analysis.
2. Materials and methods
2.1. Datasets and data preprocessing
We used the Biopsy 1, Replicate 2 sample from the human melanoma dataset from Spatial Transcriptomics technology, as it covers a relatively complete spatial region of the tissue section. The tissue section is captured at a spatial resolution of 100 μm and contains 293 spatial spots and 16,148 genes. To reduce technical noise and sparsity, we apply standard filtering procedures using Scanpy, excluding genes expressed in fewer than 20 spots and removing spots with fewer than 10 detected genes. After filtering, 292 spatial spots and 11,007 genes remain.
The mouse brain dataset corresponds to the sagittal posterior section of an adult mouse brain and is generated using the 10x Genomics Visium platform. The tissue section is captured at a spatial resolution of 55 μm and contains 3,353 spatial spots and 31,053 genes in the raw count matrix. After filtering out genes detected in fewer than 10 spots and removing low-quality spots with fewer than 200 expressed genes, the dataset contains 3,339 spatial spots and 16,609 genes.
Spatial coordinates were discretized into a 2D regular grid, and gene expression values were assembled into a 3D tensor , where
is the number of genes and
denotes the spatial layout. For model input, the tensor was reshaped into a 4D format
with
.
2.2. Overview of SpaVGN
In this study, we propose a hybrid model SpaVGN for high-quality reconstruction and completion of spatial transcriptomics data. The design of SpaVGN is motivated by the unique characteristics of spatial transcriptomics data and the limitations of existing unimodal learning approaches. ST datasets are inherently high-dimensional, spatially structured, and sparse, presenting challenges in both local signal reconstruction and global spatial understanding.
To address these challenges, SpaVGN integrates three complementary deep learning architectures: convolutional neural network (CNN), vision transformer (ViT), and graph neural network (GNN), each contributing unique capabilities. The CNN is employed to capture local spatial features and microenvironmental textures in gene expression maps. In spatial transcriptomics data, functionally related genes frequently exhibit localized co-expression patterns that the CNN can effectively model due to its localized receptive fields and translation-invariant operations. The vision transformer component is introduced to capture global contextual relationships across distant tissue regions. GNN is incorporated to explicitly encode spatial adjacency relationships between tissue spots or patches. Unlike the transformer that relies solely on learned attention weights without inherent geometric constraints, the GNN leverages biologically meaningful neighborhood graphs constructed based on either Euclidean distances or spot topology. This architectural choice ensures the model maintains both biological plausibility and spatial continuity (Fig 1).
2.3. Convolutional neural network module
The convolutional neural network module in SpaVGN is designed to extract localized spatial features from low-resolution gene expression maps. Spatial transcriptomics data, when arranged as two-dimensional grids, can be naturally interpreted as grayscale images with a single channel, where each pixel encodes gene expression intensity at a given spatial location. The CNN module leverages this structure to capture local gene expression patterns that are characteristic of microenvironmental variation across tissue sections.
The input to the CNN module is a single-channel tensor where
denotes the batch size, and
,
are the spatial dimensions of the gene expression. The CNN module consists of two stacked convolutional layers with ReLU nonlinearity applied after each convolution.
The input data is received and convolved with kernel
where
is the output feature map of the first convolutional layer. The second convolutional layer further processes the output of the first layer by applying convolution with kernel
. The CNN module functions as a front-end encoder that captures biologically relevant spatial variations in gene expression at a local level. These features are then passed to the input of the patch embedding for global context modeling.
2.4. Vision transformer module
To capture long-range spatial dependencies across tissue sections, we incorporate a ViT module into the SpaVGN architecture. Unlike CNN, which is limited by local receptive fields, ViT enables all-to-all patch-wise interactions through self-attention mechanisms, allowing the model to reason over global spatial context and improve expression inference in sparsely measured or biologically distant regions.
2.4.1. Patch embedding and position encoding.
After CNN-based local processing, the grayscale image of gene expression is first divided into non-overlapping patches using the Patch Embedding module of the ViT. Specifically, a convolution with kernel size and stride
is applied to the input feature map, transforming the original spatial structure into a series of patch vectors. After convolution, the resulting tensor has shape
, where
is the embedding dimension. Subsequently, flatten and transpose operations are applied to arrange it as:
where is the number of patches.
To preserve the spatial arrangement of the non-overlapping patches extracted from the 2D gene expression maps, we incorporate absolute positional encoding into the patch embeddings. Since the self-attention mechanism in Transformers is permutation-invariant, the model must be explicitly informed of the relative and absolute positions of each patch to ensure spatial awareness.
The position encoding is defined using sine and cosine functions, with the formula as follows:
where is the flattened patch index
is the embedding dimension index, and
is the total embedding dimension. The resulting positional encoding matrix
is added to the patch embedding sequence prior to self-attention computation:
This step allows the model to capture the spatial layout of the tissue section during global context modeling.
2.4.2. Multi-Head Self-Attention.
In the context of spatial transcriptomics, MHSA enables the model to dynamically integrate expression information across distant tissue regions, capturing non-local co-expression patterns that may reflect shared functional programs or morphogenetic gradients. Unlike CNN or purely local models, the Transformer can identify context-aware dependencies even when signals are weak or noisy, a common challenge in spatial transcriptomics dataset.
In the subsequent global feature modeling, a standard Transformer module is employed. Each layer first utilizes the Multi-Head Self-Attention mechanism to model global dependencies within the input sequence, incorporating position encoding. The MHSA mechanism projects the input into queries, keys, and values:
where are learnable weight matrices, then reshaped into
attention heads, each of dimension
, producing
. Each head computes scaled dot-product attention:
Outputs from all heads are concatenated and linearly projected:
where is the output projection.
2.5. Graph neural network module
The self-attention mechanism in the ViT module allows each patch to attend to all others in the sequence, enabling the model to capture global contextual dependencies and long-range co-expression patterns. However, this mechanism does not inherently account for the geometric proximity or topology of the underlying tissue structure. To address this limitation, we embed a GNN module into each transformer block. The GNN operates on a spatial graph constructed from the 2D coordinates of the patches, using a Gaussian kernel to define edge weights based on Euclidean distance. Through a series of localized message-passing operations, the GNN explicitly encodes spatial adjacency and reinforces topological continuity in the learned representations.
Given a sequence of spatial patches, we assign each patch a 2D coordinate, corresponding to its row and column indices on the tissue grid. We compute a pairwise Euclidean distance matrix:
To quantify spatial similarity, we apply a Gaussian kernel with parameter :
To reduce noise interference, only the nearest neighbors of each patch are retained (
in this study). A mask matrix
is constructed where
if
is among the
nearest neighbors of
, otherwise
, followed by normalization:
For the input feature of all patches in a batch, the GNN layer aggregates the information of neighboring nodes to update each node embedding.
2.6. Resolution reconstruction
Following the transformer blocks, SpaVGN generates a four-channel output tensor representing four spatially offset sub-pixel predictions for each patch. This design is inspired by the sub-pixel convolution technique commonly used in image super-resolution tasks. Instead of directly regressing high-resolution results, the model predicts four interleaved subregions corresponding to a pixel neighborhood at each low-resolution grid location.
Specifically, for an input feature map of shape , we use
to denote the index of each of the four output channels
, each of which is assigned to one of the four spatial positions in a
high-resolution patch. When
represents even-row even-column positions
;
indicates even-row and odd-column
;
denotes odd-row and even-column
;
implies odd-row and odd-column
. Where
traverse the low-resolution grid range
, the final high-resolution output
is constructed by interleaving these sub-pixel components to form a complete
spatial map.
2.7. Loss function
The loss function in this study is based on multi-channel mean squared error with spatial masking. Through decomposition modeling and biological tissue masking mechanism, it maintains spatial topological consistency while effectively excluding interference from non-tissue regions. Let the input low-resolution image patch be , the corresponding high-resolution target be
, and the tissue mask matrix be
(where 1 indicates valid tissue regions). The model’s predicted four-channel high-resolution output is denoted as
, and the loss function is constructed as follows:
In the mathematical expression of this loss function, the channel index corresponds to four sub-pixel position patterns in high-resolution space. The mask component
is obtained by down-sampling the original tissue mask matrix
, defined as
, which extracts binary identifiers corresponding to sub-pixel positions from the high-resolution mask. The normalization factor
dynamically counts the total number of valid tissue pixels across all channels, ensuring loss comparability between samples with different tissue morphologies.
Overall, this study proposes a hybrid CNN, ViT, and GNN model for spatial transcriptomics data reconstruction, combining CNN local feature extraction, ViT global attention modeling, and GNN spatial relationship capture. The architecture processes gene expression data through convolutional layers, patch embedding with positional encoding, multi-head self-attention, and graph-based neighborhood aggregation, followed by inverse patch reconstruction. A masked MSE loss function preserves spatial topology while excluding non-tissue regions, enabling high-resolution prediction of unmeasured gene expression patterns.
2.8. Performance evaluation
To assess the performance of SpaVGN in reconstructing high-resolution spatial gene expression, we designed a standardized evaluation protocol based on spatial down-sampling and reconstruction. Specifically, we simulated low-resolution inputs from full-resolution datasets and evaluated reconstruction accuracy by comparing the imputed results with the original ground truth.
2.8.1. Data down-sampling and masking.
For each dataset, we first normalized the spatial coordinates to start from , and then applied grid-based down-sampling to simulate low-resolution spatial transcriptomic measurements. In the melanoma dataset, we retained spots at every second row and column position with a step size of 2, producing a uniformly down-sampled grid. For the 10x Visium dataset, which uses a honeycomb-like staggered layout, we adopted an alternating even-odd strategy to ensure accurate spatial sampling while preserving geometric patterns. In both cases, the remaining positions, which were not selected during down-sampling, were designated as masked locations, and their gene expression values were removed from the input. These masked spots served as the target locations for imputation. All imputation methods were applied to the low-resolution data to predict gene expression at these masked coordinates.
2.8.2. Imputation and correlation analysis.
After imputation, the reconstructed gene expression matrices were reassembled into a full-resolution spatial grid. For SpaVGN, the predicted sub-pixel patches were rearranged into a 2 × super-resolved output and aligned with the original coordinate space. To ensure a fair comparison, only those positions within the original tissue boundary were considered in the evaluation.
We then computed gene-wise Pearson correlation coefficient between the predicted and true expression vectors across all valid spatial locations:
where denotes the number of evaluated spots,
and
are the true and predicted expression values of gene
at location
, and
,
are the corresponding means.
2.8.3. Spatial clustering evaluation.
To quantitatively assess the quality of spatial domain identification produced by SpaVGN, we employed two standard unsupervised clustering metrics: the Silhouette Coefficient (SC) and the Davies–Bouldin Index (DB). The SC measures how similar each point is to its assigned cluster compared to other clusters. For a given spatial spot with a cluster assignment label, the silhouette score is computed as:
where is the average cluster distance between spot
and all other spots in the same predicted cluster,
is the minimum average cluster distance between spot
and all spots in any other cluster. The overall SC is the average of
. An SC close to 1 suggests accurate spatial domain delineation, while a value close to −1 implies possible misclassification of spots. The DB measures the average similarity between each predicted cluster and its most similar cluster, taking into account both intra-cluster compactness and inter-cluster separation. It is defined as:
where is the number of predicted spatial domains,
is the average distance of spots in cluster centroid, and
is the distance between the centroids of clusters
and
.
We evaluated the training and inference of SpaVGN on two benchmark datasets using a system equipped with an NVIDIA GeForce RTX 3090 GPU. On the melanoma dataset, 500 epochs plus inference required approximately 10 minutes. On the sagittal posterior mouse brain dataset, the same process took about 27 minutes.
3. Results
3.1. Algorithm evaluation of SpaVGN
To evaluate the performance of SpaVGN compared to existing methods (stEnTrans, DIST, Linear, Cubic, Nearest Neighbor and NEDI), we conducted a comprehensive analysis using melanoma and mouse brain tissue datasets. The comparison was based on Pearson correlation coefficients between predicted and true gene expression patterns. For three representative genes (RPS25, TPT1, and MS4A1), SpaVGN demonstrated higher prediction accuracy, with PCCs of 0.9688, 0.9639 and 0.9419, respectively, compared to all other methods (Fig 2a). These high-performance genes were selected from the top 3 based on median PCC rankings across all models.
(a). Pearson correlation coefficients (PCC) between predicted and true expression patterns for three representative genes (RPS25, TPT1, and MS4A1) across methods. Genes were selected from the top 3 genes with highest median PCC shared by all methods. (b). Violin plots showing gene-wise PCC distributions for all predicted genes in melanoma (left) and mouse brain (right) datasets.
Gene-wise PCC analysis further confirmed that SpaVGN maintained robust performance across the entire transcriptome (Fig 2b). In the melanoma tissue dataset, SpaVGN achieved a median PCC of 0.6090, significantly outperforming stEnTrans (0.5778), DIST (0.4968) and traditional interpolation methods Linear (0.4210), Cubic (0.3917), Nearest Neighbor (0.3187), and NEDI (0.3011). In the sagittal posterior mouse brain dataset, although all methods exhibited improved performance, SpaVGN maintained its leading position with a median PCC of 0.6816. The performance hierarchy remained consistent, with stEnTrans (0.5947), DIST (0.6293) and interpolation methods Linear (0.6674) and Cubic (0.6487) following closely, while Nearest (0.5938) and NEDI (0.5077) again demonstrated relatively lower accuracy. These results collectively demonstrate that SpaVGN provides more accurate spatial gene expression predictions across diverse tissue types and gene categories compared to existing methods.
3.2 SpaVGN performance in melanoma ST dataset
The histological image of melanoma tissue sections (Fig 3a) reveals three main components: melanoma, stroma, and lymphoid tissue [41]. These regions exhibit distinct morphological differences, providing an anatomical reference for subsequent analysis. To evaluate the effectiveness of gene expression imputation, we compared the spatial distribution of two representative genes, CD37 and DLL3, before and after imputation (Figs 3b–c). Prior to imputation, the expression patterns of these genes exhibited sparsity and localized absence. After imputation, the spatial continuity was significantly enhanced, and gene expression signals were noticeably restored, indicating that the imputation strategy effectively fills in missing regions and improves data completeness.
(a). Microscopic image of melanoma tissue sections showing three main tissue types: melanoma (black arrows), stroma (red arrows), and lymphoid tissue (green arrows). Scale bar: 100 μm. (b-c). Spatial expression patterns of representative genes (CD37 and DLL3) before and after imputation. Color scale indicates normalized expression levels. (d). Performance comparison of five computational methods in tissue region segmentation. Color-coded regions correspond to different tissue domains. (e). Uniform Manifold Approximation and Projection (UMAP) visualization of cell distributions generated by five different algorithms. (f). Spatial trajectory analysis results from five methods, showing inferred developmental paths between tissue regions. Arrows indicate trajectory directions.
Further, we evaluated the performance of five computational methods for spatial domain identification, including SpaVGN, SEDR, SCANPY, PAST, STAGATE and MuCoST (Fig 3d). These methods exhibited varying performances in tissue region partitioning, with SpaVGN showing the highest agreement with histological annotations, indicating its ability to better preserve the spatial structural information of the tissue. To further validate the clustering consistency, we conducted UMAP-based dimensionality reduction analysis (Fig 3e). The results indicated that SpaVGN achieved clearer spatial clustering and better separation of tissue regions while maintaining biological relevance, whereas the other methods exhibited some degree of mixing between different regions. Finally, spatial trajectory inference analysis was conducted to explore the spatial relationships among different tissue regions (Fig 3f). The results demonstrated that SpaVGN and STAGATE more accurately reconstructed spatial trajectories, revealing underlying biological hierarchies, while the inferred trajectories from other methods were more scattered or lacked clear hierarchical structure.
3.3. SpaVGN performance in sagittal posterior mouse brain 10x visium dataset
This study conducted a comprehensive analysis of spatial transcriptomics data from mouse brain tissue to evaluate the performance of various algorithms in identifying spatial domains. Based on the tissue structure observed under microscopy, regions such as the Isocortex, Olfactory Bulb (OLF), Basal Ganglia (BS), Hippocampal Formation (HPF), Cerebellum (CB), and Ventral Striatum (VS) were annotated [42], providing a reference framework for subsequent analysis (Fig 4a). Due to the lack of spot-level annotations, the Silhouette Score and Davies-Bouldin Score were used to evaluate clustering performance. The results showed that SpaVGN outperformed other methods in both Silhouette Score (0.43) and Davies-Bouldin Score (0.86), compared to SEDR (SC: 0.23, DB: 2.27), PAST (SC: 0.29, DB: 1.58), STAGATE (SC: 0.21, DB: 1.36), MuCoST(SC: 0.12, DB: 2.20) indicating that SpaVGN produces more compact and well-separated clusters (Fig 4b). Due to its lack of latent variable output, Scanpy could not be evaluated using these metrics.
(a). Microscopic image of the tissue section and its annotated regions, showing the distribution of different anatomical areas. (b). Performance comparison of different methods in tissue region segmentation. The left panel shows the Separation Score, and the right panel shows the Davies-Bouldin Score. (c). Visualization of spatial domain identification results by different methods. The clustering results of each method are represented by different colors. (d). UMAP analysis results of different methods. (e). Partition-based Graph Abstraction analysis results of different methods.
Furthermore, SpaVGN demonstrated high precision in identifying subregions of the Hippocampal Formation (HPF), such as the Cornu Ammonis and Dentate Gyrus [43], demonstrating its fine-grained capability in spatial domain partitioning (Fig 4c). Further analysis of the low-dimensional embeddings generated by each method showed that SpaVGN better preserves the structure of spatial domains, forming tighter and more distinct clusters (Fig 4d). The PAGA network also supported this observation, as SpaVGN constructed a clearer and sparser spatial domain connectivity graph, reflecting biologically plausible relationships between spatial domains (Fig 4e). In contrast, other methods exhibited certain limitations. The SpaVGN network accurately captured spatial relationships between cell types, with node distributions closely matching the annotated tissue regions. By comparison, SEDR and Scanpy produced noisier connectivity graphs, while PAST, STAGATE and MuCoST generated overly sparse structures that failed to fully represent the complex spatial relationships among cell types.
In UMAP analysis, SpaVGN more effectively separated distinct tissue regions, forming compact and biologically meaningful low-dimensional embedding distributions. Compared to competing methods, UMAP plots generated by SpaVGN exhibited clearer tissue domain boundaries, indicating its significant advantage in maintaining spatial structural integrity. In PAGA network analysis, SpaVGN constructed a clearer and sparser spatial domain connectivity graph that accurately reflected biologically plausible relationships between spatial domains. Its network structure demonstrated high consistency with tissue-annotated regions. In contrast, methods such as SEDR and Scanpy produced networks with noisier connections, while networks generated by PAST, STAGATE and MuCoST were overly sparse to adequately represent complex spatial relationships among cell types.
4. Discussion
In this study, we present SpaVGN, a hybrid deep learning framework that integrates convolutional neural networks, vision transformer, and graph neural networks to address the critical limitations of spatial transcriptomics technologies: low resolution and pervasive data sparsity. Through extensive validations on melanoma and mouse brain datasets demonstrates that SpaVGN achieves superior performance in both gene expression imputation and spatial domain identification, with notably high accuracy in transcriptomic reconstruction (melanoma median PCC up to 0.6090; mouse brain median PCC up to 0.6816).
In comparison to state-of-the-art methods such as Scanpy, SEDR, PAST, STAGATE and MuCoST, SpaVGN markedly enhances the preservation of tissue morphology and spatial continuity. Specifically, in the melanoma tissue dataset, it effectively restored missing expression signals and accurately delineated spatial domains that aligned well with histopathological annotations. In the mouse brain dataset, SpaVGN captured fine-grained structures including Cornu Ammonis and Dentate Gyrus, surpassing existing methods in terms of cluster compactness and inter-cluster separation (SC score: 0.43; DB score: 0.86). These results highlight SpaVGN’s ability to resolve subtle spatial patterns often obscured by data sparsity or noise in traditional approaches.
Furthermore, in UMAP visualizations, SpaVGN achieved clearer separation of distinct tissue regions, resulting in more compact and biologically meaningful low-dimensional embedding. Compared to other methods, the UMAP plots generated by SpaVGN exhibited clearer tissue domain boundaries, indicating its significant advantage in maintaining spatial structural integrity. Similarly, in PAGA network analysis, SpaVGN constructed a clearer and sparser spatial domain connectivity graph that accurately reflected biologically plausible relationships among spatial domains. Its network structure demonstrated high consistency with tissue-annotated regions, whereas competing approaches such as SEDR and Scanpy produced networks with noisier connections. In contrast, networks generated by PAST, STAGATE and MuCoST were excessively sparse, limiting their ability to represent the complex spatial relationships between cell types.
The success of SpaVGN can be attributed to its architectural design, which effectively captures both local microenvironmental signals and global tissue-level topological structures. The CNN module excels at extracting localized expression patterns, while the ViT component introduces global attention mechanisms to enhance contextual awareness. Furthermore, the incorporation of GNN enables SpaVGN to respect spatial adjacency of tissue spots, a critical factor for maintaining biological plausibility in downstream tasks such as spatial domain segmentation, UMAP embedding, and trajectory inference. The synergistic integration of these complementary components allows SpaVGN to model complex spatial dependencies and biological heterogeneity more comprehensively than unimodal approaches. Nevertheless, SpaVGN still has limitations requiring further investigation. For instance, while the current framework utilizes 2D spatial relationships, future work could extend SpaVGN to 3D or temporal ST datasets to model higher-dimensional tissue dynamics. Moreover, integrating H&E-stained sections or single-cell RNA-seq data may further improve resolution, interpretability, and cross-modal alignment.
In summary, SpaVGN represents a promising solution for overcoming the limitations of current spatial transcriptomics technologies. Its innovative approach and superior performance underscore its potential as a powerful tool for advancing our understanding of the intricate spatial organization and dynamics of gene expression in biological tissues.
5. Code availability
The code for SpaVGN can be obtained at https://github.com/BIOQM/SpaVGN/tree/master.
Supporting information
S1 File. Datasets for SpaVGN.
The datasets underlying the findings of this study are available from the Figshare repository at: https://doi.org/10.6084/m9.figshare.29374538.
https://doi.org/10.1371/journal.pone.0329122.s001
(PDF)
References
- 1. Rao A, Barkley D, França GS, Yanai I. Exploring tissue architecture using spatial transcriptomics. Nature. 2021;596(7871):211–20. pmid:34381231
- 2. Vickovic S, Eraslan G, Salmén F, Klughammer J, Stenbeck L, Schapiro D, et al. High-definition spatial transcriptomics for in situ tissue profiling. Nat Methods. 2019;16(10):987–90. pmid:31501547
- 3. Miller BF, Huang FY, Atta L, Sahoo A, Fan J. Reference-free cell type deconvolution of multi-cellular pixel-resolution spatially resolved transcriptomics data. Nat Commun. 2022; 13(1). doi: https://doi.org/10.1038/s41467-022-30033-z pmid:35487922
- 4. Zhang M, Eichhorn SW, Zingg B, Yao Z, Cotter K, Zeng H, et al. Spatially resolved cell atlas of the mouse primary motor cortex by MERFISH. Nature. 2021;598(7879):137–43. pmid:34616063
- 5. Shah S, Takei Y, Zhou W, Lubeck E, Yun J, Eng C-HL, et al. Dynamics and Spatial Genomics of the Nascent Transcriptome by Intron seqFISH. Cell. 2018;174(2):363-376.e16. pmid:29887381
- 6. Codeluppi S, Borm LE, Zeisel A, La Manno G, van Lunteren JA, Svensson CI, et al. Spatial organization of the somatosensory cortex revealed by osmFISH. Nat Methods. 2018;15(11):932–5. pmid:30377364
- 7. Wang X, He Y, Zhang Q, Ren X, Zhang Z. Direct Comparative Analyses of 10X Genomics Chromium and Smart-seq2. Genomics Proteomics Bioinformatics. 2021;19(2):253–66. pmid:33662621
- 8. Stickels RR, Murray E, Kumar P, Li J, Marshall JL, Di Bella DJ, et al. Highly sensitive spatial transcriptomics at near-cellular resolution with Slide-seqV2. Nat Biotechnol. 2021;39(3):313–9. pmid:33288904
- 9. Chen A, Liao S, Cheng M, Ma K, Wu L, Lai Y, et al. Spatiotemporal transcriptomic atlas of mouse organogenesis using DNA nanoball-patterned arrays. Cell. 2022;185(10):1777-1792.e21. pmid:35512705
- 10. Liu Y, Yang M, Deng Y, Su G, Enninful A, Guo CC, et al. High-Spatial-Resolution Multi-Omics Sequencing via Deterministic Barcoding in Tissue. Cell. 2020;183(6):1665-1681.e18. pmid:33188776
- 11. Chen KH, Boettiger AN, Moffitt JR, Wang S, Zhuang X. RNA imaging. Spatially resolved, highly multiplexed RNA profiling in single cells. Science. 2015;348(6233):aaa6090. pmid:25858977
- 12. Longo SK, Guo MG, Ji AL, Khavari PA. Integrating single-cell and spatial transcriptomics to elucidate intercellular tissue dynamics. Nat Rev Genet. 2021;22(10):627–44. pmid:34145435
- 13. Wang S. Resolving the bone - optimizing decalcification in spatial transcriptomics and molecular pathology. J Histotechnol. 2025;48(1):68–77. pmid:39723974
- 14. Stark R, Grzelak M, Hadfield J. RNA sequencing: the teenage years. Nat Rev Genet. 2019;20(11):631–56. pmid:31341269
- 15. Ståhl PL, Salmén F, Vickovic S, Lundmark A, Navarro JF, Magnusson J, et al. Visualization and analysis of gene expression in tissue sections by spatial transcriptomics. Science. 2016;353(6294):78–82. pmid:27365449
- 16. Hu J, Coleman K, Zhang D, Lee EB, Kadara H, Wang L, et al. Deciphering tumor ecosystems at super resolution from spatial transcriptomics with TESLA. Cell Syst. 2023;14(5):404-417.e4. pmid:37164011
- 17. Bergenstråhle L, He B, Bergenstråhle J, Abalo X, Mirzazadeh R, Thrane K, et al. Super-resolved spatial transcriptomics by deep data fusion. Nat Biotechnol. 2022;40(4):476–9. pmid:34845373
- 18. Zhang D, Schroeder A, Yan H, Yang H, Hu J, Lee MYY, et al. Inferring super-resolution tissue architecture by integrating spatial transcriptomics with histology. Nat Biotechnol. 2024;42(9):1372–7. pmid:38168986
- 19. Li B, Bao F, Hou Y, Li F, Li H, Deng Y, et al. Tissue characterization at an enhanced resolution across spatial omics platforms with deep generative model. Nat Commun. 2024;15(1):6541. pmid:39095360
- 20. Xiao X, Kong Y, Li R, Wang Z, Lu H. Transformer with convolution and graph-node co-embedding: An accurate and interpretable vision backbone for predicting gene expressions from local histopathological image. Med Image Anal. 2024;91:103040. pmid:38007979
- 21. Jia Y, Liu J, Chen L, Zhao T, Wang Y. THItoGene: a deep learning method for predicting spatial transcriptomics from histological images. Brief Bioinform. 2023;25(1):bbad464. pmid:38145948
- 22. Min W, Shi Z, Zhang J, Wan J, Wang C. Multimodal contrastive learning for spatial gene expression prediction using histology images. Brief Bioinform. 2024;25(6):bbae551. pmid:39471412
- 23. Zhao Y, Wang K, Hu G. DIST: spatial transcriptomics enhancement using deep learning. Brief Bioinform. 2023;24(2):bbad013. pmid:36653906
- 24.
Xue S, Zhu F, Wang C, Min W, editors. stEnTrans: Transformer-Based Deep Learning for Spatial Transcriptomics Enhancement. Bioinformatics Research and Applications. 2024; 63–75. doi: https://doi.org/10.1007/978-981-97-5128-0_6
- 25. Sinaga KP, Yang M-S. Unsupervised K-Means Clustering Algorithm. IEEE Access. 2020;8:80716–27.
- 26. Traag VA, Waltman L, van Eck NJ. From Louvain to Leiden: guaranteeing well-connected communities. Sci Rep. 2019;9(1):5233. pmid:30914743
- 27. Wolf FA, Angerer P, Theis FJ. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 2018;19(1):15. pmid:29409532
- 28. Hu J, Li X, Coleman K, Schroeder A, Ma N, Irwin DJ, et al. SpaGCN: Integrating gene expression, spatial location and histology to identify spatial domains and spatially variable genes by graph convolutional network. Nat Methods. 2021;18(11):1342–51. pmid:34711970
- 29. Lei L, Han K, Wang Z, Shi C, Wang Z, Dai R, et al. Attention-guided variational graph autoencoders reveal heterogeneity in spatial transcriptomics. Brief Bioinform. 2024;25(3):bbae173. pmid:38627939
- 30. Zhao E, Stone MR, Ren X, Guenthoer J, Smythe KS, Pulliam T, et al. Spatial transcriptomics at subspot resolution with BayesSpace. Nat Biotechnol. 2021;39(11):1375–84. pmid:34083791
- 31. Xu H, Fu H, Long Y, Ang KS, Sethi R, Chong K, et al. Unsupervised spatially embedded deep representation of spatial transcriptomics. Genome Med. 2024;16(1):12. pmid:38217035
- 32. Shan Y, Zhang Q, Guo W, Wu Y, Miao Y, Xin H, et al. TIST: Transcriptome and Histopathological Image Integrative Analysis for Spatial Transcriptomics. Genomics Proteomics Bioinformatics. 2022;20(5):974–88. pmid:36549467
- 33. Li Z, Chen X, Zhang X, Jiang R, Chen S. Latent feature extraction with a prior-based self-attention framework for spatial transcriptomics. Genome Res. 2023;33(10):1757–73. pmid:37903634
- 34. Xu C, Jin X, Wei S, Wang P, Luo M, Xu Z, et al. DeepST: identifying spatial domains in spatial transcriptomics by deep learning. Nucleic Acids Res. 2022;50(22):e131. pmid:36250636
- 35. Zhang L, Liang S, Wan L. A multi-view graph contrastive learning framework for deciphering spatially resolved transcriptomics data. Brief Bioinform. 2024;25(4):bbae255. pmid:38801701
- 36. Wang R, Dai Q, Duan X, Zou Q. stHGC: a self-supervised graph representation learning for spatial domain recognition with hybrid graph and spatial regularization. Brief Bioinform. 2024;26(1):bbae666. pmid:39710435
- 37. Liang Y, Shi G, Cai R, Yuan Y, Xie Z, Yu L, et al. PROST: quantitative identification of spatially variable genes and domain detection in spatial transcriptomics. Nat Commun. 2024;15(1):600. pmid:38238417
- 38. Shi X, Zhu J, Long Y, Liang C. Identifying spatial domains of spatially resolved transcriptomics via multi-view graph convolutional networks. Brief Bioinform. 2023;24(5):bbad278. pmid:37544658
- 39. Wang B, Luo J, Liu Y, Shi W, Xiong Z, Shen C, et al. Spatial-MGCN: a novel multi-view graph convolutional network for identifying spatial domains with attention mechanism. Brief Bioinform. 2023;24(5):bbad262. pmid:37466210
- 40. Zhu P, Shu H, Wang Y, Wang X, Zhao Y, Hu J, et al. MAEST: accurately spatial domain detection in spatial transcriptomics with graph masked autoencoder. Brief Bioinform. 2025;26(2):bbaf086. pmid:40052440
- 41. Wang L, Maletic-Savatic M, Liu Z. Region-specific denoising identifies spatial co-expression patterns and intra-tissue heterogeneity in spatially resolved transcriptomics data. Nat Commun. 2022;13(1):6912. pmid:36376296
- 42. Wang Y, Liu Z, Ma X. MuCST: restoring and integrating heterogeneous morphology images and spatial transcriptomics data with contrastive learning. Genome Med. 2025;17(1):21. pmid:40082941
- 43. Wang Y, Liu Z, Ma X. MNMST: topology of cell networks leverages identification of spatial domains from spatial transcriptomics data. Genome Biol. 2024;25(1):133. pmid:38783355