Skip to main content
Advertisement
  • Loading metrics

SMASH: Scalable Method for Analyzing Spatial Heterogeneity of genes in spatial transcriptomics data

  • Souvik Seal ,

    Roles Conceptualization, Formal analysis, Funding acquisition, Investigation, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    sealso@musc.edu

    Affiliation Department of Public Health Sciences, School of Medicine, Medical University of South Carolina, Charleston, South Carolina, United States of America

  • Benjamin G. Bitler,

    Roles Conceptualization, Resources, Writing – review & editing

    Affiliation Department of Obstetrics and Gynecology, School of Medicine, University of Colorado Denver Anschutz Medical Campus, Aurora, Colorado, United States of America

  • Debashis Ghosh

    Roles Conceptualization, Investigation, Methodology, Project administration, Resources, Supervision, Writing – original draft, Writing – review & editing

    Affiliation Department of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado Denver Anschutz Medical Campus, Aurora, Colorado, United States of America

Abstract

In high-throughput spatial transcriptomics (ST) studies, it is of great interest to identify the genes whose level of expression in a tissue covaries with the spatial location of cells/spots. Such genes, also known as spatially variable genes (SVGs), can be crucial to the biological understanding of both structural and functional characteristics of complex tissues. Existing methods for detecting SVGs either suffer from huge computational demand or significantly lack statistical power. We propose a non-parametric method termed SMASH that achieves a balance between the above two problems. We compare SMASH with other existing methods in varying simulation scenarios demonstrating its superior statistical power and robustness. We apply the method to four ST datasets from different platforms uncovering interesting biological insights.

Author summary

In recent years, spatial transcriptomics (ST) has become increasingly popular to study the expression profile of genes across different spatial locations of a tissue. Many of the genes exhibit spatially varying expression patterns making them immensely valuable for understanding the structural and functional properties of the tissue. The proposed method termed SMASH enables powerful and scalable detection of such genes in high-dimensional ST datasets.

Introduction

Spatial transcriptomics (ST) performs high-throughput measurement of transcriptomes in complex biological tissues at single-cell or subcellular resolution, preserving spatial information [19]. In the past decade, the rapid development of ST technologies has facilitated exciting discoveries in different domains, including neuroscience [1012] and cancer research [1315]. The popular ST technologies and corresponding platforms differ in terms of the procedure used to record spatial profiles, such as region of interest (ROI) selection [16, 17], next-generation sequencing (NGS) with spatial barcoding [1820], and single-molecule fluorescence in situ hybridization (smFISH) [2123]. Two crucial aspects that a researcher considers before choosing a suitable platform, are a) the capability of transcriptome-wide profiling, and b) the granularity of spatial resolution. For example, the majority of the smFISH-based technologies excel at capturing single-cell level resolution but lack the capability of transcriptome-wide profiling. On the other hand, ROI or NGS-based technologies can be used for transcriptome-wide profiling but on a significantly lower spatial resolution, such as 55 μm for the most popular and commercialized ST platform Visium (10X Genomics). We refer to Moses et al. (2022) [24] for a detailed discussion on these technologies. Deriving biological insights from datasets obtained using such platforms with either huge spatial or genomic profiles, or both, not only poses numerous statistical challenges but also requires maximum computational efficiency [25].

A critical step in the analysis of ST datasets is to identify the genes whose level of expression co-varies with the spatial locations across the tissue. These genes, often referred to as spatially variable genes (SVGs), can be used in downstream analyses, such as identifying potential markers for biological processes and defining areas in the tissue that dictate cellular differentiation and function [2629]. For example, Wang et al. (2020) [30] analyzed an ST dataset on the tumor microenvironment (TME) of three tissue sections from a prostate cancer subject [31]. In every tissue section, a unique set of spatially variable metabolic genes were identified, which could arguably be used to guide targeted tissue-specific therapy. A simplistic approach for detecting SVGs could be to identify spatially located layers or cell types (if any) based on either a priori biological knowledge or using popular software, such as RCTD [32] and Seurat [33], with the transcriptional profiles, and then checking which genes exhibit highly enriched expression in a particular spatial layer or cell type. However, such an approach would achieve satisfactory performance only if the layers or cell types are spatially well-separated, and always be sensitive to the quality of the layer or cell type-identification step [34]. In recent years, more sophisticated methods have been developed to identify SVGs, a systematic overview of some of which can be found in Li et al. (2021) [35]. The methods can be broadly classified into three types: a) based on statistical modeling, b) based on machine learning or neural network, and c) based on graphical networks or spatial grids. Some of the notable methods of each type are, type (a): Trendsceek [36], SpatialDE [34], SPARK [37], SPARK-X [38], Boost-GP [39], and nnSVG [40], type (b): SPADE [41], SOMDE [33], and SpaGCN [42], and type (c): HMRF [43], MERINGUE [44], Binspect-Giotto [45], Boost-MI [46], ScGCO [47], and SpaGene [48]. We focus on methods of types (a) and (c) in this manuscript.

The statistical power of the methods greatly varies based on gene expression patterns and the spatial structure of ST datasets. The methods encounter different levels of computational complexity based on two quantities, N and K, denoting the numbers of cells/spots and genes, respectively. SpatialDE [34] is one of the earliest methods of type (a). It employs a Gaussian process (GP) regression model [49] with kernel-based covariance matrices [50] of multiple types, such as linear, Gaussian, and cosine, computed using the distance between the spatial coordinates of the cells. The model decomposes the total variability of a gene expression into two components, spatial and error variance. A significantly large value of the spatial variance would imply that the gene is spatially variable. Borrowing an efficient estimation algorithm from the statistical genetics literature [51], SpatialDE manages to estimate the variance components with a reasonable degree of computational efficiency, requiring O(N3 + N2K) floating point operations (FLOPS). A newer method named SPARK [37] extends the framework of SpatialDE by considering a generalized linear spatial model (GLSM) [52] with a Poisson distribution, arguing to be better suited for modeling the raw count data from the ST platforms directly. However, the penalized quasi-likelihood (PQL) approach [53] used for parameter estimation in SPARK is extremely computationally demanding with a complexity of O(N3K), making it unusable for a transcriptome-wide analysis when N is moderately large (N > 3, 000). To this end, a non-parametric highly scalable method named SPARK-X [38] has been recently developed requiring just linear complexity w.r.t. N. It is based on the robust covariance testing framework [54] that compares the linear kernel-based covariance matrices of the gene expression and the spatial coordinates. However, using a linear kernel makes SPARK-X equivalent to fitting a multiple linear regression model [55] with the gene expression as the dependent variable and the spatial coordinates (or, some transformation of these) as the predictors and testing if the fixed effect coefficients differ from zero. Thus, it is only capable of detecting spatial dependencies or patterns that manifest linearly in the mean or expected value of the gene expression, also known as first-order dependencies, and drastically loses power in complex scenarios as to be shown later. Zhu et al. (2021) [38] has partially acknowledged this issue with their primary focus being computational scalability.

On the other side, a popular method of type (c), MERINGUE [44] considers spatial autocorrelation and cross-correlation based on spatial neighborhood graphs to identify SVGs. Improving hugely on the complexity of MERINGUE, another model-free method named SpaGene [48] has been recently developed. It constructs a spatial network between cells/spots using the k-nearest neighbors approach, and then for each gene, extracts the subnetwork whose nodes have high gene expression. Then, it compares the observed degree distribution of the subnetwork to a distribution from a fully connected network using the earth mover’s distance [56]. It considers a permutation test [57] to obtain the p-value for every gene. SpaGene is highly comparable to SPARK-X w.r.t. computational complexity and thus applicable to ST datasets with large N. However, the method is harder to interpret than the methods of type (a), can not readily accommodate additional covariates, and also lacks power in various scenarios (see Simulations section).

We propose a non-parametric method, named SMASH, which achieves superior statistical power than both SPARK-X and SpaGene, while remaining computationally tractable. It augments the idea of SPARK-X in its use of the Hilbert-Schmidt independence criteria (HSIC) or robust covariance testing framework [54, 58] coupled with more general kernel-based spatial covariance matrices. With a computational complexity quadratic in N, SMASH sacrifices some degree of computational efficiency in favor of significantly higher detection power than both SPARK-X and SpaGene. However, it is worth highlighting that SMASH is notably faster than other type (a) methods, such as SpatialDE and SPARK, and can thus be thought of as a balanced alternative, fusing high detection power with a moderate degree of scalability. In varying simulation scenarios, we demonstrate that SMASH achieves highly consistent and superior performance as compared to the methods SPARK-X and SpaGene. Finally, our analysis of four large ST datasets from platforms like SlideSeq V2, Visium, and MERFISH using these three methods, not only reveals exciting biological insights but also demonstrates SMASH’s capability of detecting SVGs that will be otherwise missed by either of the other two methods. A Python-based software implementation of SMASH is available at, https://github.com/sealx017/SMASH-package, which returns the lists of SVGs detected by both SMASH and SPARK-X, allowing users to investigate the overlap between them.

Results

Simulations

We evaluated the performance of SMASH, SPARK, and SpaGene in three different simulation studies. We omitted SpatialDE and SPARK from the power comparison for two reasons: a) high computational requirements and b) these two methods have already been thoroughly studied in previous works [38, 48]. In simulation setup (1), we followed the procedure described in the SPARK-X manuscript [38]. In setups (2) and (3), we considered the Gaussian process (GP)-based spatial regression model from the SpatialDE manuscript [34], respectively with the Gaussian and cosine kernel-based covariance functions (see Eq (1)). In all the setups, three values of the number of cells (N) were considered, N = 1000, 5000, and 10,000. The spatial coordinates of the cells were simulated first, followed by the expression levels of K (500 or 1000) genes with varying levels of dependence. In setup (1), the expression levels were simulated using a negative binomial distribution, while in setups (2) and (3), the expression levels were simulated using a multivariate normal distribution. In all the setups, distinct spatial patterns were ensured to be present in the expression levels. Further details regarding the simulation setups are provided at the end of the Methods section. Figs 1, 2 and 3 respectively correspond to the three simulation setups, in which we display the simulated spatial patterns and the statistical power of the three methods for different parameter combinations.

thumbnail
Fig 1. Simulation following the SPARK-X manuscript.

A Four spatial expression patterns that the genes were assumed to follow. B Statistical power plots of the three methods, SMASH, SPARK-X, and SpaGene under varying values of N and fold-size, for K = 500 genes at a level of α = 0.05. The results were averaged over five replications.

https://doi.org/10.1371/journal.pgen.1010983.g001

thumbnail
Fig 2. Simulation using Gaussian process-based regression model with the Gaussian covariance.

A) Four spatial expression patterns that were generated using Gaussian covariance matrices with four different values of the lengthscale l. B) Statistical power plots of the three methods under varying values of N and effect-size (h) for K = 1000 genes at a level of α = 0.05. The results were averaged over five replications.

https://doi.org/10.1371/journal.pgen.1010983.g002

thumbnail
Fig 3. Simulation using Gaussian process-based regression model with the cosine covariance.

A) Four spatial expression patterns that were generated using cosine covariance matrices with four different values of the period p. B) Statistical power plots of the three methods under varying values of N and effect-size (h) for K = 1000 genes at a level of α = 0.05. The results were averaged over five replications.

https://doi.org/10.1371/journal.pgen.1010983.g003

In simulation setup (1), SMASH, and SPARK-X performed much better than SpaGene for all four spatial patterns, namely streak, reverse streak, hotspot, and reverse hotspot (Fig 1). SpaGene was particularly poor for the patterns: streak and hotspot. The power of SMASH and SPARK-X steadily increased as N and the fold-change parameter increased. Note that a fold value of 1 implied no spatial association while a larger value indicated higher spatial association. This particular simulation setup favored SPARK-X in the sense that the spatial variability of the expression was of the first order, manifesting entirely through the mean or expectation. Even in this scenario, SMASH managed to achieve similar power.

In simulation setups (2) and (3), the spatial variability of the expression was of higher order, manifesting through the covariance. In setup (2), which involved the Gaussian covariance function, SMASH performed the best followed by SPARK-X and then SpaGene in most cases. SMASH performed the best in setup (3) as well. However, SpaGene achieved better power than SPARK-X here. SPARK-X had almost zero power in many of the cases, especially when the period p was small (p = 0.5, 1), demonstrating its lack of robustness under complicated spatial dependency structures.

We compared the run-time of the methods in the simulation setup (2) for varying numbers of cells, N = 1000, 5000, and 10000 (Table 1). Since the computational complexity of the algorithms mainly differs w.r.t. N and not the number of genes K, we kept K = 1000. We noticed that the run-time of SMASH expectedly increased in an almost squared order w.r.t. N. SPARK-X and SpaGene were both extremely fast for just having linear complexity w.r.t. N. We also added SpatialDE to this comparison to show how computationally intensive it can be to fit a fully parametric model in such a context. We omitted SPARK entirely as it is much slower than even SpatialDE with a computational complexity of O(N3K).

thumbnail
Table 1. Computational complexity and run-time comparison.

The table lists the theoretical complexity and run-time (in seconds) of the four methods, SMASH, SPARK-X, SpaGene, and SpatialDE in a simulation setup with K = 1000 genes and varying number of cells N. The number of spatial coordinates d was equal to 2. *SpaGene constructs multiple kNN graphs and performs permutation tests. We are only listing the complexity of the KNN algorithm.

https://doi.org/10.1371/journal.pgen.1010983.t001

Application to real data

We applied the methods, SMASH, SPARK-X, and SpaGene to four datasets: 1) mouse cerebellum data collected using Slide-seq V2 [19, 59], 2) human dorsolateral prefrontal cortex (DLPFC) data collected using Visium [11], 3) small cell ovarian carcinoma of the ovary hypercalcemic type (SCCOHT) data collected using Visium [11], and 4) mouse hypothalamus data collected using MERFISH [60, 61]. The datasets have varying numbers of genes and spots/cells.

Mouse cerebellum by Slide-seqV2.

The mouse cerebellum data [19] has 20,117 genes and 11,626 spots. We restricted our focus to the 7,653 genes that express in more than 1% of the spots. The mouse cerebellum is made of four spatial layers, white matter layer (WML), granule layer (GL), Purkinje layer (PL), and molecular layer (ML) [62]. These layers consist of different types of cells. For example, WML contains oligodendrocytes, GL contains granule cells, PL contains Purkinje neurons and Bergmann gila, and ML contains intra-neurons MLI. These cell types can be inferred based on just the transcriptional profiles using cell clustering software like RCTD [32]. We display the inferred cell types overlayed on the spatial locations in Fig 4. Out of the 7,653 genes, SMASH identified 1173 genes to be spatially variable (adjusted p-value: padjust < 0.05). SPARK-X and SpaGene respectively detected 608 and 518 genes, and the overlaps between the detected SVGs by the three methods are displayed in a Venn diagram (Fig 4). We noted that SPARK-X and SpaGene had many of the SVGs uncommon. SMASH, on the other hand, could identify almost all the detected genes by those two methods, especially SPARK-X, while detecting an additional 363 SVGs.

thumbnail
Fig 4. Analysis of mouse cerebellum data.

A) Location of the major cell types corresponding to the four spatial layers of the mouse cerebellum. B) Overlap between the detected SVGs by the three methods. C) Enrichment scores of the methods in the four spatial layers.

https://doi.org/10.1371/journal.pgen.1010983.g004

Next, we performed two types of enrichment analysis. First, we compared the performance of the methods in different layers by computing their enrichment scores (ES) following Liu et al. (2022) [48]. It is based on the expectation that the genes which abundantly express themselves in the four spatial layers, should be identified and ranked top by the methods. In that regard, we noticed that SPARK-X performed poorly in the PL, whereas SpaGene performed poorly in the WML. SMASH, on the other hand, consistently achieved similar or better performance compared to the other two methods in all four layers. Secondly, we performed functional enrichment analysis of the following four sets of SVGs: a) the common genes identified by all three methods, b) the genes identified by SMASH and SpaGene but not by SPARK-X, c) the genes identified by SMASH and SPARK-X but not by SpaGene, and d) the genes identified only by SMASH. The expression pattern of three representative genes of the enriched pathways for each of these four sets of genes, are shown in Fig 5. For set (a), top enriched Gene Ontology (GO) terms, such as GO: 0098916 (anterograde trans-synaptic signaling), GO: 0007268 (chemical synaptic transmission), and GO: 0099536 (synaptic signaling), were broadly associated with synaptic regulation. The protein-coding genes Fam107a, Ppp3ca, and Calm1 appeared in these top pathways. Fam107a seems to express in the PL, whereas the other two express in the GL (Fig 5). For set (b), the top GO terms including GO: 0006873 (intracellular monoatomic ion homeostasis), GO: 0030003 (intracellular monoatomic cation homeostasis), and GO: 0098771 (inorganic ion homeostasis) were associated with ion homeostasis. The representative genes Atp1a3 and Thy1 express in the PL while Calm3 expresses in the GL. For set (c), the top pathways including GO: 0006811 (monoatomic ion transport), GO: 0006812 (monoatomic cation transport), and GO: 0098655 (monoatomic cation transmembrane transport) were associated with ion transportation. The representative genes Pllp and Efnb3 express in the WML, whereas Cox7a2 expresses roughly in the GL. For set (d), the top enriched GO terms, such as GO: 0044057 (regulation of system process) and GO: 0050877 (nervous system process), were associated with regulating different types of system processes. The representative genes Gls, Tmem36a, and Coro2b roughly express in the GL.

thumbnail
Fig 5. Expression patterns in mouse cerebellum data.

Three representative genes from the detected pathways for the four sets of genes: a) the common genes identified by all three methods, b) the genes identified by SMASH and SpaGene but not by SPARK-X, c) the genes identified by SMASH and SPARK-X but not by SpaGene, and d) the genes identified only by SMASH.

https://doi.org/10.1371/journal.pgen.1010983.g005

Human DLPFC by Visium.

The human dorsolateral prefrontal cortex (DLPFC) data [11] has 33,538 and 3,639 spots. We focused on the 13,783 genes which express in more than 1% of the spots. Every spot belongs to one of the six manually labeled cortical layers or the white matter layer (WML) (Fig 6). SMASH and SPARK-X identified 10,871 and 10,416 SVGs respectively (padjust < 0.05), whereas SpaGene identified only 2379. The overlaps between the detected SVGs by the three methods are displayed in a Venn diagram (Fig 6). We noted that almost all the genes detected by SpaGene were also detected by both SMASH and SPARK-X. SMASH and SPARK-X detected a lot of additional SVGs. We performed functional enrichment analysis of the two sets of detected genes: a) the common genes identified by all three methods and b) the genes identified only by SMASH and SPARK-X but not by SpaGene. For set (a), top enriched GO terms, such as GO: 0099537 (trans-synaptic signaling) and GO: 0099177 (regulation of trans-synaptic signaling), were associated with synaptic signaling. For set (b), top enriched GO terms like GO: 0006397 (mRNA processing) and GO: 0000375 (RNA splicing, via transesterification reactions), were associated with RNA processing. The expression of three representative genes from the set (b) are displayed in Fig 6. There seemed to be a gradient spatial pattern of expression for all three genes which SpaGene failed to detect. Similar to the previous section, we computed the enrichment score (ES) of every method in the seven manually labeled spatial layers. From Fig 6, we noticed that SpaGene performed poorly in terms of ES, especially in Layers 1 and 6. We also performed an additional check as follows. There are three cortical-layer associated SVGs, MOBP, SNAP25, and PCP4, and three blood and immune-related SVGs, HBB, IGKC, and NPY, known to be spatially variable from previous studies [11]. We checked how many of these genes appeared in the lists of the top thousand SVGs (in terms of padjust) by the three methods. SMASH and SpaGene respectively ranked five and six of these SVGs, whereas SPARK-X ranked only two cortical-layer associated genes.

thumbnail
Fig 6. Analysis of human DLPFC data.

A) Manually labeled cortical layers (layers 1–6) and white matter layer (WML). B) Overlap between the detected SVGs by the three methods. C) Expression of three representative genes identified only by SMASH and SPARK-X. D) Enrichment scores of the methods in different layers.

https://doi.org/10.1371/journal.pgen.1010983.g006

SCCOHT by Visium.

The small cell carcinoma of the ovary hypercalcemic type (SCCOHT) data [63] has 15,229 genes and 2071 cells. We restricted our focus to the 12,001 genes that express in more than 5% of the cells. Sanders et al. (2022) [63] grouped the cells into twelve clusters based on the expression profile of a selected few genes, using Seurat [33], which we display in Fig 7. SMASH, SPARK-X, and SpaGene respectively detected 9361, 6564, and 6899 SVGs (padjust < 0.05). The overlaps between the detected SVGs by the three methods are displayed in a Venn diagram (Fig 7). SMASH could detect most of the SVGs identified by at least one of the other two methods and an additional 1634 genes. Similar to the analysis of the mouse cerebellum data, we checked if the methods could identify the top genes that show enriched expression in the twelve spatially well-separated clusters found by Sanders et al. (2022). We computed the enrichment scores (ES) of the methods for each of the clusters (Fig 7). SMASH achieved consistently higher ES for all the clusters while SpaGene was the second best in most cases. Additionally, in Fig 8, we show the expression of three chosen genes from each of the following four sets of SVGs, a) the common genes identified by all three methods, b) the genes identified by SMASH and SpaGene but not by SPARK-X, c) the genes identified by SMASH and SPARK-X but not by SpaGene, and d) the genes identified only by SMASH. We also checked the clinical relevance of these genes in the existing literature. For example, CITED4, which was detected to be an SVG by all three methods, has been found to be associated with lung adenocarcinoma [64]. From the set (b), ELF4A1 has been found to be associated with gastric cancer [65]. EZH2, from the set (c), is a well-known marker for being associated with the development and progression of different types of cancer [66, 67]. Sanders et al. (2022) [63] also found the expression of EZH2 to be highly variable across their identified spatial clusters. Finally, from the set (d), SEMA4F has been found to be associated with endometrial cancer [68].

thumbnail
Fig 7. Analysis of SCCOHT data.

A) Pre-identified clusters of cells using Seurat. B) Overlap between the detected SVGs by the three methods. C) Enrichment scores of the methods in different clusters.

https://doi.org/10.1371/journal.pgen.1010983.g007

thumbnail
Fig 8. Expression patterns in SCCOHT data.

Three representative genes from the four sets of SVGs: a) the common genes identified by all three methods, b) the genes identified by SMASH and SpaGene but not by SPARK-X, c) the genes identified by SMASH and SPARK-X but not by SpaGene, and d) the genes identified only by SMASH.

https://doi.org/10.1371/journal.pgen.1010983.g008

Mouse hypothalamus by MERFISH.

The mouse hypothalamus data [60] has 161 genes and 5665 cells. 156 genes are pre-selected markers for different cell types and can thus be expected to be highly variable, whereas the other five are control genes. The cell types, such as endothelial, ependymal, and inhibitory, can be identified based on the transcriptional profiles of the markers. The spatial organizations of a few major cell types are shown in Fig 9. SMASH was able to detect 139 genes, whereas SPARK-X and SpaGene detected 127 and 124 genes, respectively (padjust < 0.01). The overlaps between the SVGs detected by the three methods are shown in Fig 9. SMASH identified all the SVGs SPARK-X could detect, while SpaGene identified one additional SVG. It should be highlighted that all the methods assigned the five control genes to not be spatially variable. We display the expression of two representative genes from three sets of genes, a) the genes identified only by SMASH and SpaGene, b) the genes identified only by SMASH and SPARK-X, and c) the genes identified only by SMASH. We did not focus on the common genes because they have been extensively studied in earlier literature, such as the work of Liu. et al. (2022) [48]. The genes Npy1r and Cplx3 belonged to set (a), and are known to be enriched in inhibitory and excitatory neurons [69, 70]. Rxfp1 and Ntsr1 belonged to set b). Even though both genes are known to express in inhibitory and excitatory neurons, Rxfp1 seems to express in ependymal cells as well. Galr2 and Crhr1 are two genes from set c) which express in multiple cell types including inhibitory cells and astrocytes.

thumbnail
Fig 9. Analysis of mouse hypothalamus data.

A) Overlap between the detected SVGs by the three methods. B) Spatial organization of a few major cell types. C) Expression of two representative genes from each of the three sets, a) the genes identified only by SMASH and SpaGene, b) the genes identified only by SMASH and SPARK-X, and c) the genes identified only by SMASH.

https://doi.org/10.1371/journal.pgen.1010983.g009

Discussion

We have proposed a novel non-parametric method SMASH for detecting spatially variable genes (SVGs) in the context of large-scale spatial transcriptomics (ST) datasets. In comparison to existing scalable approaches, SMASH achieves superior power in both complex simulation scenarios and real data analyses while remaining computationally tractable.

Recently developed spatial transcriptomics platforms produce high-dimensional datasets [1820] in terms of the number of cells and the number of genes. In such large datasets, fully parametric approaches for detecting SVGs, such as SpatialDE [34] and SPARK [37], albeit statistically powerful, become intractable for their high computational demand. Computationally efficient alternative non-parametric approaches, such as SPARK-X [38] and SpaGene [35], on the other hand, can often turn out to be significantly less powerful. In our method SMASH, we strive to find a balance between these two issues, achieving higher statistical power while attaining a moderate degree of scalability. We augment the kernel-based covariance testing framework [54], used before in SPARK-X, by accounting for more complex spatial dependencies.

In three different simulation setups, one following the SPARK-X manuscript [38] and the other two following the framework of SpatialDE [34], we evaluated the performance of SMASH, along with two other methods: SPARK-X and SpaGene, in terms of type 1 error and power. SMASH achieved consistently similar or better power than the other two methods in all the simulation setups for all combinations of the varying parameters. In contrast, both SPARK-X and SpaGene behaved unpredictably, achieving almost zero detection power in many of the cases. It demonstrated their lack of robustness and failure to capture complicated structures of spatial dependency in the gene expression. In the run-time comparison of the methods, we showed that SMASH, although slower than SPARK-X and SpaGene, remained fairly tractable and was almost ten times faster than a fully parametric approach like SpatialDE. SMASH, SPARK-X, and SpaGene were then applied to four real datasets: 1) mouse cerebellum data collected using Slide-seq V2 [19], 2) human dorsolateral prefrontal cortex data collected using Visium [11], 3) small cell ovarian carcinoma of the ovary hypercalcemic type data collected using Visium [11], and 4) mouse hypothalamus data collected using MERFISH [60]. We compared the methods via a number of avenues: a) checking the overlap between the detected SVGs by the three methods, b) computing enrichment scores (ES) of the methods in different spatial layers or cell types identified based on the transcriptional profiles using popular softwares, such as RCTD [32] and Seurat [33], and c) investigating the functional enrichment of the genes that were detected by SMASH but remained undetected by at least one of the other two methods. For all the datasets, SMASH detected more SVGs than the other two methods, which included nearly all of the SVGs detected by SPARK-X. SMASH could also detect most of the SVGs that were identified by SpaGene but not by SPARK-X. For example, in data (1), from the 7,653 genes after quality control, SMASH identified 1173 SVGs which included 607 out of the 608 SVGs SPARK-X could detect. Out of the 518 SVGs detected by SpaGene, only 248 were also detected by SPARK-X, while SMASH detected 451 of them. It is important to highlight that SMASH produced calibrated p-values in the null simulations from all of these datasets, lending credibility to these higher numbers of detected SVGs. In the same dataset, SMASH achieved a higher enrichment score (ES) than the other two methods in different pre-identified spatially separated layers or cell types of the mouse cerebellum. A higher ES implied better capability to identify the genes that showed highly variable expression in a particular spatial layer compared to the rest. In the other datasets as well, SMASH consistently achieved better ES in different spatially localized cell types. We also studied the functional properties and clinical significance of the identified SVGs. For example, in data (3), the gene EZH2 was detected to be spatially variable by SMASH and SPARK-X. EZH2 is a known marker for the progression of different types of cancers [66, 67].

In all the methods we have discussed, including SMASH, the biology of a single tissue section from a single subject is explored at a time. It means that if we either have multiple tissue sections from the same subject or from multiple subjects, the methods will have to identify SVGs individually, disregarding the shared information between and across the subjects. Thus, we would like to extend SMASH in a hierarchical fashion for jointly analyzing more than one tissue section or subject in the future. One more important functionality that we would like to incorporate would be the ability to classify the genes based on their similarity of spatial expression patterns. For example, SpatialDE [34] considers a hierarchical Bayesian mixture model approach that suffers from extremely high computational demand. SpaGene [48] considers a non-negative matrix factorization [71] of the expression data to identify similarly expressed genes. This approach, although computationally feasible, does not take into account the spatial locations directly and can thus be suboptimal in capturing truly spatial patterns. In the future, we would like to study this problem with a deeper focus and pursue methodological development in this area. Finally, we would like to explore the possibility of using SMASH in the context of multiplex immunohistochemistry (mIHC) datasets [72, 73] where the goal is to identify spatially variable cell types and their interaction.

Materials and methods

We briefly discuss some of the existing methods such as SpatialDE [34], SPARK [37], SPARK-X [38], and SpaGene [35], and then present the proposed method SMASH. Note that we did not compare SMASH to either SpatialDE or SPARK in our Results section except for the time comparison, primarily due to their high computational demand and the fact that these have already been studied in great detail in earlier works. However, we still discuss their modeling frameworks to facilitate comparisons. Let us introduce a few relevant notations. Suppose there is a single subject (image) with N cells/spots and the expression profile of K genes is observed in the cells. For the i-th cell, let si denote its location i.e., a vector of spatial (two or three-dimensional) coordinates, and yki denote the expression of the k-th gene in the cell. Let us also define, yk = (yk1, …, ykN)T and S = (s1, …, sN)T. For the sake of simplicity, we are assuming that there are no additional covariates but in all the methods, except SpaGene, covariates can be readily incorporated.

A brief overview of existing methods

SpatialDE.

SpatialDE uses a Gaussian process (GP)-based spatial regression model [49, 74]. which has the following form in a finite sample, (1) where 1 denotes the n-length vector of all 1’s, I denotes the N-dimensional identity matrix and Σ denotes a Gaussian covariance matrix. ||.|| denotes the Euclidean norm, and the hyperparameter l, known as the characteristic lengthscale [75, 76], controls the rapidness at which the covariance decays as a function of the spatial distance. The fixed effect μk accounts for the mean expression level and accounts for the expression variance attributable to spatial effects. A large value of should imply that the gene shows differential spatial expression. To formally test the hypothesis, against , SpatialDE considers the likelihood ratio test (LRT) [77]. To estimate the model parameters under the full model, the log-likelihood corresponding to Eq (1) is optimized w.r.t. () using an efficient algorithm by Lippert et al. (2011) [51]. Ideally, it is desirable to optimize over the hyperparameter l as well but for the sake of computational feasibility, l is kept fixed at a few carefully chosen values. For every choice of Σ, to analyze all K genes, the efficient algorithm requires just one computationally demanding step with a complexity of O(N3), instead of O(N3K) as incurred in naive algorithms. Along with the Gaussian covariance function, SpatialDE also considers linear and cosine covariance functions to construct Σ, and finally, combines all the LRT values corresponding to different choices of Σ for the inference. For a particular Σ, the computational complexity of SpatialDE is of O(N3 + N2K).

SPARK and SPARK-X.

SPARK [37] extends Eq 1 by considering a generalized linear spatial model (GLSM) [52] with Poisson distribution as (2)

For cell i, λk(si) is an unknown Poisson rate parameter that represents the underlying gene expression. The variance parameters, and have similar interpretations as earlier. To test , SPARK uses the score test [78]. Parameter estimation and inference are incredibly hard in GLSM which is why SPARK uses an approximate algorithm based on the penalized quasi-likelihood (PQL) approach [53, 79]. The approach has the computational complexity of O(N3) for every trait, or O(N3K) in total. Thus, it lacks severely in terms of scalability.

Improving upon SPARK’s scalability, a recent non-parametric method named SPARK-X [38] has been proposed. The method is built on a simple intuition: if yk is independent of S, the spatial distance between two locations i and j should be independent of the difference in gene expression between the two locations. It computes the expression covariance matrix, and the distance covariance matrix, D = S(STS)−1ST and constructs the test statistic as, where tr() denotes the trace operator. Assume yk to be mean-standardized for the sake of simplicity. Under the null hypothesis of no association, Tk asymptotically follows a weighted mixture of independent distributions. The weights are the products of the ordered eigenvalues of the matrices, Ek, and D. SPARK-X requires the computational complexity of just O(Nd2) for every gene, or O(NKd2) in total, where d is the dimension of the location-space , e.g., d = 3 if A linear complexity w.r.t. N makes SPARK-X easily applicable to large-scale ST datasets. SPARK-X also considers several element-wise non-linear transformations of S as g(S), where g is a Gaussian or cosine transformation (not to be confused with Gaussian or cosine kernels), and repeats the above testing procedure replacing S with g(S). The p-values are combined using a Cauchy p-value combination rule [80].

However, the form of D corresponds to a linear covariance function [75]. It makes SPARK-X equivalent to performing a multiple linear regression of yk on S or g(S) and testing if the fixed effect parameters differ from zero. Thus, SPARK-X is only capable of detecting first-order spatial dependencies and as shown in the Results section, severely lacks power for higher-order dependencies.

SpaGene.

A very recently developed method, SpaGene [48], is different from the rest of the methods discussed so far in the sense of being model-free and based on graphs. The intuition behind the method is that the cells/spots with high gene expression are more likely to be spatially connected than random. It constructs the k-nearest neighbor (kNN) graph based on spatial locations. Then, for each gene, it extracts a subnetwork comprising only cells/spots with high expression from the kNN graph. SpaGene quantifies the connectivity of the subnetwork using the earth mover’s distance (EMD) [56] between degree distributions of the subnetwork and a fully connected one. To generate the null distribution of the EMD for inference, a permutation test is considered. For further details, we refer the readers to the original manuscript [41].

Proposed method: SMASH

Setup.

We test the null hypothesis of yk and S being independent, i.e., H0: ykS, using a non-parametric kernel-based framework [58, 8183]. Let yk and S have domains and , respectively. Denote and to be two measurable positive definite (PD) kernels with the corresponding reproducible kernel Hilbert spaces (RKHSs) denoted by and on and , respectively. Then, the cross-covariance operator: from to can be defined by the relation: , , where <.> denotes an inner product. can be interpreted as a more general version of the covariance matrix on Euclidean spaces, representing higher-order correlations of yk and S through f2(yk) and f1(S). Under additional regulatory assumptions on RKHSs: and [58], it can be shown that testing H0: ykS is equivalent to testing, . This testing can be performed using a test statistic of the form , where and are the kernel covariance or Gram matrices, obtained using the PD kernels: and [75, 76]. In the context of real datasets, exact choices for and are never known. Therefore, we consider different kernel choices and aggregate the results. Our test statistic has the form , where Ek is defined as earlier, i.e., is fixed to be a linear kernel, while and consequently, is varied to have different forms as described next.

Kernels and hyperparameters.

In this work, we consider to have three forms: a) the Gaussian kernel covariance matrix, Σ defined in Eq (1), b) a cosine or periodic kernel covariance matrix of the form, , where parameter p is known as the period, and c) the linear kernel-based covariance matrix D considered in SPARK-X (with Gaussian and cosine transformations of S as well). For the Gaussian and cosine covariance matrices, we consider ten data-driven fixed values of the lengthscale l and period p, respectively (see S1 Text). Refer to Fig A in S1 Text, for visualizing the spatial patterns corresponding to the different kernel covariance matrices. In Table 2, we list the kernel covariance matrices used in different methods. Note that can be interpreted as a special case of as the former only considers linear kernel covariance matrices.

thumbnail
Table 2. Kernel choices in different methods.

The table shows (yes/no) if a particular kernel covariance or Gram matrix is considered in different methods.

https://doi.org/10.1371/journal.pgen.1010983.t002

Distribution and computational complexity.

For a particular choice of , the asymptotic null distribution of is a weighted mixture of independent distributions, where the weights are the products of the ordered eigenvalues of the matrices, Ek, and [54, 58]. However, unlike the kernel choices of , does not always have a projection matrix-like structure as D, and thus, its eigenvalues can not be computed with the complexity of O(Nd2). Instead, it requires the complexity of O(N3), rendering it intractable as N increases. Therefore, we consider a variation of Welch-Satterthwaite approximation [84, 85], to approximate the asymptotic null distribution of with a gamma distribution [58] as below, where and denote the expectation and variance, respectively. It is easy to verify that . Notice that we can now avoid any operation of complexity O(N3). Computation of just requires the complexity of O(N2) using the property that for two matrices, A = [[aij]]N × N and B = [[bij]]N × N [86]. Thus, for a particular choice of , to analyze all K genes, SMASH requires the complexity of O(N2K). This computational complexity is higher than SPARK-X. But we are making that sacrifice to gain significantly more power, as shown in both simulation studies and real data analyses while still achieving a moderate degree of scalability. It is worth pointing out that even though SMASH is non-parametric and does not make any distributional assumptions, shares a close similarity with the SpatialDE model under some additional assumptions (see S1 Text).

Aggregation and covariates.

As mentioned earlier, we consider multiple (say, R) choices for , to construct multiple test statistics: . Finally, we combine the p-values corresponding to these test statistics using the minimum p-value combination rule [80] (see S1 Text for more details). Note that we have assumed that yk is mean-standardized and there are no additional covariates to be taken into account. In the presence of covariates, we would regress the covariates out from the gene expression vector yk, prior to performing the test, using a multiple linear regression model. To further elaborate, letting X be the corresponding matrix of covariates, we would compute the projection matrix PX = X(XTX)−1XT, and substitute the vector yk with , in our proposed test statistic.

FDR control and non-PD kernels

In the real data analysis, we used Benjamini-Yekutieli [87] procedure to control the false discovery rate (FDR) at 0.05 (or, 0.01) for all the methods. In the Results section, padjust refers to the adjusted p-values. It was shown by Zhu et al. (2021) [38] that parametric methods like SpatialDE and SPARK often produce highly inflated p-values for most ST datasets, and hence need additional testing correction. To check if our p-values were inflated in the four real datasets, we randomly permuted the spatial locations of the cells/spots five times and then performed the tests using the three methods. Thus, we obtained the empirical null distribution of the p-values for each method which we displayed as quantile-quantile plots (see Fig B in S1 Text). In all four cases, SMASH showed no sign of inflation with rather slightly conservative p-values which is expected since the minimum p-value combination rule used for combining the p-values in our method, is known to be conservative [88].

The cosine or periodic kernel covariance matrix is not positive definite (PD). Our testing framework and the distributional derivations hold only for PD kernel covariance matrices. One solution could be to truncate the negative eigenvalues of the kernel matrix, i.e., adjusting as , where λi and Ui denote the i-th eigenvalue and eigenvector, respectively. However, computing eigenvalues can become computationally challenging as it requires a complexity of O(N3). In our simulations, we have noticed that using unadjusted versions of the kernel matrices yielded conservative test results, with no sign of p-value inflation. We refer to S1 Text for further details and plots.

Enrichment scores

In the real data analysis, we computed the enrichment scores (ES) of the three methods following the procedure outlined in Liu et al. (2022) [48]. Cell clustering based on biological knowledge or using popular software, such as RCTD [32] and Seurat [33], with the transcriptional profiles, can often identify spatially localized layers or cell types. Therefore, marker genes in those spatially-restricted cell types should ideally be identified as SVGs. Suppose there are M cell types. For every cell type m, the gene set Gm is built from the top 50 markers based on the fold change between the expression in the cell type m compared to the others. The SVGs detected by the three methods are ranked from the most to the least significant. Finally, unweighted gene set enrichment analysis [89] is implemented to evaluate the enrichment of the gene sets, Gm, m = 1, …, M, in the high ranking of the ranked SVG lists of the methods.

Softwares used

To fit SPARK-X, SpaGene, and SpatialDE, we used the existing packages which are available at,

Gene-set functional enrichment analyses were performed using ShinyGO Version 0.77 [90] available at, http://bioinformatics.sdstate.edu/go/.

Simulation description

In simulation setup (1), we generated the spatial coordinates for varying numbers of cells, N = 1000, 5000, and 10,000 using a random point-pattern Poisson process [91]. The expression values of K = 500 genes in these cells were simulated based on a negative binomial distribution displaying one of the four spatial patterns: streak, reverse streak, hotspot, and reverse hotspot as shown in Fig 1. For each of the patterns, 80% of the spatial locations were assumed to be background locations, while the rest 20% were assumed to be part of the pattern. The difference between the mean expression of a gene on a background location and a patterned location was captured through a fold-change parameter. Several values of fold-change were considered where a value of 1 implied a null scenario i.e., no spatial pattern, and a high value implied a prominent spatial pattern. We refer to Zhu et al. (2021) [38] for more details.

For simulation setup (2), we considered the Gaussian process (GP)-based spatial regression model from SpatialDE [34]. The locations were simulated based on Uniform distribution, which were then used to construct Gaussian covariance matrices with varying lengthscale (l) parameters as in Eq (1). The expression levels of genes were independently and identically simulated from the multivariate normal distribution described in Eq (1) for different values of the variance parameters and . We fixed the total variance, , and varied the individual values as and , where “effect-size” h ranged from zero to larger values implying null to an increasingly stronger spatial pattern. In simulation setup (3), we followed setup (2) replacing the Gaussian covariance with the cosine covariance for varying values of the period parameter p. In all three setups, we compared SMASH, SPARK-X, and SpaGene in terms of type 1 error and power.

Supporting information

S1 Text.

Section 1 discusses how to choose suitable kernel covariance matrices and combine the p-values corresponding to different kernel covariance matrices. Section 2 shows SPARK-X’s equivalence with the multiple linear regression model. Section 3 analyzes the null QQ plots of different methods in the real datasets. Section 4 discusses the severity of using non-positive definite (non-PD) kernel covariance matrices. We list and briefly describe the figures from S1 Text below.

  • Fig A. Visualization of patterns of different kernel covariance matrices.
  • Fig B. QQ-plots of different methods under null simulations in the real datasets.
  • Fig C. QQ-plots with the observed and theoretical distributions of the SMASH test statistic with an unadjusted cosine kernel matrix.
  • Fig D. QQ-plots with the observed and theoretical distributions of the SMASH test statistic with an adjusted cosine kernel matrix.
  • Fig E. QQ-plots with the observed and theoretical distributions of the—log10(p)-values obtained using SMASH with all the kernel matrices.

https://doi.org/10.1371/journal.pgen.1010983.s001

(PDF)

Acknowledgments

We would like to thank Dr. Kristen Wells-Wrasman for her help with processing the SCCOHT dataset.

References

  1. 1. Ståhl PL, Salmén F, Vickovic S, Lundmark A, Navarro JF, Magnusson J, et al. Visualization and analysis of gene expression in tissue sections by spatial transcriptomics. Science. 2016;353(6294):78–82. pmid:27365449
  2. 2. Shah S, Lubeck E, Zhou W, Cai L. In situ transcription profiling of single cells reveals spatial organization of cells in the mouse hippocampus. Neuron. 2016;92(2):342–357. pmid:27764670
  3. 3. Shah S, Lubeck E, Zhou W, Cai L. seqFISH accurately detects transcripts in single cells and reveals robust spatial organization in the hippocampus. Neuron. 2017;94(4):752–758. pmid:28521130
  4. 4. Wang G, Moffitt JR, Zhuang X. Multiplexed imaging of high-density libraries of RNAs with MERFISH and expansion microscopy. Scientific reports. 2018;8(1):1–13.
  5. 5. Xia C, Fan J, Emanuel G, Hao J, Zhuang X. Spatial transcriptome profiling by MERFISH reveals subcellular RNA compartmentalization and cell cycle-dependent gene expression. Proceedings of the National Academy of Sciences. 2019;116(39):19490–19499. pmid:31501331
  6. 6. Eng CHL, Lawson M, Zhu Q, Dries R, Koulena N, Takei Y, et al. Transcriptome-scale super-resolved imaging in tissues by RNA seqFISH+. Nature. 2019;568(7751):235–239. pmid:30911168
  7. 7. Asp M, Bergenstråhle J, Lundeberg J. Spatially resolved transcriptomes—next generation tools for tissue exploration. BioEssays. 2020;42(10):1900221. pmid:32363691
  8. 8. Guilliams M, Bonnardel J, Haest B, Vanderborght B, Wagner C, Remmerie A, et al. Spatial proteogenomics reveals distinct and evolutionarily conserved hepatic macrophage niches. Cell. 2022;185(2):379–396. pmid:35021063
  9. 9. Dhainaut M, Rose SA, Akturk G, Wroblewska A, Nielsen SR, Park ES, et al. Spatial CRISPR genomics identifies regulators of the tumor microenvironment. Cell. 2022;185(7):1223–1239. pmid:35290801
  10. 10. Chen WT, Lu A, Craessaerts K, Pavie B, Frigerio CS, Corthout N, et al. Spatial transcriptomics and in situ sequencing to study Alzheimer’s disease. Cell. 2020;182(4):976–991. pmid:32702314
  11. 11. Maynard KR, Collado-Torres L, Weber LM, Uytingco C, Barry BK, Williams SR, et al. Transcriptome-scale spatial gene expression in the human dorsolateral prefrontal cortex. Nature neuroscience. 2021;24(3):425–436. pmid:33558695
  12. 12. Ortiz C, Carlén M, Meletis K. Spatial transcriptomics: molecular maps of the mammalian brain. Annual review of neuroscience. 2021;44:547–562. pmid:33914592
  13. 13. Levy-Jurgenson A, Tekpli X, Kristensen VN, Yakhini Z. Spatial transcriptomics inferred from pathology whole-slide images links tumor heterogeneity to survival in breast and lung cancer. Scientific reports. 2020;10(1):1–11. pmid:33139755
  14. 14. Yoosuf N, Navarro JF, Salmén F, Ståhl PL, Daub CO. Identification and transfer of spatial transcriptomics signatures for cancer diagnosis. Breast Cancer Research. 2020;22(1):1–10. pmid:31931856
  15. 15. Hunter MV, Moncada R, Weiss JM, Yanai I, White RM. Spatially resolved transcriptomics reveals the architecture of the tumor-microenvironment interface. Nature communications. 2021;12(1):1–16. pmid:34725363
  16. 16. Zollinger DR, Lingle SE, Sorg K, Beechem JM, Merritt CR. GeoMx RNA assay: high multiplex, digital, spatial analysis of RNA in FFPE tissue. In Situ Hybridization Protocols. 2020; p. 331–345. pmid:32394392
  17. 17. Merritt CR, Ong GT, Church SE, Barker K, Danaher P, Geiss G, et al. Multiplex digital spatial profiling of proteins and RNA in fixed tissue. Nature biotechnology. 2020;38(5):586–599. pmid:32393914
  18. 18. Rodriques SG, Stickels RR, Goeva A, Martin CA, Murray E, Vanderburg CR, et al. Slide-seq: A scalable technology for measuring genome-wide expression at high spatial resolution. Science. 2019;363(6434):1463–1467. pmid:30923225
  19. 19. Stickels RR, Murray E, Kumar P, Li J, Marshall JL, Di Bella DJ, et al. Highly sensitive spatial transcriptomics at near-cellular resolution with Slide-seqV2. Nature biotechnology. 2021;39(3):313–319. pmid:33288904
  20. 20. Vickovic S, Eraslan G, Salmén F, Klughammer J, Stenbeck L, Schapiro D, et al. High-definition spatial transcriptomics for in situ tissue profiling. Nature methods. 2019;16(10):987–990. pmid:31501547
  21. 21. Kwon S. Single-molecule fluorescence in situ hybridization: quantitative imaging of single RNA molecules. BMB reports. 2013;46(2):65. pmid:23433107
  22. 22. Lubeck E, Coskun AF, Zhiyentayev T, Ahmad M, Cai L. Single-cell in situ RNA profiling by sequential hybridization. Nature methods. 2014;11(4):360–361. pmid:24681720
  23. 23. Chen KH, Boettiger AN, Moffitt JR, Wang S, Zhuang X. Spatially resolved, highly multiplexed RNA profiling in single cells. Science. 2015;348(6233):aaa6090. pmid:25858977
  24. 24. Moses L, Pachter L. Museum of spatial transcriptomics. Nature Methods. 2022;19(5):534–546. pmid:35273392
  25. 25. Atta L, Fan J. Computational challenges and opportunities in spatially resolved transcriptomic data analysis. Nature Communications. 2021;12(1):1–5. pmid:34489425
  26. 26. Thrane K, Eriksson H, Maaskola J, Hansson J, Lundeberg J. Spatially resolved transcriptomics enables dissection of genetic heterogeneity in stage III cutaneous malignant melanoma. Cancer research. 2018;78(20):5970–5979. pmid:30154148
  27. 27. Navarro JF, Croteau DL, Jurek A, Andrusivova Z, Yang B, Wang Y, et al. Spatial transcriptomics reveals genes associated with dysregulated mitochondrial functions and stress signaling in Alzheimer disease. Iscience. 2020;23(10):101556. pmid:33083725
  28. 28. Kats I, Vento-Tormo R, Stegle O. SpatialDE2: Fast and localized variance component analysis of spatial transcriptomics. bioRxiv. 2021;.
  29. 29. Rao A, Barkley D, França GS, Yanai I. Exploring tissue architecture using spatial transcriptomics. Nature. 2021;596(7871):211–220. pmid:34381231
  30. 30. Wang Y, Ma S, Ruzzo WL. Spatial modeling of prostate cancer metabolic gene expression reveals extensive heterogeneity and selective vulnerabilities. Scientific reports. 2020;10(1):1–14. pmid:32103057
  31. 31. Berglund E, Maaskola J, Schultz N, Friedrich S, Marklund M, Bergenstråhle J, et al. Spatial maps of prostate cancer transcriptomes reveal an unexplored landscape of heterogeneity. Nature communications. 2018;9(1):2419. pmid:29925878
  32. 32. Cable DM, Murray E, Zou LS, Goeva A, Macosko EZ, Chen F, et al. Robust decomposition of cell type mixtures in spatial transcriptomics. Nature Biotechnology. 2022;40(4):517–526. pmid:33603203
  33. 33. Hao Y, Hao S, Andersen-Nissen E, Mauck WM, Zheng S, Butler A, et al. Integrated analysis of multimodal single-cell data. Cell. 2021;184(13):3573–3587. pmid:34062119
  34. 34. Svensson V, Teichmann SA, Stegle O. SpatialDE: identification of spatially variable genes. Nature methods. 2018;15(5):343–346. pmid:29553579
  35. 35. Li K, Yan C, Li C, Chen L, Zhao J, Zhang Z, et al. Computational elucidation of spatial gene expression variation from spatially resolved transcriptomics data. Molecular Therapy-Nucleic Acids. 2021;. pmid:35036053
  36. 36. Edsgärd D, Johnsson P, Sandberg R. Identification of spatial expression trends in single-cell gene expression data. Nature methods. 2018;15(5):339–342. pmid:29553578
  37. 37. Sun S, Zhu J, Zhou X. Statistical analysis of spatial expression patterns for spatially resolved transcriptomic studies. Nature methods. 2020;17(2):193–200. pmid:31988518
  38. 38. Zhu J, Sun S, Zhou X. SPARK-X: non-parametric modeling enables scalable and robust detection of spatial expression patterns for large spatial transcriptomic studies. Genome Biology. 2021;22(1):1–25. pmid:34154649
  39. 39. Li Q, Zhang M, Xie Y, Xiao G. Bayesian modeling of spatial molecular profiling data via Gaussian process. Bioinformatics. 2021;37(22):4129–4136. pmid:34146105
  40. 40. Weber LM, Saha A, Datta A, Hansen KD, Hicks SC. nnSVG for the scalable identification of spatially variable genes using nearest-neighbor Gaussian processes. Nature Communications. 2023;14(1):4059. pmid:37429865
  41. 41. Bae S, Choi H, Lee DS. Discovery of molecular features underlying the morphological landscape by integrating spatial transcriptomic data with deep features of tissue images. Nucleic acids research. 2021;49(10):e55–e55. pmid:33619564
  42. 42. Hu J, Li X, Coleman K, Schroeder A, Ma N, Irwin DJ, et al. SpaGCN: Integrating gene expression, spatial location and histology to identify spatial domains and spatially variable genes by graph convolutional network. Nature methods. 2021;18(11):1342–1351. pmid:34711970
  43. 43. Zhu Q, Shah S, Dries R, Cai L, Yuan GC. Identification of spatially associated subpopulations by combining scRNAseq and sequential fluorescence in situ hybridization data. Nature biotechnology. 2018;36(12):1183–1190. pmid:30371680
  44. 44. Miller BF, Bambah-Mukku D, Dulac C, Zhuang X, Fan J. Characterizing spatial gene expression heterogeneity in spatially resolved single-cell transcriptomic data with nonuniform cellular densities. Genome research. 2021;31(10):1843–1855. pmid:34035045
  45. 45. Dries R, Zhu Q, Dong R, Eng CHL, Li H, Liu K, et al. Giotto: a toolbox for integrative analysis and visualization of spatial expression data. Genome biology. 2021;22(1):1–31.
  46. 46. Jiang X, Xiao G, Li Q. A Bayesian modified Ising model for identifying spatially variable genes from spatial transcriptomics data. Statistics in Medicine. 2022;41(23):4647–4665. pmid:35871762
  47. 47. Zhang K, Feng W, Wang P. Identification of spatially variable genes with graph cuts. Nature Communications. 2022;13(1):5488. pmid:36123336
  48. 48. Liu Q, Hsu CY, Shyr Y. Scalable and model-free detection of spatial patterns and colocalization. Genome research. 2022;32(9):1736–1745. pmid:36223499
  49. 49. Banerjee S, Gelfand AE, Finley AO, Sang H. Gaussian predictive process models for large spatial data sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2008;70(4):825–848. pmid:19750209
  50. 50. Liu D, Lin X, Ghosh D. Semiparametric regression of multidimensional genetic pathway data: Least-squares kernel machines and linear mixed models. Biometrics. 2007;63(4):1079–1088. pmid:18078480
  51. 51. Lippert C, Listgarten J, Liu Y, Kadie CM, Davidson RI, Heckerman D. FaST linear mixed models for genome-wide association studies. Nature methods. 2011;8(10):833–835. pmid:21892150
  52. 52. Christensen OF, Waagepetersen R. Bayesian prediction of spatial count data using generalized linear mixed models. Biometrics. 2002;58(2):280–286. pmid:12071400
  53. 53. Dean CB, Ugarte MD, Militino AF. Penalized quasi-likelihood with spatially correlated data. Computational statistics & data analysis. 2004;45(2):235–248.
  54. 54. Zhang K, Peters J, Janzing D, Schölkopf B. Kernel-based conditional independence test and application in causal discovery. arXiv preprint arXiv:12023775. 2012;.
  55. 55. Rencher AC, Christensen WF. Methods of Multivariate Analysis. Wiley; 2012.
  56. 56. Rubner Y, Tomasi C, Guibas LJ. The earth mover’s distance as a metric for image retrieval. International journal of computer vision. 2000;40(2):99–121.
  57. 57. Odén A, Wedel H. Arguments for Fisher’s permutation test. The Annals of Statistics. 1975; p. 518–520.
  58. 58. Gretton A, Fukumizu K, Teo C, Song L, Schölkopf B, Smola A. A kernel statistical test of independence. Advances in neural information processing systems. 2007;20.
  59. 59. Righelli D, Weber LM, Crowell HL, Pardo B, Collado-Torres L, Ghazanfar S, et al. SpatialExperiment: infrastructure for spatially-resolved transcriptomics data in R using Bioconductor. Bioinformatics. 2022;38(11):3128–3131. pmid:35482478
  60. 60. Moffitt JR, Bambah-Mukku D, Eichhorn SW, Vaughn E, Shekhar K, Perez JD, et al. Molecular, spatial, and functional single-cell profiling of the hypothalamic preoptic region. Science. 2018;362(6416):eaau5324. pmid:30385464
  61. 61. Moffitt JR, Bambah-Mukku D, Eichhorn SW, Vaughn E, Shekhar K, Perez JD, et al. Data from: Molecular, spatial, and functional single-cell profiling of the hypothalamic preoptic region. Dryad. https://datadryad.org/stash/dataset/.
  62. 62. Kirsch L, Liscovitch N, Chechik G. Localizing genes to cerebellar layers by classifying ISH images. PLOS computational biology. 2012;8(12):e1002790. pmid:23284274
  63. 63. Sanders BE, Wolsky R, Doughty ES, Wells KL, Ghosh D, Ku L, et al. Small cell carcinoma of the ovary hypercalcemic type (SCCOHT): A review and novel case with dual germline SMARCA4 and BRCA2 mutations. Gynecologic Oncology Reports. 2022; p. 101077. pmid:36249907
  64. 64. Zhang L, Wang Y, Sha Y, Zhang B, Zhang R, Zhang H, et al. CITED4 enhances the metastatic potential of lung adenocarcinoma. Thoracic Cancer. 2021;12(9):1291–1302. pmid:33759374
  65. 65. Gao C, Guo X, Xue A, Ruan Y, Wang H, Gao X. High intratumoral expression of eIF4A1 promotes epithelial-to-mesenchymal transition and predicts unfavorable prognosis in gastric cancer. Acta Biochimica et Biophysica Sinica. 2020;52(3):310–319. pmid:32147684
  66. 66. Gan L, Yang Y, Li Q, Feng Y, Liu T, Guo W. Epigenetic regulation of cancer progression by EZH2: from biological insights to therapeutic potential. Biomarker research. 2018;6(1):1–10. pmid:29556394
  67. 67. Duan R, Du W, Guo W. EZH2: a novel target for cancer treatment. Journal of hematology & oncology. 2020;13(1):1–12. pmid:32723346
  68. 68. Chen F, Qin T, Zhang Y, Wei L, Dang Y, Liu P, et al. Reclassification of endometrial cancer and identification of key genes based on neural-related genes. Frontiers in Oncology. 2022;12. pmid:36212450
  69. 69. Nelson TS, Taylor BK. Targeting spinal neuropeptide Y1 receptor-expressing interneurons to alleviate chronic pain and itch. Progress in neurobiology. 2021;196:101894. pmid:32777329
  70. 70. Viswanathan S, Bandyopadhyay S, Kao JP, Kanold PO. Changing microcircuits in the subplate of the developing cortex. Journal of Neuroscience. 2012;32(5):1589–1601. pmid:22302801
  71. 71. Wang YX, Zhang YJ. Nonnegative matrix factorization: A comprehensive review. IEEE Transactions on knowledge and data engineering. 2012;25(6):1336–1353.
  72. 72. Seal S, Wrobel J, Johnson AM, Nemenoff RA, Schenk EL, Bitler BG, et al. On Clustering for Cell Phenotyping in Multiplex Immunohistochemistry (mIHC) and Multiplexed Ion Beam Imaging (MIBI) Data. BMC Research Notes. 2022;15(1):215. pmid:35725622
  73. 73. Seal S, Ghosh D. MIAMI: mutual information-based analysis of multiplex imaging data. Bioinformatics. 2022;38(15):3818–3826. pmid:35748713
  74. 74. Seal S, Datta A, Basu S. Efficient estimation of SNP heritability using Gaussian predictive process in large scale cohort studies. PLoS genetics. 2022;18(4):e1010151. pmid:35442943
  75. 75. Rasmussen CE, Williams CK. Gaussian processes for machine learning. International Journal of Neural Systems. 2006;14.
  76. 76. Cressie N. Statistics for spatial data. John Wiley & Sons; 2015.
  77. 77. Gourieroux C, Holly A, Monfort A. Likelihood ratio test, Wald test, and Kuhn-Tucker test in linear models with inequality constraints on the regression parameters. Econometrica: journal of the Econometric Society. 1982; p. 63–80.
  78. 78. Boos DD, Stefanski LA, et al. Essential statistical inference. Springer; 2013.
  79. 79. Sun S, Zhu J, Mozaffari S, Ober C, Chen M, Zhou X. Heritability estimation and differential analysis of count data with generalized linear mixed models in genomic sequencing studies. Bioinformatics. 2019;35(3):487–496. pmid:30020412
  80. 80. Liu Y, Chen S, Li Z, Morrison AC, Boerwinkle E, Lin X. ACAT: a fast and powerful p value combination method for rare-variant analysis in sequencing studies. The American Journal of Human Genetics. 2019;104(3):410–421. pmid:30849328
  81. 81. Fukumizu K, Bach FR, Jordan MI. Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. Journal of Machine Learning Research. 2004;5(Jan):73–99.
  82. 82. Gretton A, Bousquet O, Smola A, Schölkopf B. Measuring statistical dependence with Hilbert-Schmidt norms. In: International conference on algorithmic learning theory. Springer; 2005. p. 63–77.
  83. 83. Fukumizu K, Gretton A, Sun X, Schölkopf B. Kernel measures of conditional dependence. Advances in neural information processing systems. 2007;20.
  84. 84. Welch BL. The generalization of ‘STUDENT’S’problem when several different population varlances are involved. Biometrika. 1947;34(1-2):28–35. pmid:20287819
  85. 85. Satterthwaite FE. An approximate distribution of estimates of variance components. Biometrics bulletin. 1946;2(6):110–114. pmid:20287815
  86. 86. Skiena SS. The algorithm design manual. vol. 2. Springer; 1998.
  87. 87. Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple testing under dependency. Annals of statistics. 2001; p. 1165–1188.
  88. 88. Narum SR. Beyond Bonferroni: less conservative analyses for conservation genetics. Conservation genetics. 2006;7:783–787.
  89. 89. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences. 2005;102(43):15545–15550. pmid:16199517
  90. 90. Ge SX, Jung D, Yao R. ShinyGO: a graphical gene-set enrichment tool for animals and plants. Bioinformatics. 2020;36(8):2628–2629. pmid:31882993
  91. 91. Baddeley A, Bárány I, Schneider R. Spatial point processes and their applications. Stochastic Geometry: Lectures Given at the CIME Summer School Held in Martina Franca, Italy, September 13–18, 2004. 2007; p. 1–75.