Figures
Abstract
In high-throughput spatial transcriptomics (ST) studies, it is of great interest to identify the genes whose level of expression in a tissue covaries with the spatial location of cells/spots. Such genes, also known as spatially variable genes (SVGs), can be crucial to the biological understanding of both structural and functional characteristics of complex tissues. Existing methods for detecting SVGs either suffer from huge computational demand or significantly lack statistical power. We propose a non-parametric method termed SMASH that achieves a balance between the above two problems. We compare SMASH with other existing methods in varying simulation scenarios demonstrating its superior statistical power and robustness. We apply the method to four ST datasets from different platforms uncovering interesting biological insights.
Author summary
In recent years, spatial transcriptomics (ST) has become increasingly popular to study the expression profile of genes across different spatial locations of a tissue. Many of the genes exhibit spatially varying expression patterns making them immensely valuable for understanding the structural and functional properties of the tissue. The proposed method termed SMASH enables powerful and scalable detection of such genes in high-dimensional ST datasets.
Citation: Seal S, Bitler BG, Ghosh D (2023) SMASH: Scalable Method for Analyzing Spatial Heterogeneity of genes in spatial transcriptomics data. PLoS Genet 19(10): e1010983. https://doi.org/10.1371/journal.pgen.1010983
Editor: Mingyao Li, University of Pennsylvania, UNITED STATES
Received: April 6, 2023; Accepted: September 19, 2023; Published: October 20, 2023
Copyright: © 2023 Seal et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: A Python-based software implementation of SMASH is available at, https://github.com/sealx017/SMASH-package. The package provides two detailed notebooks to perform the analysis on the mouse hypothalamus data by MERFISH and the human DLPFC data by 10X Visium (along with the datasets as compressed Python objects). Both the mouse cerebellum data by Slide-seqV2 and the human DLPFC data by 10X Visium are available in the R Bioconductor package: STexampleData, available at, https://bioconductor.org/packages/release/data/experiment/html/STexampleData.html. The full mouse hypothalamus data by MERFISH is available at the link provided in the corresponding manuscript, from which we focused on only “Replicate 6”, as it had the largest number of cells. The SCCOHT dataset by 10X Visium was collected at the University of Colorado Denver Anschutz Medical Campus, and is provided in the Github repository.
Funding: S.S. was supported in part by the Biostatistics Shared Resource, Hollings Cancer Center, Medical University of South Carolina (P30 CA138313). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Spatial transcriptomics (ST) performs high-throughput measurement of transcriptomes in complex biological tissues at single-cell or subcellular resolution, preserving spatial information [1–9]. In the past decade, the rapid development of ST technologies has facilitated exciting discoveries in different domains, including neuroscience [10–12] and cancer research [13–15]. The popular ST technologies and corresponding platforms differ in terms of the procedure used to record spatial profiles, such as region of interest (ROI) selection [16, 17], next-generation sequencing (NGS) with spatial barcoding [18–20], and single-molecule fluorescence in situ hybridization (smFISH) [21–23]. Two crucial aspects that a researcher considers before choosing a suitable platform, are a) the capability of transcriptome-wide profiling, and b) the granularity of spatial resolution. For example, the majority of the smFISH-based technologies excel at capturing single-cell level resolution but lack the capability of transcriptome-wide profiling. On the other hand, ROI or NGS-based technologies can be used for transcriptome-wide profiling but on a significantly lower spatial resolution, such as 55 μm for the most popular and commercialized ST platform Visium (10X Genomics). We refer to Moses et al. (2022) [24] for a detailed discussion on these technologies. Deriving biological insights from datasets obtained using such platforms with either huge spatial or genomic profiles, or both, not only poses numerous statistical challenges but also requires maximum computational efficiency [25].
A critical step in the analysis of ST datasets is to identify the genes whose level of expression co-varies with the spatial locations across the tissue. These genes, often referred to as spatially variable genes (SVGs), can be used in downstream analyses, such as identifying potential markers for biological processes and defining areas in the tissue that dictate cellular differentiation and function [26–29]. For example, Wang et al. (2020) [30] analyzed an ST dataset on the tumor microenvironment (TME) of three tissue sections from a prostate cancer subject [31]. In every tissue section, a unique set of spatially variable metabolic genes were identified, which could arguably be used to guide targeted tissue-specific therapy. A simplistic approach for detecting SVGs could be to identify spatially located layers or cell types (if any) based on either a priori biological knowledge or using popular software, such as RCTD [32] and Seurat [33], with the transcriptional profiles, and then checking which genes exhibit highly enriched expression in a particular spatial layer or cell type. However, such an approach would achieve satisfactory performance only if the layers or cell types are spatially well-separated, and always be sensitive to the quality of the layer or cell type-identification step [34]. In recent years, more sophisticated methods have been developed to identify SVGs, a systematic overview of some of which can be found in Li et al. (2021) [35]. The methods can be broadly classified into three types: a) based on statistical modeling, b) based on machine learning or neural network, and c) based on graphical networks or spatial grids. Some of the notable methods of each type are, type (a): Trendsceek [36], SpatialDE [34], SPARK [37], SPARK-X [38], Boost-GP [39], and nnSVG [40], type (b): SPADE [41], SOMDE [33], and SpaGCN [42], and type (c): HMRF [43], MERINGUE [44], Binspect-Giotto [45], Boost-MI [46], ScGCO [47], and SpaGene [48]. We focus on methods of types (a) and (c) in this manuscript.
The statistical power of the methods greatly varies based on gene expression patterns and the spatial structure of ST datasets. The methods encounter different levels of computational complexity based on two quantities, N and K, denoting the numbers of cells/spots and genes, respectively. SpatialDE [34] is one of the earliest methods of type (a). It employs a Gaussian process (GP) regression model [49] with kernel-based covariance matrices [50] of multiple types, such as linear, Gaussian, and cosine, computed using the distance between the spatial coordinates of the cells. The model decomposes the total variability of a gene expression into two components, spatial and error variance. A significantly large value of the spatial variance would imply that the gene is spatially variable. Borrowing an efficient estimation algorithm from the statistical genetics literature [51], SpatialDE manages to estimate the variance components with a reasonable degree of computational efficiency, requiring O(N3 + N2K) floating point operations (FLOPS). A newer method named SPARK [37] extends the framework of SpatialDE by considering a generalized linear spatial model (GLSM) [52] with a Poisson distribution, arguing to be better suited for modeling the raw count data from the ST platforms directly. However, the penalized quasi-likelihood (PQL) approach [53] used for parameter estimation in SPARK is extremely computationally demanding with a complexity of O(N3K), making it unusable for a transcriptome-wide analysis when N is moderately large (N > 3, 000). To this end, a non-parametric highly scalable method named SPARK-X [38] has been recently developed requiring just linear complexity w.r.t. N. It is based on the robust covariance testing framework [54] that compares the linear kernel-based covariance matrices of the gene expression and the spatial coordinates. However, using a linear kernel makes SPARK-X equivalent to fitting a multiple linear regression model [55] with the gene expression as the dependent variable and the spatial coordinates (or, some transformation of these) as the predictors and testing if the fixed effect coefficients differ from zero. Thus, it is only capable of detecting spatial dependencies or patterns that manifest linearly in the mean or expected value of the gene expression, also known as first-order dependencies, and drastically loses power in complex scenarios as to be shown later. Zhu et al. (2021) [38] has partially acknowledged this issue with their primary focus being computational scalability.
On the other side, a popular method of type (c), MERINGUE [44] considers spatial autocorrelation and cross-correlation based on spatial neighborhood graphs to identify SVGs. Improving hugely on the complexity of MERINGUE, another model-free method named SpaGene [48] has been recently developed. It constructs a spatial network between cells/spots using the k-nearest neighbors approach, and then for each gene, extracts the subnetwork whose nodes have high gene expression. Then, it compares the observed degree distribution of the subnetwork to a distribution from a fully connected network using the earth mover’s distance [56]. It considers a permutation test [57] to obtain the p-value for every gene. SpaGene is highly comparable to SPARK-X w.r.t. computational complexity and thus applicable to ST datasets with large N. However, the method is harder to interpret than the methods of type (a), can not readily accommodate additional covariates, and also lacks power in various scenarios (see Simulations section).
We propose a non-parametric method, named SMASH, which achieves superior statistical power than both SPARK-X and SpaGene, while remaining computationally tractable. It augments the idea of SPARK-X in its use of the Hilbert-Schmidt independence criteria (HSIC) or robust covariance testing framework [54, 58] coupled with more general kernel-based spatial covariance matrices. With a computational complexity quadratic in N, SMASH sacrifices some degree of computational efficiency in favor of significantly higher detection power than both SPARK-X and SpaGene. However, it is worth highlighting that SMASH is notably faster than other type (a) methods, such as SpatialDE and SPARK, and can thus be thought of as a balanced alternative, fusing high detection power with a moderate degree of scalability. In varying simulation scenarios, we demonstrate that SMASH achieves highly consistent and superior performance as compared to the methods SPARK-X and SpaGene. Finally, our analysis of four large ST datasets from platforms like SlideSeq V2, Visium, and MERFISH using these three methods, not only reveals exciting biological insights but also demonstrates SMASH’s capability of detecting SVGs that will be otherwise missed by either of the other two methods. A Python-based software implementation of SMASH is available at, https://github.com/sealx017/SMASH-package, which returns the lists of SVGs detected by both SMASH and SPARK-X, allowing users to investigate the overlap between them.
Results
Simulations
We evaluated the performance of SMASH, SPARK, and SpaGene in three different simulation studies. We omitted SpatialDE and SPARK from the power comparison for two reasons: a) high computational requirements and b) these two methods have already been thoroughly studied in previous works [38, 48]. In simulation setup (1), we followed the procedure described in the SPARK-X manuscript [38]. In setups (2) and (3), we considered the Gaussian process (GP)-based spatial regression model from the SpatialDE manuscript [34], respectively with the Gaussian and cosine kernel-based covariance functions (see Eq (1)). In all the setups, three values of the number of cells (N) were considered, N = 1000, 5000, and 10,000. The spatial coordinates of the cells were simulated first, followed by the expression levels of K (500 or 1000) genes with varying levels of dependence. In setup (1), the expression levels were simulated using a negative binomial distribution, while in setups (2) and (3), the expression levels were simulated using a multivariate normal distribution. In all the setups, distinct spatial patterns were ensured to be present in the expression levels. Further details regarding the simulation setups are provided at the end of the Methods section. Figs 1, 2 and 3 respectively correspond to the three simulation setups, in which we display the simulated spatial patterns and the statistical power of the three methods for different parameter combinations.
A Four spatial expression patterns that the genes were assumed to follow. B Statistical power plots of the three methods, SMASH, SPARK-X, and SpaGene under varying values of N and fold-size, for K = 500 genes at a level of α = 0.05. The results were averaged over five replications.
A) Four spatial expression patterns that were generated using Gaussian covariance matrices with four different values of the lengthscale l. B) Statistical power plots of the three methods under varying values of N and effect-size (h) for K = 1000 genes at a level of α = 0.05. The results were averaged over five replications.
A) Four spatial expression patterns that were generated using cosine covariance matrices with four different values of the period p. B) Statistical power plots of the three methods under varying values of N and effect-size (h) for K = 1000 genes at a level of α = 0.05. The results were averaged over five replications.
In simulation setup (1), SMASH, and SPARK-X performed much better than SpaGene for all four spatial patterns, namely streak, reverse streak, hotspot, and reverse hotspot (Fig 1). SpaGene was particularly poor for the patterns: streak and hotspot. The power of SMASH and SPARK-X steadily increased as N and the fold-change parameter increased. Note that a fold value of 1 implied no spatial association while a larger value indicated higher spatial association. This particular simulation setup favored SPARK-X in the sense that the spatial variability of the expression was of the first order, manifesting entirely through the mean or expectation. Even in this scenario, SMASH managed to achieve similar power.
In simulation setups (2) and (3), the spatial variability of the expression was of higher order, manifesting through the covariance. In setup (2), which involved the Gaussian covariance function, SMASH performed the best followed by SPARK-X and then SpaGene in most cases. SMASH performed the best in setup (3) as well. However, SpaGene achieved better power than SPARK-X here. SPARK-X had almost zero power in many of the cases, especially when the period p was small (p = 0.5, 1), demonstrating its lack of robustness under complicated spatial dependency structures.
We compared the run-time of the methods in the simulation setup (2) for varying numbers of cells, N = 1000, 5000, and 10000 (Table 1). Since the computational complexity of the algorithms mainly differs w.r.t. N and not the number of genes K, we kept K = 1000. We noticed that the run-time of SMASH expectedly increased in an almost squared order w.r.t. N. SPARK-X and SpaGene were both extremely fast for just having linear complexity w.r.t. N. We also added SpatialDE to this comparison to show how computationally intensive it can be to fit a fully parametric model in such a context. We omitted SPARK entirely as it is much slower than even SpatialDE with a computational complexity of O(N3K).
The table lists the theoretical complexity and run-time (in seconds) of the four methods, SMASH, SPARK-X, SpaGene, and SpatialDE in a simulation setup with K = 1000 genes and varying number of cells N. The number of spatial coordinates d was equal to 2. *SpaGene constructs multiple kNN graphs and performs permutation tests. We are only listing the complexity of the KNN algorithm.
Application to real data
We applied the methods, SMASH, SPARK-X, and SpaGene to four datasets: 1) mouse cerebellum data collected using Slide-seq V2 [19, 59], 2) human dorsolateral prefrontal cortex (DLPFC) data collected using Visium [11], 3) small cell ovarian carcinoma of the ovary hypercalcemic type (SCCOHT) data collected using Visium [11], and 4) mouse hypothalamus data collected using MERFISH [60, 61]. The datasets have varying numbers of genes and spots/cells.
Mouse cerebellum by Slide-seqV2.
The mouse cerebellum data [19] has 20,117 genes and 11,626 spots. We restricted our focus to the 7,653 genes that express in more than 1% of the spots. The mouse cerebellum is made of four spatial layers, white matter layer (WML), granule layer (GL), Purkinje layer (PL), and molecular layer (ML) [62]. These layers consist of different types of cells. For example, WML contains oligodendrocytes, GL contains granule cells, PL contains Purkinje neurons and Bergmann gila, and ML contains intra-neurons MLI. These cell types can be inferred based on just the transcriptional profiles using cell clustering software like RCTD [32]. We display the inferred cell types overlayed on the spatial locations in Fig 4. Out of the 7,653 genes, SMASH identified 1173 genes to be spatially variable (adjusted p-value: padjust < 0.05). SPARK-X and SpaGene respectively detected 608 and 518 genes, and the overlaps between the detected SVGs by the three methods are displayed in a Venn diagram (Fig 4). We noted that SPARK-X and SpaGene had many of the SVGs uncommon. SMASH, on the other hand, could identify almost all the detected genes by those two methods, especially SPARK-X, while detecting an additional 363 SVGs.
A) Location of the major cell types corresponding to the four spatial layers of the mouse cerebellum. B) Overlap between the detected SVGs by the three methods. C) Enrichment scores of the methods in the four spatial layers.
Next, we performed two types of enrichment analysis. First, we compared the performance of the methods in different layers by computing their enrichment scores (ES) following Liu et al. (2022) [48]. It is based on the expectation that the genes which abundantly express themselves in the four spatial layers, should be identified and ranked top by the methods. In that regard, we noticed that SPARK-X performed poorly in the PL, whereas SpaGene performed poorly in the WML. SMASH, on the other hand, consistently achieved similar or better performance compared to the other two methods in all four layers. Secondly, we performed functional enrichment analysis of the following four sets of SVGs: a) the common genes identified by all three methods, b) the genes identified by SMASH and SpaGene but not by SPARK-X, c) the genes identified by SMASH and SPARK-X but not by SpaGene, and d) the genes identified only by SMASH. The expression pattern of three representative genes of the enriched pathways for each of these four sets of genes, are shown in Fig 5. For set (a), top enriched Gene Ontology (GO) terms, such as GO: 0098916 (anterograde trans-synaptic signaling), GO: 0007268 (chemical synaptic transmission), and GO: 0099536 (synaptic signaling), were broadly associated with synaptic regulation. The protein-coding genes Fam107a, Ppp3ca, and Calm1 appeared in these top pathways. Fam107a seems to express in the PL, whereas the other two express in the GL (Fig 5). For set (b), the top GO terms including GO: 0006873 (intracellular monoatomic ion homeostasis), GO: 0030003 (intracellular monoatomic cation homeostasis), and GO: 0098771 (inorganic ion homeostasis) were associated with ion homeostasis. The representative genes Atp1a3 and Thy1 express in the PL while Calm3 expresses in the GL. For set (c), the top pathways including GO: 0006811 (monoatomic ion transport), GO: 0006812 (monoatomic cation transport), and GO: 0098655 (monoatomic cation transmembrane transport) were associated with ion transportation. The representative genes Pllp and Efnb3 express in the WML, whereas Cox7a2 expresses roughly in the GL. For set (d), the top enriched GO terms, such as GO: 0044057 (regulation of system process) and GO: 0050877 (nervous system process), were associated with regulating different types of system processes. The representative genes Gls, Tmem36a, and Coro2b roughly express in the GL.
Three representative genes from the detected pathways for the four sets of genes: a) the common genes identified by all three methods, b) the genes identified by SMASH and SpaGene but not by SPARK-X, c) the genes identified by SMASH and SPARK-X but not by SpaGene, and d) the genes identified only by SMASH.
Human DLPFC by Visium.
The human dorsolateral prefrontal cortex (DLPFC) data [11] has 33,538 and 3,639 spots. We focused on the 13,783 genes which express in more than 1% of the spots. Every spot belongs to one of the six manually labeled cortical layers or the white matter layer (WML) (Fig 6). SMASH and SPARK-X identified 10,871 and 10,416 SVGs respectively (padjust < 0.05), whereas SpaGene identified only 2379. The overlaps between the detected SVGs by the three methods are displayed in a Venn diagram (Fig 6). We noted that almost all the genes detected by SpaGene were also detected by both SMASH and SPARK-X. SMASH and SPARK-X detected a lot of additional SVGs. We performed functional enrichment analysis of the two sets of detected genes: a) the common genes identified by all three methods and b) the genes identified only by SMASH and SPARK-X but not by SpaGene. For set (a), top enriched GO terms, such as GO: 0099537 (trans-synaptic signaling) and GO: 0099177 (regulation of trans-synaptic signaling), were associated with synaptic signaling. For set (b), top enriched GO terms like GO: 0006397 (mRNA processing) and GO: 0000375 (RNA splicing, via transesterification reactions), were associated with RNA processing. The expression of three representative genes from the set (b) are displayed in Fig 6. There seemed to be a gradient spatial pattern of expression for all three genes which SpaGene failed to detect. Similar to the previous section, we computed the enrichment score (ES) of every method in the seven manually labeled spatial layers. From Fig 6, we noticed that SpaGene performed poorly in terms of ES, especially in Layers 1 and 6. We also performed an additional check as follows. There are three cortical-layer associated SVGs, MOBP, SNAP25, and PCP4, and three blood and immune-related SVGs, HBB, IGKC, and NPY, known to be spatially variable from previous studies [11]. We checked how many of these genes appeared in the lists of the top thousand SVGs (in terms of padjust) by the three methods. SMASH and SpaGene respectively ranked five and six of these SVGs, whereas SPARK-X ranked only two cortical-layer associated genes.
A) Manually labeled cortical layers (layers 1–6) and white matter layer (WML). B) Overlap between the detected SVGs by the three methods. C) Expression of three representative genes identified only by SMASH and SPARK-X. D) Enrichment scores of the methods in different layers.
SCCOHT by Visium.
The small cell carcinoma of the ovary hypercalcemic type (SCCOHT) data [63] has 15,229 genes and 2071 cells. We restricted our focus to the 12,001 genes that express in more than 5% of the cells. Sanders et al. (2022) [63] grouped the cells into twelve clusters based on the expression profile of a selected few genes, using Seurat [33], which we display in Fig 7. SMASH, SPARK-X, and SpaGene respectively detected 9361, 6564, and 6899 SVGs (padjust < 0.05). The overlaps between the detected SVGs by the three methods are displayed in a Venn diagram (Fig 7). SMASH could detect most of the SVGs identified by at least one of the other two methods and an additional 1634 genes. Similar to the analysis of the mouse cerebellum data, we checked if the methods could identify the top genes that show enriched expression in the twelve spatially well-separated clusters found by Sanders et al. (2022). We computed the enrichment scores (ES) of the methods for each of the clusters (Fig 7). SMASH achieved consistently higher ES for all the clusters while SpaGene was the second best in most cases. Additionally, in Fig 8, we show the expression of three chosen genes from each of the following four sets of SVGs, a) the common genes identified by all three methods, b) the genes identified by SMASH and SpaGene but not by SPARK-X, c) the genes identified by SMASH and SPARK-X but not by SpaGene, and d) the genes identified only by SMASH. We also checked the clinical relevance of these genes in the existing literature. For example, CITED4, which was detected to be an SVG by all three methods, has been found to be associated with lung adenocarcinoma [64]. From the set (b), ELF4A1 has been found to be associated with gastric cancer [65]. EZH2, from the set (c), is a well-known marker for being associated with the development and progression of different types of cancer [66, 67]. Sanders et al. (2022) [63] also found the expression of EZH2 to be highly variable across their identified spatial clusters. Finally, from the set (d), SEMA4F has been found to be associated with endometrial cancer [68].
A) Pre-identified clusters of cells using Seurat. B) Overlap between the detected SVGs by the three methods. C) Enrichment scores of the methods in different clusters.
Three representative genes from the four sets of SVGs: a) the common genes identified by all three methods, b) the genes identified by SMASH and SpaGene but not by SPARK-X, c) the genes identified by SMASH and SPARK-X but not by SpaGene, and d) the genes identified only by SMASH.
Mouse hypothalamus by MERFISH.
The mouse hypothalamus data [60] has 161 genes and 5665 cells. 156 genes are pre-selected markers for different cell types and can thus be expected to be highly variable, whereas the other five are control genes. The cell types, such as endothelial, ependymal, and inhibitory, can be identified based on the transcriptional profiles of the markers. The spatial organizations of a few major cell types are shown in Fig 9. SMASH was able to detect 139 genes, whereas SPARK-X and SpaGene detected 127 and 124 genes, respectively (padjust < 0.01). The overlaps between the SVGs detected by the three methods are shown in Fig 9. SMASH identified all the SVGs SPARK-X could detect, while SpaGene identified one additional SVG. It should be highlighted that all the methods assigned the five control genes to not be spatially variable. We display the expression of two representative genes from three sets of genes, a) the genes identified only by SMASH and SpaGene, b) the genes identified only by SMASH and SPARK-X, and c) the genes identified only by SMASH. We did not focus on the common genes because they have been extensively studied in earlier literature, such as the work of Liu. et al. (2022) [48]. The genes Npy1r and Cplx3 belonged to set (a), and are known to be enriched in inhibitory and excitatory neurons [69, 70]. Rxfp1 and Ntsr1 belonged to set b). Even though both genes are known to express in inhibitory and excitatory neurons, Rxfp1 seems to express in ependymal cells as well. Galr2 and Crhr1 are two genes from set c) which express in multiple cell types including inhibitory cells and astrocytes.
A) Overlap between the detected SVGs by the three methods. B) Spatial organization of a few major cell types. C) Expression of two representative genes from each of the three sets, a) the genes identified only by SMASH and SpaGene, b) the genes identified only by SMASH and SPARK-X, and c) the genes identified only by SMASH.
Discussion
We have proposed a novel non-parametric method SMASH for detecting spatially variable genes (SVGs) in the context of large-scale spatial transcriptomics (ST) datasets. In comparison to existing scalable approaches, SMASH achieves superior power in both complex simulation scenarios and real data analyses while remaining computationally tractable.
Recently developed spatial transcriptomics platforms produce high-dimensional datasets [18–20] in terms of the number of cells and the number of genes. In such large datasets, fully parametric approaches for detecting SVGs, such as SpatialDE [34] and SPARK [37], albeit statistically powerful, become intractable for their high computational demand. Computationally efficient alternative non-parametric approaches, such as SPARK-X [38] and SpaGene [35], on the other hand, can often turn out to be significantly less powerful. In our method SMASH, we strive to find a balance between these two issues, achieving higher statistical power while attaining a moderate degree of scalability. We augment the kernel-based covariance testing framework [54], used before in SPARK-X, by accounting for more complex spatial dependencies.
In three different simulation setups, one following the SPARK-X manuscript [38] and the other two following the framework of SpatialDE [34], we evaluated the performance of SMASH, along with two other methods: SPARK-X and SpaGene, in terms of type 1 error and power. SMASH achieved consistently similar or better power than the other two methods in all the simulation setups for all combinations of the varying parameters. In contrast, both SPARK-X and SpaGene behaved unpredictably, achieving almost zero detection power in many of the cases. It demonstrated their lack of robustness and failure to capture complicated structures of spatial dependency in the gene expression. In the run-time comparison of the methods, we showed that SMASH, although slower than SPARK-X and SpaGene, remained fairly tractable and was almost ten times faster than a fully parametric approach like SpatialDE. SMASH, SPARK-X, and SpaGene were then applied to four real datasets: 1) mouse cerebellum data collected using Slide-seq V2 [19], 2) human dorsolateral prefrontal cortex data collected using Visium [11], 3) small cell ovarian carcinoma of the ovary hypercalcemic type data collected using Visium [11], and 4) mouse hypothalamus data collected using MERFISH [60]. We compared the methods via a number of avenues: a) checking the overlap between the detected SVGs by the three methods, b) computing enrichment scores (ES) of the methods in different spatial layers or cell types identified based on the transcriptional profiles using popular softwares, such as RCTD [32] and Seurat [33], and c) investigating the functional enrichment of the genes that were detected by SMASH but remained undetected by at least one of the other two methods. For all the datasets, SMASH detected more SVGs than the other two methods, which included nearly all of the SVGs detected by SPARK-X. SMASH could also detect most of the SVGs that were identified by SpaGene but not by SPARK-X. For example, in data (1), from the 7,653 genes after quality control, SMASH identified 1173 SVGs which included 607 out of the 608 SVGs SPARK-X could detect. Out of the 518 SVGs detected by SpaGene, only 248 were also detected by SPARK-X, while SMASH detected 451 of them. It is important to highlight that SMASH produced calibrated p-values in the null simulations from all of these datasets, lending credibility to these higher numbers of detected SVGs. In the same dataset, SMASH achieved a higher enrichment score (ES) than the other two methods in different pre-identified spatially separated layers or cell types of the mouse cerebellum. A higher ES implied better capability to identify the genes that showed highly variable expression in a particular spatial layer compared to the rest. In the other datasets as well, SMASH consistently achieved better ES in different spatially localized cell types. We also studied the functional properties and clinical significance of the identified SVGs. For example, in data (3), the gene EZH2 was detected to be spatially variable by SMASH and SPARK-X. EZH2 is a known marker for the progression of different types of cancers [66, 67].
In all the methods we have discussed, including SMASH, the biology of a single tissue section from a single subject is explored at a time. It means that if we either have multiple tissue sections from the same subject or from multiple subjects, the methods will have to identify SVGs individually, disregarding the shared information between and across the subjects. Thus, we would like to extend SMASH in a hierarchical fashion for jointly analyzing more than one tissue section or subject in the future. One more important functionality that we would like to incorporate would be the ability to classify the genes based on their similarity of spatial expression patterns. For example, SpatialDE [34] considers a hierarchical Bayesian mixture model approach that suffers from extremely high computational demand. SpaGene [48] considers a non-negative matrix factorization [71] of the expression data to identify similarly expressed genes. This approach, although computationally feasible, does not take into account the spatial locations directly and can thus be suboptimal in capturing truly spatial patterns. In the future, we would like to study this problem with a deeper focus and pursue methodological development in this area. Finally, we would like to explore the possibility of using SMASH in the context of multiplex immunohistochemistry (mIHC) datasets [72, 73] where the goal is to identify spatially variable cell types and their interaction.
Materials and methods
We briefly discuss some of the existing methods such as SpatialDE [34], SPARK [37], SPARK-X [38], and SpaGene [35], and then present the proposed method SMASH. Note that we did not compare SMASH to either SpatialDE or SPARK in our Results section except for the time comparison, primarily due to their high computational demand and the fact that these have already been studied in great detail in earlier works. However, we still discuss their modeling frameworks to facilitate comparisons. Let us introduce a few relevant notations. Suppose there is a single subject (image) with N cells/spots and the expression profile of K genes is observed in the cells. For the i-th cell, let si denote its location i.e., a vector of spatial (two or three-dimensional) coordinates, and yki denote the expression of the k-th gene in the cell. Let us also define, yk = (yk1, …, ykN)T and S = (s1, …, sN)T. For the sake of simplicity, we are assuming that there are no additional covariates but in all the methods, except SpaGene, covariates can be readily incorporated.
A brief overview of existing methods
SpatialDE.
SpatialDE uses a Gaussian process (GP)-based spatial regression model [49, 74]. which has the following form in a finite sample,
(1)
where 1 denotes the n-length vector of all 1’s, I denotes the N-dimensional identity matrix and Σ denotes a Gaussian covariance matrix. ||.|| denotes the Euclidean norm, and the hyperparameter l, known as the characteristic lengthscale [75, 76], controls the rapidness at which the covariance decays as a function of the spatial distance. The fixed effect μk accounts for the mean expression level and
accounts for the expression variance attributable to spatial effects. A large value of
should imply that the gene shows differential spatial expression. To formally test the hypothesis,
against
, SpatialDE considers the likelihood ratio test (LRT) [77]. To estimate the model parameters under the full model, the log-likelihood corresponding to Eq (1) is optimized w.r.t. (
) using an efficient algorithm by Lippert et al. (2011) [51]. Ideally, it is desirable to optimize over the hyperparameter l as well but for the sake of computational feasibility, l is kept fixed at a few carefully chosen values. For every choice of Σ, to analyze all K genes, the efficient algorithm requires just one computationally demanding step with a complexity of O(N3), instead of O(N3K) as incurred in naive algorithms. Along with the Gaussian covariance function, SpatialDE also considers linear and cosine covariance functions to construct Σ, and finally, combines all the LRT values corresponding to different choices of Σ for the inference. For a particular Σ, the computational complexity of SpatialDE is of O(N3 + N2K).
SPARK and SPARK-X.
SPARK [37] extends Eq 1 by considering a generalized linear spatial model (GLSM) [52] with Poisson distribution as
(2)
For cell i, λk(si) is an unknown Poisson rate parameter that represents the underlying gene expression. The variance parameters, and
have similar interpretations as earlier. To test
, SPARK uses the score test [78]. Parameter estimation and inference are incredibly hard in GLSM which is why SPARK uses an approximate algorithm based on the penalized quasi-likelihood (PQL) approach [53, 79]. The approach has the computational complexity of O(N3) for every trait, or O(N3K) in total. Thus, it lacks severely in terms of scalability.
Improving upon SPARK’s scalability, a recent non-parametric method named SPARK-X [38] has been proposed. The method is built on a simple intuition: if yk is independent of S, the spatial distance between two locations i and j should be independent of the difference in gene expression between the two locations. It computes the expression covariance matrix, and the distance covariance matrix, D = S(STS)−1ST and constructs the test statistic as,
where tr() denotes the trace operator. Assume yk to be mean-standardized for the sake of simplicity. Under the null hypothesis of no association, Tk asymptotically follows a weighted mixture of independent
distributions. The weights are the products of the ordered eigenvalues of the matrices, Ek, and D. SPARK-X requires the computational complexity of just O(Nd2) for every gene, or O(NKd2) in total, where d is the dimension of the location-space
, e.g., d = 3 if
A linear complexity w.r.t. N makes SPARK-X easily applicable to large-scale ST datasets. SPARK-X also considers several element-wise non-linear transformations of S as g(S), where g is a Gaussian or cosine transformation (not to be confused with Gaussian or cosine kernels), and repeats the above testing procedure replacing S with g(S). The p-values are combined using a Cauchy p-value combination rule [80].
However, the form of D corresponds to a linear covariance function [75]. It makes SPARK-X equivalent to performing a multiple linear regression of yk on S or g(S) and testing if the fixed effect parameters differ from zero. Thus, SPARK-X is only capable of detecting first-order spatial dependencies and as shown in the Results section, severely lacks power for higher-order dependencies.
SpaGene.
A very recently developed method, SpaGene [48], is different from the rest of the methods discussed so far in the sense of being model-free and based on graphs. The intuition behind the method is that the cells/spots with high gene expression are more likely to be spatially connected than random. It constructs the k-nearest neighbor (kNN) graph based on spatial locations. Then, for each gene, it extracts a subnetwork comprising only cells/spots with high expression from the kNN graph. SpaGene quantifies the connectivity of the subnetwork using the earth mover’s distance (EMD) [56] between degree distributions of the subnetwork and a fully connected one. To generate the null distribution of the EMD for inference, a permutation test is considered. For further details, we refer the readers to the original manuscript [41].
Proposed method: SMASH
Setup.
We test the null hypothesis of yk and S being independent, i.e., H0: yk ⊥ S, using a non-parametric kernel-based framework [58, 81–83]. Let yk and S have domains and
, respectively. Denote
and
to be two measurable positive definite (PD) kernels with the corresponding reproducible kernel Hilbert spaces (RKHSs) denoted by
and
on
and
, respectively. Then, the cross-covariance operator:
from
to
can be defined by the relation:
,
, where <.> denotes an inner product.
can be interpreted as a more general version of the covariance matrix on Euclidean spaces, representing higher-order correlations of yk and S through f2(yk) and f1(S). Under additional regulatory assumptions on RKHSs:
and
[58], it can be shown that testing H0: yk ⫫ S is equivalent to testing,
. This testing can be performed using a test statistic of the form
, where
and
are the kernel covariance or Gram matrices, obtained using the PD kernels:
and
[75, 76]. In the context of real datasets, exact choices for
and
are never known. Therefore, we consider different kernel choices and aggregate the results. Our test statistic has the form
, where Ek is defined as earlier, i.e.,
is fixed to be a linear kernel, while
and consequently,
is varied to have different forms as described next.
Kernels and hyperparameters.
In this work, we consider to have three forms: a) the Gaussian kernel covariance matrix, Σ defined in Eq (1), b) a cosine or periodic kernel covariance matrix of the form,
, where parameter p is known as the period, and c) the linear kernel-based covariance matrix D considered in SPARK-X (with Gaussian and cosine transformations of S as well). For the Gaussian and cosine covariance matrices, we consider ten data-driven fixed values of the lengthscale l and period p, respectively (see S1 Text). Refer to Fig A in S1 Text, for visualizing the spatial patterns corresponding to the different kernel covariance matrices. In Table 2, we list the kernel covariance matrices used in different methods. Note that
can be interpreted as a special case of
as the former only considers linear kernel covariance matrices.
The table shows (yes/no) if a particular kernel covariance or Gram matrix is considered in different methods.
Distribution and computational complexity.
For a particular choice of , the asymptotic null distribution of
is a weighted mixture of independent
distributions, where the weights are the products of the ordered eigenvalues of the matrices, Ek, and
[54, 58]. However, unlike the kernel choices of
,
does not always have a projection matrix-like structure as D, and thus, its eigenvalues can not be computed with the complexity of O(Nd2). Instead, it requires the complexity of O(N3), rendering it intractable as N increases. Therefore, we consider a variation of Welch-Satterthwaite approximation [84, 85], to approximate the asymptotic null distribution of
with a gamma distribution [58] as below,
where
and
denote the expectation and variance, respectively. It is easy to verify that
. Notice that we can now avoid any operation of complexity O(N3). Computation of
just requires the complexity of O(N2) using the property that
for two matrices, A = [[aij]]N × N and B = [[bij]]N × N [86]. Thus, for a particular choice of
, to analyze all K genes, SMASH requires the complexity of O(N2K). This computational complexity is higher than SPARK-X. But we are making that sacrifice to gain significantly more power, as shown in both simulation studies and real data analyses while still achieving a moderate degree of scalability. It is worth pointing out that even though SMASH is non-parametric and does not make any distributional assumptions,
shares a close similarity with the SpatialDE model under some additional assumptions (see S1 Text).
Aggregation and covariates.
As mentioned earlier, we consider multiple (say, R) choices for , to construct multiple test statistics:
. Finally, we combine the p-values corresponding to these test statistics using the minimum p-value combination rule [80] (see S1 Text for more details). Note that we have assumed that yk is mean-standardized and there are no additional covariates to be taken into account. In the presence of covariates, we would regress the covariates out from the gene expression vector yk, prior to performing the test, using a multiple linear regression model. To further elaborate, letting X be the corresponding matrix of covariates, we would compute the projection matrix PX = X(XTX)−1XT, and substitute the vector yk with
, in our proposed test statistic.
FDR control and non-PD kernels
In the real data analysis, we used Benjamini-Yekutieli [87] procedure to control the false discovery rate (FDR) at 0.05 (or, 0.01) for all the methods. In the Results section, padjust refers to the adjusted p-values. It was shown by Zhu et al. (2021) [38] that parametric methods like SpatialDE and SPARK often produce highly inflated p-values for most ST datasets, and hence need additional testing correction. To check if our p-values were inflated in the four real datasets, we randomly permuted the spatial locations of the cells/spots five times and then performed the tests using the three methods. Thus, we obtained the empirical null distribution of the p-values for each method which we displayed as quantile-quantile plots (see Fig B in S1 Text). In all four cases, SMASH showed no sign of inflation with rather slightly conservative p-values which is expected since the minimum p-value combination rule used for combining the p-values in our method, is known to be conservative [88].
The cosine or periodic kernel covariance matrix is not positive definite (PD). Our testing framework and the distributional derivations hold only for PD kernel covariance matrices. One solution could be to truncate the negative eigenvalues of the kernel matrix, i.e., adjusting as
, where λi and Ui denote the i-th eigenvalue and eigenvector, respectively. However, computing eigenvalues can become computationally challenging as it requires a complexity of O(N3). In our simulations, we have noticed that using unadjusted versions of the kernel matrices yielded conservative test results, with no sign of p-value inflation. We refer to S1 Text for further details and plots.
Enrichment scores
In the real data analysis, we computed the enrichment scores (ES) of the three methods following the procedure outlined in Liu et al. (2022) [48]. Cell clustering based on biological knowledge or using popular software, such as RCTD [32] and Seurat [33], with the transcriptional profiles, can often identify spatially localized layers or cell types. Therefore, marker genes in those spatially-restricted cell types should ideally be identified as SVGs. Suppose there are M cell types. For every cell type m, the gene set Gm is built from the top 50 markers based on the fold change between the expression in the cell type m compared to the others. The SVGs detected by the three methods are ranked from the most to the least significant. Finally, unweighted gene set enrichment analysis [89] is implemented to evaluate the enrichment of the gene sets, Gm, m = 1, …, M, in the high ranking of the ranked SVG lists of the methods.
Softwares used
To fit SPARK-X, SpaGene, and SpatialDE, we used the existing packages which are available at,
- SPARK-X: https://github.com/xzhoulab/SPARK,
- SpaGene: https://github.com/liuqivandy/SpaGene, and
- SpatialDE: https://github.com/Teichlab/SpatialDE.
Gene-set functional enrichment analyses were performed using ShinyGO Version 0.77 [90] available at, http://bioinformatics.sdstate.edu/go/.
Simulation description
In simulation setup (1), we generated the spatial coordinates for varying numbers of cells, N = 1000, 5000, and 10,000 using a random point-pattern Poisson process [91]. The expression values of K = 500 genes in these cells were simulated based on a negative binomial distribution displaying one of the four spatial patterns: streak, reverse streak, hotspot, and reverse hotspot as shown in Fig 1. For each of the patterns, 80% of the spatial locations were assumed to be background locations, while the rest 20% were assumed to be part of the pattern. The difference between the mean expression of a gene on a background location and a patterned location was captured through a fold-change parameter. Several values of fold-change were considered where a value of 1 implied a null scenario i.e., no spatial pattern, and a high value implied a prominent spatial pattern. We refer to Zhu et al. (2021) [38] for more details.
For simulation setup (2), we considered the Gaussian process (GP)-based spatial regression model from SpatialDE [34]. The locations were simulated based on Uniform distribution, which were then used to construct Gaussian covariance matrices with varying lengthscale (l) parameters as in Eq (1). The expression levels of genes were independently and identically simulated from the multivariate normal distribution described in Eq (1) for different values of the variance parameters and
. We fixed the total variance,
, and varied the individual values as
and
, where “effect-size” h ranged from zero to larger values implying null to an increasingly stronger spatial pattern. In simulation setup (3), we followed setup (2) replacing the Gaussian covariance with the cosine covariance for varying values of the period parameter p. In all three setups, we compared SMASH, SPARK-X, and SpaGene in terms of type 1 error and power.
Supporting information
S1 Text.
Section 1 discusses how to choose suitable kernel covariance matrices and combine the p-values corresponding to different kernel covariance matrices. Section 2 shows SPARK-X’s equivalence with the multiple linear regression model. Section 3 analyzes the null QQ plots of different methods in the real datasets. Section 4 discusses the severity of using non-positive definite (non-PD) kernel covariance matrices. We list and briefly describe the figures from S1 Text below.
- Fig A. Visualization of patterns of different kernel covariance matrices.
- Fig B. QQ-plots of different methods under null simulations in the real datasets.
- Fig C. QQ-plots with the observed and theoretical distributions of the SMASH test statistic with an unadjusted cosine kernel matrix.
- Fig D. QQ-plots with the observed and theoretical distributions of the SMASH test statistic with an adjusted cosine kernel matrix.
- Fig E. QQ-plots with the observed and theoretical distributions of the—log10(p)-values obtained using SMASH with all the kernel matrices.
https://doi.org/10.1371/journal.pgen.1010983.s001
(PDF)
Acknowledgments
We would like to thank Dr. Kristen Wells-Wrasman for her help with processing the SCCOHT dataset.
References
- 1. Ståhl PL, Salmén F, Vickovic S, Lundmark A, Navarro JF, Magnusson J, et al. Visualization and analysis of gene expression in tissue sections by spatial transcriptomics. Science. 2016;353(6294):78–82. pmid:27365449
- 2. Shah S, Lubeck E, Zhou W, Cai L. In situ transcription profiling of single cells reveals spatial organization of cells in the mouse hippocampus. Neuron. 2016;92(2):342–357. pmid:27764670
- 3. Shah S, Lubeck E, Zhou W, Cai L. seqFISH accurately detects transcripts in single cells and reveals robust spatial organization in the hippocampus. Neuron. 2017;94(4):752–758. pmid:28521130
- 4. Wang G, Moffitt JR, Zhuang X. Multiplexed imaging of high-density libraries of RNAs with MERFISH and expansion microscopy. Scientific reports. 2018;8(1):1–13.
- 5. Xia C, Fan J, Emanuel G, Hao J, Zhuang X. Spatial transcriptome profiling by MERFISH reveals subcellular RNA compartmentalization and cell cycle-dependent gene expression. Proceedings of the National Academy of Sciences. 2019;116(39):19490–19499. pmid:31501331
- 6. Eng CHL, Lawson M, Zhu Q, Dries R, Koulena N, Takei Y, et al. Transcriptome-scale super-resolved imaging in tissues by RNA seqFISH+. Nature. 2019;568(7751):235–239. pmid:30911168
- 7. Asp M, Bergenstråhle J, Lundeberg J. Spatially resolved transcriptomes—next generation tools for tissue exploration. BioEssays. 2020;42(10):1900221. pmid:32363691
- 8. Guilliams M, Bonnardel J, Haest B, Vanderborght B, Wagner C, Remmerie A, et al. Spatial proteogenomics reveals distinct and evolutionarily conserved hepatic macrophage niches. Cell. 2022;185(2):379–396. pmid:35021063
- 9. Dhainaut M, Rose SA, Akturk G, Wroblewska A, Nielsen SR, Park ES, et al. Spatial CRISPR genomics identifies regulators of the tumor microenvironment. Cell. 2022;185(7):1223–1239. pmid:35290801
- 10. Chen WT, Lu A, Craessaerts K, Pavie B, Frigerio CS, Corthout N, et al. Spatial transcriptomics and in situ sequencing to study Alzheimer’s disease. Cell. 2020;182(4):976–991. pmid:32702314
- 11. Maynard KR, Collado-Torres L, Weber LM, Uytingco C, Barry BK, Williams SR, et al. Transcriptome-scale spatial gene expression in the human dorsolateral prefrontal cortex. Nature neuroscience. 2021;24(3):425–436. pmid:33558695
- 12. Ortiz C, Carlén M, Meletis K. Spatial transcriptomics: molecular maps of the mammalian brain. Annual review of neuroscience. 2021;44:547–562. pmid:33914592
- 13. Levy-Jurgenson A, Tekpli X, Kristensen VN, Yakhini Z. Spatial transcriptomics inferred from pathology whole-slide images links tumor heterogeneity to survival in breast and lung cancer. Scientific reports. 2020;10(1):1–11. pmid:33139755
- 14. Yoosuf N, Navarro JF, Salmén F, Ståhl PL, Daub CO. Identification and transfer of spatial transcriptomics signatures for cancer diagnosis. Breast Cancer Research. 2020;22(1):1–10. pmid:31931856
- 15. Hunter MV, Moncada R, Weiss JM, Yanai I, White RM. Spatially resolved transcriptomics reveals the architecture of the tumor-microenvironment interface. Nature communications. 2021;12(1):1–16. pmid:34725363
- 16. Zollinger DR, Lingle SE, Sorg K, Beechem JM, Merritt CR. GeoMx RNA assay: high multiplex, digital, spatial analysis of RNA in FFPE tissue. In Situ Hybridization Protocols. 2020; p. 331–345. pmid:32394392
- 17. Merritt CR, Ong GT, Church SE, Barker K, Danaher P, Geiss G, et al. Multiplex digital spatial profiling of proteins and RNA in fixed tissue. Nature biotechnology. 2020;38(5):586–599. pmid:32393914
- 18. Rodriques SG, Stickels RR, Goeva A, Martin CA, Murray E, Vanderburg CR, et al. Slide-seq: A scalable technology for measuring genome-wide expression at high spatial resolution. Science. 2019;363(6434):1463–1467. pmid:30923225
- 19. Stickels RR, Murray E, Kumar P, Li J, Marshall JL, Di Bella DJ, et al. Highly sensitive spatial transcriptomics at near-cellular resolution with Slide-seqV2. Nature biotechnology. 2021;39(3):313–319. pmid:33288904
- 20. Vickovic S, Eraslan G, Salmén F, Klughammer J, Stenbeck L, Schapiro D, et al. High-definition spatial transcriptomics for in situ tissue profiling. Nature methods. 2019;16(10):987–990. pmid:31501547
- 21. Kwon S. Single-molecule fluorescence in situ hybridization: quantitative imaging of single RNA molecules. BMB reports. 2013;46(2):65. pmid:23433107
- 22. Lubeck E, Coskun AF, Zhiyentayev T, Ahmad M, Cai L. Single-cell in situ RNA profiling by sequential hybridization. Nature methods. 2014;11(4):360–361. pmid:24681720
- 23. Chen KH, Boettiger AN, Moffitt JR, Wang S, Zhuang X. Spatially resolved, highly multiplexed RNA profiling in single cells. Science. 2015;348(6233):aaa6090. pmid:25858977
- 24. Moses L, Pachter L. Museum of spatial transcriptomics. Nature Methods. 2022;19(5):534–546. pmid:35273392
- 25. Atta L, Fan J. Computational challenges and opportunities in spatially resolved transcriptomic data analysis. Nature Communications. 2021;12(1):1–5. pmid:34489425
- 26. Thrane K, Eriksson H, Maaskola J, Hansson J, Lundeberg J. Spatially resolved transcriptomics enables dissection of genetic heterogeneity in stage III cutaneous malignant melanoma. Cancer research. 2018;78(20):5970–5979. pmid:30154148
- 27. Navarro JF, Croteau DL, Jurek A, Andrusivova Z, Yang B, Wang Y, et al. Spatial transcriptomics reveals genes associated with dysregulated mitochondrial functions and stress signaling in Alzheimer disease. Iscience. 2020;23(10):101556. pmid:33083725
- 28. Kats I, Vento-Tormo R, Stegle O. SpatialDE2: Fast and localized variance component analysis of spatial transcriptomics. bioRxiv. 2021;.
- 29. Rao A, Barkley D, França GS, Yanai I. Exploring tissue architecture using spatial transcriptomics. Nature. 2021;596(7871):211–220. pmid:34381231
- 30. Wang Y, Ma S, Ruzzo WL. Spatial modeling of prostate cancer metabolic gene expression reveals extensive heterogeneity and selective vulnerabilities. Scientific reports. 2020;10(1):1–14. pmid:32103057
- 31. Berglund E, Maaskola J, Schultz N, Friedrich S, Marklund M, Bergenstråhle J, et al. Spatial maps of prostate cancer transcriptomes reveal an unexplored landscape of heterogeneity. Nature communications. 2018;9(1):2419. pmid:29925878
- 32. Cable DM, Murray E, Zou LS, Goeva A, Macosko EZ, Chen F, et al. Robust decomposition of cell type mixtures in spatial transcriptomics. Nature Biotechnology. 2022;40(4):517–526. pmid:33603203
- 33. Hao Y, Hao S, Andersen-Nissen E, Mauck WM, Zheng S, Butler A, et al. Integrated analysis of multimodal single-cell data. Cell. 2021;184(13):3573–3587. pmid:34062119
- 34. Svensson V, Teichmann SA, Stegle O. SpatialDE: identification of spatially variable genes. Nature methods. 2018;15(5):343–346. pmid:29553579
- 35. Li K, Yan C, Li C, Chen L, Zhao J, Zhang Z, et al. Computational elucidation of spatial gene expression variation from spatially resolved transcriptomics data. Molecular Therapy-Nucleic Acids. 2021;. pmid:35036053
- 36. Edsgärd D, Johnsson P, Sandberg R. Identification of spatial expression trends in single-cell gene expression data. Nature methods. 2018;15(5):339–342. pmid:29553578
- 37. Sun S, Zhu J, Zhou X. Statistical analysis of spatial expression patterns for spatially resolved transcriptomic studies. Nature methods. 2020;17(2):193–200. pmid:31988518
- 38. Zhu J, Sun S, Zhou X. SPARK-X: non-parametric modeling enables scalable and robust detection of spatial expression patterns for large spatial transcriptomic studies. Genome Biology. 2021;22(1):1–25. pmid:34154649
- 39. Li Q, Zhang M, Xie Y, Xiao G. Bayesian modeling of spatial molecular profiling data via Gaussian process. Bioinformatics. 2021;37(22):4129–4136. pmid:34146105
- 40. Weber LM, Saha A, Datta A, Hansen KD, Hicks SC. nnSVG for the scalable identification of spatially variable genes using nearest-neighbor Gaussian processes. Nature Communications. 2023;14(1):4059. pmid:37429865
- 41. Bae S, Choi H, Lee DS. Discovery of molecular features underlying the morphological landscape by integrating spatial transcriptomic data with deep features of tissue images. Nucleic acids research. 2021;49(10):e55–e55. pmid:33619564
- 42. Hu J, Li X, Coleman K, Schroeder A, Ma N, Irwin DJ, et al. SpaGCN: Integrating gene expression, spatial location and histology to identify spatial domains and spatially variable genes by graph convolutional network. Nature methods. 2021;18(11):1342–1351. pmid:34711970
- 43. Zhu Q, Shah S, Dries R, Cai L, Yuan GC. Identification of spatially associated subpopulations by combining scRNAseq and sequential fluorescence in situ hybridization data. Nature biotechnology. 2018;36(12):1183–1190. pmid:30371680
- 44. Miller BF, Bambah-Mukku D, Dulac C, Zhuang X, Fan J. Characterizing spatial gene expression heterogeneity in spatially resolved single-cell transcriptomic data with nonuniform cellular densities. Genome research. 2021;31(10):1843–1855. pmid:34035045
- 45. Dries R, Zhu Q, Dong R, Eng CHL, Li H, Liu K, et al. Giotto: a toolbox for integrative analysis and visualization of spatial expression data. Genome biology. 2021;22(1):1–31.
- 46. Jiang X, Xiao G, Li Q. A Bayesian modified Ising model for identifying spatially variable genes from spatial transcriptomics data. Statistics in Medicine. 2022;41(23):4647–4665. pmid:35871762
- 47. Zhang K, Feng W, Wang P. Identification of spatially variable genes with graph cuts. Nature Communications. 2022;13(1):5488. pmid:36123336
- 48. Liu Q, Hsu CY, Shyr Y. Scalable and model-free detection of spatial patterns and colocalization. Genome research. 2022;32(9):1736–1745. pmid:36223499
- 49. Banerjee S, Gelfand AE, Finley AO, Sang H. Gaussian predictive process models for large spatial data sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2008;70(4):825–848. pmid:19750209
- 50. Liu D, Lin X, Ghosh D. Semiparametric regression of multidimensional genetic pathway data: Least-squares kernel machines and linear mixed models. Biometrics. 2007;63(4):1079–1088. pmid:18078480
- 51. Lippert C, Listgarten J, Liu Y, Kadie CM, Davidson RI, Heckerman D. FaST linear mixed models for genome-wide association studies. Nature methods. 2011;8(10):833–835. pmid:21892150
- 52. Christensen OF, Waagepetersen R. Bayesian prediction of spatial count data using generalized linear mixed models. Biometrics. 2002;58(2):280–286. pmid:12071400
- 53. Dean CB, Ugarte MD, Militino AF. Penalized quasi-likelihood with spatially correlated data. Computational statistics & data analysis. 2004;45(2):235–248.
- 54.
Zhang K, Peters J, Janzing D, Schölkopf B. Kernel-based conditional independence test and application in causal discovery. arXiv preprint arXiv:12023775. 2012;.
- 55.
Rencher AC, Christensen WF. Methods of Multivariate Analysis. Wiley; 2012.
- 56. Rubner Y, Tomasi C, Guibas LJ. The earth mover’s distance as a metric for image retrieval. International journal of computer vision. 2000;40(2):99–121.
- 57. Odén A, Wedel H. Arguments for Fisher’s permutation test. The Annals of Statistics. 1975; p. 518–520.
- 58. Gretton A, Fukumizu K, Teo C, Song L, Schölkopf B, Smola A. A kernel statistical test of independence. Advances in neural information processing systems. 2007;20.
- 59. Righelli D, Weber LM, Crowell HL, Pardo B, Collado-Torres L, Ghazanfar S, et al. SpatialExperiment: infrastructure for spatially-resolved transcriptomics data in R using Bioconductor. Bioinformatics. 2022;38(11):3128–3131. pmid:35482478
- 60. Moffitt JR, Bambah-Mukku D, Eichhorn SW, Vaughn E, Shekhar K, Perez JD, et al. Molecular, spatial, and functional single-cell profiling of the hypothalamic preoptic region. Science. 2018;362(6416):eaau5324. pmid:30385464
- 61.
Moffitt JR, Bambah-Mukku D, Eichhorn SW, Vaughn E, Shekhar K, Perez JD, et al. Data from: Molecular, spatial, and functional single-cell profiling of the hypothalamic preoptic region. Dryad. https://datadryad.org/stash/dataset/.
- 62. Kirsch L, Liscovitch N, Chechik G. Localizing genes to cerebellar layers by classifying ISH images. PLOS computational biology. 2012;8(12):e1002790. pmid:23284274
- 63. Sanders BE, Wolsky R, Doughty ES, Wells KL, Ghosh D, Ku L, et al. Small cell carcinoma of the ovary hypercalcemic type (SCCOHT): A review and novel case with dual germline SMARCA4 and BRCA2 mutations. Gynecologic Oncology Reports. 2022; p. 101077. pmid:36249907
- 64. Zhang L, Wang Y, Sha Y, Zhang B, Zhang R, Zhang H, et al. CITED4 enhances the metastatic potential of lung adenocarcinoma. Thoracic Cancer. 2021;12(9):1291–1302. pmid:33759374
- 65. Gao C, Guo X, Xue A, Ruan Y, Wang H, Gao X. High intratumoral expression of eIF4A1 promotes epithelial-to-mesenchymal transition and predicts unfavorable prognosis in gastric cancer. Acta Biochimica et Biophysica Sinica. 2020;52(3):310–319. pmid:32147684
- 66. Gan L, Yang Y, Li Q, Feng Y, Liu T, Guo W. Epigenetic regulation of cancer progression by EZH2: from biological insights to therapeutic potential. Biomarker research. 2018;6(1):1–10. pmid:29556394
- 67. Duan R, Du W, Guo W. EZH2: a novel target for cancer treatment. Journal of hematology & oncology. 2020;13(1):1–12. pmid:32723346
- 68. Chen F, Qin T, Zhang Y, Wei L, Dang Y, Liu P, et al. Reclassification of endometrial cancer and identification of key genes based on neural-related genes. Frontiers in Oncology. 2022;12. pmid:36212450
- 69. Nelson TS, Taylor BK. Targeting spinal neuropeptide Y1 receptor-expressing interneurons to alleviate chronic pain and itch. Progress in neurobiology. 2021;196:101894. pmid:32777329
- 70. Viswanathan S, Bandyopadhyay S, Kao JP, Kanold PO. Changing microcircuits in the subplate of the developing cortex. Journal of Neuroscience. 2012;32(5):1589–1601. pmid:22302801
- 71. Wang YX, Zhang YJ. Nonnegative matrix factorization: A comprehensive review. IEEE Transactions on knowledge and data engineering. 2012;25(6):1336–1353.
- 72. Seal S, Wrobel J, Johnson AM, Nemenoff RA, Schenk EL, Bitler BG, et al. On Clustering for Cell Phenotyping in Multiplex Immunohistochemistry (mIHC) and Multiplexed Ion Beam Imaging (MIBI) Data. BMC Research Notes. 2022;15(1):215. pmid:35725622
- 73. Seal S, Ghosh D. MIAMI: mutual information-based analysis of multiplex imaging data. Bioinformatics. 2022;38(15):3818–3826. pmid:35748713
- 74. Seal S, Datta A, Basu S. Efficient estimation of SNP heritability using Gaussian predictive process in large scale cohort studies. PLoS genetics. 2022;18(4):e1010151. pmid:35442943
- 75. Rasmussen CE, Williams CK. Gaussian processes for machine learning. International Journal of Neural Systems. 2006;14.
- 76.
Cressie N. Statistics for spatial data. John Wiley & Sons; 2015.
- 77. Gourieroux C, Holly A, Monfort A. Likelihood ratio test, Wald test, and Kuhn-Tucker test in linear models with inequality constraints on the regression parameters. Econometrica: journal of the Econometric Society. 1982; p. 63–80.
- 78.
Boos DD, Stefanski LA, et al. Essential statistical inference. Springer; 2013.
- 79. Sun S, Zhu J, Mozaffari S, Ober C, Chen M, Zhou X. Heritability estimation and differential analysis of count data with generalized linear mixed models in genomic sequencing studies. Bioinformatics. 2019;35(3):487–496. pmid:30020412
- 80. Liu Y, Chen S, Li Z, Morrison AC, Boerwinkle E, Lin X. ACAT: a fast and powerful p value combination method for rare-variant analysis in sequencing studies. The American Journal of Human Genetics. 2019;104(3):410–421. pmid:30849328
- 81. Fukumizu K, Bach FR, Jordan MI. Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. Journal of Machine Learning Research. 2004;5(Jan):73–99.
- 82.
Gretton A, Bousquet O, Smola A, Schölkopf B. Measuring statistical dependence with Hilbert-Schmidt norms. In: International conference on algorithmic learning theory. Springer; 2005. p. 63–77.
- 83. Fukumizu K, Gretton A, Sun X, Schölkopf B. Kernel measures of conditional dependence. Advances in neural information processing systems. 2007;20.
- 84. Welch BL. The generalization of ‘STUDENT’S’problem when several different population varlances are involved. Biometrika. 1947;34(1-2):28–35. pmid:20287819
- 85. Satterthwaite FE. An approximate distribution of estimates of variance components. Biometrics bulletin. 1946;2(6):110–114. pmid:20287815
- 86.
Skiena SS. The algorithm design manual. vol. 2. Springer; 1998.
- 87. Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple testing under dependency. Annals of statistics. 2001; p. 1165–1188.
- 88. Narum SR. Beyond Bonferroni: less conservative analyses for conservation genetics. Conservation genetics. 2006;7:783–787.
- 89. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences. 2005;102(43):15545–15550. pmid:16199517
- 90. Ge SX, Jung D, Yao R. ShinyGO: a graphical gene-set enrichment tool for animals and plants. Bioinformatics. 2020;36(8):2628–2629. pmid:31882993
- 91.
Baddeley A, Bárány I, Schneider R. Spatial point processes and their applications. Stochastic Geometry: Lectures Given at the CIME Summer School Held in Martina Franca, Italy, September 13–18, 2004. 2007; p. 1–75.