Skip to main content
Browse Subject Areas

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Genomics data analysis via spectral shape and topology

  • Erik J. Amézquita,

    Roles Formal analysis, Investigation, Visualization

    Affiliation Computational Mathematics, Science, and Engineering, Michigan State University, East Lansing, MI, United States of America

  • Farzana Nasrin ,

    Roles Conceptualization, Formal analysis, Investigation, Methodology, Project administration, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Department of Mathematics, University of Hawaii at Manoa, Honolulu, HI, United States of America

  • Kathleen M. Storey,

    Roles Conceptualization, Formal analysis, Investigation, Methodology, Project administration, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Department of Mathematics, Lafayette College, Easton, PA, United States of America

  • Masato Yoshizawa

    Roles Formal analysis, Investigation, Methodology, Validation, Visualization, Writing – review & editing

    Affiliation School of Life Sciences, University of Hawaii at Manoa, Honolulu, HI, United States of America


Mapper, a topological algorithm, is frequently used as an exploratory tool to build a graphical representation of data. This representation can help to gain a better understanding of the intrinsic shape of high-dimensional genomic data and to retain information that may be lost using standard dimension-reduction algorithms. We propose a novel workflow to process and analyze RNA-seq data from tumor and healthy subjects integrating Mapper, differential gene expression, and spectral shape analysis. Precisely, we show that a Gaussian mixture approximation method can be used to produce graphical structures that successfully separate tumor and healthy subjects, and produce two subgroups of tumor subjects. A further analysis using DESeq2, a popular tool for the detection of differentially expressed genes, shows that these two subgroups of tumor cells bear two distinct gene regulations, suggesting two discrete paths for forming lung cancer, which could not be highlighted by other popular clustering methods, including t-distributed stochastic neighbor embedding (t-SNE). Although Mapper shows promise in analyzing high-dimensional data, tools to statistically analyze Mapper graphical structures are limited in the existing literature. In this paper, we develop a scoring method using heat kernel signatures that provides an empirical setting for statistical inferences such as hypothesis testing, sensitivity analysis, and correlation analysis.

1 Introduction

Topological data analysis (TDA) is a mathematical approach that yields promising results to unravel the underlying structure of diverse data sets in biology. At the molecular level, TDA has been used to understand the structure of proteins [1], protein-ligand binding affinities [2, 3], and viral reassortment [4]. M. Nicolau and coauthors [5], used TDA to identify a previously unreported group of breast cancer tumors with a unique molecular profile and excellent prognosis. They employed a topological algorithm, known as Mapper, to build a graphical representation of the data that reduces the dimensionality of the data while still preserving its local structure [6]. J. Arsuaga et al. developed a homology-based classification method for genomic hybridization arrays and gene expression in [7, 8], and showed that it could distinguish most breast cancer subtypes. They extended this topological method and discovered new DNA copy number aberrations within specific subtypes of breast cancer [9]. A recent study by R. Jeitziner et al. [10] presented a new two-tiered version of the Mapper algorithm, which is particularly useful for small genomic sample sizes. Topological approaches to RNA-sequencing (RNA-seq) data have also been applied to study the in vitro differentiation of murine embryonic stem cells into neurons in [11], suggesting the translatability of Mapper-based topological analysis to many biological contexts.

There remains a massive quantity of high-throughput data unexplored by TDA, so we have begun our investigation by applying a Mapper-based classification approach to genomic data sequenced from lung adenocarcinoma samples. We choose to focus on lung carcinoma, since it is frequently accompanied by a large number of genetic mutations. The set of RNA-seq cancer data is provided in the Cancer Genome Atlas (TCGA), which is an effort run by the National Cancer Institute and the National Human Genome Research Institute to provide a vast amount of open access data from over 11,000 cancer patients [12]. The lung adenocarcinoma data was initially collected in two studies from 2014 and 2016 [13, 14]. We compare the data sequenced from lung tumor tissue with data sequenced from healthy lung tissue. The healthy tissue data is provided in the Genotype Tissue Expression (GTEx) project, which has cataloged over 9000 healthy tissue samples [15, 16].

We propose a novel workflow, outlined in Fig 1, to describe and analyze bulk tumor cell RNA-seq data and to detect robust patterns in such data. This workflow involves first preprocessing the data using a Gaussian mixture approximation and its corresponding scores, as illustrated inside the purple box in Fig 1. The Gaussian fitting-based analysis of RNA-seq data was proposed in [17]. However, the method relies on another form of sequencing data known as ChiP-seq data, which is not available for our analysis. In the second step of the workflow, we use Mapper to construct a graphical representation of the Gaussian mixture scores (yellow box in Fig 1) We apply this workflow to the lung data set described above, and we identify two distinct groups of tumor subjects that are consistently separated graphically by healthy subjects. In order to determine which subjects are placed consistently in the graph as Mapper parameters vary, we assign an index to each subject, which we call the position index (PI).

Fig 1. The genomics data analysis workflow proposed in this paper.

The purple box describes the data preprocessing step via Gaussian mixture scores (Section 3.1), the orange box illustrates the Mapper representation step (Section 3.2), the green box highlights the gene expression analysis (Section 3.4), and the orange box snapshots the sensitivity analysis performed using graphical subjects scores (Section 3.5).

The next step of the workflow involves further analysis of Mapper graphical representation obtained from the RNA-seq data guided by a pipeline of gene expression comparisons, using DESeq2, followed by gene pathway prediction using Gene Ontogeny (GO) analyses with Enrichr (green box in Fig 1). DESeq2 is built on negative binomial generalized linear models and is useful for the detection of differentially expressed genes [18]. We use DESeq2 to perform analysis on PI-specific RNA-seq count data in a pairwise manner between each of the three subgroups generated by Mapper—one mostly consisting of healthy subjects vs. two others composed primarily of tumor subjects. The analysis shows that the majority of the p-values are less than the significance level of 0.01, thus indicating that the two tumor subgroups have significantly large numbers of dysregulated genes. We then use Enrichr to identify the enriched GO terms and molecular pathways among these differentially expressed genes [1921]. This analysis predicts the biological processes that these dysregulated genes belong to, such as inflammatory reactions and muscle functions, pointing to possible impaired biological processes in cancer patients, thereby informing targeted therapeutic approaches.

The orange box in Fig 1 shows the final step of the workflow, where we develop a scoring method to provide statistical inferences, in particular, sensitivity analysis under different parameter choices. We denote these scores as graphical subject scores (GSS). The scoring method is built on heat kernel signatures (HKS) which can be thought of as spectral node signatures of graphs [22]. HKS belongs to a family of spectral node signatures called Laplacian family signatures (LFS). LFS is parameterized by a construction filter that relies on the graph Laplacian [23]. LFS has drawn recent attention for graph matching and graph classification problems [2427]. In particular, the integration of spectral shape analysis and topological tools has been introduced in recent work [25, 26]. The combination of topological structures and HKS proves to be useful, as the former can capture the global topological properties, whereas the latter relies on the values generated by the spectral signatures to characterize the graphical structure, but HKS cannot be used to extract the topological information on its own. We consider the Mapper graphical structure as a doubly weighted graph, and the weighted graph Laplacian is based on the definition in [28]. The HKS assigns scores to each node and consequently to each subject in the Mapper graph. Although we focus on sensitivity analysis in this paper, the GSS method is general enough to be employed for other statistical inferences such as correlation analysis and hypothesis testing.

The main contributions of this work are:

  1. A novel approach to integrate the Mapper graphical structures and traditional differential gene expression analysis to provide insight regarding distinctive genetic markers.
  2. A scoring method based on spectral shape analysis to develop an empirical setting for statistical inferences of Mapper graphical structures.

This paper is organized as follows. Section 2 provides the necessary background and definitions for building the methodology of the paper. In Section 3, we describe the key methodologies. Detailed demonstrations of the data set and results are presented in Section 4. Finally, we end with a discussion in Section 5. Additional details describing the genetic analysis results are provided in the supporting information files. Futhermore, the codes and data implemented in this paper are freely available on a Github repository genomicsTDA.

2 Background materials

We begin by discussing the necessary background for generating Mapper graphical structures and GSSs. In Subsection 2.1, we review the Gaussian mixture scores which are used for constructing Mapper graphs. Pertinent definitions, a lemma, and some basic facts about HKS are discussed in Subsection 2.2.

2.1 Gaussian mixture scores

Let x be a one-dimensional vector generated from a Gaussian mixture model (GMM) of N components obtained by fitting the distribution of FPKM (Fragments Per Kilobase Million) values. Then the probability density function for the GMM is given by (1) where N is the number of mixture components, and cj, μj, and σj are the weights, means, and standard deviations, respectively. In order to standardize the Gaussian mixture, we implement the notion of membership weight in Bayesian probability [29]. Precisely, a latent variable sj is defined by where π ∈ {1, ⋯, N} and is an the indicator function. Then for s ∼ Multinominal(1; c1, ⋯, cN) we have . The membership weight is then defined by (2) where xj(⋅) = xj(⋅|μj, σj). The latent variables, sj, are not known and we require the estimates of sj. In order to estimate the latent variable, the authors of [29] discuss two types of assignments—hard and soft assignments The soft assignments indicate that the estimates, , are the membership weight. On the other hand, the hard assignment defines the estimates as an indicator function based on the maximization of the membership weights [29]. These assignments are used to define four standardized scores, shown below using the notations in [29]. Let , , then (3) (4) (5) (6)

The transformations described in Eqs (3)–(6) construct approximate multivariate standard normal distributions [29]. The T0 score is defined using the hard assignment and involves indicator functions. The standardization of T1 in Eq (4) is achieved by inverting the variances first and then computing the combination with . Whereas, for the standardization of T2 in Eq (5), we compute the combination of variances with and then invert the term. Finally, the T3 score in Eq (6) is derived by using marginal covariance-based standardization of the multivariate mixture model proposed in [30] Per a reviewer’s request, we present the derivation of the transformation in Eq (6) in S1 File. The convergence properties corresponding to these scores are investigated in [31].

2.2 Heat kernel signatures (HKS)

Definition 1. [28] (Doubly-weighted graph). A connected undirected graph G = (V, M, W) is a doubly weighted graph, where V = {1, ⋯, n} is the vertex set, M = diag{m1, ⋯, mn} is the diagonal matrix for weights of vertices, and W = {Wij} is the matrix for weights of edges. If D = {d1, ⋯, dn} is the degree matrix of the graph G, then di = ∑jWij.

Definition 2. [28] (Weighted graph Laplacian). Suppose G = (V, M, W) is a doubly weighted graph. Let be the linear space of all functions . The gradient of f is defined as a vector for all yV, and the weighted Laplacian Δ is an operator in defined as . The integral of f is defined asf ≔ ∑xV f(x)mx, and is equipped with the inner product < f, g >= ∫ fg for all .

Lemma 1. [28] Δ is equivalent to the weighted Laplacian matrix . Notice that when M = I or M = D, a weighted Laplacian becomes an unnormalized Laplacian or normalized Laplacian.

The Laplacian matrix is symmetric and positive semi-definite. However, this is true in general, as in [28] the authors provide the weighted spectral algorithm and show how to compute the eigendecomposition for LM,W. They denote the eigenpairs by .

Definition 3. [23] (Laplacian Family Signature (LFS).) Suppose is the construction filter function for a family of signatures. Then the LFS of a node iV is a one-parameter family of structural node descriptors, (7)

As the signature of a given node iV is a function of the parameter analogous to time in the well-known heat diffusion process, two nodes i and j can be compared using any kind of distance or norm between the functions νi(⋅) and νj(⋅). A physical interpretation of the node signature is that it captures the amount of heat left at the node at various times t, assuming initially t = 0. To obtain HKS, we select h(t; λk) = exp(−tλk).

3 Methodology

3.1 Gaussian mixture approximation of FPKM data

When analyzing RNA-seq data, it is routine to differentiate high and low expression genes, or to detect proteins from several genes expressed below an appropriate threshold [17, 32, 33]. By using visual interpretation and curve fitting to assess the quality of fitting, Hebenstreit and collaborators [32] showed that RNA-seq follows a bimodal distribution of high and low expression genes. In [33], the authors calculated an automatic measure of the presence and prevalence of transcripts from known and previously unknown genes. FPKM (Fragments Per Kilobase Million) computation is gaining attention as an expression for RNA-seq data, as this measure is normalized with respect to gene length and allows us to identify the relative gene expression more intuitively.

An alternative to the use of FPKM values, in particular the Z-scores of log2(FPKM) values, was proposed in [17]. However, they rely on a predefined threshold as a cutoff point and another set of sequencing data known as ChiP-seq data. The cut-off is defined as a point where the ratio of active to repressed gene promoters is less than 1. In order to find the Z-scores of FPKM, they fit a Gaussian curve to each gene expression curve centered at the half-point of the log2(FPKM) values. The fitting of a symmetric distribution to an asymmetric curve is justified by the cutoff point, as anything to the left of that point was ignored. In our study, the ChiP-seq data is not available, so there is no reasonable cutoff point to consider. However, the Z-score approach proves to be useful if the goal is to visualize the gene expression curve centering the highest expressed genes for each individual. In order to capture the full distribution of gene expression levels, we choose to implement the Gaussian mixture model to fit the log2(FPKM) values. We present a motivating example for applying this model in Section 3.1.1. We then estimate standardized scores for GMM, presented in Section 2.1, and their corresponding distributions, to determine the most appropriate score for further analysis.

3.1.1 Motivating example.

We provide a motivating example, exhibiting similar behavior to all other subjects in our data set, to justify our use of the GMM to standardize RNA-seq data in this work. Fig 2(a) shows the distribution of log2(FPKM) values of gene expression in healthy tissue, provided from GTEx, as a purple curve. It also includes the fitted Gaussian curve with mean = 8.76 and standard deviation = 2.08 in red. However, the purple curve is asymmetric and suggests the presence of two subpopulations. Hence we find that a Gaussian mixture model (GMM) of two components is more useful for identifying local features. It should be noted that the Gaussian fitting proposed in [17] was sufficient in their work, as they incorporated the ChiP-seq data and in turn estimated an appropriate cutoff point. This allowed them to ignore the left subpopulation.

Fig 2.

An example of (a) the fitted Gaussian curve and (b) the fitted Gaussian mixture curve of two components to the distribution of log2(FPKM) values for a sample subject from the GTEx data set.

As suggested by the structure of the log2(FPKM) curve, a fitted Gaussian mixture model has two components with means 3.12 and 8.87, and standard deviations 2.61 and 2.11, respectively (see Fig 2(b)). Comparing with Fig 2(a), we observe that one of the components has very similar parameters to the standard Gaussian fitting. The presence of the second subpopulation suggests that the incorporation of GMM is necessary to provide a complete analysis of the data.

3.2 Mapper algorithm

To represent our data graphically, we use a topological algorithm, known as Mapper, developed by Singh, Memoli, and Carlssson in [6]. Mapper, based on a generalized Reeb graph [34], is a tool used for data analysis and visualization, which relies upon a specified filter function to guide the clustering of data points. The stability of Mapper graphs is discussed in [35]. The filter function assigns a scalar value to each subject, after which subjects are sorted into overlapping bins, according to their corresponding filter function output. Next subjects within each bin are clustered together, according to a specified clustering algorithm, e.g. DBSCAN or agglomerative clustering, to form nodes of the graph. Two nodes are then connected by an edge if they contain one or more subjects in common.

Thus, Mapper requires the choice of a filter function and clustering algorithm, as well as the specification of input parameters: b, the number of filter output bins of equal length; p, the percent of overlap between adjacent bins; and ϵ, the scale parameter used in the clustering algorithm, with the metric Euclidean distance. We considered several filter functions, including maximum, minimum, and mean correlation, defined for a subject χ to be the maximum, minimum, or mean correlation, respectively, between χ and all other subjects in the data set. When applying the algorithm to a data set composed of tumor and healthy subjects of a given cancer type, we find that mean correlation is most effective in producing a meaningful graphical structure that largely separates tumor and healthy subjects. Thus, we focus primarily on the mean correlation filter function in this work.

We vary the number of bins, b, in the interval 60 ≤ b ≤ 110 and vary the parameter p in the interval 30 ≤ p ≤ 80. We use the agglomerative clustering algorithm to cluster subjects within each node. The FPKM expression levels and T0 scores are on different orders of magnitude, so for the parameter ϵ used in clustering data points, we let 600 ≤ ϵ ≤ 1000 when using T0 scores, and let 1 × 106ϵ ≤ 2 × 106 when using FPKM levels for comparison. Since the Mapper algorithm is sensitive to parameter variation, we test the robustness of the graphical structure to variation in parameter values in Section 4.5.

3.3 Position index (PI)

In order to classify the position of subjects in a Mapper graph, we define the position index (PI), which assigns a value of 1, −1, or 0 to a subject, according to its position in the Mapper graph. The PI is applicable for graphs that have a strand-like structure, without significant branching, which are the only types of graphs that Mapper produced on the T0 scores of the RNA-seq data used in this study, with pertinent choices of parameters.

We assign a PI of +1 to all subjects that appear in the 13 nodes on the left side of the visual representation of the Mapper graph. The number 13 is chosen to account for roughly of the total nodes using our baseline set of parameters. Similarly, we assign a PI of −1 to all subjects that appear on the right side of the Mapper graph, and we assign 0 to all remaining subjects.

3.4 Gene expression analysis

For the RNA-seq transcriptome analysis, variation in gene expression was analyzed using the guideline of DESeq2, an R software package [36]. The differential gene expression testing method in the package makes use of the negative binomial distribution to model the count data and fit the model by generalized linear models to estimate parameters. The method generates various results to analyze the differential gene expressions and among them we implement the absolute value of the logarithmic fold changes (LFCs). The fold change between two groups is the ratio of their counts data. For the detection of differential genes DESeq2 tests the null hypothesis that the LFC between two groups for a gene’s expression is zero.

As for Gene Ontology (GO) term analysis, we use the Enrichr platform [1921] by selecting significantly upregulated or downregulated genes between (1) PI = -1 vs. healthy, and (2) PI = +1 vs. healthy subjects. We select genes showing the absolute value of the LFCs 0.90 or more (i.e. 20.90 = 1.87 or higher differential expression levels) in the group (1) and (2), yielding 2,885 and 2,923 genes, respectively. Among them, we subcategorize the shared genes between groups (1) and (2) (1,727 genes) and unique genes in (1) (1,158 genes) or (2) (1,183 genes). Each of these three gene lists is loaded at the input data window in the Enrichr site, and the analyzed results using the “GO Biological Process 2021” or “KEGG 2021 Human”, for GO term analysis and pathway analysis, respectively, are downloaded. The top 10 candidates of GO terms and KEGG signaling pathways are shown in Fig 9.

3.5 Graphical subject scores (GSS)

The structure of the Mapper graph depends on the set of parameters chosen by the user in a brute-force setting. Often it is very challenging to measure the robustness of structure to variation in parameters. In this work, we propose a method to perform a sensitivity analysis of different parameter choices using heat kernel signature (HKS). We consider Mapper graphical structures as doubly-weighted graphs G = (V, M, W), as defined in Section 2.2. For a Mapper graph with n nodes, we consider the ratio of healthy and tumor subjects as the weights of vertices . The weight of the edge eij that connects the two vertices i and j is denoted as Wij and is estimated as the number of overlapping subjects between the two nodes i and j. The doubly weighted graph proves to be useful for our analysis, as it can incorporate various levels of information about the underlying data set encoded in the Mapper graph. We then compute the graph Laplacian and the corresponding eigenpairs, as defined in Section 2.2. The HKS of each node is computed using Eq (7). One subject can belong to more than one node, so their GSS is computed as the sum of the HKSs associated with those nodes. The Mapper structure relies on three parameters: the number of filter output bins b, the percent of overlap between bins p, and the scale parameter for the clustering ϵ. We vary one of them while keeping the other two fixed for the sensitivity analysis.

4 Results

Here, we present a detailed illustration of the data set and the results using the workflow in Fig 1. To reproduce these results, the interested reader may utilize the Github repository genomicsTDA.

4.1 Data set

We apply the methodology described in the previous section to RNA-seq data sets, obtained from lung tissue. The data sets contain sequencing data from healthy tissue and from tumor tissue. The healthy data was collected as part of the Genotype Tissue Expression project (GTEx) [15, 16], and the tumor data was collected from The Cancer Genome Atlas (TCGA) [12]. The data is obtained from the published data in [37], so it is processed and normalized according to the process described in [37], in order to correct for batch effects and to allow for comparison between RNA-seq data from two sources. Combining the two data sets, we have 814 total subjects, with 314 healthy subjects from GTEx, and 500 tumor subjects from TCGA. Every subject contains an RNA-seq expression level, reported in FPKM, for 19648 distinct genes.

4.2 Data preprocessing

We fit a Gaussian mixture model (GMM) to the log2(FPKM) values for each of the 814 subjects by using the process discussed in Section 3.1. To test whether the mixture model can fit the data appropriately we perform a two-sided Kolmogorov-Smirnov (KS) hypothesis test, where the corresponding null hypothesis is that the log2(FPKM) values follow a two-component GMM, and the alternative hypothesis is that the values do not follow the GMM. We employ the significance level of α = 0.01. Consequently, a p-value associated with the hypothesis test greater than α characterizes the choice of GMM fitting to be statistically significant. We compute p-values for all 814 subjects and observe that the values justify the choice of GMM fitting almost consistently. We present the p-values of 6 subjects randomly chosen from both tumor and healthy subjects in Table 1. The p-values corresponding to the tumor subjects (from TCGA) is relatively lower than that of the healthy subjects. This indicates that the GMM fitting for healthy subjects is statistically more significant than the tumor subjects, due to more variability present in tumor subjects.

Table 1. p–values based on KS testing, as described in Section 4.2.

All of them are statistically significant at the threshold level 0.01.

Next, we estimate the four standardized scores defined in Eqs (3)–(6). Fig 3 shows the histograms of all four scores, and a fitted Gaussian curve for each, for the sample subject from Fig 2. It can be observed that the distribution of T0 is symmetric and fits the data well, as we desire from standardized scores. We also create the Q-Q plots of all four scores. In Fig 4 we present the plots for the sample subject considered in Fig 3. The Q-Q plots of T0 scores form more of a straight-line structure than all other scores.

Fig 3. The histograms and fitted Gaussian curves for the four scores defined in Section 2.1, for a sample subject from the GTEx data set.

Precisely, (a), (b), (c), and (d) display the T0, T1, T2, and T3 scores, respectively.

Fig 4. The Q-Q plots for the four scores defined in Section 2.1, for the same subject considered in Fig 3.

Precisely, (a), (b), (c), and (d) display the T0, T1, T2, and T3 scores, respectively. We observe that only the points in (a) enjoy a straight-line shape.

The readers may expect that gene expressions of different individuals require different types of standardized scores for analysis purposes. However, we have found in all data samples used in this work, that T0 is sufficient for further investigation. The reason is that out of these four scores, only the T0 score enjoys an approximately normal distribution. Hereafter we restrict our analysis to T0 scores.

4.3 Mapper and data visualization

Fig 5(a) and 5(b) show a comparison of the Mapper graphs when using FPKM data, and Fig 5(c) and 5(d) when using T0 scores. The colors indicate the percentage of tumor subjects versus healthy subjects within each node, with dark purple representing 100% healthy subjects and bright yellow representing 100% tumor subjects. Both of the T0 score graphs have a long strand-like appearance, with highly connected nodes in the p = 70 case, due to the higher percentage of overlap between adjacent bins. In both T0 cases, there are two distinct groups of yellow nodes, indicating that they contain a large number of tumor subjects, separated by all of the healthy subjects, located in the darker nodes near the center of the graphical structure (Fig 5(c) and 5(d)).

Fig 5.

(a) and (b) show Mapper graphs constructed using FPKM data, and (c) and (d) show Mapper graphs constructed using T0 scores. The FPKM data required a larger ϵ, so we use ϵ = 2 × 106 in (a) and (b), and ϵ = 600 in (c) and (d). (a) and (c) use the parameter p = 50, and (b) and (d) use p = 70. In all cases, we fix b = 80. The colors indicate the percentage of tumor subjects versus healthy subjects within each node, with dark purple representing 100% healthy subjects and bright yellow representing 100% tumor subjects.

Note that when we compare the Mapper graphs using FPKM data versus T0 scores, there is much more consistency in the graphical structure constructed from T0 scores. There is a higher level of connectivity between nodes with p = 70, but both graphs have a strand-like structure with similar coloring throughout. Using FPKM data, the graphs for p = 50 and p = 70 look particularly different in the regions containing healthy subjects in each case (Fig 5(a) and 5(b)). Notice that when using T0 scores the two outer ends of the strand-like graphs are primarily yellow in color, indicating that they contain mostly tumor subjects. The middle of the graph is primarily purple in color, indicating that the centrally located nodes contain mostly healthy subjects (Fig 5(c) and 5(d)). The separated groups of tumor subjects provided the motivation for the position index (PI) described in Section 3.3, as a method to distinguish between these two groups and for use in further investigations with more traditional differential gene expression analysis.

We compute the PI for the subjects clustered in the Mapper graphs using the parameter values b = 80, p ∈ {50, 70}, and ϵ ∈ {600, 700}. We find consistency in the PI across these parameter sets. Thus, using the PI allowed us to distinguish between tumor subjects clustered in the nodes on either end of the Mapper graph. Fig 6 displays the PI for the strand-like Mapper graph that uses T0 scores, with the parameter values p = 50 and ϵ = 600. The vertical red line separates the healthy subjects, on the left, from the tumor subjects, on the right. Note that almost all of the subjects on the two ends of the Mapper graph are tumor subjects, producing the yellow color that we see on the graph in Fig 5(c). In the next section, we describe the results from a genetic analysis that we performed on the two distinct groups of tumor subjects with PI = +1 and PI = −1, to identify any distinct genetic differences between these two groups. We also explore significant differences between each group of tumor subjects and the healthy subjects (i.e. those clustered in the center of the Mapper graph).

Fig 6. Position index (PI) of +1 and -1 are shown for healthy subjects, with subject ID smaller than 314 (to the left of the vertical red line), and tumor subjects, with subject ID larger than 314.

The PIs are computed from the Mapper graph using T0 scores, with b = 80, p = 50 and ϵ = 600.

We compare the PI for all subjects in Mapper graphs as we vary the parameter ϵ, the scale parameter used in the agglomerative clustering algorithm, in increments of 100 within the interval ϵ ∈ [600, 1000]. Fig 7 displays the sum of the PI over the Mapper graphs using these five distinct ϵ values. We observe that for all but one subject, the sum of the PI is 5, indicating a very high level of consistency in the location of the subjects in the Mapper graph, robust to variation on the local clustering scale within the Mapper algorithm. We perform a more extensive sensitivity analysis of the Mapper structure with a varying set of parameters using HKS in Section 4.5.

Fig 7. The sum of the position indices (PIs) from five different Mapper graphs, with the parameter ϵ varying in the interval [600, 1000].

The healthy subjects have subject ID smaller than 314 (to the left of the vertical red line), and tumor subjects have subject ID larger than 314. All PIs are computed from the Mapper graphs using T0 scores, with b = 80 and p = 50.

Finally, we compare Mapper graphs with one of the most commonly used techniques to visualize RNA-seq data, t-distributed Stochastic Neighbor Embedding (t-SNE) [38, 39]. Although t-SNE is useful in clustering in low-dimensional space, it is unable to produce continuous structures in visualizing gene expression profiles (see [40] and references therein). In our analysis, t-SNE can accurately separate healthy and tumor into two different clusters using both FPKM and T0 scores (See Fig 8). However, the two subgroups generated by the PI have not been observed in these clusters. In this paper we restrict our analysis to the comparison of Mapper with t-SNE only. Detailed comparisons of Mapper with state-of-the-art clustering methods for the visualization of RNA-seq data can be found in [40].

Fig 8. The t-SNE clusters constructed using lung FPKM data (a) and using lung T0 scores (b).

4.4 Genetic analysis

We then run the differential gene expression analyses by using the DESeq2 R package, followed by GO term/KEGG pathway analyses (see Section 3.4). As we listed the genes that are significantly differentially expressed between the case and control of lung tumor, we find large overlaps of the genes in the groups of PI = +1 and PI = -1 (GO term analysis and KEGG pathway analysis: S2 File). Under the GO term analysis, these shared genes revealed enriched clusters with muscle function (e.g., actin family, and myoglobin heavy/light chains), where many genes are upregulated in the cells in both PI = -1 and +1 groups (Fig 9; S3 File).

Fig 9.

The biological functions corresponding to the significant genes found uniquely in the PI = -1 group (panels A and D), in the PI = +1 group (C and F), and shared between the +1 and -1 groups (B and E). (A-C) shows the biological processes found using GO term analysis, and (D-F) shows the signaling pathway analysis found using KEGG. The x-axes indicate the p-values of significant enrichment. Parentheses indicate the summary of the enriched processes in each panel.

In the genes uniquely grouped in PI = -1, GO-terms are more enriched in cell signaling and inflammatory reaction. KEGG pathways for the PI = -1 genes are also enriched in inflammatory reactions, and many of these genes were upregulated (Fig 9; S2 File). In contrast, the genes in the PI = +1 group condense in tumor-related functions (retinoid metabolism, epidermis development, p38MAPK enhancement), and KEGG show retinol metabolism and some diet digestion and metabolisms (Fig 9; S2 and S3 Files). We discuss the utility of the information and the hidden biological significance revealed by the Mapper graph.

The Mapper graph indicated that the tumor subjects’ cells were clustered in two groups on either end of the strand-like graphical structure, which we label with PI = +1 and -1, while the cells from healthy individuals were largely clustered in the center. The two PI groups suggest two possible processes for forming lung cancer. For example, the PI = -1 group is strongly clustered with the tumor cells (majority of nodes are yellow colored in Fig 5(c)), suggesting that the cells upregulating the inflammatory reactions are primarily cancer cells (S2 and S3 Files). In contrast, the PI = +1 group consists of a mixture of healthy and cancer cells (green and yellow nodes in Fig 5(c)), indicating the cells in this PI = +1 arm trajectory, although still ‘high risk,’ are not as high as the PI = -1 arm pathway for lung cancer. Uniquely diversified gene expressions in the PI = +1 group are enriched with signaling pathways associated with tumors and nicotine consumption (Fig 9; S3 File), suggesting possible environmental factors (such as smoking) and tumor gene interactions.

Our pipeline presents a new path to reveal individualized treatment strategies for lung cancer as well as, more generally, to interpret bulk RNA-seq based studies. The vertices in PI = -1 and +1 are continuously connected by edges; therefore, we can predict the risk state of ‘healthy cells’: if the cells belong to the vertices close to PI = -1 or +1 group, these cells are at higher risk to be cancer cells than the cells in the middle of the healthy cell cluster. In a biological context, if the expressions of muscle-related genes plus inflammatory-related genes are upregulated, these cells are at the highest risk. This type of interpretation using vertices-edge patterns is very difficult to obtain using the current popular clustering algorithms, including t-SNE. We also consider shared differentially expressed genes in both the PI = -1 and +1 groups, and we find they are enriched in muscle physiology/differentiation (Fig 9). This result is surprising to us because, as far as we know, the association between muscle contractions and lung cancer has not been well-characterized. One possibility is that the recruitment of actin-myosin and related genes is necessary during metastasis. Further investigation of muscle genes in the context of lung cancer may shed light on previously unknown aspects of lung cancer development.

4.5 Sensitivity analysis

In order to showcase the implication of GSS, we perform a simple sensitivity analysis under different parameter sets that are used to generate Mapper graphs. As discussed in Section 3.2 the Mapper pipeline we consider here typically relies on three parameters, namely the number of bins b, the percent of overlap p, and the scale parameter used in the clustering algorithm ϵ. We estimated the GSS for the subjects in the underlying data set using all possible combinations of the parameter values b ∈ {60, 70, 80, 90, 100, 110}, p ∈ {30, 40, 50, 60, 70, 80}, and ϵ ∈ {600, 700, 800, 900, 1000}.

In order to perform the sensitivity analysis we vary one parameter while leaving the other two fixed. We then estimate the 95% confidence interval from the empirical distribution of the GSSs of each subject in the data set. For the sake of brevity, we present some selected results that show robustness of the parameters. We find consistency in the GSSs and, consequently, tighter confidence intervals across the parameter sets b = {60, 70, 80, 110} and p = {40, 50, 60, 70}. We observe that several combinations of these parameter sets produce Mapper graphs that are very robust to changes in ϵ. Furthermore, the GSSs show distinguishable patterns between healthy and tumor subjects. Precisely, we observe less variation in the scores of healthy subjects and more in that of tumor subjects. This pattern was anticipated, as the tumor subjects were clustered in the nodes on either end of the Mapper graph, and healthy subjects are located at the center of the graph.

Fig 10 displays the GSSs and corresponding CIs for the parameter values ϵ ∈ {700, 800, 900, 1000} and p = 80. For all of the cases in Fig 10 we vary the values of b in {60, 70, 80, 90, 100, 110} The horizontal axis represents the subject index, and the vertical axis represents the GSSs. The red curve indicates the mean of the observations, the black curve represents the upper confidence limit, and the green curve represents the lower confidence limit. We observe a tighter confidence range for the healthy subjects’ scores, especially in Fig 10(a)–10(c). A thicker confidence range is generated in Fig 10(d). For all of these cases we find that the CI is wider for some of the tumor subjects. This is because the tumor subjects were clustered in the nodes on either end, and HKS values reflect where the subject is clustered in the Mapper graph. We also note that varying the number of filter output bins, b, changes the number of nodes in the Mapper graph, which impacts the GSS for many subjects.

Fig 10.

95% confidence intervals (CIs) of graphical subject scores (GSSs) are shown for 814 subjects with (a) ϵ = 700 and p = 80, (b) ϵ = 800 and p = 80, (c) ϵ = 900 and p = 80, and (d) ϵ = 1000 and p = 80. The CIs were estimated by considering all possible values of the parameter b. The horizontal axis represents the subject index and vertical axis represents the GSSs. The red line is for the mean of the observations, black line is for the upper confidence limit, and green line is lower confidence limit. The first 314 subjects are healthy subjects, and we observe a distinguishable pattern in their GSSs with respect to the other 500 subjects.

Figs 11 and 12 display the GSSs and corresponding CIs for combinations of parameter values b ∈ {60, 70, 80, 110} and p ∈ {40, 50, 60, 70}. For all of the cases in Fig 11 we vary the values of ϵ in {600, 700, 800, 900, 1000}. The horizontal axis represents the subject index and vertical axis represents the GSSs. The red curve displays the mean of the observations, the black curve is for the upper confidence limit, and the green curve shows the lower confidence limit. Similarly to the results above, in Fig 11, we observe a tighter confidence range for the healthy subjects’ scores and a wider confidence range for tumor subjects, due to the clustering of the tumor subjects in the nodes on either end of the Mapper graph. We note that in Fig 12, as we vary ϵ, the Mapper graphs are nearly identical, which is why the green, red, and black curves appear nearly overlapping. This demonstrates that the Mapper graphs and the clustering of the subjects are very robust to changes in ϵ in the parameter regime tested here.

Fig 11.

95% confidence intervals (CIs) of graphical subject scores (GSSs) are shown with (a) b = 60 and p = 80, (b) b = 70 and p = 80, (c) b = 80 and p = 80, and (d) b = 110 and p = 80. The CIs were estimated by considering the parameter ϵ ∈ {600, 700, 800, 900, 1000}. The horizontal axis represents the subject index and vertical axis represents the GSSs. The red curve is for the mean of the observations, black curve is for the upper confidence limit, and green curve is the lower confidence limit. The first 314 subjects are healthy subjects and we observe distinguishable patterns in their GSSs with that of the other 500 subjects.

Fig 12.

95% confidence intervals (CIs) of graphical subject scores (GSSs) are shown with (a) b = 80 and p = 40, (b) b = 80 and p = 50, (c) b = 80 and p = 60, and (d) b = 80 and p = 70. The CIs were estimated by considering the parameter ϵ ∈ {600, 700, 800, 900, 1000}. The horizontal axis represents the subject index and vertical axis represents the GSSs. The red curve is for the mean of the observations, black curve is for the upper confidence limit, and the green curve shows the lower confidence limit.

5 Conclusion and discussion

In this work, we developed a novel workflow to analyze RNA-sequencing data using the topological algorithm known as Mapper and a corresponding scoring method, known as the heat kernel signature (HKS). In order to produce a Mapper graph with an informative structure, we first describe the RNA-seq data using a Gaussian mixture model (GMM) and corresponding normalization scores. The number of components in GMM is empirically selected. A line of future research is to develop a statistical analysis which will account for uncertainty in the GMM parameters and will automatically estimate them. We test this workflow on a data set containing genetic expression levels in lung tissue from tumor and healthy subjects, which revealed a graphical structure with two distinct groups of tumor subjects. The distinct groups are not found using traditional statistical clustering methods, such as t-SNE.

Subsequent genetic analysis, using the differential gene expression analyses with DESeq2, followed by Enrichr, of the two tumor subgroups and the healthy subjects, identified the most significant genes within each group. This analysis reveals two distinct pathways for the development of lung cancer. One pathway involves retinoid metabolism and digestion, and the other primarily involves inflammatory reactions. The mixture of healthy subjects with the PI = +1 group of tumor subjects suggests that the upregulation of metabolism and digestion-related genes is not as high risk as the upregulation of inflammatory-related genes. Additionally, muscle-related genes are upregulated in both tumor groups, so cells with high expressions of muscle-related genes plus inflammatory-related genes could be at the highest risk. However, without further investigation, we cannot accurately interpret why muscle signaling is involved in tumor formation. We suspect the actomyosin complex (a part of the muscle signaling) could contribute to metastasis in lung cancer. The separation of the two tumor subgroups in the Mapper graph allowed us to discover these important genetic distinctions between the groups, which have not been well-characterized previously, to the best of our knowledge. These results suggest a direction for further biological investigation, which could reveal previously unknown lung cancer biomarkers and new mechanisms for the development of lung cancer.

Additionally, we showcase how the HKS can be applied to Mapper graphs in order to provide an associated score for each subject. We call this signature the graphical subject score (GSS), and it can be used to perform a parameter sensitivity analysis for a given data set, as we have demonstrated in this work. In our data set, the tumor subjects are more sensitive to parameter variation, due to the changing number of nodes as b and p varied, and the clustering of tumor subjects on both ends of the graph. However, the positioning of all subjects was very robust to changes in the clustering parameter ϵ. The GSS is useful for identifying Mapper parameter ranges that produce consistent graphical structures. Since it associates a score for each subject over parameter ranges, this method can be used in future work to allow for more robust statistical inference involving Mapper graphs.

Supporting information

S1 File. Derivation of Eq (6).

The .pdf file includes the derivation of standardized score T3 in Eq (6) to follow the suggestion of a reviewer.


S2 File. GO term analysis and KEGG pathway analysis.

The .xlsx file includes an archive of total six .xlsx files. The first three present GO term analysis results with unique genes in PI = +1, shared genes in PI = +1 and PI = -1, and unique genes in PI = -1, respectively. The next three consist of KEGG pathway analysis results with unique genes in PI = +1, shared genes in PI = +1 and PI = -1, and unique genes in PI = -1, respectively.


S3 File. DESeq2 analysis.

The .xlsx file includes two .xlsx files—one has results from DESeq2 for healthy subjects vs. PI = -1 and healthy subjects vs. PI = +1 (A to H), and another one has a subcategorized list of genes in PI = -1 only, PI = +1 only, and shared genes in PI = +1 and PI = -1.



This research began at the ICERM Workshop: Applied Mathematical Modeling with Topological Techniques, and we thank Vladislav Bukshtynov, Steven Ellis, Elin Farnell, Hwayeon Ryu, and Sarah Tymochko for their ideas during early conversations related to this work.


  1. 1. Kovacev-Nikolic V, Bubenik P, Nikolić D, Heo G. Using persistent homology and dynamical distances to analyze protein binding. Stat Appl Genet Mol Biol. 2016;15(1):19–38. pmid:26812805
  2. 2. Cang Z, Mu L, Wei GW. Representability of algebraic topology for biomolecules in machine learning based scoring and virtual screening. PLOS Computational Biology. 2018;14(1):1–44. pmid:29309403
  3. 3. Cang Z, Wei GW. Integration of element specific persistent homology and machine learning for protein-ligand binding affinity prediction. Int J Numer Meth Biomed Engng. 2018;34(2):e2914. pmid:28677268
  4. 4. Chan JM, Carlsson G, Rabadán R. Topology of viral evolution. PNAS. 2013;110(46):18566–18571. pmid:24170857
  5. 5. Nicolau M, Levine A, Carlsson G. Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival. PNAS. 2011;108(17):7265–7270. pmid:21482760
  6. 6. Singh G, Memoli F, Carlsson G. Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition. In: Botsch M, Pajarola R, editors. Eurographics Symposium on Point-Based Graphics. Geneva: Eurographics Association; 2007. p. 91–100.
  7. 7. DeWoskin D, Climent J, Cruz-White I, et al. Applications of computational homology to the analysis of treatment response in breast cancer patients. Topology and its Applications. 2010;157(1):157–164.
  8. 8. Arsuaga J, Baas NA, DeWoskin D, et al. Topological analysis of gene expression arrays identifies high risk molecular subtypes in breast cancer. AAECC. 2012;23(3):3–15.
  9. 9. Arsuaga J, Borrman T, Cavalcante R, et al. Identification of Copy Number Aberrations in Breast Cancer Subtypes Using Persistence Topology. Microarrays. 2015;4(3):339–369. pmid:27600228
  10. 10. Jeitziner R, Carriere M, Rougemont J, et al. Two-Tier Mapper, an unbiased topology-based clustering method for enhanced global gene expression analysis. Bioinformatics. 2019;35(18):3339–3347. pmid:30753284
  11. 11. Rizvi A, Camara P, Kandror E, et al. Single-cell topological RNA-seq analysis reveals insights into cellular differentiation and development. Nat Biotechnol. 2017;35:551–560. pmid:28459448
  12. 12. Institute NC. The Cancer Genome Atlas Program; 2022. Available from:
  13. 13. Campbell J, Alexandrov A, Kim J, et al. Distinct patterns of somatic genome alterations in lung adenocarcinomas and squamous cell carcinomas. Nature Genetics. 2016;48:607–616. pmid:27158780
  14. 14. Network TCGAR. Comprehensive molecular profiling of lung adenocarcinoma. Nature. 2014;511:543–550.
  15. 15. GTEx Consortium. The Genotype-Tissue Expression (GTEx) project. Nat Genet. 2013;45:580–585.
  16. 16. GTEx Consortium. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science. 2015;348:648–660.
  17. 17. Hart T, Komori HK, LaMere Sea. Finding the active genes in deep RNA-seq gene expression studies. BMC Genomics. 2013;14:778. pmid:24215113
  18. 18. Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15:550. pmid:25516281
  19. 19. Xie Z, Bailey A, et al. Gene Set Knowledge Discovery with Enrichr. Curr Protoc. 2021;1. pmid:33780170
  20. 20. Kuleshov M, Jones M, et al. Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Res. 2016;44. pmid:27141961
  21. 21. Chen E, Tan C, et al. Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool. BMC Bioinformatics. 2013;14(128). pmid:23586463
  22. 22. Sun J, Ovsjanikov M, Guibas L. A Concise and Provably Informative Multi-Scale Signature Based on Heat Diffusion. In: Proceedings of the Symposium on Geometry Processing. SGP’09. Goslar, DEU: Eurographics Association; 2009. p. 1383–1392.
  23. 23. Hu N, Rustamov R, Guibas L. Stable and Informative Spectral Signatures for Graph Matching. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition; 2014. p. 2313–2320.
  24. 24. Chowdhury S, Needham T. Generalized Spectral Clustering via Gromov-Wasserstein Learning. In: AISTATS; 2021.
  25. 25. Royer M, Chazal F, Levrard C, Ike Y, Umeda Y. ATOL: Measure Vectorization for Automatic Topologically-Oriented Learning. In: AISTATS; 2021.
  26. 26. Carriere M, Chazal F, Ike Y, Lacombe T, Royer M, Umeda Y. PersLay: A Neural Network Layer for Persistence Diagrams and New Graph Topological Signatures. In: Chiappa S, Calandra R, editors. Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics. vol. 108 of Proceedings of Machine Learning Research. PMLR; 2020. p. 2786–2796. Available from:
  27. 27. Yim K, Leygonie J. Optimization of Spectral Wavelets for Persistence-Based Graph Classification. Frontiers in Applied Mathematics and Statistics. 2021;7.
  28. 28. Xu S, Fang J, Li X. Weighted Laplacian Method and Its Theoretical Applications. IOP Conference Series: Materials Science and Engineering. 2020;768:072032.
  29. 29. Li M, Schwartzman A. Standardization of multivariate Gaussian mixture models and background adjustment of PET images in brain oncology. Ann Appl Stat. 2018;12(4):2197–2227.
  30. 30. Guo M VdAALNSA Yap JT. Voxelwise single-subject analysis of imaging metabolic response to therapy in neuro-oncology. Stat. 2014;3:1. pmid:24999285
  31. 31. Li L, Cheng WY, Glicksberg BS, Gottesman O, Tamler R, Chen R, et al. Identification of type 2 diabetes subgroups through topological analysis of patient similarity. Science Translational Medicine. 2015;7(311):311ra174–311ra174. pmid:26511511
  32. 32. D H, M F, M G, V C, van Oudenaarden A, SA T. RNA sequencing reveals two major classes of gene expression levels in metazoan cells. Mol Syst Biol. 2011;7:497.
  33. 33. Mortazavi A, Williams B, McCue Kea. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008;5:621–628. pmid:18516045
  34. 34. Reeb G. Sur les points singuliers d’une forme de Pfaff complètement integrable ou d’une fonction numérique. C R Acad Sci Paris. 1946;222:847–849.
  35. 35. Carrière M, Michel B, Oudot S. Statistical Analysis and Parameter Selection for Mapper. Journal of Machine Learning Research. 2018;19(12):1–39.
  36. 36. Love M, Anders S, Kim V, Huber W. RNA-Seq workflow: gene-level exploratory analysis and differential expression. F1000Research. 2016;4 (1070).
  37. 37. Wang Q, Armenia J, Zhang C. Unifying cancer and normal RNA sequencing data from different sources. Nature Scientific Data. 2018;5(6):180061. pmid:29664468
  38. 38. Hinton GE, Roweis ST. Stochastic Neighbor Embedding. In: Advances in Neural Information Processing Systems. vol. 15. Cambridge, MA, USA: The MIT Press; 2002. p. 833–840.
  39. 39. van der Maaten L, Hinton G. Visualizing Data using t-SNE. Journal of Machine Learning Research. 2008;9(86):2579–2605.
  40. 40. T W, T J, J Z, K H. Topological Methods for Visualization and Analysis of High Dimensional Single-Cell RNA Sequencing Data. Pac Symp Biocomput. 2019;24:350–361.