Tissue-adjusted pathway analysis of cancer (TPAC): A novel approach for quantifying tumor-specific gene set dysregulation relative to normal tissue

H. Robert Frost

doi:10.1371/journal.pcbi.1011717

Abstract

We describe a novel single sample gene set testing method for cancer transcriptomics data named tissue-adjusted pathway analysis of cancer (TPAC). The TPAC method leverages information about the normal tissue-specificity of human genes to compute a robust multivariate distance score that quantifies gene set dysregulation in each profiled tumor. Because the null distribution of the TPAC scores has an accurate gamma approximation, both population and sample-level inference is supported. As we demonstrate through an analysis of gene expression data for 21 solid human cancers from The Cancer Genome Atlas (TCGA) and associated normal tissue expression data from the Human Protein Atlas (HPA), TPAC gene set scores are more strongly associated with patient prognosis than the scores generated by existing single sample gene set testing methods.

Author summary

Cancer biology is highly tissue-specific: most cancer-driving somatic alterations occur in just a limited number of tissues and inherited mutations frequently have a tissue-specific functional impact. To leverage the associations between gene activity in normal and malignant tissue for gene set testing, we have developed a new single sample gene set testing method for tumor-derived transcriptomics data named TPAC (tissue-adjusted pathway analysis of cancer). The TPAC method uses normal tissue-specificity to quantifies gene set dysregulation in each profiled tumor. Importantly, we found that TPAC gene set scores are more strongly associated with patient prognosis than the scores generated by existing single sample gene set testing methods.

Citation: Frost HR (2024) Tissue-adjusted pathway analysis of cancer (TPAC): A novel approach for quantifying tumor-specific gene set dysregulation relative to normal tissue. PLoS Comput Biol 20(1): e1011717. https://doi.org/10.1371/journal.pcbi.1011717

Editor: Martin H. Schaefer, Istituto Europeo di Oncologia, ITALY

Received: May 9, 2023; Accepted: November 27, 2023; Published: January 11, 2024

Copyright: © 2024 H. Robert Frost. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Cancer and normal tissue gene expression data was accessed from The Cancer Genome Atlas (TCGA) and Human Protein Atlas (HPA). For TCGA data, the PANCAN RNA-seq and phenotype data contained in the files ‘GDCPANCAN.htseq fpkm.tsv.gz’ and ‘GDCPANCAN.GDC phenotype.tsv’ was downloaded from the GCD Data Portal (https://gdc.cancer.gov/). Alternative cancer outcomes (e.g., progression-free interval) were accessed from the TCGA Pan-Cancer Clinical Data Resource (https://gdc.cancer.gov/about-data/publications/PanCan-Clinical-2018). Note that an equivalent FPKM-normalized version of the TCGA RNA-seq data can be access from the HPA project at https://www.proteinatlas.org/download/rna cancer sample.tsv.zip. For HPA data, the HPA staff provided normal tissue gene expression data in the file ‘HPA.normal.FPKM.GDCpipeline.csv’; this data was especially normalized by the HPA group as FPKM using a pipeline similar to that employed by GDC for the TCGA data (this data was generated for the ‘Human Pathology Atlas’ paper and can be accessed at https://hrfrost.host.dartmouth.edu/TPAC). The Hallmark collection gene sets were downloaded from version 7.2 of the Molecular Signatures Database (MSigDB) (as downloaded from http://software.broadinstitute.org/gsea/downloads.jsp). The TPAC R package and imported TPACData package are available on CRAN at https://cran.r-project.org/web/packages/TPAC and https://cran.r-project.org/web/packages/TPACData.

Funding: HRF received funding from National Institutes of Health grants R21CA253408, R35GM146586, P20GM130454 and P30CA023108. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Cancer develops when somatic alterations disrupt pathways associated with genome maintenance, cell proliferation, or cell survival to give a relative growth advantage to the impacted cell population [1]. Given the central role of pathway dysregulation in tumor initiation, growth and metastatic spread, basic and translational cancer research is often focused on pathway-related questions, e.g., How do somatic alterations cause dysregulation of specific pathways? What is the impact of pathway dysregulation on patient prognosis? What types of therapeutic agents can restore normal pathway function? One common approach for answering these questions involves the analysis of tumor-derived genomic data using gene set testing, or pathway analysis, methods. Gene set testing is a hypothesis aggregation technique that evaluates statistics computed on functionally related groups of genes, e.g., the sets maintained in collections like the Molecular Signatures Database (MSigDB) [2]. By focusing on gene sets, rather than individual genes, gene set testing can significantly improve power, interpretation and replication [3–6]. It should be noted that this paper uses the terms pathway and gene set interchangably to refer to an unordered group of genes, i.e., no knowledge is assumed regarding the regulatory or other relationships between the genes in set/pathway. This usage is in contrast to a distinction that is often made between these terms with ‘gene set’ representing an unordered groups of genes, e.g., Gene Ontology term [7], and ‘pathway’ a model that captures the regulatory relationships among a set of genes, e.g, Reactome pathways [8].

Although the pathways most commonly impacted in cancer have been identified [1] and progress has been made developing cancer-specific pathway analysis methods that can integrate gene expression and somatic alteration data [9–13], current approaches leverage just tumor-specific genomic data and do not take into account gene activity in the associated normal tissue. This is an important limitation given the strong association between gene activity in normal tissues and the pathophysiology of cancers originating in those tissues. Cancer biology is highly tissue and cell type-specific [14–18] with most driving somatic alterations either occuring in only a small number of cancer types, e.g., KRAS mutations in pancreatic, lung and colorectal cancers [19], or having a functional impact that varies between impacted tissues, e.g., germline BRCA1/BRCA2 mutations that only drive cancer in estrogen-sensitive tissues [20]. As we explored in a recent paper [21], this tissue-specificity means that the pattern of gene activity (as quantified by mRNA expression) in normal tissues carries important information regarding the biology of the associated cancer types. In that paper, we demonstrated through an analysis of solid tumor data from The Cancer Genome Atlas (TCGA) [22] and normal tissue data from the Human Protein Atla (HPA) [23] that the association between tissue-specific and cancer-specific expression values, i.e., the ratio of gene expression in a specific tissue or cancer relative to the mean across multiple tissues or cancers, can be used to improve survival analysis, the comparative analysis of distinct cancer types, and the analysis of cancer/normal tissue pairs. Specifically, we found that normal tissue-specific genes (i.e., genes that are highly-expressed in a given normal tissue relative to other tissue types) are typically down-regulated in the associated cancer (see the ρ_cancer/norm column in Table 1). We also found that elevated expression of tissue-specific genes in a tumor is associated with improved cancer prognosis (see the ρ_surv column in Table 1). As detailed in that paper [21], these associations can be leveraged via gene filtering/weighting approaches to improve the statistical power for both cancer survivaland cancer vs. normal differential expression analyses, and to identify transcriptomic differences between different cancers that are independent of the associated normal tissues.

Download:

Table 1. The 21 TCGA cancer types and corresponding HPA normal tissues supported by TPAC.

The ρ_cancer/norm column holds the Spearman rank correlation between normal tissue-specific gene weights (the log fold-change of the mean expression in the normal tissue to the mean in all normal tissues) and the log fold-change of the mean expression in the cancer type to mean expression in the normal tissue. The ρ_surv column holds the Spearman rank correlation between normal tissue-specific gene weights and the signed log of the p-value from a Kaplan-Meir test of the association between gene expression and cancer survival as computed by Uhlen et al. [23]), which is computed as -log(p-value) for favorable genes and log(p-value) for unfavorable genes. The contents of this table are a synthesis of information from tables 2 and 3 in Frost [21].

https://doi.org/10.1371/journal.pcbi.1011717.t001

While a number of gene test testing approaches have been explored that account for tissue or cell type-specificity [24–27], the primary goal of these techniques was the identification of tissue-specific biological processes. None of these prior methods specifically focuses on cancer or uses the association between normal tissue-specificity and gene activity within dysplastic tissue. Given the lack of gene set testing methods that can leverage the association between normal tissue-specificity and cancer biology, we developed a new single sample gene set testing method for tumor-derived transcriptomics data named TPAC (tissue-adjusted pathway analysis of cancer). The TPAC method uses information about the normal tissue-specificity of human genes to compute a robust multivariate distance score that quantifies pathway dysregulation in each profiled tumor. TPAC currently supports the 21 solid tumor types listed in Table 1. As we demonstrate through an analysis of TCGA gene expression data, TPAC gene set scores are more strongly associated with both patient prognosis and tumor stage than the scores generated by existing single sample gene set testing methods. In the remainder of this paper, we detail the TPAC method in and evaluate the performance of TPAC relative to existing single sample methods.

Materials and methods

Data sources

The findings detailed in this paper are based on bulk RNA-seq and clinical data from The Cancer Genome Atlas (TCGA) [22] for 21 human solid cancers and bulk RNA-seq data from the Human Protein Atlas (HPA) [28] for the associated 18 normal human tissues. These cancer types and normal tissues are listed in Table 1. Alternative cancer outcomes (e.g., progression-free interval) were accessed from the TCGA Pan-Cancer Clinical Data Resource (TCGA-CDR) [29]. The Hallmark gene set definitions were accessed from version 7.2 of the Molecular Signatures Database (MSigDB) [2]. As mentioned above regarding the distinction between gene sets and pathways, it is worth noting that MSigDB only includes unordered gene sets and, for some collections, e.g., Gene Ontology [30] and KEGG [31], significant filtering is performed. Researchers interested in leveraging the TPAC method should therefore explore not only MSigDB collections in addition to the Hallmark sets but the unmodified versions of collections such as the Gene Ontology, KEGG and Reactome [8]. Additional details on the data sources used in this paper can be found in the data availability section and the Methods section in S1 Text.

TPAC method

The TPAC method generates single sample gene set scores from tumor gene expression data using a modified version of the classic Mahalanobis multivariate distance measure [32]. TPAC takes four inputs:

X: n × p matrix that holds the normalized expression measurements for p genes in n tumors all of the same type (e.g., normalized RNA-seq data from TCGA). For the current implementation of TPAC, the tumor type is limited to one of the 21 solid cancers listed in Table 1. Genes with 0 variance are removed.
t: a length p vector that holds the mean gene expression values (as quantified by the HPA) for the normal tissue associated with the cancer type whose data is held in X.
: a length p vector that holds the average of the mean gene expression values across all 18 normal tissues in Table 1.
A: m × p matrix that represents the annotation of p genes to m gene sets as defined by a collection from a repository like the Molecular Signatures Database (MSigDB) [2] (a_i,j = 1 if gene j belongs to gene set i).

Given X, t, , and A, TPAC computes n × m matrices S, S⁺, and S⁻. These matrices hold single sample scores for each of the m gene sets defined in A and n tumors captured in X. These single sample gene set scores are computed as follows:

Compute normal tissue-specificity: Let the length p vector t* hold values representing the normal tissue-specificity of the p genes in X. These tissue-specific values are computed as , i.e., the fold-change in mean expression between the normal tissue associated with the target cancer type and the average in all 18 normal tissues listed in Table 1.
Compute expression deviation between each tumor and associated normal tissue: Let Δ hold the differences between the expression values in X and the normal tissue means in t, i.e, row i in Δ is computed by subtracting t from row i of X. Define two versions of Δ, Δ⁺ and Δ⁻, that capture just the positive or just the negative deviations. Specifically, element of Δ⁺ is set to element δ_i,j of Δ if δ_i,j ≥ 0, otherwise, it is set to 0. Similarily, element of Δ⁻ is set to element δ_i,j of Δ if δ_i,j < 0, otherwise, it is set to 0.
Compute weighted sample covariance matrices: Two weighted versions of the unbiased sample covariance matrix, and , are computed that adjust the sample variance according to normal tissue-specificity. Specifically, diagonal element of is set to the sample variance of gene i as computed on X multiplied by (the tissue-specificity value for gene i). Similarly, diagonal element of is set to the sample variance of gene i divided by . All off-diagonal elements in and are set to 0. Table 2 captures the impact of this weighting on the sample variance for gene i (i.e., ), e.g., the variance is inflated for genes that are up-regulated in the normal tissue relative to other tissues (i.e., ) and have elevated expression in the tumor relative to normal tissue (i.e., δ_i,j ≥ 0). The impact of these variance changes is discussed in more detail below.
Compute modified Mahalanobis distances for positive expression deviations: Let M⁺ be an n × m matrix of squared values of modified Mahalanobis distances. Each column k of M⁺, which holds the positive component of the sample-specific squared distances for gene set k, is calculated as: (1) where g is the size of gene set k, is a n × g matrix containing the g columns of Δ⁺ corresponding to the members of set k, and is a g × g matrix containing the g rows and columns of corresponding to the members of set k.
Compute modified Mahalanobis distances for negative expression deviations: Let M⁻ be an n × m matrix of squared values of modified Mahalanobis distances. Similar to M⁺, each column k of M⁻, which holds the negative component of the sample-specific squared distances for gene set k, is calculated as: (2) where g is the size of gene set k, is a n × g matrix containing the g columns of Δ⁻ corresponding to the members of set k, and is a g × g matrix containing the g rows and columns of corresponding to the members of set k.
Compute modified Mahalanobis distances for positive and negative expression deviations: Let M be an n × m matrix of squared values of modified Mahalanobis distances that capture both positive and negative expression deviations from the associated normal tissue. These total squared distances are simply computed as the sum of the positive and negative distances: (3)
Compute modified Mahalanobis distances on permuted X: To capture the distribution of the squared modified Mahalanobis distances under the H₀ that the expression values in X are uncorrelated with no mean difference between samples, the M, M⁺, and M⁻ matrices are recomputed on a version of X where the row labels of each column are randomly permuted. Let X_p represent the row-permuted version of X and let M_p, , and that hold the squared modified Mahalanobis distances computed on X_p according to (1), (2), or (3).
Fit gamma distribution to columns of M_p, , and : A separate gamma distribution is fit using the method of maximum likelihood (as implemented by the fitdistr() function in the MASS R package [33]) to the non-zero elements in each column of M_p, , and . Let and represent the gamma shape and rate parameters estimated for gene set k on M_p using this procedure. For and , these estimated parameters are , , , and .
Use gamma cumulative distribution function (CDF) to compute tumor-specific gene set scores: The tumor-specific gene set scores are set to the gamma CDF value for each element of M, M⁺, and M⁻. Specifically, each column k of S, which holds the tumor-specific scores for gene set k, is calculated as: (4) where is the CDF for the gamma distribution with shape and rate . Under the H₀ of uncorrelated expression, valid p-values can be generated by subtracting the elements of S from 1. The elements of S⁺ and S⁻ are populated using the same approach for the elements of M⁺ and M⁻ and gamma distributions fit on , and .

Download:

Table 2. Impact of tissue-specific weighting on sample variance for gene i.

https://doi.org/10.1371/journal.pcbi.1011717.t002

The TPAC method is motivated in part by our previously developed Variance-adjusted Mahalanobis (VAM) method [34], which uses a modified Mahalanobis distance for cell-level gene set testing of single cell RNA-sequencing data. For the VAM approach, only positive distances, measured relative to the origin, are used and the sample covariance matrix is modified to capture just the technical component of gene expression variance. Similar to the use of gamma CDF values for the VAM method, the use of to generate TPAC scores has several important benefits: 1) it enables gene set inference for individual tumors, 2) it transforms the distances for gene sets of different sizes into a common scale, and 3) it produces scores that are bound between 0 and 1 and robust to large expression values.

Choice of S, S⁺ or S⁻

The three different TPAC generated score matrices, S, S⁺ and S⁻, capture distinct features of pathway dysregulation within each tumor and the choice of which scores to use will therefore vary depending on the analysis goals. For all matrices, large scores correspond to tumors that are more significantly dysregulated relative to the corresponding normal tissue and are therefore more likely on average to be associated with a poor prognosis or advanced cancer stage.

S⁺: Large values in S⁺ correspond to tumors where expression of gene set members is elevated relative to the associated normal tissue. The use of normal tissue-specificity to adjust sample variances (as detailed in Table 2 above) will prioritize expression differences for genes that are normally surpressed in the associated normal tissue, i.e., a tumor is considered more dysregulated if genes that are expressed at a low level in the associated normal tissue relative to other tissues are up-regulated in the tumor.
S⁻: Large values in S⁻ correspond to tumors where expression of gene set members is down-regulated relative to the associated normal tissue. In constrast to the impact on S⁺, the use of normal tissue-specificity to adjust sample variances leads to larger S⁻ values when tissue-specific genes are down-regulated in the tumor, i.e., a tumor is considered more dysregulated if genes that are expressed at a high level in the associated normal tissue relative to other tissues are down-regulated in the tumor.
S: Large values in S correspond to tumors where expression of gene set members exhibit a combination of up and down-regulation relative to the associated normal tissue.

For most of the analysis results presented below, we use the scores in the S matrix since these capture a wider range of dysregulation patterns.

TPAC implementation

The TPAC method detailed above is implemented by the TPAC R package that is available on CRAN (https://cran.r-project.org/web/packages/TPAC). The requires normal tissue expression data from the HPA is made available via the TPACData R package, which is imported by TPAC and is also available on CRAN (https://cran.r-project.org/web/packages/TPACData). Readers interested in using the TPAC method should start by reviewing the liver cancer vignette (included in S2 Text and embedded in the TPAC package) which illustrates the computation of TPAC scores for TCGA liver cancer RNA-seq data using MSigDB Hallmark gene sets. The TPAC R package documentation contains complete details on the supported functions.

Comparison methods

We compared the performance of the TPAC method against four existing single sample gene set testing techniques: GSVA [35], ssGSEA [36], the z-scoring method of Lee et al. [37] (referred to throughout the reminder of the paper as the ‘z-scoring method’), and GRAPE [38]. We also included results for a ‘null’ version of TPAC as a negative control (i.e., TPAC scores with permuted sample labels to break any associations with cancer prognosis or stage), and a version of TPAC that does not use tissue-specific weights to adjust gene expression sample variances.

GSVA and ssGSEA are both widely used competitive techniques that generate sample-level scores using a Kolmogorov-Smirnov (KS) like random walk statistic computed on the gene ranks within each sample following some form of gene standardization across the samples. The z-scoring method is a simple self-contained technique that fits a standard normal distribution to the average of set genes across all samples. For GSVA, ssGSEA and the z-scoring method, the implementation in the GSVA R package was used with default parameter settings. Like the z-scoring technique and TPAC, the GRAPE method only uses the expression values for genes within the evaluated set. Similar to TPAC, and unique among the four comparison methods, GRAPE generates scores for each sample by comparison with a reference expression profile. In particular, GRAPE scores are based on the distance between a binarized version of set gene expression value for a given sample and a binarized reference template profile (or profile collection). For GRAPE, the binarization of a gene set with m members results in a m(m − 1)/2 length binary vector whose elements are indicator variables for an inequality operator applied to all gene pairs, i.e., the value for the i, j comparison is 1(g_i < g_j) with ties resolved randomly, and the distance between binary vectors are computed as the weighted average absolute difference of the elements. For the results in this paper, GRAPE scores were computed using the makeGRAPE_psMat() function in the GRAPE R package with the reference template set to the HPA expression profile for the associated normal tissue with all other parameters set to default values.

Although GRAPE is the most similar of the comparison methods to TPAC, there are a number of important differences between the two approaches: 1) the binarization approach used by GRAPE is equivalent to a rank-based representation whereas TPAC measures distances directly on the normalized expression values, 2) GRAPE computes distances as the averaged weighted absolute difference between binarized vectors and thereby ignores direction; TPAC separately models increased and decreased expression in the sample relative to the reference, 3) TPAC scores have a valid null distribution so can be used for inference; GRAPE scores do not have a probabilistic interpretation, and 4) TPAC is specifically designed for the analysis of cancer gene expression data and adjusts scores based on gene normal tissue-specificity; although the GRAPE paper included TCGA analysis results, the method is not specifically optimized for the analysis of cancer data, the reported analyses did not score tumor expression data relative to the associated normal tissue and the method does not account for normal tissue-specificity.

TCGA analyses

The TPAC method and the comparative methods outlined above in Section were used to generate single sample gene set scores for TCGA RNA-seq data from the 21 cohorts listed in Table 1 and the 50 gene sets from the MSigDB Hallmark collection. The TPAC S matrix scores and scores from the comparative methods were used for the following analyses:

Landscape of pan-cancer gene set dysregulation: Single sample gene set scores for tumors from all TCGA cohorts and Hallmark gene sets were clustered and visualized. This analysis was performed to explore the pattern of gene set dysregulation across multiple cancer types.
Survival analysis: Univariable Cox proportion hazards models were fit for each cohort using progression free interval (PFI) as the outcome and single sample gene set scores as a single predictor variable. The rationale for this analysis is an assumption that cancer prognosis is associated with the level of transcriptional dysregulation of important biological processes (as represented by Hallmark gene sets). Given ths assumption, we expected to find more statistically significant associations between TPAC scores and PFI than for the scores generated by comparative single sample gene set testing methods.
Tumor/lymph node stage analysis: For each TCGA cohort, a Wilcoxon rank sum test was performed on the single sample scores generated by each method for all Hallmark gene sets. For the analysis of tumor stage, the TCGA tumor stage was discretized as T01 vs non-T01. For the analysis of lymph node stage, the TCGA lymph node stage was discretized as N0 vs non-N0. Similar to the rationale for performing a survival analysis using single sample gene set scores, this analysis was motivated by the assumption that transcriptional dysregulation is associated with cancer progression (as represented by tumor stage).
Single tumor inference: To evaluate the use of TPAC scores for tumor-level inference, the CDF values in the S matrices for all cohorts were converted to p-values. False discovery rate (FDR) values were then computed for the family containing all gene set scores across the 21 TCGA cohorts and 50 Hallmark gene sets using the method of Benjamini and Hochberg method [39].
Kaplan-Meir analysis: To explore the association between TPAC score significance and cancer prognosis, a Kaplan-Meir analysis was performed for the TCGA KIRP cohort and the progression-free interval (PFI) outcome with patients stratified according to the significance of TPAC score for the MSigDB Hallmark MYC Targets V1 gene set. Significance was determined according to whether the FDR value associated with the TPAC score was < 0.25 where the family of hypotheses included the TPAC scores for all 50 Hallmark gene sets for all 321 KIRP samples with PFI data (16,050 total hypotheses).
Transcription factor activity: To explore the biological features represented by TPAC gene set dsyregulation scores, we estimated sample-level transcription factor (TF) activity from the TCGA gene expression data for all analyzed cohorts using the decoupleR method [40]. Following the logic in the ‘Transcription factor activity inference in bulk RNA-seq’ vignette associated with the decoupleR Bioconductor R package, this analysis executed the decoupleR method using run_wmean() function with transcription factor target information obtained from the Collection of Transcriptional Regulatory Interactions (CollecTRI) database [41]. This decoupleR analysis generated a matrix of sample-level activity estimates for 618 TFs for all tumors in the analyzed TCGA cohorts. Spearman rank correlations were then computed between the TF activity estimates and the TPAC scores for all MSigDB Hallmark gene sets.

Results

Inference with TPAC scores maintains type I error control

To explore the statistical properties of the TPAC method, normalized count data was simulated under both the null and alternative hypotheses. To estimate the type I error rate, we generated a 10,0000-by-100 matrix of independent Poisson random variables with λ = 5 to which library size normalization was applied. For this null simulation, p-values derived from the S matrix generated by TPAC for a single set containing the first 20 variables maintained a type I error rate of 0.0527 at α = 0.05. To estimate power, a similar 10,000-by-100 matrix of Poisson counts was simulated with a constant offset of λ * δ added to the values for the first 1,000 samples and 20 variables with δ ∈ {0.1, 0.2, …, 2}. Fig 1 illustrates the estimated empirical power for different effect sizes at α = 0.05.

Download:

Fig 1. TPAC empirical power for different simulated effect sizes.

The simulation model is detailed in section ‘Inference with TPAC scores maintains type I error control’.

https://doi.org/10.1371/journal.pcbi.1011717.g001

TPAC scores reveal pan-cancer landscape of gene set dysregulation

Fig 2 illustrates the overall pattern of S matrix TPAC scores for the MSigDB Hallmark gene sets across tumors from all 21 evaluated TCGA cohorts (a similar heatmap for GSVA scores is included as Fig A in S1 Text). As seen in this figure, tumors cluster into four primary groups according to TPAC scores (these clusters are annotated along the bottom of the heatmap):

Overall dysregulation: Tumors that are highly dysregulated across all Hallmark gene sets (i.e., tumors represented by the left-most columns in the heatmap)
Minimal dysregulation: Tumors that exhibit limited gene expression dysregulation (i.e., tumors represented by columns in the middle of the heatmap).
Immune signaling dysregulation: Tumors that show pronounced dysregulation among gene sets related to immune cell signaling (i.e., tumors represented by columns to the immediate right of the minimally dysregulated tumors)
Proliferation dysregulation: Tumors that show pronounced dysregulation among proliferation gene sets (i.e., tumors represented by the right-most columns in the heatmap)

Download:

Fig 2. Heatmap illustrating the pan-cancer distribution of S matrix TPAC scores for the MSigDB Hallmark gene sets.

Annotations along the top reflect cancer type and annotations on the bottom represent the four main types of dyregulation pattern (1: overall dysregulation across all gene sets, 2: minimal dysregulation, 3: immune signaling dysregulation, and 4: proliferation dysregulation).

https://doi.org/10.1371/journal.pcbi.1011717.g002

TPAC scores have more significant associations with cancer prognosis than scores generated by comparative methods

Fig 3 illustrates the distribution of p-values from univariable Cox proportional hazards models fit for each TCGA cohort that use progression-free interval (PFI) as the outcome. Separate Cox models were fit for each combination of TCGA cohort, Hallmark gene set, and gene set testing method with the distribution of all p-values associated with each analysis methods plotted as a separate curve. As shown in the figure, models using TPAC-generated gene set predictors provided the most significant associations (286), as determined by Cox models with FDR values ≤ 0.1. As expected, the null version of TPAC had 0 significant results. The results for each TCGA cohort are visualized in separate Q-Q plots in Fig C in S1 Text, which demonstrates significant variability in prognositic signal and relative method performance across the different cancer types. The magnitude and direction of the PFI associations for the TPAC-generated gene set scores are visualized in Fig D in S1 Text. As illustrated by Fig D in S1 Text, gene set dysregulation is usually associated with an unfavorable cancer prognosis and the strength of the association varies across the TCGA cohorts.

Download:

Fig 3. Q-Q plot comparing p-values from Cox proportional hazards models that use single sample gene set scores generated by TPAC and each of the comparison techniques as the single predictor and PFI as the outcome against the U(0, 1) distribution expected under the null.

For each evaluated method, a separate test is performed for all 50 MSigDB Hallmark pathways for each of the 21 analyzed TCGA cancer types for a total of 1,050 tests per method. The results for each evaluated single sample gene set testing method are plotted separately with the number of hypothesis tests out of a family of 1,050 tests associated with FDR values ≤ 0.1 listed in paratheses after the method in the legend.

https://doi.org/10.1371/journal.pcbi.1011717.g003

TPAC scores are strongly associated with tumor and lymph node stage

Fig 4 and Fig F in S1 Text provide a similar visualization as Fig 3 and Fig D in S1 Text of the p-values from statistical models based on single sample gene set scores computed using TPAC and the other comparison methods. For these figures, the p-values capture the association between gene set scores and discretized tumor stage (T01 vs. other) as evaluated using a Wilcoxon rank sum test. For these analyses, the scores generated by GRAPE generate the most significant results (330) with the TPAC variants a close second (299 and 265) and all other comparison methods producing a markedly smaller number of significant results (104, 137, and 147). Importantly, the two methods that generate scores based on the deviation in gene expression between each tumor and the corresponding normal tissue (i.e., TPAC and GRAPE) have substantially more significant associations relative to techniques that do not use a normal tissue reference. It is interesting to note that the version of TPAC that does not include an adjustment for tissue-specificity outperforms the version that includes the tissue-specific adjustment. The results for each TCGA cohort are visualized in separate Q-Q plots in Fig E in S1 Text.

Download:

Fig 4. Q-Q plot comparing the distribution of p-values from Wilcoxon rank sum tests comparing single sample gene set scores generated by TPAC and each of the comparison techniques for tumors with stage T01 vs. the scores for tumors with higher stages.

For each evaluated method, a separate test is performed for all 50 MSigDB Hallmark pathways for each of the 21 analyzed TCGA cancer types for a total of 1,050 tests per method. The results for each evaluated single sample gene set testing methods are plotted separately with the number of hypothesis tests out of a family of 1,050 tests associated with FDR values ≤ 0.1 listed in paratheses after the method in the legend.

https://doi.org/10.1371/journal.pcbi.1011717.g004

Similar to Fig 4 and Fig F in S1 Text, Fig 5 and Fig H in S1 Text visualize the association between single sample gene set scores and lymph node stage associated with each tumor. Similar to the tumor stage analysis, the GRAPE method generated the most significant results (216) for the lymph node stage analysis with TPAC second (173). In this case, the standard TPAC generated substantially more significant associations (173) than the non-tissue specific TPAC (118). The results for each TCGA cohort are visualized in separate Q-Q plots in Fig G in S1 Text.

Download:

Fig 5. Q-Q plot comparing the distribution of p-values from Wilcoxon rank sum tests comparing single sample gene set scores generated by TPAC and each of the comparison techniques for tumors associated with lymph node stage N0 vs. the scores for tumors associated with higher lymph node stages.

For each evaluated method, a separate test is performed for all 50 MSigDB Hallmark pathways for each of the 21 analyzed TCGA cancer types for a total of 1,050 tests per method. The results for each evaluated single sample gene set testing methods are plotted separately with the number of hypothesis tests out of a family of 1,050 tests associated with FDR values ≤ 0.1 listed in paratheses after the method in the legend.

https://doi.org/10.1371/journal.pcbi.1011717.g005

Sample-level inference on TPAC scores highlights the significant elements of pan-cancer gene set dysregulation

Fig 6 illustrates the use of TPAC scores for tumor-specific inference regarding gene set dysregulation. To generate this heatmap, the CDF values in the S matrices for all TCGA cohorts were converted to p-values. False discovery rate (FDR) values were then computed for the family containing all gene set scores across the 21 TCGA cohorts and 50 Hallmark gene sets (461,900 total hypotheses) using the method of Benjamini and Hochberg method [39]. For cells corresponding to FDR values ≥ 0.3, the TPAC scores are set to 0 and the modified TPAC scores are then visualized as a heatmap. For this analysis, 2.6% of the TPAC scores (11,797 individual scores) were significant at an FDR threshold of 0.3.

Download:

Fig 6. Illustration of the pan-cancer significance of S matrix TPAC scores for the MSigDB Hallmark gene sets.

This figure visualizes the same data as Fig 2 but with each TPAC scores whose associated FDR value is ≥ 0.3 set to 0 (X total hypotheses).

https://doi.org/10.1371/journal.pcbi.1011717.g006

TPAC score statistical significance effectively stratifies patients according to cancer prognosis

Fig 7 visualizes the use of tumor-level inference for survival analysis. Specifically, the TPAC scores for the TCGA KIRP cohort and Hallmark MYC Targets V1 gene set were discretized according to an FDR threshold of 0.25 (16,050 total hypotheses) and these discretized values were then used for a Kaplan-Meir analysis relative to patient PFI. As shown in this plot, tumors with significant dysregulation of the MYC Targets V1 gene set have a significantly worse prognosis than tumors lacking significant dysregulation. Because only the TPAC method supported score-level inference, comparable results cannot be generated for the other evaluated methods.

Download:

Fig 7. Kaplan-Meir plot for TCGA KIRP cohort and progression-free interval (PFI) outcome with patients stratified according to the significance of the TPAC score for the MSigDB Hallmark MYC Targets V1 gene set.

Significance was determined according to whether the FDR value associated with the TPAC score was < 0.25 where the family of hypotheses included the TPAC scores for all 50 Hallmark gene sets for all 321 KIRP samples with PFI data (16,050 total hypotheses).

https://doi.org/10.1371/journal.pcbi.1011717.g007

Association between TPAC scores and transcription factor activity illuminates the regulatory impact of tumorigenesis

Fig 8 visualizes the rank correlation between tumor-level transcription factor (TF) activity values, as estimated using the decoupleR [40] method, and overall TPAC scores for each analyzed TCGA cohort. Each cell represents the average rank correlation across all 50 MSigDB Hallmark gene sets; only the 25 TFs with the largest average absolute correlation are shown. These results reveal that the estimated activity for most TFs do not have a consistent association with TPAC scores across all Hallmark gene sets and that the pattern of association varies significantly between the different cancer types. However, the cases where TF and TPAC scores do have a consistent direction of association for a cancer type across all of the Hallmark gene sets, e.g, KLF13, THAP11 and IRF7 for GBM, are consistent with prior findings on TFs whose activity is linked to cancer (KLF13 [42], THAP11 [43], and IRF7 [44]). Fig F in S1 Text shows the per-gene set correlation averaged across all cohorts. In this case, the rank correlations are quite small when averaged across all cohorts but, for the top TFs, show a consistent direction of association across all Hallmark gene sets. The TFs captured in Fig F in S1 Text also have known associations with cancer (e.g., KLF4 [45] and PAX2 [46]).

Download:

Fig 8. Association between transcription factor (TF) activity, as estimated using the decoupleR method, and TPAC scores.

Each cell represents the average rank correlation between overall TPAC scores and TF activity estimates for one of the TCGA cohorts where averaging is performed across all 50 MSigDB Hallmark gene sets. Results are only shown for the 25 TFs with the largest average absolute correlation.

https://doi.org/10.1371/journal.pcbi.1011717.g008

As expected, larger magnitude TF/TPAC correlations are observed when not averaging over cancer types or gene sets. Fig G in S1 Text and Fig H in S1 Text visualize these correlations for the KIRP and KIRC cohorts. Looking specifically at the Hallmark MYC Targets V1 gene set that was used to generate the Kaplan-Meir plot in Fig 7, the 10 TFs with the largest positive and negative rank correlations in the KIRP cohort are listed in Table B in S1 Text and provide some insight into potential regulatory mechanisms underlying the significant association with TPAC score significance for this gene set and KIRP cancer prognosis. Similar to the results shown in Fig 8, Fig F in S1 Text and Fig G in S1 Text, these TFs have known links to cancer (e.g., SIX4 [47] and PURA [48]).

Discussion

Novel features of the TPAC method

The TPAC method incorporates several novel features not found in existing single sample gene set testing methods:

TPAC scores capture the deviation in gene set expression between each analyzed tumor and the associated normal tissue using a novel modified Mahalanobis distance metric. While the GRAPE method also generates scores based on the distance between the expression profile of each sample and a reference, it is not specific to cancer, and, prior to the evaluation in this paper, had not been used to score tumor expression relative to normal tissue.
TPAC adjusts for gene normal tissue-specificity using the weighting scheme detailed in Table 2, which improves the association between the generated gene set scores and both cancer prognosis and lymph node stage.
The TPAC method generates separate scores that capture the positive (S⁺), negative (S⁻) and overall (S) deviations in tumor expression relative to normal tissue. These direction-based scores not only support the use of tissue-specific weights (as detailed in Table 2) but also enable a range of downstream analyses (as detailed in the ‘Choice of S, S⁺ or S⁻’ section). By contrast, the GRAPE method generates scores based on the average weighted absolute difference between binarized vectors, which does not distinguish between positive and negative deviations.
The modified Mahalanobis distances used by TPAC have a null distribution that approximately follows a gamma distribution, which enables both population and sample-level inference. While this probabilistic score model is shared by our previously developed VAM technique [34], the VAM method is not appropriate for the TCGA analyses explored in this paper since it focuses on single cell RNA-seq data and does not support the comparison against a reference profile.

Benefits of TPAC for the analysis of cancer transcriptomics

The novel features of the TPAC method underpin several important analytical benefits:

TPAC scores capture tissue-independent features of the tumor transcriptome. A differential expression analysis of two cancer types will generate results that largely mirror the output from a comparison of the associated normal tissues (this effect is visualized in Fig 4 of our prior paper [21]). This strong tissue type signal makes it challenging to disentangle the contribution of normal tissue biology from the impact of mutagenic processes when analyzing cancer transcriptomic data. By quantifying the deviation of each tumor from the associated normal tissue, the TPAC method help identify the tissue-independent transcriptomic impact of a given cancer. This property of TPAC scores is demonstrated in Fig 2 by the fact that tumors cluster according to common dysregulation patterns rather than by cancer/tissue type.
TPAC scores have the strongest association with cancer prognosis among comparable techniques. Under the assumption that cancer aggressiveness is positively correlated with the disruption of important biological processes (as represented by the MSigDB Hallmark gene sets), we expected to find statistically signficant associations between TPAC scores for many of the Hallmark gene sets and cancer prognosis as represented by the progression free interval (PFI). As shown in Fig 3, TPAC scores have the most significant associations with PFI (286 vs. a range of 164–191 for the other techniques). Importantly, the standard version of TPAC that adjusts for normal tissue-specificity has more significant associations (286) than a variant of TPAC that does not include the tissue-specificity adjustment (247). The strong association between TPAC scores and cancer prognosis is also illusrated by Fig 7, where stratification of KIRP patients according to the statistical significance of the TPAC score for Hallmark MYC Targets V1 gene set reveals a dramatic difference in PFI (i.e., significant dysregulation of this gene set relative to normal kidney tissue is associated with a much shorter PFI).
TPAC scores have a strong association with both tumor and lymph node stage. Similar to our assumption regarding a link between cancer aggressiveness and pathway dysregulation, we also assumed that cancer progression is marked by an increased dysregulation in key cellular pathways. Given this assumption, we expected to find a statistically significant association between TPAC scores and both tumor and lymph node stage. In constrast to the PFI analysis, scores from the GRAPE method had the most significant associations with tumor stage (330 vs. a range of 104–299 for the other methods) and lymph node stage (216 vs. a range 108–173). Although TPAC did not produce the most significant associations, its performance was close to GRAPE (299 vs. 330 for tumor stage; 173 vs. 216 for lymph node stage) and markedly above the other techniques (299 vs. a range of 104–147 for tumor stage; 173 vs. a range of 108–116 for lymph node stage). Importantly, the best performing methods (GRAPE and TPAC) both generate scores based on the deviation between tumor expression and expression in the associated normal tissue. In this context, it is worth noting that the version of GRAPE used in the comparative evaluation (i.e., GRAPE with HPA normal tissue expression as the reference profile) is distinct from the version detailed in the GRAPE paper. This use of associated normal tissue data was not explored in the original GRAPE paper and can be viewed as part of the contribution of this manuscript.
TPAC scores are associated with the activity of cancer-linked transcription factors (TF). As illustrated in Fig 8 and Fig I in S1 Text, Fig J in S1 Text and Fig K in S1 Text, TPAC scores are associated with the estimated activity of TFs that have been experimentally shown to be up or down-regulated within human cancers. Although the strength of these associations is low and there is significant heterogeneity across different tumor types, the results demonstrate that TPAC scores may be helpful in characterizing the regulatory program driving biological process dysregulation during tumorigensis.

TPAC limitations

When interpreting the results in this manuscript or considering adoption of TPAC method, readers should keep several important limitations in mind. First, the results presented in this paper analyzed TCGA bulk RNA-seq data for solid human cancers using expression data for the associated normal tissues from the HPA. While a common processing pipeline was used on both the TCGA and HPA data, batch effects may still exist. Use of the TPAC method (as implemented in the TPAC R package) uses HPA normal tissue bulk RNA-seq data so users must ensure that a equivalent normalization process is employed if analyzing non-TCGA RNA-seq data. A related limitation is that the TPAC R package only supports the analysis of bulk RNA-seq data for 21 solid human cancer types associated with the 18 normal tissue types whose HPA data is embedded in the package; analysis of other cancer types requires the custom integration of expression data for the relevant normal tissue from HPA or another repository. Analysis of cancer single cell RNA-seq data is not currently supported and will require the development of a cell type, vs. tissue type, reference model.

A second important limitation is that the relative performance of the TPAC method varies considerably across the various analyzed cancer types. As a consequence, potential users of the TPAC method should not assume that the overall analysis results presented in Figs 3, 4 and 5 will necessarily hold for a more focused analysis. The cancer type-specific results in Fig C in S1 Text, Fig E in S1 Text and Fig G in S1 Text should be consulted to understand the expected performance on a specific cancer type and outcome measure. When interpreting the TPAC performance reported in this paper, readers should also keep in mind that TPAC only supports unordered gene sets (i.e., the relationships between genes within a pathway are not considered), that the normal tissue-specific weighting used by TPAC may not improve performance in all cases (e.g., the association between TPAC scores and tumor stage shown in Fig 4), and that an improvement in the number of significant associations (as shown for PFI in Fig 3 is only beneficial as long as the type I error rate is maintained.

Conclusion

Cancer biology is highly tissue-specific: most cancer-driving somatic alterations occur in just a limited number of tissues and inherited mutations frequently have a tissue-specific functional impact. As we have explored in prior work [21], cancer tissue-specificity can be leverged to improve the power and accuracy of cancer genomic analyses. To leverage the associations between gene activity in normal and malignant tissue for gene set testing, we have developed a new single sample gene set testing method for tumor-derived transcriptomics data named TPAC (tissue-adjusted pathway analysis of cancer). The TPAC method uses the normal tissue-specificity of human genes to compute a robust multivariate distance score that quantifies gene set dysregulation in each profiled tumor and can be used for both population and sample-level inference. As demonstrated through an analysis of TCGA RNA-seq data, TPAC gene set scores are more strongly associated with patient prognosis than the scores generated by existing single sample gene set testing methods.

Supporting information

S1 Text. Includes supplemental methods and results.

https://doi.org/10.1371/journal.pcbi.1011717.s001

(PDF)

S2 Text. Contains a vignette illustrating the analysis of TCGA liver cancer RNA-seq data using MSigDB Hallmark gene sets.

https://doi.org/10.1371/journal.pcbi.1011717.s002

(PDF)

Acknowledgments

We would like to thank HPA group for providing access to the HPA.normal.FPKM.GDCpipeline.csv file.

References

1. Vogelstein B, Papadopoulos N, Velculescu VE, Zhou S, Diaz LA Jr, Kinzler KW. Cancer genome landscapes. Science. 2013;339(6127):1546–58. pmid:23539594
- View Article
- PubMed/NCBI
- Google Scholar
2. Liberzon A, Subramanian A, Pinchback R, Thorvaldsdóttir H, Tamayo P, Mesirov JP. Molecular signatures database (MSigDB) 3.0. Bioinformatics. 2011;27(12):1739–40. pmid:21546393
- View Article
- PubMed/NCBI
- Google Scholar
3. Allison DB, Cui X, Page GP, Sabripour M. Microarray data analysis: from disarray to consolidation and consensus. Nature Reviews Genetics. 2006;7(1):55–65. pmid:16369572
- View Article
- PubMed/NCBI
- Google Scholar
4. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005;102(43):15545–15550. pmid:16199517
- View Article
- PubMed/NCBI
- Google Scholar
5. Goeman JJ, Buehlmann P. Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics. 2007;23(8):980–987. pmid:17303618
- View Article
- PubMed/NCBI
- Google Scholar
6. Khatri P, Sirota M, Butte AJ. Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Computational Biology. 2012;8(2):e1002375. pmid:22383865
- View Article
- PubMed/NCBI
- Google Scholar
7. Gene Ontology Consortium. The Gene Ontology in 2010: extensions and refinements. Nucleic Acids Res. 2010;38(Database issue):D331–5.
- View Article
- Google Scholar
8. Croft D, Mundo AF, Haw R, Milacic M, Weiser J, Wu G, et al. The Reactome pathway knowledgebase. Nucleic Acids Res. 2014;42(Database issue):D472–7. pmid:24243840
- View Article
- PubMed/NCBI
- Google Scholar
9. Mutation Consequences and Pathway Analysis working group of the International Cancer Genome Consortium. Pathway and network analysis of cancer genomes. Nat Methods. 2015;12(7):615–21.
- View Article
- Google Scholar
10. Paull EO, Carlin DE, Niepel M, Sorger PK, Haussler D, Stuart JM. Discovering causal pathways linking genomic events to transcriptional states using Tied Diffusion Through Interacting Events (TieDIE). Bioinformatics. 2013;29(21):2757–64. pmid:23986566
- View Article
- PubMed/NCBI
- Google Scholar
11. Vaske CJ, Benz SC, Sanborn JZ, Earl D, Szeto C, Zhu J, et al. Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using PARADIGM. Bioinformatics. 2010;26(12):i237–45. pmid:20529912
- View Article
- PubMed/NCBI
- Google Scholar
12. Drier Y, Sheffer M, Domany E. Pathway-based personalized analysis of cancer. Proc Natl Acad Sci U S A. 2013;110(16):6388–93. pmid:23547110
- View Article
- PubMed/NCBI
- Google Scholar
13. Frost HR, Amos CI. A multi-omics approach for identifying important pathways and genes in human cancer. BMC Bioinformatics. 2018;19(1):479. pmid:30541428
- View Article
- PubMed/NCBI
- Google Scholar
14. Schneider G, Schmidt-Supprian M, Rad R, Saur D. Tissue-specific tumorigenesis: context matters. Nat Rev Cancer. 2017;17(4):239–253. pmid:28256574
- View Article
- PubMed/NCBI
- Google Scholar
15. Sack LM, Davoli T, Li MZ, Li Y, Xu Q, Naxerova K, et al. Profound Tissue Specificity in Proliferation Control Underlies Cancer Drivers and Aneuploidy Patterns. Cell. 2018;173(2):499–514.e23. pmid:29576454
- View Article
- PubMed/NCBI
- Google Scholar
16. Haigis KM, Cichowski K, Elledge SJ. Tissue-specificity in cancer: The rule, not the exception. Science. 2019;363(6432):1150–1151. pmid:30872507
- View Article
- PubMed/NCBI
- Google Scholar
17. Bianchi JJ, Zhao X, Mays JC, Davoli T. Not all cancers are created equal: Tissue specificity in cancer genes and pathways. Curr Opin Cell Biol. 2020;63:135–143. pmid:32092639
- View Article
- PubMed/NCBI
- Google Scholar
18. Davoli T, Xu AW, Mengwasser KE, Sack LM, Yoon JC, Park PJ, et al. Cumulative haploinsufficiency and triplosensitivity drive aneuploidy patterns and shape the cancer genome. Cell. 2013;155(4):948–62. pmid:24183448
- View Article
- PubMed/NCBI
- Google Scholar
19. Buscail L, Bournet B, Cordelier P. Role of oncogenic KRAS in the diagnosis, prognosis and treatment of pancreatic cancer. Nat Rev Gastroenterol Hepatol. 2020;17(3):153–168. pmid:32005945
- View Article
- PubMed/NCBI
- Google Scholar
20. Elledge SJ, Amon A. The BRCA1 suppressor hypothesis: an explanation for the tissue-specific tumor development in BRCA1 patients. Cancer Cell. 2002;1(2):129–32. pmid:12086871
- View Article
- PubMed/NCBI
- Google Scholar
21. Frost HR. Analyzing cancer gene expression data through the lens of normal tissue-specificity. PLoS Comput Biol. 2021;17(6):e1009085. pmid:34143767
- View Article
- PubMed/NCBI
- Google Scholar
22. Cancer Genome Atlas Research Network, Weinstein JN, Collisson EA, Mills GB, Shaw KRM, Ozenberger BA, et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nat Genet. 2013;45(10):1113–20. pmid:24071849
- View Article
- PubMed/NCBI
- Google Scholar
23. Uhlen M, Zhang C, Lee S, Sjöstedt E, Fagerberg L, Bidkhori G, et al. A pathology atlas of the human cancer transcriptome. Science. 2017;357 (6352). pmid:28818916
- View Article
- PubMed/NCBI
- Google Scholar
24. Frost HR. Computation and application of tissue-specific gene set weights. Bioinformatics. 2018. pmid:29659714
- View Article
- PubMed/NCBI
- Google Scholar
25. Wang W, Hao J, Zheng S, Fan Q, He A, Wen Y, et al. Tissue-specific pathway association analysis using genome-wide association study summaries. Bioinformatics. 2017;33(2):243–247. pmid:27651483
- View Article
- PubMed/NCBI
- Google Scholar
26. Sharon M, Vinogradov E, Argov CM, Lazarescu O, Zoabi Y, Hekselman I, et al. The differential activity of biological processes in tissues and cell subsets can illuminate disease-related processes and cell-type identities. Bioinformatics. 2022;38(6):1584–1592. pmid:35015838
- View Article
- PubMed/NCBI
- Google Scholar
27. Sharon M, Gruber G, Argov CM, Volozhinsky M, Yeger-Lotem E. ProAct: quantifying the differential activity of biological processes in tissues, cells, and user-defined contexts. Nucleic Acids Res. 2023;51(W1):W478–W483. pmid:37207335
- View Article
- PubMed/NCBI
- Google Scholar
28. Uhlén M, Fagerberg L, Hallström BM, Lindskog C, Oksvold P, Mardinoglu A, et al. Proteomics. Tissue-based map of the human proteome. Science. 2015;347(6220):1260419. pmid:25613900
- View Article
- PubMed/NCBI
- Google Scholar
29. Liu J, Lichtenberg T, Hoadley KA, Poisson LM, Lazar AJ, Cherniack AD, et al. An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics. Cell. 2018;173(2):400–416.e11. pmid:29625055
- View Article
- PubMed/NCBI
- Google Scholar
30. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene Ontology: tool for the unification of biology. Nature Genetics. 2000;25(1):25–29. pmid:10802651
- View Article
- PubMed/NCBI
- Google Scholar
31. Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Research. 2000;28(1):27–30. pmid:10592173
- View Article
- PubMed/NCBI
- Google Scholar
32. Mahalanobis PC. On the generalized distance in statistics. Proceedings of the National Institute of Sciences (Calcutta). 1936;2:49–55.
- View Article
- Google Scholar
33. Venables WN, Ripley BD. Modern Applied Statistics with S. 4th ed. New York: Springer; 2002. Available from: http://www.stats.ox.ac.uk/pub/MASS4.
34. Frost HR. Variance-adjusted Mahalanobis (VAM): a fast and accurate method for cell-specific gene set scoring. Nucleic Acids Res. 2020. pmid:32633778
- View Article
- PubMed/NCBI
- Google Scholar
35. Hänzelmann S, Castelo R, Guinney J. GSVA: gene set variation analysis for microarray and RNA-seq data. BMC Bioinformatics. 2013;14:7. pmid:23323831
- View Article
- PubMed/NCBI
- Google Scholar
36. Barbie DA, Tamayo P, Boehm JS, Kim SY, Moody SE, Dunn IF, et al. Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1. Nature. 2009;462(7269):108–12. pmid:19847166
- View Article
- PubMed/NCBI
- Google Scholar
37. Lee E, Chuang HY, Kim JW, Ideker T, Lee D. Inferring pathway activity toward precise disease classification. PLoS Comput Biol. 2008;4(11):e1000217. pmid:18989396
- View Article
- PubMed/NCBI
- Google Scholar
38. Klein MI, Stern DF, Zhao H. GRAPE: a pathway template method to characterize tissue-specific functionality from gene expression profiles. BMC Bioinformatics. 2017;18(1):317. pmid:28651562
- View Article
- PubMed/NCBI
- Google Scholar
39. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B (Statistical Methodology). 1995; p. 289–300.
- View Article
- Google Scholar
40. Badia-I-Mompel P, Vélez Santiago J, Braunger J, Geiss C, Dimitrov D, Müller-Dott S, et al. decoupleR: ensemble of computational methods to infer biological activities from omics data. Bioinform Adv. 2022;2(1):vbac016. pmid:36699385
- View Article
- PubMed/NCBI
- Google Scholar
41. Müller-Dott S, Tsirvouli E, Vázquez M, Flores ROR, i Mompel PB, Fallegger R, et al. Expanding the coverage of regulons from high-confidence prior knowledge for accurate estimation of transcription factor activities. bioRxiv. 2023.
- View Article
- Google Scholar
42. Wang Q, Peng R, Wang B, Wang J, Yu W, Liu Y, et al. Transcription factor KLF13 inhibits AKT activation and suppresses the growth of prostate carcinoma cells. Cancer Biomark. 2018;22(3):533–541. pmid:29843216
- View Article
- PubMed/NCBI
- Google Scholar
43. Zhang J, Zhang H, Shi H, Wang F, Du J, Wang Y, et al. THAP11 Functions as a Tumor Suppressor in Gastric Cancer through Regulating c-Myc Signaling Pathways. Biomed Res Int. 2020;2020:7838924. pmid:32908912
- View Article
- PubMed/NCBI
- Google Scholar
44. Qing F, Liu Z. Interferon regulatory factor 7 in inflammation, cancer and infection. Front Immunol. 2023;14:1190841. pmid:37251373
- View Article
- PubMed/NCBI
- Google Scholar
45. He Z, He J, Xie K. KLF4 transcription factor in tumorigenesis. Cell Death Discov. 2023;9(1):118. pmid:37031197
- View Article
- PubMed/NCBI
- Google Scholar
46. Feng Y, Tang Y, Mao Y, Liu Y, Yao D, Yang L, et al. PAX2 promotes epithelial ovarian cancer progression involving fatty acid metabolic reprogramming. Int J Oncol. 2020;56(3):697–708. pmid:31922217
- View Article
- PubMed/NCBI
- Google Scholar
47. Sun X, Ma J, Chen Q, Hou Z, Luo X, Wang G, et al. SIX4 promotes metastasis through STAT3 activation in breast cancer. Am J Cancer Res. 2020;10(1):224–236. pmid:32064163
- View Article
- PubMed/NCBI
- Google Scholar
48. Xu WW, Liao L, Dai W, Zheng CC, Tan XP, He Y, et al. Genome-wide CRISPR/Cas9 screening identifies a targetable MEST-PURA interaction in cancer metastasis. EBioMedicine. 2023;92:104587. pmid:37149929
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Vogelstein B, Papadopoulos N, Velculescu VE, Zhou S, Diaz LA Jr, Kinzler KW. Cancer genome landscapes. Science. 2013;339(6127):1546–58. pmid:23539594
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Liberzon A, Subramanian A, Pinchback R, Thorvaldsdóttir H, Tamayo P, Mesirov JP. Molecular signatures database (MSigDB) 3.0. Bioinformatics. 2011;27(12):1739–40. pmid:21546393
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Allison DB, Cui X, Page GP, Sabripour M. Microarray data analysis: from disarray to consolidation and consensus. Nature Reviews Genetics. 2006;7(1):55–65. pmid:16369572
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005;102(43):15545–15550. pmid:16199517
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref5] 5. Goeman JJ, Buehlmann P. Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics. 2007;23(8):980–987. pmid:17303618
View Article
PubMed/NCBI
Google Scholar

[18] View Article

[19] PubMed/NCBI

[20] Google Scholar

[ref6] 6. Khatri P, Sirota M, Butte AJ. Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Computational Biology. 2012;8(2):e1002375. pmid:22383865
View Article
PubMed/NCBI
Google Scholar

[22] View Article

[23] PubMed/NCBI

[24] Google Scholar

[ref7] 7. Gene Ontology Consortium. The Gene Ontology in 2010: extensions and refinements. Nucleic Acids Res. 2010;38(Database issue):D331–5.
View Article
Google Scholar

[26] View Article

[27] Google Scholar

[ref8] 8. Croft D, Mundo AF, Haw R, Milacic M, Weiser J, Wu G, et al. The Reactome pathway knowledgebase. Nucleic Acids Res. 2014;42(Database issue):D472–7. pmid:24243840
View Article
PubMed/NCBI
Google Scholar

[29] View Article

[30] PubMed/NCBI

[31] Google Scholar

[ref9] 9. Mutation Consequences and Pathway Analysis working group of the International Cancer Genome Consortium. Pathway and network analysis of cancer genomes. Nat Methods. 2015;12(7):615–21.
View Article
Google Scholar

[33] View Article

[34] Google Scholar

[ref10] 10. Paull EO, Carlin DE, Niepel M, Sorger PK, Haussler D, Stuart JM. Discovering causal pathways linking genomic events to transcriptional states using Tied Diffusion Through Interacting Events (TieDIE). Bioinformatics. 2013;29(21):2757–64. pmid:23986566
View Article
PubMed/NCBI
Google Scholar

[36] View Article

[37] PubMed/NCBI

[38] Google Scholar

[ref11] 11. Vaske CJ, Benz SC, Sanborn JZ, Earl D, Szeto C, Zhu J, et al. Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using PARADIGM. Bioinformatics. 2010;26(12):i237–45. pmid:20529912
View Article
PubMed/NCBI
Google Scholar

[40] View Article

[41] PubMed/NCBI

[42] Google Scholar

[ref12] 12. Drier Y, Sheffer M, Domany E. Pathway-based personalized analysis of cancer. Proc Natl Acad Sci U S A. 2013;110(16):6388–93. pmid:23547110
View Article
PubMed/NCBI
Google Scholar

[44] View Article

[45] PubMed/NCBI

[46] Google Scholar

[ref13] 13. Frost HR, Amos CI. A multi-omics approach for identifying important pathways and genes in human cancer. BMC Bioinformatics. 2018;19(1):479. pmid:30541428
View Article
PubMed/NCBI
Google Scholar

[48] View Article

[49] PubMed/NCBI

[50] Google Scholar

[ref14] 14. Schneider G, Schmidt-Supprian M, Rad R, Saur D. Tissue-specific tumorigenesis: context matters. Nat Rev Cancer. 2017;17(4):239–253. pmid:28256574
View Article
PubMed/NCBI
Google Scholar

[52] View Article

[53] PubMed/NCBI

[54] Google Scholar

[ref15] 15. Sack LM, Davoli T, Li MZ, Li Y, Xu Q, Naxerova K, et al. Profound Tissue Specificity in Proliferation Control Underlies Cancer Drivers and Aneuploidy Patterns. Cell. 2018;173(2):499–514.e23. pmid:29576454
View Article
PubMed/NCBI
Google Scholar

[56] View Article

[57] PubMed/NCBI

[58] Google Scholar

[ref16] 16. Haigis KM, Cichowski K, Elledge SJ. Tissue-specificity in cancer: The rule, not the exception. Science. 2019;363(6432):1150–1151. pmid:30872507
View Article
PubMed/NCBI
Google Scholar

[60] View Article

[61] PubMed/NCBI

[62] Google Scholar

[ref17] 17. Bianchi JJ, Zhao X, Mays JC, Davoli T. Not all cancers are created equal: Tissue specificity in cancer genes and pathways. Curr Opin Cell Biol. 2020;63:135–143. pmid:32092639
View Article
PubMed/NCBI
Google Scholar

[64] View Article

[65] PubMed/NCBI

[66] Google Scholar

[ref18] 18. Davoli T, Xu AW, Mengwasser KE, Sack LM, Yoon JC, Park PJ, et al. Cumulative haploinsufficiency and triplosensitivity drive aneuploidy patterns and shape the cancer genome. Cell. 2013;155(4):948–62. pmid:24183448
View Article
PubMed/NCBI
Google Scholar

[68] View Article

[69] PubMed/NCBI

[70] Google Scholar

[ref19] 19. Buscail L, Bournet B, Cordelier P. Role of oncogenic KRAS in the diagnosis, prognosis and treatment of pancreatic cancer. Nat Rev Gastroenterol Hepatol. 2020;17(3):153–168. pmid:32005945
View Article
PubMed/NCBI
Google Scholar

[72] View Article

[73] PubMed/NCBI

[74] Google Scholar

[ref20] 20. Elledge SJ, Amon A. The BRCA1 suppressor hypothesis: an explanation for the tissue-specific tumor development in BRCA1 patients. Cancer Cell. 2002;1(2):129–32. pmid:12086871
View Article
PubMed/NCBI
Google Scholar

[76] View Article

[77] PubMed/NCBI

[78] Google Scholar

[ref21] 21. Frost HR. Analyzing cancer gene expression data through the lens of normal tissue-specificity. PLoS Comput Biol. 2021;17(6):e1009085. pmid:34143767
View Article
PubMed/NCBI
Google Scholar

[80] View Article

[81] PubMed/NCBI

[82] Google Scholar

[ref22] 22. Cancer Genome Atlas Research Network, Weinstein JN, Collisson EA, Mills GB, Shaw KRM, Ozenberger BA, et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nat Genet. 2013;45(10):1113–20. pmid:24071849
View Article
PubMed/NCBI
Google Scholar

[84] View Article

[85] PubMed/NCBI

[86] Google Scholar

[ref23] 23. Uhlen M, Zhang C, Lee S, Sjöstedt E, Fagerberg L, Bidkhori G, et al. A pathology atlas of the human cancer transcriptome. Science. 2017;357 (6352). pmid:28818916
View Article
PubMed/NCBI
Google Scholar

[88] View Article

[89] PubMed/NCBI

[90] Google Scholar

[ref24] 24. Frost HR. Computation and application of tissue-specific gene set weights. Bioinformatics. 2018. pmid:29659714
View Article
PubMed/NCBI
Google Scholar

[92] View Article

[93] PubMed/NCBI

[94] Google Scholar

[ref25] 25. Wang W, Hao J, Zheng S, Fan Q, He A, Wen Y, et al. Tissue-specific pathway association analysis using genome-wide association study summaries. Bioinformatics. 2017;33(2):243–247. pmid:27651483
View Article
PubMed/NCBI
Google Scholar

[96] View Article

[97] PubMed/NCBI

[98] Google Scholar

[ref26] 26. Sharon M, Vinogradov E, Argov CM, Lazarescu O, Zoabi Y, Hekselman I, et al. The differential activity of biological processes in tissues and cell subsets can illuminate disease-related processes and cell-type identities. Bioinformatics. 2022;38(6):1584–1592. pmid:35015838
View Article
PubMed/NCBI
Google Scholar

[100] View Article

[101] PubMed/NCBI

[102] Google Scholar

[ref27] 27. Sharon M, Gruber G, Argov CM, Volozhinsky M, Yeger-Lotem E. ProAct: quantifying the differential activity of biological processes in tissues, cells, and user-defined contexts. Nucleic Acids Res. 2023;51(W1):W478–W483. pmid:37207335
View Article
PubMed/NCBI
Google Scholar

[104] View Article

[105] PubMed/NCBI

[106] Google Scholar

[ref28] 28. Uhlén M, Fagerberg L, Hallström BM, Lindskog C, Oksvold P, Mardinoglu A, et al. Proteomics. Tissue-based map of the human proteome. Science. 2015;347(6220):1260419. pmid:25613900
View Article
PubMed/NCBI
Google Scholar

[108] View Article

[109] PubMed/NCBI

[110] Google Scholar

[ref29] 29. Liu J, Lichtenberg T, Hoadley KA, Poisson LM, Lazar AJ, Cherniack AD, et al. An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics. Cell. 2018;173(2):400–416.e11. pmid:29625055
View Article
PubMed/NCBI
Google Scholar

[112] View Article

[113] PubMed/NCBI

[114] Google Scholar

[ref30] 30. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene Ontology: tool for the unification of biology. Nature Genetics. 2000;25(1):25–29. pmid:10802651
View Article
PubMed/NCBI
Google Scholar

[116] View Article

[117] PubMed/NCBI

[118] Google Scholar

[ref31] 31. Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Research. 2000;28(1):27–30. pmid:10592173
View Article
PubMed/NCBI
Google Scholar

[120] View Article

[121] PubMed/NCBI

[122] Google Scholar

[ref32] 32. Mahalanobis PC. On the generalized distance in statistics. Proceedings of the National Institute of Sciences (Calcutta). 1936;2:49–55.
View Article
Google Scholar

[124] View Article

[125] Google Scholar

[ref33] 33. Venables WN, Ripley BD. Modern Applied Statistics with S. 4th ed. New York: Springer; 2002. Available from: http://www.stats.ox.ac.uk/pub/MASS4.

[ref34] 34. Frost HR. Variance-adjusted Mahalanobis (VAM): a fast and accurate method for cell-specific gene set scoring. Nucleic Acids Res. 2020. pmid:32633778
View Article
PubMed/NCBI
Google Scholar

[128] View Article

[129] PubMed/NCBI

[130] Google Scholar

[ref35] 35. Hänzelmann S, Castelo R, Guinney J. GSVA: gene set variation analysis for microarray and RNA-seq data. BMC Bioinformatics. 2013;14:7. pmid:23323831
View Article
PubMed/NCBI
Google Scholar

[132] View Article

[133] PubMed/NCBI

[134] Google Scholar

[ref36] 36. Barbie DA, Tamayo P, Boehm JS, Kim SY, Moody SE, Dunn IF, et al. Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1. Nature. 2009;462(7269):108–12. pmid:19847166
View Article
PubMed/NCBI
Google Scholar

[136] View Article

[137] PubMed/NCBI

[138] Google Scholar

[ref37] 37. Lee E, Chuang HY, Kim JW, Ideker T, Lee D. Inferring pathway activity toward precise disease classification. PLoS Comput Biol. 2008;4(11):e1000217. pmid:18989396
View Article
PubMed/NCBI
Google Scholar

[140] View Article

[141] PubMed/NCBI

[142] Google Scholar

[ref38] 38. Klein MI, Stern DF, Zhao H. GRAPE: a pathway template method to characterize tissue-specific functionality from gene expression profiles. BMC Bioinformatics. 2017;18(1):317. pmid:28651562
View Article
PubMed/NCBI
Google Scholar

[144] View Article

[145] PubMed/NCBI

[146] Google Scholar

[ref39] 39. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B (Statistical Methodology). 1995; p. 289–300.
View Article
Google Scholar

[148] View Article

[149] Google Scholar

[ref40] 40. Badia-I-Mompel P, Vélez Santiago J, Braunger J, Geiss C, Dimitrov D, Müller-Dott S, et al. decoupleR: ensemble of computational methods to infer biological activities from omics data. Bioinform Adv. 2022;2(1):vbac016. pmid:36699385
View Article
PubMed/NCBI
Google Scholar

[151] View Article

[152] PubMed/NCBI

[153] Google Scholar

[ref41] 41. Müller-Dott S, Tsirvouli E, Vázquez M, Flores ROR, i Mompel PB, Fallegger R, et al. Expanding the coverage of regulons from high-confidence prior knowledge for accurate estimation of transcription factor activities. bioRxiv. 2023.
View Article
Google Scholar

[155] View Article

[156] Google Scholar

[ref42] 42. Wang Q, Peng R, Wang B, Wang J, Yu W, Liu Y, et al. Transcription factor KLF13 inhibits AKT activation and suppresses the growth of prostate carcinoma cells. Cancer Biomark. 2018;22(3):533–541. pmid:29843216
View Article
PubMed/NCBI
Google Scholar

[158] View Article

[159] PubMed/NCBI

[160] Google Scholar

[ref43] 43. Zhang J, Zhang H, Shi H, Wang F, Du J, Wang Y, et al. THAP11 Functions as a Tumor Suppressor in Gastric Cancer through Regulating c-Myc Signaling Pathways. Biomed Res Int. 2020;2020:7838924. pmid:32908912
View Article
PubMed/NCBI
Google Scholar

[162] View Article

[163] PubMed/NCBI

[164] Google Scholar

[ref44] 44. Qing F, Liu Z. Interferon regulatory factor 7 in inflammation, cancer and infection. Front Immunol. 2023;14:1190841. pmid:37251373
View Article
PubMed/NCBI
Google Scholar

[166] View Article

[167] PubMed/NCBI

[168] Google Scholar

[ref45] 45. He Z, He J, Xie K. KLF4 transcription factor in tumorigenesis. Cell Death Discov. 2023;9(1):118. pmid:37031197
View Article
PubMed/NCBI
Google Scholar

[170] View Article

[171] PubMed/NCBI

[172] Google Scholar

[ref46] 46. Feng Y, Tang Y, Mao Y, Liu Y, Yao D, Yang L, et al. PAX2 promotes epithelial ovarian cancer progression involving fatty acid metabolic reprogramming. Int J Oncol. 2020;56(3):697–708. pmid:31922217
View Article
PubMed/NCBI
Google Scholar

[174] View Article

[175] PubMed/NCBI

[176] Google Scholar

[ref47] 47. Sun X, Ma J, Chen Q, Hou Z, Luo X, Wang G, et al. SIX4 promotes metastasis through STAT3 activation in breast cancer. Am J Cancer Res. 2020;10(1):224–236. pmid:32064163
View Article
PubMed/NCBI
Google Scholar

[178] View Article

[179] PubMed/NCBI

[180] Google Scholar

[ref48] 48. Xu WW, Liao L, Dai W, Zheng CC, Tan XP, He Y, et al. Genome-wide CRISPR/Cas9 screening identifies a targetable MEST-PURA interaction in cancer metastasis. EBioMedicine. 2023;92:104587. pmid:37149929
View Article
PubMed/NCBI
Google Scholar

[182] View Article

[183] PubMed/NCBI

[184] Google Scholar

Figures

Abstract

Author summary

Introduction

Materials and methods

Data sources

TPAC method

Choice of S, S+ or S−

TPAC implementation

Comparison methods

TCGA analyses

Results

Inference with TPAC scores maintains type I error control

TPAC scores reveal pan-cancer landscape of gene set dysregulation

TPAC scores have more significant associations with cancer prognosis than scores generated by comparative methods

TPAC scores are strongly associated with tumor and lymph node stage

Sample-level inference on TPAC scores highlights the significant elements of pan-cancer gene set dysregulation

TPAC score statistical significance effectively stratifies patients according to cancer prognosis

Association between TPAC scores and transcription factor activity illuminates the regulatory impact of tumorigenesis

Discussion

Novel features of the TPAC method

Benefits of TPAC for the analysis of cancer transcriptomics

TPAC limitations

Conclusion

Supporting information

S1 Text. Includes supplemental methods and results.

S2 Text. Contains a vignette illustrating the analysis of TCGA liver cancer RNA-seq data using MSigDB Hallmark gene sets.

Acknowledgments

References

Choice of S, S⁺ or S⁻