A multi-omics framework for survival mediation analysis of high-dimensional proteogenomic data

Seungjun Ahn; Weijia Fu; Maaike van Gerwen; Lei Liu; Zhigang Li

doi:10.1371/journal.pcbi.1014217

Abstract

Survival analysis plays a crucial role in understanding time-to-event (survival) outcomes such as disease progression. Despite recent advancements in causal mediation frameworks for survival analysis, existing methods are typically based on Cox regression and primarily focus on a single exposure or individual omics layers, often overlooking multi-omics interplay. This limitation hinders the full potential of integrated biological insights. In this paper, we propose SMAHP, a novel method for survival mediation analysis that simultaneously handles high-dimensional exposures and mediators, integrates multi-omics data, and offers a robust statistical framework for identifying causal pathways on survival outcomes. This is one of the first attempts to introduce the accelerated failure time (AFT) model within a multi-omics causal mediation framework for survival outcomes. Through simulations across multiple scenarios, we demonstrate that SMAHP achieves high statistical power, while effectively controlling false discovery rate (FDR), compared with two other approaches. We further apply SMAHP to the largest head-and-neck carcinoma proteogenomic data, detecting a gene mediated by a protein that influences survival time. R package is freely available on CRAN repository and published under General Public License version 3.

Author summary

In this study, we propose SMAHP, a novel multi-omics causal mediation framework that addresses the unique challenges of high-dimensional exposures, high-dimensional mediators, and survival outcomes. To our knowledge, this is the first methodological development specifically focused on survival causal mediation analysis in the context of multi-omics proteogenomic data. SMAHP incorporates a two-stage feature selection procedure combining penalization techniques and sure independence screening to efficiently identify relevant exposure and mediator candidates associated with survival outcomes. Through comprehensive simulation studies, we demonstrate the robustness of our approach. We further illustrate the practical utility of SMAHP by applying it to the largest proteogenomic dataset of head and neck cancer (CPTAC), uncovering a causal mediation pathway where a specific protein negatively mediates the effect of gene expression on survival time in patients with HPV-negative tumors. The proposed methodology is publicly available as an R package, SMAHP, on CRAN, accompanied by a detailed vignette to facilitate reproducibility and application.

Citation: Ahn S, Fu W, van Gerwen M, Liu L, Li Z (2026) A multi-omics framework for survival mediation analysis of high-dimensional proteogenomic data. PLoS Comput Biol 22(4): e1014217. https://doi.org/10.1371/journal.pcbi.1014217

Editor: Chris Amos, undefined, UNITED STATES OF AMERICA

Received: July 15, 2025; Accepted: April 9, 2026; Published: April 27, 2026

Copyright: © 2026 Ahn et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The pre-processed RNA-Seq, proteomics, and clinical data (metadata) from the National Cancer Institute-initiated CPTAC are available in the Proteomic Data Commons (https://pdc.cancer.gov/pdc/cptac-pancancer). The SMAHP (all upper cases) R package is freely available in the Comprehensive R Archive Network (CRAN) repository (https://cran.r-project.org/web/packages/SMAHP/index.html).

Funding: The first author (S.A.) was supported in part by National Cancer Institute Cancer Center Support Grant P30CA196521 awarded to the Tisch Cancer Center of the Icahn School of Medicine at Mount Sinai. This work was supported by the Clinical and Translational Science Awards (CTSA) grant UL1TR004419 from the National Center for Advancing Translational Sciences. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Survival analysis is a powerful tool for understanding time-to-event outcomes, such as patient survival or disease progression. Recent advancements in high-throughput technologies have enabled the profiling of biological data on a scale that was once unimaginable, including proteomics data and RNA-Seq data. These advances open new avenues for uncovering complex relationships between biological entities and survival outcomes. For example, recent studies have identified protein or gene biomarkers associated with survival outcomes in patients with idiopathic pulmonary fibrosis [1], glioblastoma [2], hepatocellular carcinoma [3], small-cell lung carcinoma [4], cardiovascular diseases [5,8], Alzheimer’s disease and related dementias [6], and oropharyngeal carcinoma [7], often using Cox proportional hazards (PH) regression models with or without penalization methods.

While traditional survival analysis methods, such as Cox PH regression, have been widely used to assess the direct effects of individual predictor variables on survival outcomes, they often fall short in capturing indirect effects or mediating pathways. Specifically, predictor models are fitted separately to assess associations between genes and survival, and between proteins and survival. The mediation analysis offers a solution by exploring how a hypothetical intermediate variable (i.e., mediator) bridges the effect of an exposure variable on an outcome. In the context of survival analysis, this approach provides valuable insights into how molecular biomarkers, such as protein expression levels, mediate the relationship between RNA-Seq gene expression and survival.

Over the past few years, pioneering methodological approaches have emerged for causal mediation frameworks with survival outcomes in the analysis of omics data. Notably, High-dimensional Mediation Analysis in Survival Models (HIMAsurvival) [42] first introduced a Cox PH regression framework, incorporating variable selection techniques via sure independence screening (SIS) [39] and minimax concave penalty (MCP)-penalization [37], to estimate and test the effects of high-dimensional mediators on survival outcome. Following this, another study presented a Cox regression-based framework [45], featuring adaptations of de-biased Lasso inference and a joint significance test for better false discovery control compared to HIMAsurvival [42]. More recently, the Mediation Analysis of Survival Outcomes and High-Dimensional Omics Mediators (MASH) [9] proposed a three-step high-dimensional mediator selection procedure (i.e., pre-screening with SIS, MCP for variable selection, and FDR control using the Benjamini-Hochberg (BH) procedure). While the mediator selection procedure is similar to the two methods described earlier [42,45], MASH distinguishes itself by using R²-based measures to estimate mediation effects, with its primary framework again centered on the Cox regression model.

Despite these advances in survival mediation analysis, two interconnected limitations persist. First, existing survival mediation approaches have primarily focused on high-dimensional mediators, with much less attention given to the setting of exposures. These studies typically consider a single binary or continuous exposure, overlooking the opportunity to capture the rich information present in high-dimensional exposure variables. A previous study [49] explored a frailty model (or Cox model with random effects) to identify mediation effects in the presence of high-dimensional exposures, but it only considers one mediator at a time. Second, these methods have showcased applications to individual omics layer (e.g., DNA methylation, metabolomics, and copy number variation data). However, biological systems are complex and driven by interplay across multiple omics layers. As demonstrated in numerous clinical and bioinformatics studies [10–13], the integration of multilayer (or multi-omics) analysis has been used to characterize key multi-omics pathways, offering a more comprehensive view of the molecular mechanisms underlying complex diseases such as cancer and Alzheimer’s disease.

Significant gaps still remain, as only one study has attempted to address this issue in the context of survival causal mediation in the analysis of omics data [14]. This unified mediation analysis framework accounts for both multivariable exposures and mediators in relation to survival outcomes, applying it to proteogenomic data to identify genes mediated by proteins associated with survival. However, the primary focus was not specifically on survival outcomes. This study considered a Cox model, which may not be valid if the PH assumption is violated. Additionally, the simulations in the study did not explore a range of censoring rates (set to 50%).

The motivation of this paper is to develop a statistical methodology capable of (1) handling both high-dimensional mediators and exposures, (2) testing the mediation effect between exposure and survival outcomes with proper FDR control, and (3) analyzing multiple omics platforms simultaneously. In addition, from a clinical perspective, most existing mediation methods have primarily been developed for analyses within a single omics layer. In contrast, a growing body of clinical proteogenomic studies [15–21] has demonstrated the clinical relevance of jointly analyzing genomic and proteomic data, highlighting their complementary roles and associations with disease phenotypes across diverse disease types. Despite this clinical importance, high-dimensional causal mediation methods that explicitly integrate proteomic and transcriptomic data remain relatively underdeveloped.

To overcome these challenges and achieve the objectives of this study, we propose a novel approach to mediation analysis, specifically focused on survival outcomes and framed within the counterfactual paradigm. By implementing and applying this methodology to multilayered, high-dimensional proteogenomic data (high-dimensional genomic exposures from RNA-Seq data and high-dimensional protein mediators from proteomics data), we introduce this framework as Survival Mediation Analysis of High-dimensional Proteogenomic data (SMAHP). Our method procedure involves the following parts: First, we conduct a preliminary screening of high-dimensional mediators and exposures using penalized accelerated failure time (AFT) [35] and MCP-penalized regression models [37], based on their disjoint indirect effects (exposure → mediator and mediator → outcome) or direct effects (exposure → outcome). Second, using proteins and genes selected from the preliminary screening, we adopt the SIS [39] to identify “important” protein mediators that relate the gene exposures to the outcome. Third, a joint significance test [47] is performed to determine whether a particular mediator lies in the causal pathway between exposure and survival outcome, with appropriate control of the false discovery rate (FDR).

We demonstrate the advantages of SMAHP through comprehensive simulations and showcase its application to proteogenomic data from the Clinical Proteomic Tumor Analysis Consortium (CPTAC) to identify potential causal mediation pathways related to head and neck squamous cell carcinomas (HNSCC). We conclude with a discussion of the summary, challenges, limitations, and intended future directions for both application and methodological research.

Results

Simulation design

We generated exposure variables from the multivariate normal distribution with no correlations between exposure variables. Each exposure variable has mean 0.4 and standard deviation 0.5. Two covariates were generated as follows: . Mediator variables were then simulated from normal distribution with a proportion of and associated with . Specifically, 40% of the are associated with 10% of with an effect size of 0.8, as well as with Z₁ and Z₂ with effect sizes of 0.2 each, and a standard deviation of 0.5. Another 40% of the are associated with 10% of the , also with an effect size of 0.8, but with a standard deviation of 0.3. Additionally, 10% of the are associated only with Z₁ and Z₂, with effect sizes of 0.2 and 0.3, respectively, and a standard deviation of 0.5. The remaining 10% of the are not associated with any of the or , with a standard deviation of 0.3. Survival outcomes T were simulated using an AFT model, with T associated with randomly selected (effect size 0.8), (effect size 4.0), and (effect size 0.12). The error term in the AFT model followed a normal distribution, and the censoring times were drawn from an exponential distribution, calibrated to achieve a 25% censoring rate. As part of sensitivity analyses, we also explored censoring rates of 50% and 75%. In comparison with the SMAHP, we compared performance using two different approaches. The first approach begins with a univariate marginal mediation and outcome model, using SIS to identify exposures, followed by a second SIS approach, described as Step 2, which we will refer to as SIS + SIS. The second approach is a naïve modeling method, where marginal mediation and outcome models are fitted without any penalization or the application of the SIS procedure. In addition, we conducted sensitivity analyses incorporating correlated gene and protein structures, considering correlations among exposure genes and among mediator proteins, with correlation levels set to 0.4. We also examined robustness by sampling the exposures from a negative binomial distribution with dispersion parameter 3 instead of a multivariate normal distribution. We included additional simulation experiments to assess the robustness of the proposed method to outliers. Specifically, 2% of the exposure values were generated as outliers by sampling from a uniform distribution over the interval , where and denote the mean and standard deviation of the normal distribution used to generate the exposures. We further evaluated performance using AFT models with two alternative residual distributions (Gamma and logistic error). For each setting, we repeated the simulation 200 times.

Simulation results

In these simulation experiments, different combinations of n, p, and k were considered. Table 1 summarizes the simulation results when the censoring rate is 25%. Across all scenarios, the naïve method significantly inflated the FDR, even though it achieved high power. In contrast, SMAHP maintained high power while adequately controlling the FDR at the 5% level. The size difference in power and FDR was small between SMAHP and SIS + SIS in Scenarios I and II. However, this difference was more pronounced in Scenarios III and IV. For instance, in Scenario III, our proposed method achieved a power of 0.8296, whereas SIS + SIS had a power of 0.6423 with a smaller sample size (n = 200). When the censoring rate was increased to 50% (see Table 2), the SIS + SIS approach no longer controlled the FDR, unlike our proposed method. This issue worsened as the censoring rate increased to 75% (see Table 3), regardless of the sample size adjustment. In all scenarios with such high 75%, SIS + SIS exhibited lower power and inflated FDR compared to SMAHP. On the other hand, SMAHP also had an inflated FDR with a smaller sample size, but it quickly regained control over FDR when the sample size was increased to n = 400. For instance, when the censoring rate was 75%, SMAHP achieved a power of 0.7618 and an FDR of 0.1082, which improved to a power of 0.9843 and an FDR of 0.0580 as the sample size increased.

Download:

Table 1. Simulation results using SMAHP (penalizations + SIS), SIS + SIS, and naïve approaches. The data were simulated with censoring rate of 25%.

https://doi.org/10.1371/journal.pcbi.1014217.t001

Download:

Table 2. Simulation results using SMAHP (penalizations + SIS), SIS + SIS, and naïve approaches. The data were simulated with censoring rate of 50%.

https://doi.org/10.1371/journal.pcbi.1014217.t002

Download:

Table 3. Simulation results using SMAHP (penalizations + SIS), SIS + SIS, and naïve approaches. The data were simulated with censoring rate of 75%.

https://doi.org/10.1371/journal.pcbi.1014217.t003

As a whole, SMAHP required the most computation time compared with the SIS + SIS and naïve methods, as we recorded the average computational time. This is expected, as SMAHP leverages a penalization technique. See Tables 1 to 3. In addition, we explored different penalties for the mediation model in SMAHP, with the MCP-penalization used as the default. We investigated whether we would obtain similar or substantially different results with other penalties, such as the elastic-net and Lasso penalties. S1 Table summarizes this experiment, showing similar power and FDR, as well as comparable computational time. Additional sensitivity analyses, suggested by the reviewers, were conducted to further evaluate the performance of SMAHP. Specifically, we examined settings with correlated exposure genes and correlated mediator proteins (S2 Table). Across these settings, the results were broadly consistent with the primary simulation results, with SMAHP showing stable power and false discovery rate (FDR) control under low to moderate correlation levels. We also evaluated performance when exposures were sampled from a negative binomial distribution (S3 Table). In this setting, SMAHP showed similar behavior to the primary simulations, with satisfactory power and FDR control around the nominal 5% level. Sensitivity analyses incorporating outliers (S4 Table) indicated that SMAHP continued to perform well as the sample size increased. Finally, under two alternative residual distributions, Gamma (S5 Table) and logistic (S6 Table), results were consistent with the primary simulations, showing that the SMAHP maintained high power and controlled the FDR. The entire set of simulations was run on the high-performance supercomputer Minerva at the Icahn School of Medicine at Mount Sinai, utilizing 10 CPU cores and 8 GB of RAM per node.

Application study: Analysis of clinical proteomic tumor analysis consortium data

We are motivated by the problem of human papillomavirus-negative (HPV-Neg) HNSCC, which remains insufficiently studied, as highlighted by recent clinical research [17,53], despite HNSCC being ranked as the sixth most prevalent epithelial cancer globally [54]. Specifically, a recent CPTAC HNSCC study [17] has emphasized that a comprehensive understanding of how transcriptomic and molecular changes contribute to tumor phenotypes is still lacking, highlighting the critical need for further investigation in this area.

In this study, pre-processed RNA-Seq, proteomics, and clinical data (metadata) from the National Cancer Institute-initiated CPTAC were downloaded from the Proteomic Data Commons (https://pdc.cancer.gov/pdc/cptac-pancancer), which is one of the largest public repositories of proteogenomic data. The CPTAC has been utilized to increase understanding of the molecular mechanisms of cancer through its large-scale, mass spectrometry-based proteomic profiling data of tumor samples, which were previously analyzed by the Cancer Genome Atlas (TCGA) [52]. In comparison to TCGA, CPTAC provides a more extensive proteome coverage.

The overall survival (OS) is defined as the time from cancer diagnosis to death. Patients were censored if the event had not occurred by the last time of follow-up. The median OS is 46.27 months, and the Kaplan-Meier curve is presented in S1 Fig. The original sample size consisted of 109 patients, with 60,669 genes and 9,469 proteins available in the RNA-Seq and proteomics data, respectively. Prior to the application of the SMAHP, a univariate AFT model was fitted to each gene and protein separately for pre-screening purposes in this ultra high-dimensional data. The top 100 genes and top 200 proteins with the smallest p-values were then selected for analysis using SMAHP. Seven patients did not have OS data available and were excluded, resulting in a final sample size of 102 patients, with a censoring rate of 67.6%. After applying SMAHP, there was indirect effect (p = 0.001) of late cornified envelope 3E protein (LCE3E; Ensembl ID: ENSG00000185966.4) on the association between high-mobility group box 1 pseudogene 23 (HMGB1P23; Ensembl ID: ENSG00000253770.1) and OS, with age included as an additional covariate. Fig 1 provides a summary of the analysis. While the direct effect on the survival time is positive, it appears that the indirect effect through the mediator is negative. Such a pattern may arise when the exposure influences survival through multiple pathways, with the mediator capturing an effect in the opposite direction to other pathways such as the direct pathway. This observation highlights the complexity of the underlying biological processes and is provided to aid interpretation of the estimated effects.

Download:

Fig 1. A directed acyclic graph summarizing the analysis of CPTAC HNSCC data, highlighting direct and indirect effects with 95% confidence intervals in parentheses using SMAHP.

https://doi.org/10.1371/journal.pcbi.1014217.g001

Although members of the high-mobility group box family have been implicated in carcinogenesis (e.g., high expression of HMGB1 in human nasopharyngeal carcinoma) and autoinflammatory diseases [55], no published literature exists on diseases or disorders specifically associated with HMGB1P23. A search of the MalaCards human disease database (https://www.malacards.org/) [56] also did not yield relevant findings. Interestingly, the MalaCards revealed that LCE3E is associated with plantar warts, which are caused by HPV. However, given that the study samples tested HPV-negative, this association should be interpreted with caution. Reported LCE3E-related pathways include keratinization and nervous system development. The total runtime to complete the analysis using SMAHP was 2.28 minutes. The analysis was performed on a 2023 MacBook Pro equipped with an M3 processor and 16GB of RAM.

Discussion

In this study, we introduce SMAHP, a novel multi-omics causal mediation framework designed to handle high-dimensional exposures, high-dimensional mediators, and survival outcomes. To the best of our knowledge, this is the first framework in the literature focused specifically on identifying causal mediation pathways for time-to-event outcomes using multi-omics data. SMAHP has several key features that make it particularly useful in practical applications where existing methods are not feasible. First, it identifies causal pathways more accurately and tests the validity of the indirect effects within these pathways. Unlike other methods, which often consider a single binary exposure or a single continuous exposure at a time when assessing causal pathways, SMAHP is designed to handle multiple exposures and mediators simultaneously. Second, SMAHP enables researchers to gain a more comprehensive understanding of biological and molecular mechanisms by mapping gene-protein-outcome pathways, rather than analyzing gene-outcome and protein-outcome relationships separately. This is particularly valuable, as highlighted in the introduction, where emerging research underscores the importance of the synergy between proteomics and genomics and their connections to phenotypes across various disease types through proteogenomic analysis [15–21]. Third, the outcome model in our method is based on the AFT model, which is not constrained by the PH assumption, a key requirement of the Cox model. In contrast, all other existing methods rely on the Cox model.

Our simulation study demonstrated that, at least for the scenarios considered, SMAHP maintains high statistical power while appropriately controlling FDR. In general, this pattern persists and even outperforms the SIS + SIS and naïve approaches across different censoring rates, although a larger sample size is needed to restore high power and proper FDR control when the censoring rate is very high (i.e., 75%). These findings highlight the importance of the censoring rate when applying the proposed method and motivate further discussion of its implications for study design. Censoring rate is an important consideration when applying the proposed method and has direct implications for study design. Higher censoring rates generally reduce the effective information available for inference, which can lead to decreased statistical power. Our simulation results demonstrate that under high censoring conditions, such as 75%, increasing the sample size can substantially improve power while maintaining appropriate control of the false discovery rate. From a practical perspective, when researchers anticipate a high censoring rate in a biological dataset, larger sample sizes may be required to achieve adequate performance. In contrast, for studies with relatively low censoring rates, the method performs well even with more moderate sample sizes. Therefore, understanding the expected censoring mechanism and rate in a given study can inform decisions regarding sample size planning and the feasibility of applying the proposed framework. In our real-data application to the CPTAC HNSCC dataset, we identified a significant mediating effect of LCE3E on the association between the relatively unexplored HMGB1P23 gene and survival time among HPV-Neg study population. At present, the proposed method has been demonstrated using a single multi-omics, high-dimensional dataset, and an independent dataset was not available to serve as a traditional validation set. As a result, the same causal pathway may or may not be observed across different datasets, and differences in findings could reflect variations in cohort characteristics, such as age or ethnicity, rather than limitations of the proposed methodology. In future studies, we will try to validate the proposed approach using independent multi-omics datasets with larger sample sizes (if available).

There are several areas where further work is needed, and future extensions are possible. In particular, we explored alternative multiple hypothesis testing controls beyond the BH procedure, including the q-value method [51], Westfall-Young correction [60], and HDMT [59]. However, these approaches were not adopted, as they either resulted in worse performance or were incompatible with modeling high-dimensional exposures (the latter being the case for HDMT). Future methodological development that better accommodates high-dimensional exposure settings could make it possible to incorporate these alternative multiple testing approaches. Given the interdependencies between biological entities, incorporating group-level biological information, such as biological pathways or protein complexes, could be a crucial next step in advancing the analysis [57,58]. By grouping genes (and proteins) based on existing biological knowledge, this approach would help identify how series of interconnected molecular interactions work together to specific biological processes, which we loosely describe as a “pathway-level mediation framework”. One potential way to achieve this is to leverage multilevel or generalized linear mixed models to account for clustered data.

Additionally, we considered the penAFT method [35] for the penalized outcome model. We believe that penalization methods for the AFT model are less studied in the literature compared to those for the Cox model, likely because the latter offers the advantage of estimating hazard ratios, which are commonly used to compare biological conditions. Future research is needed to develop more computationally and statistically robust penalization algorithms to enhance the identification of exposures and mediators in the AFT outcome model. For example, in our study, we assumed a parametric AFT model, with the error term following a normal distribution. In future studies, it would be interesting to consider a nonparametric AFT model. This approach would eliminate the need for pre-processed, normalized data, such as the bioinformatics normalization used to align raw data from different samples, and would help reduce technical noise by not relying on parametric assumptions. Future work may consider extensions incorporating non-linear or interaction effects. Although feasible in principle, such extensions would substantially increase computational burden in high-dimensional settings, and interactions between continuous variables may be difficult to interpret biologically, potentially limiting their practical utility. From a computational perspective, further improvements may also be achieved by leveraging parallel computing strategies and incorporating dimensionality reduction or feature screening steps with marginal screening to reduce computational complexity when analyzing increasingly high-dimensional multi-omics data. Finally, a typical limitation of high dimensional mediation modeling methods is that it cannot disentangle the potential sequential causal mediation effect through multiple mediators due to the complexity induced by the high dimensionality of mediators. As with many existing methods, our proposed method does not test any sequential causal mediation effects or the cumulative causal mediation effect of multiple mediators.

Materials and methods

Notations and assumptions

Herein, we consider a proteogenomic data with a survival outcome T, a vector of proteomes as mediators, a vector of genes as exposures, and additional covariates to be adjusted for such as age, sex, smoking history, alcohol consumption categories, immune score, and histologic grades from n i.i.d. observations. For better readability, the subject-level subscripts are suppressed unless otherwise stated.

In this paper, our causal mediation model is implemented within the counterfactual framework [22,23]. In counterfactual notations, we will let denote the potential outcome of the kth proteome when each is set to x, and let be the potential outcome of T with an observed expression levels for all genes when .

The causal effects are assessed by taking the mean expected difference in counterfactual outcomes that would have been observed [24,25]. In the log-scale, we define the interventional indirect (IIE) and interventional direct effect [61,62] with any two levels of continuous exposure by the decomposition of a total effect [26].

where for all genes in and or is a random draw from the distribution of or . The exposure values are rescaled in order to assess the changes in causal quantities after the unit increase in original exposure value. The estimates of IDE and IIE are obtained from the mediation and outcome models, respectively, which are discussed in Model Specifications subsection.

We will in addition assume that the IDE and IIE are estimated under the assumptions of no-unmeasured confounders [27–29] for the exposure-outcome (Eq 1), mediator-outcome (Eq 2), and exposure-mediator (Eq 3). That is, for each exposure,

(1)

(2)

(3)

Model specifications

We consider the following models to describe the causal relationships illustrated in Fig 2. For the outcome model (Eq 4), the high-dimensional accelerated failure time (AFT) model [30–32] is used to estimate and test the effect of mediator (proteomes) in the causal pathway between continuous exposures (gene expressions) on the survival outcome T, while accounting for clinically meaningful covariates . The data we use to fit this outcome model also includes , the censoring indicator for the log censoring time . Furthermore, without loss of generality, the features are centered to eliminate an intercept. For the mediator model (Eq 5), the linear regression is fitted to model the association between and each mediator for .

(4)

(5)

where is the regression coefficients vector for the effect of on T; is a vector of regression coefficients for the covariates; is the regression coefficients vector of the effect of on T with the presence of ; and is a random error variable following the log-Weibull distribtution (flexible such as log-normal and other distributions) with the scale parameter b [33]. is a vector of regression coefficients for the effect of on each ; is the regression coefficients vector for covariates; and is normal random error. It is important to note that estimates of lead to the IDE of on T (i.e., ). The IIE of on the causal pathway between and T (i.e., ) is defined by the product rule of estimates [32,34,45].

Download:

Fig 2. A causal diagram to describe the framework of the mediation analysis of high-dimensional mediators with multivariable exposures and survival endpoint.

Here, the two types of effects are assessed: (1) the global indirect effect between and T, and specific mediation effect that is mediated by a mediator variable for each proteome that is selected from the penalized variable selection (nodes colored in blue) and (2) the direct effect between multivariable exposures RNA-Seq gene expressions (nodes colored in orange) and survival outcome T (a node colored in red).

https://doi.org/10.1371/journal.pcbi.1014217.g002

Step 1: Penalization for selecting mediator and exposure candidates

Penalization for outcome model.

It is likely that proteomes are highly correlated with one another, as are genes. Moreover, it is essential to assess the associations between these biological entities and the outcome. Therefore, we consider penalized regression technique to identify proteomic mediators and genomic exposures, following the recent work [35] on the penalized AFT model using a variation of the alternating direction method of multipliers algorithm. For the ith subject, we can derive the following equations from the outcome model (Eq 4)

where the outcome model penalization is applied separately for and , respectively. The estimated active sets, denoted as and , will be determined from each of these models. refers to the active set of pre-screened significant genes for , and refers to the active set of pre-screened significant proteins for . For more details on the penalized AFT regression, refer to the S1 Appendix.

Penalization for mediation model.

We identify important mediators and exposures in the outcome model through the penalization approach detailed in Penalization for Outcome Model subsection earlier. Concurrently, we consider penalized mediation models for each mediator, minimizing the penalties based on the minimax concave penalty (MCP) approach [37] for the marginal mediation model (Eq 5). For ith subject, we express the marginal model for mediators as follows

See the S1 Appendix for more details on MCP estimates. From this penalized mediation model, the estimated active set will be identified, consisting of pre-screened significant genes for . In our simulation study, we compared its performance with that of elastic-net [38] and L₂ regularization (Lasso) [36] when modeling the mediator marginally.

Furthermore, we can rewrite the outcome and mediation models derived from these penalization steps as follows:

(6)

(7)

(8)

where the focus is shifted to (i) being the estimated active set (i.e., non-zero coefficients) of K mediators selected as candidate proteomes to be extensively studied based on the penalized outcome model in Eq 6; (ii) is the estimated active set of P exposures selected as candidate genes from the penalized outcome model in Eq 7; and lastly, (iii) is the estimated active set of P exposures selected as candidate genes from the penalized mediation model in Eq 8.

Step 2: Screening important mediators with control for exposures

In Step 1, candidate proteomes and genes are selected from the penalized AFT and MCP regression models. However, these candidates are chosen based on either the disjoint indirect effect (i.e., and , separately) or the direct effect (i.e., ). Hence, in Step 2, we reformulate the models defined in Eqs 6–8 into those specified for the causal mediation framework in Eqs 4 and 5. The objective of this section is to identify “important” proteomic mediators that account for exposures and to further screen potential mediators in order to reduce dimensionality in high-dimensional data, thereby boosting computational efficiency.

For each , we consider the outcome and mediation models as redefined

where can be estimated by the maximum likelihood estimators of the AFT outcome model, and can be estimated by the ordinary least squares estimators of the linear regression mediation model, respectively.

Leveraging the sure independence screening (SIS) approach [39], the pairs ( among the top largest values of are screened. If a pair exhibits meaningful mediation effects, the mediators and genes are selected and denoted as and , where is a subset of , and is the subset of for any , respectively. Otherwise, pairs without replacing “with minimal” mediation effects are dropped. SIS is a computationally efficient method that quickly reduces the dimensionality while retaining the variables in the model with higher correlations [40]. The threshold for selecting pairs are typically and , as used without formal theoretical justificaiton in the original SIS methodology paper [39]. However, the threshold can be increased to a multiple of , such as or , to increase the probability of identifying important mediators, as demonstrated in other studies [41–44].

Step 3: Hypothesis Testing

In the earlier subsections, penalized regression approaches are used for variable selection (Step 1) and SIS for identifying “important” mediators (Step 2). By incorporating these mediators, exposures, and clinically meaningful covariates, we can express the outcome and mediation models as

(9)

(10)

where and are the active sets of “important” mediators and exposures identified in Step 2, respectively. The coefficient corresponds to estimates for the mediators within . We have identified exposures that impose greater causal effect when paired with derived from the SIS. These exposures are subsequently regressed in the mediation model as described in Eq 10.

The joint significance (JS) test [47] is adopted to test whether a particular proteomic mediator lies in the causal pathway from an genomic exposure to a survival outcome (i.e., vs. for , where denotes the cardinality of a set ). The JS test (also referred to as the JS-uniform test) has been utilized in several studies [42,46] and is recognized for its ability to control the type I error rate while maintaining statistical power [48]. The primary distinction between our study and the aforementioned studies [42,46] lies in the number of exposures being tested for each mediator. Accordingly, for , where genes, we can articulate the null and research hypotheses as follows

(11)

to determine the gene-specific (or elementwise) indirect effect for each proteome mediator. The null hypothesis in Eq 11 can be further decomposed to three disjoint null sub-hypotheses

Let denote a matrix of the mediation-exposure associations from r mediation models, be a r-vector of mediator-outcome associations from the penalized outcome model, and be a matrix containing elementwise p-values. The elementwise p-value for testing hypotheses in Eq 11 can be obtained by (i) comparing the p-values from the sth row of with that from and (ii) taking the maximum of the two p-values.

(12)

where and are defined as

where and are regression coefficient estimates from fitted models using Eqs 9 and 10, respectively. and are estimates of standard error for and . is a standard normal cumulative distribution.

It is important to appropriately control the false discovery rate (FDR) for multiple hypothesis testing. Therefore, the BH-adjusted p-value [50] is applied on the matrix using stats R package.

Supporting information

S1 Appendix. Penalized outcome and mediation models in Step 1.

Detailed derivation of the penalized AFT model and MCP-penalized mediation model used for screening in Step 1.

https://doi.org/10.1371/journal.pcbi.1014217.s001

(PDF)

S1 Fig. A Kaplan-Meier curve for CPTAC-HNSCC application study.

Kaplan-Meier survival curve of overall survival for HPV-negative patients with head and neck squamous cell carcinoma from the CPTAC HNSCC dataset.

https://doi.org/10.1371/journal.pcbi.1014217.s002

(TIF)

S1 Table. Simulation results of the SMAHP with varying penal- ties in the mediation model.

The default MCP penalized mediation model is compared with mediation models using elastic-net and Lasso penalties.

https://doi.org/10.1371/journal.pcbi.1014217.s003

(PDF)

S2 Table. Simulation results for the SMAHP model (penalization + SIS) with correlated gene and protein structures.

SMAHP was assessed with correlated gene and protein structures.

https://doi.org/10.1371/journal.pcbi.1014217.s004

(PDF)

S3 Table. Simulation results of the SMAHP where exposures were generated from a negative binomial distribution, with a censoring rate of 25%.

SMAHP was further evaluated under this alternative exposure distribution.

https://doi.org/10.1371/journal.pcbi.1014217.s005

(PDF)

S4 Table. Simulation results of the SMAHP in the presence of outliers, with censoring rates of 25%.

We further evaluated the performance of SMAHP under this setting with outliers.

https://doi.org/10.1371/journal.pcbi.1014217.s006

(PDF)

S5 Table. Simulation results of SMAHP under a Gamma error distribution with censoring rates of 25%.

SMAHP was evaluated under Gamma residual distributions.

https://doi.org/10.1371/journal.pcbi.1014217.s007

(PDF)

S6 Table. Simulation results of SMAHP under a logistic error distribution with censoring rates of 25%.

SMAHP was evaluated under logistic residual distributions.

https://doi.org/10.1371/journal.pcbi.1014217.s008

(PDF)

Acknowledgments

We gratefully acknowledge the Minerva high-performance computing system, provided by Scientific Computing and Data at the Icahn School of Medicine at Mount Sinai. We would also like to thank Dr. Scott Roof (Icahn School of Medicine at Mount Sinai) for his valuable review and support of this manuscript.

References

1. Oldham JM, Huang Y, Bose S, Ma S-F, Kim JS, Schwab A, et al. Proteomic Biomarkers of Survival in Idiopathic Pulmonary Fibrosis. Am J Respir Crit Care Med. 2024;209(9):1111–20. pmid:37847691
- View Article
- PubMed/NCBI
- Google Scholar
2. Stetson LC, Dazard J-E, Barnholtz-Sloan JS. Protein Markers Predict Survival in Glioma Patients. Mol Cell Proteomics. 2016;15(7):2356–65. pmid:27143410
- View Article
- PubMed/NCBI
- Google Scholar
3. Wu Z-H, Yang D-L. Identification of a protein signature for predicting overall survival of hepatocellular carcinoma: a study based on data mining. BMC Cancer. 2020;20(1):720. pmid:32746792
- View Article
- PubMed/NCBI
- Google Scholar
4. Huo Z, Duan Y, Zhan D, Xu X, Zheng N, Cai J, et al. Proteomic Stratification of Prognosis and Treatment Options for Small Cell Lung Cancer. Genomics Proteomics Bioinformatics. 2024;22(2):qzae033. pmid:38961535
- View Article
- PubMed/NCBI
- Google Scholar
5. Schuermans A, Pournamdari AB, Lee J, Bhukar R, Ganesh S, Darosa N, et al. Integrative proteomic analyses across common cardiac diseases yield mechanistic insights and enhanced prediction. Nat Cardiovasc Res. 2024;3(12):1516–30. pmid:39572695
- View Article
- PubMed/NCBI
- Google Scholar
6. Zhang Y-R, Wu B-S, Chen S-D, Yang L, Deng Y-T, Guo Y, et al. Whole exome sequencing analyses identified novel genes for Alzheimer’s disease and related dementia. Alzheimers Dement. 2024;20(10):7062–78. pmid:39129223
- View Article
- PubMed/NCBI
- Google Scholar
7. Liu X, Liu P, Chernock RD, Kuhs KAL, Lewis JS Jr, Luo J, et al. A prognostic gene expression signature for oropharyngeal squamous cell carcinoma. EBioMedicine. 2020;61:102805. pmid:33038770
- View Article
- PubMed/NCBI
- Google Scholar
8. Zhao S, Cang H, Liu Y, Huang Y, Zhang S. Integrated analysis of bulk RNA-seq and single-cell RNA-seq reveals the function of pyrocytosis in the pathogenesis of abdominal aortic aneurysm. Aging (Albany NY). 2023;15(24):15287–323. pmid:38112597
- View Article
- PubMed/NCBI
- Google Scholar
9. Chi S, Flowers CR, Li Z, Huang X, Wei P. Mash: Mediation Analysis Of Survival Outcome And High-dimensional Omics Mediators With Application To Complex Diseases. Ann Appl Stat. 2024;18(2):1360–77. pmid:39328363
- View Article
- PubMed/NCBI
- Google Scholar
10. Sharma A, Debik J, Naume B, Ohnstad HO, Oslo Breast Cancer Consortium (OSBREAC), Bathen TF, et al. Comprehensive multi-omics analysis of breast cancer reveals distinct long-term prognostic subtypes. Oncogenesis. 2024;13(1):22. pmid:38871719
- View Article
- PubMed/NCBI
- Google Scholar
11. Nativio R, Lan Y, Donahue G, Sidoli S, Berson A, Srinivasan AR, et al. An integrated multi-omics approach identifies epigenetic alterations associated with Alzheimer’s disease. Nat Genet. 2020;52(10):1024–35. pmid:32989324
- View Article
- PubMed/NCBI
- Google Scholar
12. Lim J, Park C, Kim M, Kim H, Kim J, Lee D-S. Advances in single-cell omics and multiomics for high-resolution molecular profiling. Exp Mol Med. 2024;56(3):515–26. pmid:38443594
- View Article
- PubMed/NCBI
- Google Scholar
13. Hasin Y, Seldin M, Lusis A. Multi-omics approaches to disease. Genome Biol. 2017;18(1):83. pmid:28476144
- View Article
- PubMed/NCBI
- Google Scholar
14. Huang L, Long JP, Irajizad E, Doecke JD, Do K-A, Ha MJ. A unified mediation analysis framework for integrative cancer proteogenomics with clinical outcomes. Bioinformatics. 2023;39(1):btad023. pmid:36648331
- View Article
- PubMed/NCBI
- Google Scholar
15. Petralia F, Tignor N, Reva B, Koptyra M, Chowdhury S, Rykunov D, et al. Integrated Proteogenomic Characterization across Major Histological Types of Pediatric Brain Cancer. Cell. 2020;183(7):1962-1985.e31. pmid:33242424
- View Article
- PubMed/NCBI
- Google Scholar
16. Zhan X, Cheng J, Huang Z, Han Z, Helm B, Liu X, et al. Correlation Analysis of Histopathology and Proteogenomics Data for Breast Cancer. Mol Cell Proteomics. 2019;18(8 suppl 1):S37–51. pmid:31285282
- View Article
- PubMed/NCBI
- Google Scholar
17. Huang C, Chen L, Savage SR, Eguez RV, Dou Y, Li Y, et al. Proteogenomic insights into the biology and treatment of HPV-negative head and neck squamous cell carcinoma. Cancer Cell. 2021;39(3):361-379.e16. pmid:33417831
- View Article
- PubMed/NCBI
- Google Scholar
18. Satpathy S, Krug K, Jean Beltran PM, Savage SR, Petralia F, Kumar-Sinha C, et al. A proteogenomic portrait of lung squamous cell carcinoma. Cell. 2021;184(16):4348-4371.e40. pmid:34358469
- View Article
- PubMed/NCBI
- Google Scholar
19. Vasaikar S, Huang C, Wang X, Petyuk VA, Savage SR, Wen B, et al. Proteogenomic Analysis of Human Colon Cancer Reveals New Therapeutic Opportunities. Cell. 2019;177(4):1035-1049.e19. pmid:31031003
- View Article
- PubMed/NCBI
- Google Scholar
20. Savage SR, Yi X, Lei JT, Wen B, Zhao H, Liao Y, et al. Pan-cancer proteogenomics expands the landscape of therapeutic targets. Cell. 2024;187(16):4389-4407.e15. pmid:38917788
- View Article
- PubMed/NCBI
- Google Scholar
21. Zhang Y, Chen F, Chandrashekar DS, Varambally S, Creighton CJ. Proteogenomic characterization of 2002 human cancers reveals pan-cancer molecular subtypes and associated pathways. Nat Commun. 2022;13(1):2669. pmid:35562349
- View Article
- PubMed/NCBI
- Google Scholar
22. Rubin D. Estimating causal effects of treatments in randomized and non-randomized studies. J Educ Psychol. 1974;66:688–701.
- View Article
- Google Scholar
23. Holland PW. Statistics and causal inference. J Am Stat Assoc. 1986;81:945–60.
- View Article
- Google Scholar
24. Robins JM, Greenland S. Identifiability and exchangeability for direct and indirect effects. Epidemiology. 1992;3(2):143–55. pmid:1576220
- View Article
- PubMed/NCBI
- Google Scholar
25. Albert JM. Mediation analysis via potential outcomes models. Stat Med. 2008;27(8):1282–304. pmid:17691077
- View Article
- PubMed/NCBI
- Google Scholar
26. Vansteelandt S, Vanderweele TJ. Natural direct and indirect effects on the exposed: effect decomposition under weaker assumptions. Biometrics. 2012;68(4):1019–27. pmid:22989075
- View Article
- PubMed/NCBI
- Google Scholar
27. Vanderweele TJ, Arah OA. Bias formulas for sensitivity analysis of unmeasured confounding for general outcomes, treatments, and confounders. Epidemiology. 2011;22(1):42–52. pmid:21052008
- View Article
- PubMed/NCBI
- Google Scholar
28. Lange T, Vansteelandt S, Bekaert M. A simple unified approach for estimating natural direct and indirect effects. Am J Epidemiol. 2012;176(3):190–5. pmid:22781427
- View Article
- PubMed/NCBI
- Google Scholar
29. Huang Y-T, Yang H-I. Causal Mediation Analysis of Survival Outcome with Multiple Mediators. Epidemiology. 2017;28(3):370–8. pmid:28296661
- View Article
- PubMed/NCBI
- Google Scholar
30. Wei LJ. The accelerated failure time model: a useful alternative to the Cox regression model in survival analysis. Stat Med. 1992;11(14–15):1871–9. pmid:1480879
- View Article
- PubMed/NCBI
- Google Scholar
31. VanderWeele TJ. Causal mediation analysis with survival data. Epidemiology. 2011;22(4):582–5. pmid:21642779
- View Article
- PubMed/NCBI
- Google Scholar
32. Fulcher IR, Tchetgen Tchetgen EJ, Williams PL. Mediation Analysis for Censored Survival Data Under an Accelerated Failure Time Model. Epidemiology. 2017;28(5):660–6. pmid:28574921
- View Article
- PubMed/NCBI
- Google Scholar
33. Krishnaiah PR, Rao CR. Handbook of Statistics. Amsterdam, Netherlands: Elsevier Science Publishers. 1988.
34. Clark-Boucher D, Zhou X, Du J, Liu Y, Needham BL, Smith JA, et al. Methods for mediation analysis with high-dimensional DNA methylation data: Possible choices and comparisons. PLoS Genet. 2023;19(11):e1011022. pmid:37934796
- View Article
- PubMed/NCBI
- Google Scholar
35. Suder PM, Molstad AJ. Scalable algorithms for semiparametric accelerated failure time models in high dimensions. Stat Med. 2022;41(6):933–49. pmid:35014701
- View Article
- PubMed/NCBI
- Google Scholar
36. Tibshirani R. Regression Shrinkage and Selection Via the Lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology. 1996;58(1):267–88.
- View Article
- Google Scholar
37. Zhang CH. Nearly unbiased variable selection under minimax concave penalty. Ann Stat. 2010;38:894–942.
- View Article
- Google Scholar
38. Zou H, Hastie T. Regularization and Variable Selection Via the Elastic Net. Journal of the Royal Statistical Society Series B: Statistical Methodology. 2005;67(2):301–20.
- View Article
- Google Scholar
39. Fan J, Lv J. Sure Independence Screening for Ultrahigh Dimensional Feature Space. Journal of the Royal Statistical Society Series B: Statistical Methodology. 2008;70(5):849–911.
- View Article
- Google Scholar
40. Wang X, Leng C. High Dimensional Ordinary Least Squares Projection for Screening Variables. Journal of the Royal Statistical Society Series B: Statistical Methodology. 2015;78(3):589–611.
- View Article
- Google Scholar
41. Fu G, Wang G, Dai X. An adaptive threshold determination method of feature screening for genomic selection. BMC Bioinformatics. 2017;18(1):212. pmid:28403836
- View Article
- PubMed/NCBI
- Google Scholar
42. Luo C, Fa B, Yan Y, Wang Y, Zhou Y, Zhang Y, et al. High-dimensional mediation analysis in survival models. PLoS Comput Biol. 2020;16(4):e1007768. pmid:32302299
- View Article
- PubMed/NCBI
- Google Scholar
43. Li R, Zhong W, Zhu L. Feature Screening via Distance Correlation Learning. J Am Stat Assoc. 2012;107(499):1129–39. pmid:25249709
- View Article
- PubMed/NCBI
- Google Scholar
44. Pan W, Wang X, Xiao W, Zhu H. A Generic Sure Independence Screening Procedure. J Am Stat Assoc. 2019;114(526):928–37. pmid:31692981
- View Article
- PubMed/NCBI
- Google Scholar
45. Zhang H, Zheng Y, Hou L, Zheng C, Liu L. Mediation analysis for survival data with high-dimensional mediators. Bioinformatics. 2021;37(21):3815–21. pmid:34343267
- View Article
- PubMed/NCBI
- Google Scholar
46. Zhang H, Zheng Y, Zhang Z, Gao T, Joyce B, Yoon G, et al. Estimating and testing high-dimensional mediation effects in epigenetic studies. Bioinformatics. 2016;32(20):3150–4. pmid:27357171
- View Article
- PubMed/NCBI
- Google Scholar
47. MacKinnon DP, Lockwood CM, Hoffman JM, West SG, Sheets V. A comparison of methods to test mediation and other intervening variable effects. Psychol Methods. 2002;7(1):83–104. pmid:11928892
- View Article
- PubMed/NCBI
- Google Scholar
48. Barfield R, Shen J, Just AC, Vokonas PS, Schwartz J, Baccarelli AA, et al. Testing for the indirect effect under the null for genome-wide mediation analyses. Genet Epidemiol. 2017;41(8):824–33. pmid:29082545
- View Article
- PubMed/NCBI
- Google Scholar
49. Shao Z, Wang T, Zhang M, Jiang Z, Huang S, Zeng P. IUSMMT: Survival mediation analysis of gene expression with multiple DNA methylation exposures and its application to cancers of TCGA. PLoS Comput Biol. 2021;17(8):e1009250. pmid:34464378
- View Article
- PubMed/NCBI
- Google Scholar
50. Benjamini Y, Hochberg Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society Series B: Statistical Methodology. 1995;57(1):289–300.
- View Article
- Google Scholar
51. Storey JD. A Direct Approach to False Discovery Rates. Journal of the Royal Statistical Society Series B: Statistical Methodology. 2002;64(3):479–98.
- View Article
- Google Scholar
52. Rudnick PA, Markey SP, Roth J, Mirokhin Y, Yan X, Tchekhovskoi DV, et al. A Description of the Clinical Proteomic Tumor Analysis Consortium (CPTAC) Common Data Analysis Pipeline. J Proteome Res. 2016;15(3):1023–32. pmid:26860878
- View Article
- PubMed/NCBI
- Google Scholar
53. Haughton PD, Haakma W, Chalkiadakis T, Breimer GE, Driehuis E, Clevers H, et al. Differential transcriptional invasion signatures from patient derived organoid models define a functional prognostic tool for head and neck cancer. Oncogene. 2024;43(32):2463–74. pmid:38942893
- View Article
- PubMed/NCBI
- Google Scholar
54. Bray F, Ferlay J, Soerjomataram I, Siegel RL, Torre LA, Jemal A. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin. 2018;68(6):394–424.
- View Article
- Google Scholar
55. Niu L, Yang W, Duan L, Wang X, Li Y, Xu C, et al. Biological functions and theranostic potential of HMGB family members in human cancers. Ther Adv Med Oncol. 2020;12:1758835920970850. pmid:33224279
- View Article
- PubMed/NCBI
- Google Scholar
56. Rappaport N, Twik M, Plaschkes I, Nudel R, Iny Stein T, Levitt J, et al. MalaCards: an amalgamated human disease compendium with diverse clinical and genetic annotation and structured search. Nucleic Acids Res. 2017;45(D1):D877–87. pmid:27899610
- View Article
- PubMed/NCBI
- Google Scholar
57. Cheng F, Zhao J, Wang Y, Lu W, Liu Z, Zhou Y, et al. Comprehensive characterization of protein-protein interactions perturbed by disease mutations. Nat Genet. 2021;53(3):342–53. pmid:33558758
- View Article
- PubMed/NCBI
- Google Scholar
58. Paczkowska M, Barenboim J, Sintupisut N, Fox NS, Zhu H, Abd-Rabbo D, et al. Integrative pathway enrichment analysis of multivariate omics data. Nat Commun. 2020;11(1):735. pmid:32024846
- View Article
- PubMed/NCBI
- Google Scholar
59. Dai JY, Stanford JL, LeBlanc M. A multiple-testing procedure for high-dimensional mediation hypotheses. J Am Stat Assoc. 2022;117(537):198–213. pmid:35400115
- View Article
- PubMed/NCBI
- Google Scholar
60. Westfall PH, Troendle JF. Multiple testing with minimal assumptions. Biom J. 2008;50(5):745–55. pmid:18932134
- View Article
- PubMed/NCBI
- Google Scholar
61. VanderWeele TJ, Tchetgen Tchetgen EJ. Mediation analysis with time varying exposures and mediators. J R Stat Soc Series B Stat Methodol. 2017;79(3):917–38.
- View Article
- Google Scholar
62. Vansteelandt S, Daniel RM. Interventional Effects for Mediation Analysis with Multiple Mediators. Epidemiology. 2017;28(2):258–65. pmid:27922534
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Oldham JM, Huang Y, Bose S, Ma S-F, Kim JS, Schwab A, et al. Proteomic Biomarkers of Survival in Idiopathic Pulmonary Fibrosis. Am J Respir Crit Care Med. 2024;209(9):1111–20. pmid:37847691
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Stetson LC, Dazard J-E, Barnholtz-Sloan JS. Protein Markers Predict Survival in Glioma Patients. Mol Cell Proteomics. 2016;15(7):2356–65. pmid:27143410
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Wu Z-H, Yang D-L. Identification of a protein signature for predicting overall survival of hepatocellular carcinoma: a study based on data mining. BMC Cancer. 2020;20(1):720. pmid:32746792
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Huo Z, Duan Y, Zhan D, Xu X, Zheng N, Cai J, et al. Proteomic Stratification of Prognosis and Treatment Options for Small Cell Lung Cancer. Genomics Proteomics Bioinformatics. 2024;22(2):qzae033. pmid:38961535
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref5] 5. Schuermans A, Pournamdari AB, Lee J, Bhukar R, Ganesh S, Darosa N, et al. Integrative proteomic analyses across common cardiac diseases yield mechanistic insights and enhanced prediction. Nat Cardiovasc Res. 2024;3(12):1516–30. pmid:39572695
View Article
PubMed/NCBI
Google Scholar

[18] View Article

[19] PubMed/NCBI

[20] Google Scholar

[ref6] 6. Zhang Y-R, Wu B-S, Chen S-D, Yang L, Deng Y-T, Guo Y, et al. Whole exome sequencing analyses identified novel genes for Alzheimer’s disease and related dementia. Alzheimers Dement. 2024;20(10):7062–78. pmid:39129223
View Article
PubMed/NCBI
Google Scholar

[22] View Article

[23] PubMed/NCBI

[24] Google Scholar

[ref7] 7. Liu X, Liu P, Chernock RD, Kuhs KAL, Lewis JS Jr, Luo J, et al. A prognostic gene expression signature for oropharyngeal squamous cell carcinoma. EBioMedicine. 2020;61:102805. pmid:33038770
View Article
PubMed/NCBI
Google Scholar

[26] View Article

[27] PubMed/NCBI

[28] Google Scholar

[ref8] 8. Zhao S, Cang H, Liu Y, Huang Y, Zhang S. Integrated analysis of bulk RNA-seq and single-cell RNA-seq reveals the function of pyrocytosis in the pathogenesis of abdominal aortic aneurysm. Aging (Albany NY). 2023;15(24):15287–323. pmid:38112597
View Article
PubMed/NCBI
Google Scholar

[30] View Article

[31] PubMed/NCBI

[32] Google Scholar

[ref9] 9. Chi S, Flowers CR, Li Z, Huang X, Wei P. Mash: Mediation Analysis Of Survival Outcome And High-dimensional Omics Mediators With Application To Complex Diseases. Ann Appl Stat. 2024;18(2):1360–77. pmid:39328363
View Article
PubMed/NCBI
Google Scholar

[34] View Article

[35] PubMed/NCBI

[36] Google Scholar

[ref10] 10. Sharma A, Debik J, Naume B, Ohnstad HO, Oslo Breast Cancer Consortium (OSBREAC), Bathen TF, et al. Comprehensive multi-omics analysis of breast cancer reveals distinct long-term prognostic subtypes. Oncogenesis. 2024;13(1):22. pmid:38871719
View Article
PubMed/NCBI
Google Scholar

[38] View Article

[39] PubMed/NCBI

[40] Google Scholar

[ref11] 11. Nativio R, Lan Y, Donahue G, Sidoli S, Berson A, Srinivasan AR, et al. An integrated multi-omics approach identifies epigenetic alterations associated with Alzheimer’s disease. Nat Genet. 2020;52(10):1024–35. pmid:32989324
View Article
PubMed/NCBI
Google Scholar

[42] View Article

[43] PubMed/NCBI

[44] Google Scholar

[ref12] 12. Lim J, Park C, Kim M, Kim H, Kim J, Lee D-S. Advances in single-cell omics and multiomics for high-resolution molecular profiling. Exp Mol Med. 2024;56(3):515–26. pmid:38443594
View Article
PubMed/NCBI
Google Scholar

[46] View Article

[47] PubMed/NCBI

[48] Google Scholar

[ref13] 13. Hasin Y, Seldin M, Lusis A. Multi-omics approaches to disease. Genome Biol. 2017;18(1):83. pmid:28476144
View Article
PubMed/NCBI
Google Scholar

[50] View Article

[51] PubMed/NCBI

[52] Google Scholar

[ref14] 14. Huang L, Long JP, Irajizad E, Doecke JD, Do K-A, Ha MJ. A unified mediation analysis framework for integrative cancer proteogenomics with clinical outcomes. Bioinformatics. 2023;39(1):btad023. pmid:36648331
View Article
PubMed/NCBI
Google Scholar

[54] View Article

[55] PubMed/NCBI

[56] Google Scholar

[ref15] 15. Petralia F, Tignor N, Reva B, Koptyra M, Chowdhury S, Rykunov D, et al. Integrated Proteogenomic Characterization across Major Histological Types of Pediatric Brain Cancer. Cell. 2020;183(7):1962-1985.e31. pmid:33242424
View Article
PubMed/NCBI
Google Scholar

[58] View Article

[59] PubMed/NCBI

[60] Google Scholar

[ref16] 16. Zhan X, Cheng J, Huang Z, Han Z, Helm B, Liu X, et al. Correlation Analysis of Histopathology and Proteogenomics Data for Breast Cancer. Mol Cell Proteomics. 2019;18(8 suppl 1):S37–51. pmid:31285282
View Article
PubMed/NCBI
Google Scholar

[62] View Article

[63] PubMed/NCBI

[64] Google Scholar

[ref17] 17. Huang C, Chen L, Savage SR, Eguez RV, Dou Y, Li Y, et al. Proteogenomic insights into the biology and treatment of HPV-negative head and neck squamous cell carcinoma. Cancer Cell. 2021;39(3):361-379.e16. pmid:33417831
View Article
PubMed/NCBI
Google Scholar

[66] View Article

[67] PubMed/NCBI

[68] Google Scholar

[ref18] 18. Satpathy S, Krug K, Jean Beltran PM, Savage SR, Petralia F, Kumar-Sinha C, et al. A proteogenomic portrait of lung squamous cell carcinoma. Cell. 2021;184(16):4348-4371.e40. pmid:34358469
View Article
PubMed/NCBI
Google Scholar

[70] View Article

[71] PubMed/NCBI

[72] Google Scholar

[ref19] 19. Vasaikar S, Huang C, Wang X, Petyuk VA, Savage SR, Wen B, et al. Proteogenomic Analysis of Human Colon Cancer Reveals New Therapeutic Opportunities. Cell. 2019;177(4):1035-1049.e19. pmid:31031003
View Article
PubMed/NCBI
Google Scholar

[74] View Article

[75] PubMed/NCBI

[76] Google Scholar

[ref20] 20. Savage SR, Yi X, Lei JT, Wen B, Zhao H, Liao Y, et al. Pan-cancer proteogenomics expands the landscape of therapeutic targets. Cell. 2024;187(16):4389-4407.e15. pmid:38917788
View Article
PubMed/NCBI
Google Scholar

[78] View Article

[79] PubMed/NCBI

[80] Google Scholar

[ref21] 21. Zhang Y, Chen F, Chandrashekar DS, Varambally S, Creighton CJ. Proteogenomic characterization of 2002 human cancers reveals pan-cancer molecular subtypes and associated pathways. Nat Commun. 2022;13(1):2669. pmid:35562349
View Article
PubMed/NCBI
Google Scholar

[82] View Article

[83] PubMed/NCBI

[84] Google Scholar

[ref22] 22. Rubin D. Estimating causal effects of treatments in randomized and non-randomized studies. J Educ Psychol. 1974;66:688–701.
View Article
Google Scholar

[86] View Article

[87] Google Scholar

[ref23] 23. Holland PW. Statistics and causal inference. J Am Stat Assoc. 1986;81:945–60.
View Article
Google Scholar

[89] View Article

[90] Google Scholar

[ref24] 24. Robins JM, Greenland S. Identifiability and exchangeability for direct and indirect effects. Epidemiology. 1992;3(2):143–55. pmid:1576220
View Article
PubMed/NCBI
Google Scholar

[92] View Article

[93] PubMed/NCBI

[94] Google Scholar

[ref25] 25. Albert JM. Mediation analysis via potential outcomes models. Stat Med. 2008;27(8):1282–304. pmid:17691077
View Article
PubMed/NCBI
Google Scholar

[96] View Article

[97] PubMed/NCBI

[98] Google Scholar

[ref26] 26. Vansteelandt S, Vanderweele TJ. Natural direct and indirect effects on the exposed: effect decomposition under weaker assumptions. Biometrics. 2012;68(4):1019–27. pmid:22989075
View Article
PubMed/NCBI
Google Scholar

[100] View Article

[101] PubMed/NCBI

[102] Google Scholar

[ref27] 27. Vanderweele TJ, Arah OA. Bias formulas for sensitivity analysis of unmeasured confounding for general outcomes, treatments, and confounders. Epidemiology. 2011;22(1):42–52. pmid:21052008
View Article
PubMed/NCBI
Google Scholar

[104] View Article

[105] PubMed/NCBI

[106] Google Scholar

[ref28] 28. Lange T, Vansteelandt S, Bekaert M. A simple unified approach for estimating natural direct and indirect effects. Am J Epidemiol. 2012;176(3):190–5. pmid:22781427
View Article
PubMed/NCBI
Google Scholar

[108] View Article

[109] PubMed/NCBI

[110] Google Scholar

[ref29] 29. Huang Y-T, Yang H-I. Causal Mediation Analysis of Survival Outcome with Multiple Mediators. Epidemiology. 2017;28(3):370–8. pmid:28296661
View Article
PubMed/NCBI
Google Scholar

[112] View Article

[113] PubMed/NCBI

[114] Google Scholar

[ref30] 30. Wei LJ. The accelerated failure time model: a useful alternative to the Cox regression model in survival analysis. Stat Med. 1992;11(14–15):1871–9. pmid:1480879
View Article
PubMed/NCBI
Google Scholar

[116] View Article

[117] PubMed/NCBI

[118] Google Scholar

[ref31] 31. VanderWeele TJ. Causal mediation analysis with survival data. Epidemiology. 2011;22(4):582–5. pmid:21642779
View Article
PubMed/NCBI
Google Scholar

[120] View Article

[121] PubMed/NCBI

[122] Google Scholar

[ref32] 32. Fulcher IR, Tchetgen Tchetgen EJ, Williams PL. Mediation Analysis for Censored Survival Data Under an Accelerated Failure Time Model. Epidemiology. 2017;28(5):660–6. pmid:28574921
View Article
PubMed/NCBI
Google Scholar

[124] View Article

[125] PubMed/NCBI

[126] Google Scholar

[ref33] 33. Krishnaiah PR, Rao CR. Handbook of Statistics. Amsterdam, Netherlands: Elsevier Science Publishers. 1988.

[ref34] 34. Clark-Boucher D, Zhou X, Du J, Liu Y, Needham BL, Smith JA, et al. Methods for mediation analysis with high-dimensional DNA methylation data: Possible choices and comparisons. PLoS Genet. 2023;19(11):e1011022. pmid:37934796
View Article
PubMed/NCBI
Google Scholar

[129] View Article

[130] PubMed/NCBI

[131] Google Scholar

[ref35] 35. Suder PM, Molstad AJ. Scalable algorithms for semiparametric accelerated failure time models in high dimensions. Stat Med. 2022;41(6):933–49. pmid:35014701
View Article
PubMed/NCBI
Google Scholar

[133] View Article

[134] PubMed/NCBI

[135] Google Scholar

[ref36] 36. Tibshirani R. Regression Shrinkage and Selection Via the Lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology. 1996;58(1):267–88.
View Article
Google Scholar

[137] View Article

[138] Google Scholar

[ref37] 37. Zhang CH. Nearly unbiased variable selection under minimax concave penalty. Ann Stat. 2010;38:894–942.
View Article
Google Scholar

[140] View Article

[141] Google Scholar

[ref38] 38. Zou H, Hastie T. Regularization and Variable Selection Via the Elastic Net. Journal of the Royal Statistical Society Series B: Statistical Methodology. 2005;67(2):301–20.
View Article
Google Scholar

[143] View Article

[144] Google Scholar

[ref39] 39. Fan J, Lv J. Sure Independence Screening for Ultrahigh Dimensional Feature Space. Journal of the Royal Statistical Society Series B: Statistical Methodology. 2008;70(5):849–911.
View Article
Google Scholar

[146] View Article

[147] Google Scholar

[ref40] 40. Wang X, Leng C. High Dimensional Ordinary Least Squares Projection for Screening Variables. Journal of the Royal Statistical Society Series B: Statistical Methodology. 2015;78(3):589–611.
View Article
Google Scholar

[149] View Article

[150] Google Scholar

[ref41] 41. Fu G, Wang G, Dai X. An adaptive threshold determination method of feature screening for genomic selection. BMC Bioinformatics. 2017;18(1):212. pmid:28403836
View Article
PubMed/NCBI
Google Scholar

[152] View Article

[153] PubMed/NCBI

[154] Google Scholar

[ref42] 42. Luo C, Fa B, Yan Y, Wang Y, Zhou Y, Zhang Y, et al. High-dimensional mediation analysis in survival models. PLoS Comput Biol. 2020;16(4):e1007768. pmid:32302299
View Article
PubMed/NCBI
Google Scholar

[156] View Article

[157] PubMed/NCBI

[158] Google Scholar

[ref43] 43. Li R, Zhong W, Zhu L. Feature Screening via Distance Correlation Learning. J Am Stat Assoc. 2012;107(499):1129–39. pmid:25249709
View Article
PubMed/NCBI
Google Scholar

[160] View Article

[161] PubMed/NCBI

[162] Google Scholar

[ref44] 44. Pan W, Wang X, Xiao W, Zhu H. A Generic Sure Independence Screening Procedure. J Am Stat Assoc. 2019;114(526):928–37. pmid:31692981
View Article
PubMed/NCBI
Google Scholar

[164] View Article

[165] PubMed/NCBI

[166] Google Scholar

[ref45] 45. Zhang H, Zheng Y, Hou L, Zheng C, Liu L. Mediation analysis for survival data with high-dimensional mediators. Bioinformatics. 2021;37(21):3815–21. pmid:34343267
View Article
PubMed/NCBI
Google Scholar

[168] View Article

[169] PubMed/NCBI

[170] Google Scholar

[ref46] 46. Zhang H, Zheng Y, Zhang Z, Gao T, Joyce B, Yoon G, et al. Estimating and testing high-dimensional mediation effects in epigenetic studies. Bioinformatics. 2016;32(20):3150–4. pmid:27357171
View Article
PubMed/NCBI
Google Scholar

[172] View Article

[173] PubMed/NCBI

[174] Google Scholar

[ref47] 47. MacKinnon DP, Lockwood CM, Hoffman JM, West SG, Sheets V. A comparison of methods to test mediation and other intervening variable effects. Psychol Methods. 2002;7(1):83–104. pmid:11928892
View Article
PubMed/NCBI
Google Scholar

[176] View Article

[177] PubMed/NCBI

[178] Google Scholar

[ref48] 48. Barfield R, Shen J, Just AC, Vokonas PS, Schwartz J, Baccarelli AA, et al. Testing for the indirect effect under the null for genome-wide mediation analyses. Genet Epidemiol. 2017;41(8):824–33. pmid:29082545
View Article
PubMed/NCBI
Google Scholar

[180] View Article

[181] PubMed/NCBI

[182] Google Scholar

[ref49] 49. Shao Z, Wang T, Zhang M, Jiang Z, Huang S, Zeng P. IUSMMT: Survival mediation analysis of gene expression with multiple DNA methylation exposures and its application to cancers of TCGA. PLoS Comput Biol. 2021;17(8):e1009250. pmid:34464378
View Article
PubMed/NCBI
Google Scholar

[184] View Article

[185] PubMed/NCBI

[186] Google Scholar

[ref50] 50. Benjamini Y, Hochberg Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society Series B: Statistical Methodology. 1995;57(1):289–300.
View Article
Google Scholar

[188] View Article

[189] Google Scholar

[ref51] 51. Storey JD. A Direct Approach to False Discovery Rates. Journal of the Royal Statistical Society Series B: Statistical Methodology. 2002;64(3):479–98.
View Article
Google Scholar

[191] View Article

[192] Google Scholar

[ref52] 52. Rudnick PA, Markey SP, Roth J, Mirokhin Y, Yan X, Tchekhovskoi DV, et al. A Description of the Clinical Proteomic Tumor Analysis Consortium (CPTAC) Common Data Analysis Pipeline. J Proteome Res. 2016;15(3):1023–32. pmid:26860878
View Article
PubMed/NCBI
Google Scholar

[194] View Article

[195] PubMed/NCBI

[196] Google Scholar

[ref53] 53. Haughton PD, Haakma W, Chalkiadakis T, Breimer GE, Driehuis E, Clevers H, et al. Differential transcriptional invasion signatures from patient derived organoid models define a functional prognostic tool for head and neck cancer. Oncogene. 2024;43(32):2463–74. pmid:38942893
View Article
PubMed/NCBI
Google Scholar

[198] View Article

[199] PubMed/NCBI

[200] Google Scholar

[ref54] 54. Bray F, Ferlay J, Soerjomataram I, Siegel RL, Torre LA, Jemal A. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin. 2018;68(6):394–424.
View Article
Google Scholar

[202] View Article

[203] Google Scholar

[ref55] 55. Niu L, Yang W, Duan L, Wang X, Li Y, Xu C, et al. Biological functions and theranostic potential of HMGB family members in human cancers. Ther Adv Med Oncol. 2020;12:1758835920970850. pmid:33224279
View Article
PubMed/NCBI
Google Scholar

[205] View Article

[206] PubMed/NCBI

[207] Google Scholar

[ref56] 56. Rappaport N, Twik M, Plaschkes I, Nudel R, Iny Stein T, Levitt J, et al. MalaCards: an amalgamated human disease compendium with diverse clinical and genetic annotation and structured search. Nucleic Acids Res. 2017;45(D1):D877–87. pmid:27899610
View Article
PubMed/NCBI
Google Scholar

[209] View Article

[210] PubMed/NCBI

[211] Google Scholar

[ref57] 57. Cheng F, Zhao J, Wang Y, Lu W, Liu Z, Zhou Y, et al. Comprehensive characterization of protein-protein interactions perturbed by disease mutations. Nat Genet. 2021;53(3):342–53. pmid:33558758
View Article
PubMed/NCBI
Google Scholar

[213] View Article

[214] PubMed/NCBI

[215] Google Scholar

[ref58] 58. Paczkowska M, Barenboim J, Sintupisut N, Fox NS, Zhu H, Abd-Rabbo D, et al. Integrative pathway enrichment analysis of multivariate omics data. Nat Commun. 2020;11(1):735. pmid:32024846
View Article
PubMed/NCBI
Google Scholar

[217] View Article

[218] PubMed/NCBI

[219] Google Scholar

[ref59] 59. Dai JY, Stanford JL, LeBlanc M. A multiple-testing procedure for high-dimensional mediation hypotheses. J Am Stat Assoc. 2022;117(537):198–213. pmid:35400115
View Article
PubMed/NCBI
Google Scholar

[221] View Article

[222] PubMed/NCBI

[223] Google Scholar

[ref60] 60. Westfall PH, Troendle JF. Multiple testing with minimal assumptions. Biom J. 2008;50(5):745–55. pmid:18932134
View Article
PubMed/NCBI
Google Scholar

[225] View Article

[226] PubMed/NCBI

[227] Google Scholar

[ref61] 61. VanderWeele TJ, Tchetgen Tchetgen EJ. Mediation analysis with time varying exposures and mediators. J R Stat Soc Series B Stat Methodol. 2017;79(3):917–38.
View Article
Google Scholar

[229] View Article

[230] Google Scholar

[ref62] 62. Vansteelandt S, Daniel RM. Interventional Effects for Mediation Analysis with Multiple Mediators. Epidemiology. 2017;28(2):258–65. pmid:27922534
View Article
PubMed/NCBI
Google Scholar

[232] View Article

[233] PubMed/NCBI

[234] Google Scholar

Figures

Abstract

Author summary

Introduction

Results

Simulation design

Simulation results

Application study: Analysis of clinical proteomic tumor analysis consortium data

Discussion

Materials and methods

Notations and assumptions

Model specifications

Step 1: Penalization for selecting mediator and exposure candidates

Penalization for outcome model.

Penalization for mediation model.

Step 2: Screening important mediators with control for exposures

Step 3: Hypothesis Testing

Supporting information

S1 Appendix. Penalized outcome and mediation models in Step 1.

S1 Fig. A Kaplan-Meier curve for CPTAC-HNSCC application study.

S1 Table. Simulation results of the SMAHP with varying penal- ties in the mediation model.

S2 Table. Simulation results for the SMAHP model (penalization + SIS) with correlated gene and protein structures.

S3 Table. Simulation results of the SMAHP where exposures were generated from a negative binomial distribution, with a censoring rate of 25%.

S4 Table. Simulation results of the SMAHP in the presence of outliers, with censoring rates of 25%.

S5 Table. Simulation results of SMAHP under a Gamma error distribution with censoring rates of 25%.

S6 Table. Simulation results of SMAHP under a logistic error distribution with censoring rates of 25%.

Acknowledgments

References