Skip to main content
Advertisement
  • Loading metrics

Powerful large scale inference in high dimensional mediation analysis

  • Asmita Roy ,

    Roles Formal analysis, Methodology, Software, Visualization, Writing – original draft

    aroy38@jh.edu

    Affiliation Department of Biostatistics/Bloomberg School of Public Health, Johns Hopkins University, Baltimore, Maryland, United States of America

  • Xianyang Zhang

    Roles Conceptualization, Funding acquisition, Methodology, Supervision, Writing – review & editing

    Affiliation Department of Statistics, Texas A&M University, College Station, Texas, United States of America

Abstract

In genome-wide epigenetic studies, determining how exposures (e.g., Single Nucleotide Polymorphisms) affect outcomes (e.g., gene expression) through intermediate variables, such as DNA methylation, is a key challenge. Mediation analysis provides a framework to identify these causal pathways; however, testing for mediation effects involves a complex composite null hypothesis. Existing methods, such as Sobel’s test or the Max-P test, are often underpowered in this context because they rely on null distributions determined under only a subset of the null space and are not optimized for the multiple testing burden inherent in high-dimensional data. To address these limitations, we introduce MLFDR (Mediation Analysis using Local False Discovery Rates), a novel method for high-dimensional mediation analysis. MLFDR leverages local false discovery rates, calculated from the coefficients of structural equation models, to construct an optimal rejection region. We demonstrate theoretically and through simulation that MLFDR asymptotically controls the false discovery rate and achieves superior statistical power compared to recent high-dimensional mediation methods. In real data applications, MLFDR identified 20%–50% more significant mediators than existing methods, demonstrating its ability to uncover biological signals missed by conventional approaches.

Author summary

The paper presents a novel approach to high-dimensional mediation analysis through a local false discovery rate (MLFDR) screening algorithm. It addresses the limitations of traditional methods like Sobel’s test and maxP, which are underpowered in high dimensional setting. We extend local FDR principles to composite null hypotheses, and derive a screening rule with a closed-form expression for false discovery proportion. We also show that MLFDR has comparable or better results than two recently-proposed methods, MDACT [30], HDMT [3] across a wide range of data types and models. We also provide a two-step global latent factor adjustment using surrogate variable analysis [9].

1 Introduction

Mediation analysis serves as a critical tool for deciphering the biological mechanisms underlying genetic associations with diseases identified in Genome-Wide Association Studies (GWAS). By bridging the gap between genetic variants and clinical outcomes, mediation analysis reveals intermediate pathways and elucidates causal relationships. As GWAS continues to uncover a vast number of genetic associations, translating these findings into actionable insights for precision medicine and therapeutic development becomes increasingly important. For instance, cigarette smoking is known to alter DNA methylation and gene expression [12]; concurrently, DNA methylation often regulates gene expression directly [4,14]. Investigating the mediating effect of DNA methylation on gene expression—particularly in the presence of environmental exposures like smoking—is therefore essential. However, these analyses are complicated by high-dimensional outcomes and clinical confounders, such as patient age, which influences both gene expression and DNA methylation heterogeneity [7,19]. This article addresses the statistical challenges inherent in such high-dimensional mediation problems.

Historically, [1] introduced the regression-based definition of mediation analysis, often referred to as the “product of coefficients method,” which examines the significance of the product of the exposure-mediator and mediator-outcome coefficients. More recently, the literature has expanded through the “counterfactual framework” [8,15,17,2326], which provides a causal interpretation for natural direct and indirect effects across various models, including those with non-linearities and binary or survival outcomes.

Let X denote the exposure, Mi the ith mediator, and Y the outcome. Under the product of coefficients approach, mediation analysis tests the null hypothesis , where represents the effect of X on Mi, and represents the effect of Mi on Y. This creates a composite null hypothesis comprising three distinct cases: (i) ; (ii) ; or (iii) . Assuming no unmeasured confounders, classical tests like Max-P [13] and Sobel’s test [18] are known to be conservative under case (iii), as statistical inference is typically derived from distributions determined by cases (i) and (ii). In genome-wide studies, however, the sparse nature of omics data implies that and hold for the majority of markers. Recent methods such as JS-mixture (HDMT) [3] and DACT [11] attempt to address this by explicitly modeling the composite nature of the null. JS-mixture improves power by using a mixture-null distribution of maximum p-values, adapting [20]’s procedure to estimate component proportions. DACT estimates the proportions of null and separately to combine case-specific p-values. However, [30] recently demonstrated that DACT suffers from False Discovery Rate (FDR) inflation under dense alternatives and proposed a modified version (MDACT) that computes the statistic’s distribution via numerical integration to improve p-value accuracy.

While JS-mixture and MDACT offer improvements over classical methods, they are not theoretically optimal regarding power. The FDR literature is broadly divided into p-value-based and local FDR-based rejection regions. Local FDR, a Bayesian approach, ranks hypotheses by the posterior probability that a case is null given the observed statistics; this ranking often differs from that based on p-values. [22] demonstrated that, except in cases of symmetric alternatives, local FDR and p-value-based orderings diverge. Furthermore, [22] proved that the local FDR-based oracle procedure is optimal: among all methods controlling the marginal FDR (mFDR), the local FDR approach yields the highest number of rejections. While the power advantage is negligible for symmetric alternatives, it becomes significant when the alternative distribution is asymmetric. Motivated by these theoretical properties, we propose MLFDR, a local FDR-based screening algorithm designed specifically for high-dimensional mediation analysis. Our contributions to the literature are as follows:

  1. We extend the concept of local FDR to the composite null hypothesis setting, deriving a screening rule with a closed-form expression for the corresponding false discovery proportion (FDP).
  2. We validate the method across a diverse array of data types—including continuous and binary variables, and scenarios with exposure-mediator interactions—demonstrating robust performance across various model specifications. We specifically incorporate Surrogate Variable Analysis (SVA) to adjust for latent confounding and illustrate the method’s efficacy in multiple mediator setups with univariate or clinical outcomes.
  3. MLFDR offers optimal power improvement over existing methods while maintaining asymptotic FDR control. Extensive simulations confirm its superiority over MDACT and HDMT in terms of power and error rate control.
  4. We provide theoretical guarantees for the identifiability and global optimality of our model under relatively mild assumptions, proving FDR control for both the oracle and adaptive procedures.

The remainder of this paper is organized as follows. Sect 2 contains the main results. Specifically, Sect 2.1 outlines the screening procedure for detecting significant mediators. Sect 2.2 presents simulation studies. Sect 2.3 discusses extensions of MLFDR to composite alternatives and latent factor models, which can account for unmeasured confounding and pleiotropy. Sect 3 provides an in-depth analysis of Prostate Cancer data and Lung cancer data from The Cancer Genome Atlas (TCGA), exploring SNP-CpG-gene expression pathways and causal pathways between smoking habits and gene expression, respectively. Sect 5 details the methodology. An R package implementing the method is available at https://github.com/asmita112358/MLFDR as well as in CRAN. Theorems proving the large-sample FDR control of MLFDR are provided in Section E of S1 Text.

Finally, we distinguish our approach from other recent efforts in high-dimensional mediation analysis. [33], [32], and [16] address the multiple mediator problem specifically for survival outcomes. Closer to our framework, [21] and [5] utilize local FDR-based rejection regions; the former approximates the alternative as a mixture of Gaussian distributions, while the latter constructs regions based on p-values. Our work advances this domain in two specific aspects: (i) we incorporate a general prior for the coefficients α and β to estimate the exact posterior density for computing the local FDR, rather than relying on approximations; and (ii) we offer theoretical guarantees for the local FDR estimates obtained via the EM algorithm, a property not previously established in this context.

2 Results

2.1 Method overview

This section outlines the workflow of MLFDR; a schematic representation of the framework is provided in Fig 1. Consider a study involving n independent samples. For each testing unit , we observe an exposure variable Xi, a mediator Mi, and an outcome Yi. Biologically, these variables may represent distinct contexts: for example, X may denote a patient’s smoking history (shared across i), with representing CpG methylation sites and representing gene expression levels. Alternatively, the analysis may focus on the functional impact of Single Nucleotide Polymorphisms (Xi) on gene expression (Yi) as mediated by CpG methylation (Mi) [3].

The mediation model posits that the exposure Xi influences the outcome Yi through the intermediate variable Mi, rather than solely through a direct relationship. We denote the coefficient for the exposure-mediator relationship () as , and the coefficient for the mediator-outcome relationship () as . In Fig 1, solid arrows indicate these direct effects.

We aim to test the composite null hypothesis against the alternative for each unit i:

(1)

The composite null hypothesis H0,i can be decomposed into three disjoint component nulls, , defined as:

(2)

for

We consider a mixture prior for , where the probability of each disjoint component nulls and H01 occur with probability and respectively. The marginal prior distributions of and , respectively, are degenerate zero under the null and follow a normal prior with an unknown mean and variance under the alternative. The marginal distribution of the least squares coefficient estimates given the latent states is computed, and the unknown parameters including the null proportions are estimated using EM algorithm.

Using these estimates, we compute the local false discovery rate (local FDR) for each coefficient pair, denoted as for . As the local FDR represents the posterior probability that the i-th hypothesis is null given the observed statistics, a lower value indicates stronger evidence against the null. Consequently, we define the rejection region for the composite null hypothesis as . The threshold δ is determined adaptively using the step-up procedure proposed by [22]. The complete algorithm is detailed in the Methods section.

Additionally, we introduce an extended algorithm (MLFDR2) which can deal with scenarios where the marginal priors of and follow mixture normal distributions under the alternative. A composite alternative leads to a joint distribution of with more than 4 mixture components, which can often be computationally burdensome. We introduce a two-step EM alogrithm that estimates the parameters of the marginal distributions of and in the first step, then uses these estimates to run another EM algorithm that computes the probabilities of each mixture component.

We also discuss another extension using Surrogate Variable Analysis [10] which can account for unmeasured confounders and pleiotropy in the model. Details are presented in Sect 2.3.

2.2 Simulation studies

We evaluate the performance of MLFDR through extensive simulations under two distinct mixture proportion scenarios: a dense alternative and a sparse alternative. Following the setups in [3], the latent class probabilities are defined as:

  • Dense alternative:
  • Sparse alternative:

We consider sample sizes of and fix the number of mediators at m = 1000. The parameter controlling the signal strength of mediation, τ, varies from 0.1 to 1.9 in increments of 0.2. The non-zero coefficients are generated as (under H10 and H11) and (under H01 and H11), where the noise terms follow and .

We compare the empirical FDR and power of MLFDR against two competing methods: MDACT [30] and HDMT (JS-mixture) [3]. The simulation settings are detailed below.

  1. Linear Model. The exposure is univariate with . (3)
    where , and error terms . The results for this setting are summarized in Fig 2.
  2. Linear Model with measured Confounder. The exposure is univariate with . We introduce a confounder : (4)
    where the confounder effects are drawn independently from . The results are presented in Fig 3.
  3. Binary Outcome. The exposure is univariate with . The outcome Yi is binary: (5)
    The results are displayed in Fig 4.
thumbnail
Fig 2. FDR and power comparison for the linear model (Setting 1).

Results are displayed for both sparse and dense alternatives. Gray ribbons indicate error margins.

https://doi.org/10.1371/journal.pcbi.1013880.g002

thumbnail
Fig 3. FDR and power comparison for the linear model with measured confounders (Setting 2).

Results are displayed for both sparse and dense alternatives. Gray ribbons indicate error margins.

https://doi.org/10.1371/journal.pcbi.1013880.g003

thumbnail
Fig 4. FDR and power comparison for the linear model with binary outcomes (Setting 3).

Results are displayed for both sparse and dense alternatives. Gray ribbons indicate error margins.

https://doi.org/10.1371/journal.pcbi.1013880.g004

Across all settings, the three methods demonstrated satisfactory FDR control. However, MLFDR consistently exhibited the highest power. Specifically, MLFDR achieved an average power improvement of 10.83% over MDACT and 12.23% over HDMT under dense alternatives. Under sparse alternatives, MLFDR maintained its advantage with an average improvement of 7.47% over MDACT and 8.51% over HDMT.

2.3 Extensions

2.3.1 Latent factors.

Unmeasured latent factors may be addressed by surrogate variable analysis [10]. Briefly, surrogate variable analysis considers the following model:

(6)

where X, Z are measured covariates, and are unmeasured latent factors. Surrogate variable analysis produces a set of K mutually orthogonal vectors (where ), which span the same linear space as the latent factors. Thus, the original equation may be re-written as:

These estimated factors, collected into a matrix , account for unmeasured confounding in the exposure-mediator relationship.

For the mediator-outcome relationship, the latent factors may be modeled as follows:

(7)

In this setting, the latent terms may account for: 1) unmeasured confounding; 2) measured confounders with unknown relationships to the outcome (e.g., global batch effects); and 3) pleiotropy, where Yi is influenced by mediators other than Mi.

Direct application of SVA to model (7) would require estimating surrogate variables for each mediator-outcome pair iteratively, leading to a computational bottleneck. To address this, we propose a global factor adjustment. We estimate a second set of surrogate variables, , based on the outcome null model (excluding mediators):

(8)

We then use the combined set of latent factors (derived from mediators) and (derived from outcome residuals) to model the mediator-outcome relationship. The validity of using from (8) relies on the assumption that any single mediator Mi contributes a relatively small amount of variance to the global outcome matrix Y.

The full procedure is summarized in Algorithm 1. We implemented surrogate variable analysis via the Bioconductor package sva [9].

Algorithm 1 Two-step global factor adjustment for high-dimensional mediation

Require: Outcome matrix , Mediator matrix , Exposure vector , Covariates .

Ensure: Adjusted estimates for and for .

Step 1: Estimate Latent Factors for Mediators ()

1: Regress M on X and Z to obtain the residual matrix :

2: Perform SVA on to extract the top kM principal components:

Step 2: Estimate Latent Factors for Outcomes ()

3: Regress Y on X and Z (excluding M) to obtain the residual matrix :

4: Perform SVA on to extract the top kY principal components:

Step 3: Mediation Analysis with Global Adjustment

5: for do

6:   Mediator Model: Regress Mi on X, Z, and :

7:   Outcome Model: Regress Yi on Mi, X, Z, , and :

8:   Extract coefficients and their standard errors for MLFDR.

9: end for

We apply this method to two data generating scenarios to demonstrate its performance.

Unknown mediator-exposure interactions.

(9)

where S is a randomly selected subset of indices with , treated as unknown during model fitting. The results are shown in Fig 5.

thumbnail
Fig 5. FDR and power comparison for the linear model with unknown mediator-exposure interactions.

Results are displayed for both sparse and dense alternatives. Gray ribbons indicate error margins.

https://doi.org/10.1371/journal.pcbi.1013880.g005

Unmeasured confounding and pleiotropy.

(10)

where Z1 and Z2 represent unmeasured confounders not included in the model fitting. S is a randomly selected subset of indices with . The term represents dense pleiotropy (the effect of other mediators on Yi), which acts as an additional source of unmeasured variation. The results are shown in Fig 6.

thumbnail
Fig 6. FDR and power comparison for the linear model with unmeasured confounders.

Results are displayed for both sparse and dense alternatives. Gray ribbons indicate error margins.

https://doi.org/10.1371/journal.pcbi.1013880.g006

2.3.2 Composite alternatives.

In this setting, the coefficients follow a Gaussian mixture distribution. The posterior distribution of the estimated coefficients is given by:

Under this framework, the null hypothesis corresponds to a mixture of 5 bivariate Gaussian distributions, while the alternative hypothesis comprises a mixture of 4 bivariate Gaussian distributions. The parameters are estimated using a two-step EM algorithm, the details of which are provided in Section C of S1 Text.

In our simulations, we set the mixture weights to and . The variance parameters were fixed at . The mean parameters were defined as functions of the mediation signal strength τ: , , , and , where τ ranges from 0.1 to 1.9 in increments of 0.2.

Fig 7 presents the results for with m = 1000. All three methods maintained satisfactory FDR control, with MLFDR demonstrating the highest power.

thumbnail
Fig 7. FDR and power comparison for composite alternatives.

Results are displayed for varying degrees of mediation (τ). Gray ribbons indicate error margins.

https://doi.org/10.1371/journal.pcbi.1013880.g007

All methods appear to have satisfactory FDR control. MLFDR is uniformly more powerful than HDMT and MDACT in all cases, with better margin of improvement in dense alternatives.

3 Real data analysis

3.1 TCGA prostate cancer data

We apply MLFDR to the Prostate Cancer dataset from The Cancer Genome Atlas (TCGA), previously analyzed by [3]. The study involves mediation analysis for 147 prostate cancer risk SNPs, integrated with DNA methylation and gene expression data from 495 samples. For each risk SNP, we identified CpG methylation probes within a 500 kb window and recorded the gene expression levels for the corresponding probes. This resulted in SNP-CpG-Gene triplets for mediation testing.

In the first stage, we regressed CpG methylation on the SNPs, adjusting for the top 3 principal components (PCs) of genotypes, the top 15 PCs of CpG methylation, age at diagnosis, and pathological stage. From this, we obtained the slope estimates, variances, and p-values for the SNPs. In the second stage, gene expression was regressed on CpG methylation, conditional on the same set of covariates.

The estimated null proportion components were (0.51,0.033,0.41) for HDMT, compared to (0.39,0.004,0.59) obtained via the EM algorithm in MLFDR.

Due to the wide spread of the methylation coefficients (β), we fitted a composite alternative Gaussian mixture model. The number of components, d2 = 8, was selected based on the Akaike Information Criterion (AIC) (Fig 8). Conversely, the SNP coefficients (α) exhibited a narrower range (–0.2 to 0.4) and were adequately modeled using a d1 = 2 component Gaussian mixture.

thumbnail
Fig 8. AIC for methylation coefficients when a d2 + 1 component Gaussian mixture model was fit.

https://doi.org/10.1371/journal.pcbi.1013880.g008

At an FDR threshold of 0.01, HDMT identified 137 triplets, MLFDR identified 187, and MDACT identified 180. Fig 9 displays a Venn diagram of the overlapping discoveries, along with the number of rejections across FDR cutoffs ranging from 0.001 to 0.05. MLFDR consistently detected more pathways than the competing methods.

thumbnail
Fig 9. SNP-CpG-Gene triplets identified by MDACT, MLFDR, and HDMT out of 69,602 tests.

The Venn diagram (left) corresponds to an FDR cutoff of 0.01.

https://doi.org/10.1371/journal.pcbi.1013880.g009

Table 1 lists 10 additional pathways identified by MLFDR (but missed by HDMT or MDACT), ranked by local FDR. Notably, six of these triplets involve rs12653946, a known prostate cancer risk variant that influences the expression of IRX4, a tumor suppressor gene in the prostate [6]. Another pathway involves rs7767188, a risk SNP associated with prostate cancer through the expression of TRIM26 [29].

thumbnail
Table 1. Top 10 additional pathways detected by MLFDR in TCGA Prostate Cancer Data (ranked by ).

These pathways were not detected by HDMT or MDACT.

https://doi.org/10.1371/journal.pcbi.1013880.t001

These empirical results align with our simulation studies: HDMT yielded the fewest discoveries, followed by MDACT, with MLFDR providing the highest detection rate. We note that the improvement of MLFDR over MDACT is modest in this application. This is likely attributable to the symmetric distribution of α and β; as noted by [22], local FDR-based tests offer limited power gains over p-value-based methods when the alternative distribution is symmetric around the null.

3.2 TCGA lung squamous cell carcinoma

We further extend our analysis to the TCGA Lung Squamous Cell Carcinoma (LUSC) dataset to investigate the mediating role of CpG methylation in the relationship between smoking history and gene expression. The data were acquired using the R package UCSCXenaTools [28]. We restricted the analysis to primary tumor samples, resulting in a sample size of n = 379 after preprocessing.

Smoking history was quantified by pack-years. Using the publicly available probe map data from the TCGA website, each CpG probe was mapped to the expression profiles of potentially multiple genes; each unique CpG-gene pair was treated as a distinct candidate mediation pathway. The analysis proceeded in two stages. In the first stage, CpG methylation beta values were regressed on smoking history (pack-years). In the second stage, a multiple linear regression was fitted for each gene, including all CpG probes mapped to that gene as predictors. The coefficients and p-values for each specific CpG-gene pair were then extracted to form the mediation hypotheses. All models were adjusted for potential confounders, including sex and the age at initial diagnosis. In total, 319,761 CpG-gene pathways were evaluated.

At an FDR threshold of 0.01, HDMT identified 13 pathways, MDACT identified 25 pathways, and MLFDR identified 44 pathways (Fig 10). The results highlight the increased power of MLFDR in detecting subtle mediation signals. Table 2 details the top 10 additional pathways detected by MLFDR (ranked by local FDR) that were not identified by the competing methods. Several of these findings are supported by existing literature. For instance, [27] discuss the relevance of WDR66 in lung cancer progression, while [2] highlights the role of LY6K as a potential therapeutic target in lung squamous cell carcinoma. [31] links TCIRG1 to metastatic potential of hepatocellular carcinoma.

thumbnail
Fig 10. Smoking-CpG-Gene pathways detected by MDACT, MLFDR, and HDMT out of 319,761 tests.

The Venn diagram (left) displays overlaps at an FDR cutoff of 0.01, while the plot (right) shows detection counts across varying FDR thresholds.

https://doi.org/10.1371/journal.pcbi.1013880.g010

thumbnail
Table 2. Top 10 additional pathways detected by MLFDR in the TCGA LUSC dataset (ranked by ).

These pathways were not detected by HDMT or MDACT.

https://doi.org/10.1371/journal.pcbi.1013880.t002

To evaluate the FDR of the three methods, we selected a subset of the data with tests, and permuted the samples to create a “global null” scenario. For each permutation, the smoking-Cpg and CpG-gene expression model is fit, and the three methods are implemented. At a given permutation, if any method has non-zero rejections, the False Positive rate for that permutation is recorded as 1 for that method. This procedure is repeated over 100 permutations, and the number of rejections is recorded for FDR levels varying from 0.0001 to 0.2. Fig 11 presents the proportion of cases in which each method came up with significant rejections under the global null. HDMT and MLFDR appear to have satisfactory control of the FDR, while MDACT appears to have some FDR inflation at low FDR levels.

thumbnail
Fig 11. Smoking-CpG-gene pathways detected by all methods under the global null out of tests.

The proportion of cases reporting positive detection counts is reported across varying FDR thresholds. The gray solid line indicates target FDR levels.

https://doi.org/10.1371/journal.pcbi.1013880.g011

4 Discussion

We have presented a flexible and powerful screening algorithm for detecting causal pathways in high-dimensional mediation analysis. MLFDR is capable of dealing with a wide array of outcome variables, confounders, and mediation exposure interactions. It can work in tandem with surrogate variable analysis to address complex dependence structures like pleiotropy and unmeasured confounding. Through simulations and theoretical analysis, we have shown that the proposed method is a viable alternative to p-value-based screening methods, which may not produce optimal results depending on the structure of the alternative hypothesis. We have also shown theoretical guarantees of FDR control for MLFDR. Overall, we hope that MLFDR will be used as an alternative to p-value based methods for mediator screening in the future.

5 Methods

5.1 The mediation model

Consider a univariate exposure X, a set of mediators , and m outcomes . We assume the following structural equation model:

(11)

where the error terms satisfy and are distributed as:

In settings with high-dimensional exposures, such as Single Nucleotide Polymorphisms (SNPs), we define the hypothesis based on the Exposure-Mediator-Outcome triplet . In such cases, the common exposure X in Eq (11) is replaced by the specific exposure Xi.

To describe the methodology, we focus on the formulation in Model (11). Given n independent samples , our goal is to test the joint significance of and for . Let , , , and define the projection matrix . The ordinary least squares (OLS) estimators are given by:

(12)

Conditional on X and , the estimators follow the distribution:

where and . Note that and are independent, as . We test the composite null hypothesis against the alternative:

(13)

Denote by the latent vector indicating the underlying truth of the i-th hypothesis, where and . The vector takes values in the set . Let represent the count of hypotheses in each state for .

We assume a prior distribution , where and . Conditional on the latent states, the non-zero effect sizes are modeled as Gaussian:

When , (and analogously for ). Assuming that and are independent conditional on , the marginal distribution of the coefficient estimates given the latent states is:

(14)

where denote the variances of the estimates conditional on .

We employ an Expectation-Maximization (EM) algorithm to estimate the unknown parameters ; details are provided in Section B of the S1 Text. Although MLFDR is derived under the framework of Model (11), it is applicable to a broader range of settings, including binary outcomes, mediator-exposure interactions, and models with confounders. The extensive simulations presented earlier demonstrate the method’s robustness across these varied scenarios.

5.2 Extension: Composite alternative

In the last section we introduced the latent variable to characterize the underlying state of the hypothesis. We now extend this framework to a composite alternative setting, where the non-zero effects are drawn from mixture distributions. Specifically, we assume are generated as follows:

Here, the index 0 denotes the null state, such that (i.e., a degenerate distribution at zero). As before, and are assumed independent conditional on the latent state .

Marginalizing over the prior distribution, the joint distribution of the estimators follows a Gaussian Mixture Model (GMM):

(15)

where . In this context, the component null hypotheses map to specific index combinations: H01,i corresponds to , while the alternative H11,i corresponds to .

The corresponding marginal distributions are given by:

(16)

where and .

Two-step EM algorithm.

Directly fitting a (d1  +  1)(d2  +  -component bivariate GMM to estimate the joint probability matrix is computationally intensive. To mitigate this burden, we propose a two-step Expectation-Maximization (EM) algorithm.

In the first step, we estimate the parameters and using the univariate marginal distributions described in (Eq 16). While the marginal mixing proportions and are insufficient to recover the joint distribution (without assuming independence), the moment estimates remain valid.

In the second step, we fix these mean and variance estimates and fit a constrained bivariate GMM to the joint data solely to estimate the joint mixing proportions . This approach significantly reduces the dimensionality of the optimization problem. Details of this algorithm are provided in Section C of S1 Text, and a supporting simulation is presented in Sect 2.3.

This two-step approach is also applicable to the standard case where . In Section D of S1 Text, we compare the standard bivariate EM against this two-step variant. The results indicate that while the two-step method offers substantial computational speedups, it incurs a slight reduction in power. Therefore, we recommend using the standard bivariate EM for simple cases () and reserving the two-step EM for complex composite alternatives where computational efficiency is paramount.

5.3 Step-up procedure based on local FDR

[22] demonstrated that for a simple null hypothesis, the oracle local FDR-based rejection region outperforms p-value-based thresholding. Specifically, it achieves a lower marginal False Non-discovery Rate (mFNR) while controlling the marginal False Discovery Rate (mFDR) at the same level. This advantage is particularly pronounced when the alternative distribution is asymmetric about the null. Motivated by these findings, we implement a local FDR-based step-up procedure to identify significant mediation pathways.

We focus our discussion on the Gaussian Mixture Model described in (14). The extension to the more general mixture model in (15) is straightforward. Conditional on the latent variables, the density function of the transformed statistics under state Hjk,i is given by:

(17)

where , , and denotes the bivariate normal density with mean vector , variances , and zero correlation.

For notational simplicity, let and . Under the composite null hypothesis, the joint local FDR is defined as:

(18)

This quantity can be estimated by substituting the parameter estimates obtained from the EM algorithm; we denote this estimate by . As the local FDR represents the posterior probability of the null hypothesis, a lower value indicates stronger evidence against the null. Consequently, we define the rejection region as , where the threshold δ must be determined to control the error rate.

We assume the cumulative distribution function (CDF) of the joint local FDR is given by:

(19)

where Gjk(t) is the conditional CDF of under hypothesis Hjk:

(20)(21)

Oracle procedure.

The oracle procedure assumes that all parameters and densities fjk are known. For a given threshold δ, we define the following counting processes:

(22)(23)(24)(25)

where represents the number of false rejections, the total number of rejections, and the number of missed discoveries (false negatives). We can express as:

Taking the expectation, we obtain:

(26)

The mFDR at threshold δ is defined as:

(27)

This quantity can be empirically estimated by:

(28)

In Section A of the S1 Text, we prove that for a fixed δ, the numerator and denominator of are unbiased estimators of the numerator and denominator of , respectively. The oracle rejection region for the composite null is defined as , where the threshold is selected as:

(29)

Adaptive procedure.

In practical applications, the true parameters and the component densities fjk are unknown. Consequently, the true local FDR values in Equation (28) are inaccessible. To address this, we substitute the unknown quantities with their estimates obtained via the EM algorithm. By plugging in these estimates, we obtain the estimated local FDR, denoted as , which yields an empirical estimate of :

(30)

Accordingly, the data-adaptive threshold is defined as:

The resulting procedure, which rejects the i-th hypothesis if , is operationally equivalent to the step-up algorithm detailed in Algorithm 2.

Algorithm 2 A data-adaptive procedure for finding the cutoffs in MLFDR.

Require: EM estimates for the parameters

1: for do

2:   Compute

3: end for

4: Sort the estimated local FDR values in ascending order to obtain the set of order statistics , where ties are broken at random.

5: For each , compute the average estimated local FDR for the top k statistics:

6: Determine the cutoff index

7: Reject the null hypotheses corresponding to the smallest k local FDR values, denoted as

Supporting information

S1 Fig. Comparison of empirical FDR and power across different methods for MLFDR and two-step MLFDR.

https://doi.org/10.1371/journal.pcbi.1013880.s001

(TIFF)

S1 Text. Supplementary file detailing the methods and the theory.

https://doi.org/10.1371/journal.pcbi.1013880.s002

(PDF)

References

  1. 1. Baron RM, Kenny DA. The moderator-mediator variable distinction in social psychological research: conceptual, strategic, and statistical considerations. J Pers Soc Psychol. 1986;51(6):1173–82. pmid:3806354
  2. 2. Chen Y, Zhou C, Zhang X, Chen M, Wang M, Zhang L, et al. Construction of a novel radioresistance-related signature for prediction of prognosis, immune microenvironment and anti-tumour drug sensitivity in non-small cell lung cancer. Ann Med. 2025;57(1):2447930. pmid:39797413
  3. 3. Dai JY, Stanford JL, LeBlanc M. A multiple-testing procedure for high-dimensional mediation hypotheses. J Am Stat Assoc. 2022;117(537):198–213. pmid:35400115
  4. 4. Dhar GA, Saha S, Mitra P, Nag Chaudhuri R. DNA methylation and regulation of gene expression: Guardian of our health. Nucleus (Calcutta). 2021;64(3):259–70. pmid:34421129
  5. 5. Ding J, Zhu X. Amdp: an adaptive detection procedure for false discovery rate control in high-dimensional mediation analysis. Advances in Neural Information Processing Systems. 2024;36.
  6. 6. Nguyen HH, Takata R, Akamatsu S, Shigemizu D, Tsunoda T, Furihata M, et al. IRX4 at 5p15 suppresses prostate cancer growth through the interaction with vitamin D receptor, conferring prostate cancer susceptibility. Hum Mol Genet. 2012;21(9):2076–85. pmid:22323358
  7. 7. Harris SE, Riggio V, Evenden L, Gilchrist T, McCafferty S, Murphy L, et al. Age-related gene expression changes,, transcriptome wide association study of physical, cognitive aging traits and in the Lothian Birth Cohort 1936. Aging (Albany NY). 2017;9(12):2489–503. pmid:29207374
  8. 8. Lange T, Hansen JV. Direct and indirect effects in a survival context. Epidemiology. 2011;22(4):575–81. pmid:21552129
  9. 9. Leek JT, Johnson WE, Parker HS, Jaffe AE, Storey JD. The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics. 2012;28(6):882–3. pmid:22257669
  10. 10. Leek JT, Storey JD. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 2007;3(9):1724–35. pmid:17907809
  11. 11. Zhonghua L, Shen J, Barfield R, Schwartz J, Baccarelli AA, Lin X. Large-scale hypothesis testing for causal mediation effects with applications in genome-wide epigenetic studies. Journal of the American Statistical Association. 2022;117:67–81.
  12. 12. Maas SCE, Mens MMJ, Kühnel B, van Meurs JBJ, Uitterlinden AG, Peters A, et al. Smoking-related changes in dna methylation and gene expression are associated with cardio-metabolic traits. Clinical Epigenetics. 12:1–16, 2020.
  13. 13. MacKinnon DP, Lockwood CM, Hoffman JM, West SG, Sheets V. A comparison of methods to test mediation and other intervening variable effects. Psychological methods. 2002;7(1):83.
  14. 14. Moore LD, Le T, Fan G. DNA methylation and its basic function. Neuropsychopharmacology. 2013;38(1):23–38. pmid:22781841
  15. 15. Pearl J. The causal mediation formula–a guide to the assessment of pathways and mechanisms. Prev Sci. 2012;13(4):426–36. pmid:22419385
  16. 16. Perera C, Zhang H, Zheng Y, Hou L, Qu A, Zheng C, et al. HIMA2: high-dimensional mediation analysis and its application in epigenome-wide DNA methylation data. BMC Bioinformatics. 2022;23(1):296. pmid:35879655
  17. 17. Robins JM, Greenland S. Identifiability and exchangeability for direct and indirect effects. Epidemiology. 1992;3(2):143–55. pmid:1576220
  18. 18. Sobel ME. Asymptotic confidence intervals for indirect effects in structural equation models. Sociological Methodology. 1982;13:290.
  19. 19. Somel M, Khaitovich P, Bahn S, Pääbo S, Lachmann M. Gene expression becomes heterogeneous with age. Curr Biol. 2006;16(10):R359-60. pmid:16713941
  20. 20. Storey JD. A direct approach to false discovery rates. Journal of the Royal Statistical Society Series B: Statistical Methodology. 2002;64(3):479–98.
  21. 21. Sun R, McCaw Z, Lin X. Testing a large number of composite null hypotheses using conditionally symmetric multidimensional gaussian mixtures in genome-wide studies. arXiv preprint 2023. https://arxiv.org/abs/2309.12584
  22. 22. Sun W, Cai TT. Oracle and adaptive compound decision rules for false discovery rate control. Journal of the American Statistical Association. 2007;102(479):901–12.
  23. 23. Tchetgen Tchetgen EJ. On causal mediation analysis with a survival outcome. Int J Biostat. 2011;7(1):Article 33. pmid:22049268
  24. 24. Valeri L, Vanderweele TJ. Mediation analysis allowing for exposure-mediator interactions and causal interpretation: theoretical assumptions and implementation with SAS and SPSS macros. Psychol Methods. 2013;18(2):137–50. pmid:23379553
  25. 25. VanderWeele TJ. Mediation analysis: a practitioner’s guide. Annu Rev Public Health. 2016;37:17–32. pmid:26653405
  26. 26. Vanderweele TJ, Vansteelandt S. Odds ratios for mediation analysis for a dichotomous outcome. Am J Epidemiol. 2010;172(12):1339–48. pmid:21036955
  27. 27. Wang Q, Ma C, Kemmner W. Wdr66 is a novel marker for risk stratification and involved in epithelial-mesenchymal transition of esophageal squamous cell carcinoma. BMC Cancer. 2013;13:137. pmid:23514407
  28. 28. Wang S, Liu X. The UCSCXenaTools R package: a toolkit for accessing genomics data from UCSC Xena platform, from cancer multi-omics to single-cell RNA-seq. JOSS. 2019;4(40):1627.
  29. 29. Wu L, Yang Y, Guo X, Shu X-O, Cai Q, Shu X, et al. An integrative multi-omics analysis to identify candidate DNA methylation biomarkers related to prostate cancer risk. Nat Commun. 2020;11(1):3905. pmid:32764609
  30. 30. Yang H, Liu Z, Wang R, Lai E-Y, Schwartz J, Baccarelli AA, et al. Causal mediation analysis for integrating exposure, genomic, and phenotype data. Annu Rev Stat Appl. 2025;12:337–60. pmid:41001172
  31. 31. Yang HD, Eun JW, Lee K-B, Shen Q, Kim HS, Kim SY, et al. T-cell immune regulator 1 enhances metastasis in hepatocellular carcinoma. Exp Mol Med. 2018;50(1):e420. pmid:29303507
  32. 32. Yu Z, Cui Y, Wei T, Ma Y, Luo C. High-dimensional mediation analysis with confounders in survival models. Front Genet. 2021;12:688871. pmid:34262599
  33. 33. Zhang H, Zheng Y, Hou L, Zheng C, Liu L. Mediation analysis for survival data with high-dimensional mediators. Bioinformatics. 2021;37(21):3815–21. pmid:34343267