A comparison of multiple testing adjustment methods with block-correlation positively-dependent tests

In high dimensional data analysis (such as gene expression, spatial epidemiology, or brain imaging studies), we often test thousands or more hypotheses simultaneously. As the number of tests increases, the chance of observing some statistically significant tests is very high even when all null hypotheses are true. Consequently, we could reach incorrect conclusions regarding the hypotheses. Researchers frequently use multiplicity adjustment methods to control type I error rates—primarily the family-wise error rate (FWER) or the false discovery rate (FDR)—while still desiring high statistical power. In practice, such studies may have dependent test statistics (or p-values) as tests can be dependent on each other. However, some commonly-used multiplicity adjustment methods assume independent tests. We perform a simulation study comparing several of the most common adjustment methods involved in multiple hypothesis testing, under varying degrees of block-correlation positive dependence among tests.

Generally, there are two types of multiple testing methods: those that control the FWER, and those that control the FDR. Methods involved in controlling the FWER are discussed in Section , and methods involved in controlling the FDR in Section . Throughout Sections and , p 1 , p 2 , ..., p m are p-values of a set of given null hypotheses H 1 , H 2 , . . . , H m (and with corresponding test statistics X 1 , . . ., X m ); some methods involve the ordered p-values p (1) ≤ p (2) ≤ . . . ≤ p (m) , with ordered p-value p (i) corresponding to null hypothesis H (i) and test statistic X (i) .

Methods involved in controlling the family wise error rate (FWER)
The FWER is defined as the probability of one or more false positive events (or type I errors) among all the hypotheses when conducting multiple hypothesis tests. That is, Bonferroni procedure In multiple comparisons, the Bonferroni correction [2] is widely used to control the probability that a true null hypothesis is incorrectly rejected, although it gives a conservative upper bound on the FWER. The Bonferroni procedure rejects hypothesis H k are significant, then H k will be declared significant based on the closed testing principle.
It is worth noting that while Holm (1979) [6] mentioned the desirability of independent test statistics from the m null hypotheses ("not for computational reasons but because a good experimental design requires the different hypotheses to be tested by variables 'not related to each others' "), the Holm procedure itself does not require independence of test statistics.

Hochberg procedure
The Hochberg procedure [9] applies the step-up technique in which one rejects all Then the procedure rejects the rest of the hypotheses, Like the Hommel procedure (see Section ), this procedure also performs well on independent test statistics. But the Hommel procedure increases the power of the tests while controlling the FWER. Sarkar (1998) [10] showed that the Hochberg procedure controls the FWER when test statistics follow the multivariate total positivity of order 2 (that is, MTP 2 ) dependence condition (see Section ).

Methods controlling the false discovery rate (FDR)
Recalling Table 1 and according to Benjamini and Hochberg (1995) [1], the False Discovery Proportion (FDP) is the number of false positives (or false rejections) divided by the number of rejections. That is, where Q is the false discovery proportion. The FDR is defined as the expected FDP. That is, Benjamini and Hochberg (BH) procedure Benjamini and Hochberg (1995) [1] provided a completely different idea of controlling error rates in multiple hypothesis testing. The procedure involves a step-up technique for non-parametric statistics in order to control the error rates. They suggested the multiplicity problem will be reduced if we only consider the FDP instead of the usual idea of the probability of making at least one type I error among all tests. In addition, the exact controlling will be accomplished by the FDR while providing a smaller type II error rate.

3/13
Benjamini and Hochberg proposed a linear step-up technique, considering the Simes (1986) [11] test, that ensures the FDP converges to a desired level α in the 1 st order mean. Their proposed method, referred to as the BH procedure, also provides a non-parametric and finite sample method for choosing the p-value threshold, which is often more powerful than the traditional methods associated with controlling the FWER.
The linear step-up BH procedure rejects all hypotheses H (k) with k ≤ max {i : p (i) ≤ (iα/m)}. Otherwise, the process stops when k does not exist. For independent tests the BH procedure controls the FDR at level α.
Following Table 1 and Eq 3, one of the important features of the FDR mentioned is that it will not be controlled at level α if m 0 = m, even if Q = 1 [1]. Moreover, controlling (V /R|R > 0) (see Eq 2) is not possible when Q = 1. Thus, the alternative formulation of the FDR is Benjamini and Yekutieli (BY) procedure Benjamini and Yekutieli (2001) [12] extended the work of the original BH procedure (see Section ) for dependent test statistics (or dependent p-values). They pointed out that the BH procedure controls the FDR at level m0 m α ≤ α when the test statistics have a positive regression dependence structure (see Section ). However, when the test statistics are negatively correlated or there is a different dependency, a modified BH procedure, referred to here as the BY procedure, is needed to control the FDR at the same level m0 m α ≤ α. This BY method multiplies the BH-adjusted p-values by m i=1 1 i , rounded down to (or truncated at) 1 as necessary. Equivalently, the BY method replaces α in the BH method with α/ m i=1 1 i .

Adaptive Benjamini and Hochberg (ABH) procedure
Benjamini and Hochberg (2000) [13] modified the original BH procedure (see Section ) by estimating m 0 (i.e., the number of true null hypotheses) in Table 1. The original BH procedure does not require estimation of m 0 , while this procedure depends on the knowledge of m 0 . Assuming independent tests to estimate m 0 by the original BH procedure, the adaptive BH procedure, referred to here as the ABH procedure, involves the following steps [14]: Step 1: Use the linear step-up procedure (that is, original BH procedure) at α, and if no hypothesis is rejected stop; otherwise, proceed.
The ABH procedure controls the FDR exactly at level α, while the original BH procedure was shown to provide a slightly conservative upper bound. In addition, the ABH procedure has greater power than the original BH procedure [13,14].

4/13
Two stage Benjamini and Hochberg (TSBH) procedure According to Benjamini et al. (2006) [14], the two-stage linear step-up procedure of the BH procedure (see Section ), which is referred to here as the TSBH procedure, is summarized in the following way: Step 1: Use the linear step-up procedure (or the original BH procedure) at level α = α/(1 + α). Let r 1 be the number of rejected hypotheses. If r 1 = 0 do not reject any hypothesis and stop; also if r 1 = m reject all m hypotheses and stop; otherwise proceed.
Step 3: Use again the linear step-up procedure with α = mα /m 0 .
In Step 2, the TSBH procedure involves estimating m 0 (see Table 1) with the constraint The BH procedure (see Section ) is used in the initial stage, ensuring E(V /R) ≤ {mα/m 0 }, so that V ≤ {αm 0 R/m}. Substituting V into Eq 5, we obtain the following relations: and The right-most bound of Eq 7 is inherently used in the TSBH procedure. This procedure also controls the FDR at level α when test statistics are independent. Benjamini et al. (2006) [14] showed that the TSBH procedure controls the FDR below but close to the nominal level α and provides higher power than the original BH procedure when tests are correlated.

q-value method
In the original BH procedure (see Section ), the rejection of all hypotheses H i (i = 1, 2, . . . , k) relies on k = max{i : p (i) ≤ iα/m}. However, the estimation process of the maximum number of rejected hypotheses (i.e., "k") from each possible combination of hypotheses is not clearly explained in the BH procedure [15]. Storey (2002) [15] also pointed out that the BH procedure may not give a reliable estimate fork when a large number of hypotheses are considered, and hence it affects controlling the FDR at the level α "for all values of m 0 (i.e., the number of true null hypotheses) simultaneously".
To address the weaknesses of the BH procedure, Storey (2002) [15] demonstrated a new way of controlling the error rates with the positive False Discovery Rate (pFDR) at the desired level α. The error rate, pFDR, preserves information about m 0 by applying point estimation techniques in a Bayesian framework. Storey (2002) [15] introduced a new term "q-value" that quantifies the pFDR. He defined the pFDR as in Eq 8: That is, the pFDR is the expected portion of erroneously rejected hypotheses among all rejected hypotheses when significant findings have occurred. Where the BH approach PLOS 5/13 uses threshold k = max{i : p (i) ≤ iα/m}, and rejects all hypotheses H i (i = 1, 2, . . . , k), Storey(2002) [15] proposed to use a fixed threshold t (for significance of p-values) and estimate the resulting FDR; the choice of t is such that the estimated FDR is less than or equal to α. Storey (2002) [15] identified the q-value as analogous to the p-value; the q-value measures the error rate with respect to the pFDR, but the p-value measures the error rate with respect to the type I error.
Principal factor approximation (PFA) method  [16] presented an alternative approach to estimate the FDR, similar to the q-value framework (see Section ), but paying particular attention to the potential arbitrary dependence among test statistics. This Principal Factor Approximation (PFA) approach treats test statistics X 1 , . . . , X m as normally distributed with mean vector (µ 1 , . . . , µ m ) and covariance matrix Σ ∼ . The basic idea of this PFA approach "is to first take out the principal factors that derive the strong dependence among observed data" X 1 , . . ., X m "and to account for such dependence in calculation of the false discovery proportion" [16]. In our implementation of this method, we treated the Σ ∼ covariance matrix as known, and used stated default parameters in the pfa.test function of the R package pfa [17].

Dependence among test results: PRDS and MTP 2
The test statistics X ∼ = (X (1) , . . . , X (m) ) corresponding to p-values p (1) ≤ . . . ≤ p (m) (and null hypotheses H (1) , . . . , H (m) ) are sometimes assumed independent by multiplicity correction methods, but in many contexts (as in the genomic, spatial epidemiology, and brain imaging studies mentioned in the main article), this assumption is suspect. Benjamini and Yekutieli (2001) [12] define different dependence types among test statistics. While those authors provide technical definitions, we briefly summarize here the main dependence types.
Let I 0 ⊆ X ∼ be the set of test statistics corresponding to true null hypotheses. The dependence type emphasized by Benjamini and Yekutieli (2001) [12] is positive regression dependency on each one from a subset I 0 , or PRDS. This can be interpreted as saying that H (1) is least likely to be in I 0 , followed by H (2) , and so on, until H (m) is the most likely to be in I 0 . Benjamini and Yekutieli (2001) [12] showed that, when the joint distribution of X ∼ is PRDS, the original BH procedure controls the FDR at level m0 m α ≤ α. They also showed that, for any joint distribution of X ∼ (even if not PRDS), their BY procedure controls the FDR at level m0 m α ≤ α. A special case of PRDS (which is easier to show than PRDS [12]) is multivariate total positivity of order 2, or MTP 2 . In this dependence type, let f X (x) be the density function of X where the min and max are evaluated componentwise [12]. According to Nichols and Hayasaka (2003) [18], "Gaussian data with positive correlations will satisfy the PRDS condition," a fact that we use in the simulation section of the article to compare performance of the aforementioned multiplicity adjustment methods when test results are not independent.

B. Additional simulations for PFA
Based on the FDR control (or lack thereof) exhibited by the PFA method in Fig. 3 of the article, we considered an additional simulation to examine if there were conditions in which the PFA method would provide FDR control. It is beyond the scope of our work to exhaustively characterize the performance of the PFA method; instead, we only consider if the PFA method can be shown to provide reasonable control of the FDR in some simulation settings.
We use a simulation set-up similar to that described in Section 3 of the article, including reference to µ, A, and ρ. Let m be the total number of tests (features); we consider m values ranging from 80 to 2000, and 10 percent of the tests have false null hypotheses. We assume six blocks of correlated tests, with constant within-block correlation ρ, and each block is comprised of 2.5 percent of the m tests. Three blocks correspond to false null hypotheses (using A = 2), and three blocks correspond to true null hypotheses (A = 0). Of the independent tests, most (82.5 percent of the m tests) correspond to true null hypotheses, but some (the remaining 2.5 percent of the m tests) correspond to false null hypotheses (again with A = 2). Across 100 simulations, we calculate the average and standard deviation of the FDR (when aiming for purported FDR control at level α = 0.05), with results summarized (and showing approximate 95 percent confidence intervals) in Fig. S1 below.
Based on Fig. S1 below, it does appear that the PFA method provides better FDR control for lower numbers of tests, but does not appear to control the FDR at the stated α level. It should be noted that this simulation (which, along with the panels of Fig. S1, can be reproduced using R code found in S2 file) treats the Σ ∼ covariance matrix as known, and uses stated default parameters in the pfa.test function of the R package pfa [17].

C. Additional perspective on FDR/FWER vs. power
During the review process of this manuscript, a reviewer requested a simultaneous representation of the FDR (or FWER) and power results, to facilitate a comparison of their relative tradeoff. Accordingly, we present the following Figures S1a, S1b, and S1c:   Based on Figures S1a -S1c, as the magnitude of differential abundance (A) increases (i.e., as the plotting characters have a darker color), there is a general trend that the FDR and FWER decrease while the power increases. The shape of this relationship is quite different for the PFA method in Fig. S1a.