Accuracy Evaluation of the Unified P-Value from Combining Correlated P-Values

Meta-analysis methods that combine -values into a single unified -value are frequently employed to improve confidence in hypothesis testing. An assumption made by most meta-analysis methods is that the -values to be combined are independent, which may not always be true. To investigate the accuracy of the unified -value from combining correlated -values, we have evaluated a family of statistical methods that combine: independent, weighted independent, correlated, and weighted correlated -values. Statistical accuracy evaluation by combining simulated correlated -values showed that correlation among -values can have a significant effect on the accuracy of the combined -value obtained. Among the statistical methods evaluated those that weight -values compute more accurate combined -values than those that do not. Also, statistical methods that utilize the correlation information have the best performance, producing significantly more accurate combined -values. In our study we have demonstrated that statistical methods that combine -values based on the assumption of independence can produce inaccurate -values when combining correlated -values, even when the -values are only weakly correlated. Therefore, to prevent from drawing false conclusions during hypothesis testing, our study advises caution be used when interpreting the -value obtained from combining -values of unknown correlation. However, when the correlation information is available, the weighting-capable statistical method, first introduced by Brown and recently modified by Hou, seems to perform the best amongst the methods investigated.


Introduction
Meta-analysis methods that combine P-values into a single unified P-value are commonly used to rank or score a list of hypotheses [1]. For each hypothesis tested, the P-values to be combined are often acquired from studying different features associated with the hypothesis or from using different data analysis methods (DAM) to analyze a chosen feature. Either approaches conducted to test the same list of hypotheses assign an overall Pvalue to each hypothesis tested. These P-values are then usually sorted, with the most significant result ranking first in the list. Given that different features may not be completely independent and that different DAMs may share protocols and use similar information, it is likely that the P-values obtained for a hypothesis are correlated.
Most P-value combining methods assume that the P-values to be combined are independent or weakly correlated [2,3]. When the unified P-value is computed by combining correlated P-values, without properly taking into account the correlation, there can be notable effects in the significance assignment of the hypothesis tested. As the P-values to be combined are possibly correlated, it is important to investigate the effect that correlation has on the unified P-value. The current study is designed to evaluate the accuracy of the unified P-value computed by combining (positively) correlated P-values using some commonly applied statistical methods. By P-value accuracy, we mean how well on average does reported P-value agree with the one-sided cumulative distribution function of the random variable (associated with the null hypotheses tested) at the critical region. In other words, accurate P-value means that when one controls type-I error rate at a level a, the type-I error rate is really controlled at the level a. To keep this paper focused, we will not provide a lengthy introduction. For methods that we will evaluate, more details are provided in the Methods sections. For others, we will only provide the readers with appropriate references.
Several studies have been performed to evaluate methods that combine independent P-values [4][5][6][7][8][9][10]. For example, Rosenthal has evaluated nine methods for combining P-values and has summarized advantages, limitations and applications for each method [4]. Loughin [5] has also conducted a systematic comparison of methods for combining P-values and recommended practitioners to choose a method based on the structure and expectation for the problem being studied. Recently, Whitlock [6] has showed that the weighted Z-method has more power and precision than Fisher's test. In other studies, Chen [8] as well as Chen and Nadarajah [9], have shown that either the generalized Fisher method due to Lancaster or a special case of Lancaster's test outperform the weighted Z-method, while Zaykin [10] has shown that the weighted Z-method has similar power to Lancaster's method when the weights are selected to be the square roots of sample sizes.
As for combining correlated P-values, only few studies have been conducted to evaluate the accuracy of the unified P-value computed by existing statistical methods [11,12]. Evidently, more comprehensive investigations that incorporate different methods, encompass a wide range of correlation strength, and have a large number of simulations can further our understanding on the effect of correlation has on computing a unified P-value. To advance towards this direction, we systematically investigate a family of statistical methods for combining P-values. Because we are interested in combining P-values obtained from the right-tailed tests, we have limited our study to methods that combine P-values based on the normal distribution (e.g. Stouffer's method) and on the Chi-square distribution (e.g. Fisher's method), the general purpose method and the right-tail method recommended by Loughin [5]. The two aforementioned methods, aside from being frequently used to combine P-values, are useful and important to study for the following reason. Both methods mentioned have variations that weight P-values while computing the combined Pvalue: Lipták, Good and Bhoj methods [13][14][15], and variations that take into account the correlation among P-values: Hartung and Hou methods [16,17]. In addition, all methods mentioned above either have closed-form formulas, i.e., distribution functions, or approximation formulas that can provide the unified P-value with minimum computation cost.
In summary, our study presents an accuracy evaluation of the unified P-value obtained from statistical methods designed to combine independent, weighted independent, correlated, and weighted correlated P-values. We have evaluated the accuracy of the unified P-value from combining positively correlated P-value vectors with correlation among P-value vectors in the range ½0,1.
Our results show that methods designed to combine independent P-values but with the capability of assigning weights to P-values perform better than methods that combine independent P-values without weights. Also methods that take into account the correlation between P-values perform significantly better than methods designed to combine independent P-values. Based on this study, the method first introduced by Brown [18] to combine correlated P-values and later adapted to include weights by Hou [17] is the best performing one amongst the methods investigated.

Methods
The main task of combining P-values is described below. Given a list of hypotheses H~fH 1 ,H 2 ,H 3 , Á Á Á ,H K g, let each hypothesis have m P-values associated with it. These m|K P-values can be organized as m P-value vectors, P 1 ,P 2 , . . . ,P m , each having K components. Each P-value vector may result from analyzing one out of m different features of every hypothesis or may be from analyzing a single feature using one of the m different DAMs. The m P-values associated with hypothesis H i are fP 1 (i),P 2 (i), . . . ,P m (i)g. Given those values, one needs to combine them to form a single unified P-value. This scenario can occur in many applications. As an example, when different studies are performed to test a set of genetic loci for allelic imbalance [19], the number of genetic regions tested will correspond to the number of hypotheses K and each region will carry with them m P-values, one from each of the m studies. To fairly rank these possible K regions, for each region one would need a unified P-value resulting from combining the m P-values associated with it. For database search based peptide identification using mass spectrometry, it is possible to analyze the data using multiple analysis methods. Here for each experimental spectrum, the number of hypotheses tested K equals the number of scored peptides in the database and each peptide receives a P-value from each of the m analysis methods. To fairly rank the candidate peptides, it is again natural to combine the m P-values associated with each scored peptide [3] to reach a unified P-value. In the sequence homology detection where multiple motifs are used as a query to a sequence database, it is often needed to combine the Pvalues, each from one of the m motifs, to assign the statistical significance to a sequence in the sequence database [2]. In this case, K is the number of sequences in the database, while m is the number of motifs used as the query.
To make the notation uniform, we will use F s and F {1 s to represent the cumulative distribution and inverse cumulative distribution. When the subscript s~n,x,c, F s represents respectively the cumulative Normal, Chi-squared, and Gamma distributions. All the parameters of these distributions will be shown as arguments enclosed by a pair of parentheses following the symbol F .

Combining Independent P-values
We begin this subsection with a brief introduction of Stouffer's (Z-transform test) and Fisher's (Chi-square test) methods. Generalizations of both methods to combine weighted P-values are also described.
Method 1. The combined Z-transform test was first used by Stouffer et al. [20] and later generalized to include weights by Lipták [13]. Under the null hypothesis, the P-values are uniformly distributed between [0,1]. Given a list of P-values (p (Hi,1) ,p (Hi,2) , Á Á Á ,p (Hi,m) ) associated with a given H i , one transforms the P-values to a new variable (x Hi,j ) by a simple transformation with parameters m~0 and s~1. Stouffer's way to combine the above P-values is by defining a new variable which is also Gaussian distributed with P-value given by the formula A generalization of the above equation that assigns weights (w j ) to the variable x Hi,j is know as the weighted Z-transform test [13] t 0~P m j~1 w j x (H i ,j) ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi P m j~1 w 2 j q : The variable of the weighted Z-transform t 0 also follows Normal distribution, and the formula for the P-value is also given by eq. (1).
Method 2. Fisher's method [21] is one of the most used method to combine independent P-values. The combined Fisher P-value is obtained through the following variable: which follows a Chi-squared distribution pdf x (t; 2m) with 2m degrees of freedom. Computing the unified P-value using the Chisquared distribution is not the most efficient approach because of the significant computational cost in calculating the cumulative distribution F x . A more efficient way to obtain the unified P-value has been proposed [2,3], where the unified P-value oft t 0 :e {t 0 =2 has a closed form given by or in terms of the t 0 variable Note that as t 0 increasest t 0 decreases and vice versa. Fisher's method does not assign weights to the P-values to be combined. However, when information is available regarding how P-values were obtained, it might be beneficial to weight P-values. Lancaster et al. [22] addresses this issue by replacing the random variable {2 ln (p (Hi,j) ) with F {1 x (1{p (Hi,j) ; d j ), a variable following a Chi-squared distribution with d j degrees of freedom not necessarily equal to two.
In Lancaster's procedure, summarized below, one can exploit the equivalence between the Chi-squared distribution pdf x (x; d) and the gamma distribution pdf c (x; a~d 2 ,b~1 2 ) to reach a different weighting generalization. For hypothesis H i , the variable t 0 can now be written as which evidently follows a Chi-squared distribution with P m j d j degrees of freedom. In the expression above, Fisher's method is recovered by setting d j~2 for all j. Another way to incorporate weights is to keep d j~2 while retaining a general b value. Specifically, one may choose, with w j being the weight factor, to use the following new variable The P-value for t 0 can be easily evaluated using the same technique as that in [3] and is given below Interestingly, with b~1=2, eq. (4) corresponds to the unified Pvalue of multiplying weighted independent P-values obtained earlier by Good [14]. This can be seen by the following observation. Good defined his variablẽ and the corresponding P-value is given by When expressed in the variable t:{2 lnt t, we easily see that in agreement with eq. (4) when b~1=2. A question that arises naturally when using methods such as the weighted Z-transform's test, Good's test, and Lancaster's test is how to obtain the optimal weights (w j )? This difficult question has been raised and it was suggested that the choice of weights may vary by cases [23]. Existing methods to assign/estimate the weights include, but are not limited to: (1) weight in proportion to the reciprocal of the variance estimated from each study [6], (2) estimate the weights from one's prior belief about a method or feature [24], (3) select weights to stabilize the variance of the combined test statistics [25], and (4) use weights that improve the testing power [26]. Because there is no universal procedure to compute the optimal weights to be used, in this study the weights, when used, were randomly generated and normalized to sum to one (see Table 1).
There are also two apparent problems with Lancaster's eq. (4) and Good's eq. (5). The first problem is that the weights used can't be identical, otherwise singularities can occur [14,15]. Second, if the difference between some of the weights are small, numerical instability can occur [15,17,27]. In order to address the problem of numerical instability associated with identical and almost identical weights, Bhoj [15] suggested an approximation using a linear combination of m gamma density functions (with t 0~{ 2 P m j~1 w j ln (p (Hi,j) )) where c(a,x)~Ð x 0 t a{1 e {t dt is the incomplete gamma function and C(a)~c(a,?) is the gamma function. Although the approximation provided by Bhoj does reduce to Fisher's distribution when the weights are all equal and does not encounter singularities when weights are identical or nearly identical, this approximation does not lead to Good's distribution when the weights are all different. A recent publication [27] has provided an analytical formula that not only is numerically stable when combining Pvalues with nearly degenerate or identical weights but also correctly reproduces Fisher's and Good's results as limiting cases.

Combining Dependent P-values
In this subsection we summarize two statistical methods that are generalizations of Stouffer's test (Z-transform test) and Fisher's test (Chi-square test) that attempt to account for the correlation among P-values to be combined.
Method 3. Hartung [16] incorporates the correlation among P-values via introducing in the Z-transform test (eq. (1)) the correlation-matrix, with elements r jv computed from the variable pairs (x Hi,j ,x Hi,v ), and by defining a new variable and where x H.,j~P K is the total number of hypotheses tested, and s 2 j is the variance of x H i ,j . The P-value for t 0 is then approximated by the standard Normal distribution which nevertheless becomes exact in the two extreme limits of r jv~1 Vj,v and r jv~0 Vj=v. Although in general the distribution of t 0 is only approximately normal, it is arguable that ignoring correlation can cause more damage to the combined P-value than the deviations from normality. Applications and extensions of Hartung's idea can also be found in more recent publications [12,28]. Method 4. Following Satterthwaite's procedure [29], there have been some attempts, when combining correlated P-values, to obtain approximate unified P-value for the Fisher's variable (no weight) [18,30] and for the Good's variable (unequal weights) [17]. The main idea of Satterthwaite's procedure is to equate the first two moments of the uncharacterized distribution to that of a Chisquared distribution. Brown [18] and Kost et al. [30] tried to approximate the distribution of the Fisher's variable and Hou [17] the distribution of Good's variable to that of a Chi-squared distribution pdf x (t 0 =c; f ), with c being a scale factor to be determined.
The expectation value (E½t 0 ) the variance (V ½t 0 ) of t 0 by formal operation are given respectively by On the other hand, the expectation value and variance of t 0 using pdf x (t 0 =c; f ) yields E½t 0 ~cf , and ð12Þ Equating (10) to (12) and (11) to (13) yields : The covariance (cov) term used above was first estimated by Brown [18] and recently an improved estimation (through numerically tabulating the covariance as a function of the correlation and then performing polynomial fits) was provided by Kost and McDermott [30] cov({2 ln (p H i ,j ),{2 ln (p H i ,v ))~3:263r jv z0:710r 2 jv z0:027r 3 jv , where r jv above is the correlation between ln (p H i ,j ) and ln (p H i ,v ).
The P-value for t 0 is then approximated by that of a Chi-squared distribution Equation (14) reduces to Fisher's formula eq. (3) when the Pvalues are independent and the weights are all same. However, the above equation does not reduces to Good's formula eq. (5) when the P-values are independent and each carries a different weight.

Generating Correlated P-value Vectors
By definition, the P-values of null hypotheses should be uniformly distributed between 0 and 1, which is often assumed by methods of combining P-values. However, the uniformity of Pvalues, when assigned by available statistical tools to a group of null hypotheses, is often lost. This would handicap the efficacy of methods for combining P-values from the start. To eliminate the effect of nonuniform null P-values from our evaluation, we enforce the quasi-uniformity of null P-values by first constructing a starter P-value vector~of size K with the ith element~(i) = i=K, for 1ƒiƒK. (See next paragraph for more details.) This guarantees an even sample of the P-values (in the range from 1=K to 1). To achieve correlations of various strengths, we have used P-value vectors, each of which is obtained via permuting (pairwise) the elements of a fixed vector, the starter vector with a small perturbation, by a randomly chosen number. The basic idea is that when the number of pairwise permutations is not large, the resulting P-value vectors will be correlated to the fixed vector and will be correlated among one another. It is worth pointing out that this approach does not generate correlations with a prescribed strength: even with the same number of random pairwise permutations of the vector elements, the correlation between any pair of such permuted vectors does not have a fixed strength. We believe this is closer to the real-world scenario than having a fixed correlation strength among the P-value vectors. The value of K should not matter in terms of testing whether a method can provide accurate combined P-value. If a small K is used, however, the combined P-value will have a large statistical fluctuation that may reduce the resolution of the comparison. On the other hand, making K large causes a long computational time. We find that using K~10,000 yields enough separations among methods tested without significantly slowing down the computation.
For each method investigated, we have performed a simulation of 500,000 realizations, each of which was conducted as follows. First, pick a random positive integer r with 1ƒrƒK=2. Second, generate the first P-value vector P 1 by adding a small random perturbation (+d) between 0 and 5|10 {5 to each vector element of~: P 1 (i)/~(i)+d i . Evidently, by increasing the upper bound for d, one will produce P-values with larger variations from exactly uniform distribution. In the third step, generate more size-K vectors P 2 ,P 3 , Á Á Á ,P m and initialize them to P 1 . For each vector generated, its vector elements are pairwise permuted r (chosen at the first step) times. After that using { ln P j (i) in place of x Hi,j the pairwise correlation r jv was computed using eq. (8) and the average correlation E½r among vectors was computed using eq. (7). This work flow is illustrated in Figure 1 with d(i)?0 for simplicity. The constructed random P-value vectors P 1 ,P 2 , Á Á Á ,P m were then combined to obtain a unified P-value vector (F) using the various methods listed in Table 0. Once the unified P-value vector (F) was calculated, its elements were sorted in increasing order and it was then compared against the rank (R) vector, whose element is obtained by dividing the rank of a F element by K, i.e., R(i)~i=K for i ranging from 1 to K. We shall call R(i), the ith element of the rank vector, the normalized rank of rank i.

Statistical Accuracy Evaluation of the Combined P-value (F)
If a method yields a unified P-value vector F agreeing with R, the scatter plot of F(i) versus R(i) should produce a straight line with slope one and intercept zero [31]. It is also important to mention that the smallest computed P-value is expected to be inversely proportional to the sample size, which for the current case is of the order of 10 {4 . An example of a logarithmic plot of F versus R generated from a single iteration of our simulation is shown in Figure 2. Using the textbook definition of P-value, the linear slope obtained from the logarithmic plot of R versus F should be approximately one for methods with accurate statistics. To quantify how well F agrees with R we use four measures: (1) the average weighted sum of squares error (AWSE), (2) the distance (D) between F and R, (3) the expected rank E½R(F c ), and (4)the expected error of F (i). Figure 2 also illustrates what is being computed by the above four measures.
The weight factor (w), 1=R(i), in the above equation was chosen so that each point in the transformed variable domain carries the same contribution to the AWSE. By construction, the P-values in the random vector R are uniformly distributed between ½10 {4 ,1. However, once we make the logarithmic transformation, y i~{ ln (R(i), we find the new variable y to be exponentially distributed, i.e., pdf(y)~e {y . One may thus introduce w(y), a weight factor making w(y)pdf(y)~1, to compensate the non-uniformity in y. This leads to w(y i )~e yi~1 =R(i), the weight factor used in eq. (15).
Angular Distance Between F and R. To compute the distance between F and R, we began by first computing the slope (b) of the logarithmic plot of R versus F using a weighted leastsquare regression, which aims to minimize the weighted sum of squares error (WSE) Taking the derivative of the above expression with respect to a and b and setting them equal to zero gives the following equations:  , the red circles show the scatter plot of normalized rank versus computed P-value from a randomly picked iteration (realization) of very weak average correlation. It is through curves like the one displayed in panel (A) that enables one to calculate the average sum of squares error using eq. (15) and the distance measure using eq. (16). Panel (B) shows 1000 curves, each of which is obtained from performing the same task as that leads to the curve in (A) but with different average correlation strengths. The lines that go significantly above y~x line are from cases with stronger average correlations. They yield unified P-values that are much exaggerated perhaps due to the fact that the Stouffer's method does not account for correlations. By averaging the normalized rank R along the blue line (F~F c ) yields the value E½R(F c ) (see eq. (17) Solving the above two equations simultaneously for a and b gives where M R and M F are the weighted average of ln (R) and ln (F ) respectively and was computed using the points (0,a) and (1,bza) along the regression line. Similarly another normalized vector V R~( 1 ffiffi ffi 2 p , 1 ffiffi ffi 2 p ) was obtained using the points (0,0) and (1,1) along the ideal line. Finally, the (angular)distance between the two unit vectors F and R was computed Methods with accurate statistics are expected to have b~1 and a~0. Evidently, b~1 leads to D~0 (see eq. (16)). The independence of the angular distance D on the intercept parameter a implies that D only measures the relative accuracy of the P-value, not the absolute accuracy. For example, if F(i)~gR(i), even when the positive constant g is different from 1, D is still zero.

Expected
Rank For iteration 1ƒaƒ (N~500,000), we denote by R a (F c ) the largest normalized rank whose corresponding reported P-value is less than or equal to a selected cutoff P-value F c . The expected rank E[R(F c )] is computed by averaging R a over all realizations and can be written as In the ideal case of absolute accuracy, R a (F c )~F c . In reality, this is hardly the case and that is why we use the expectation value of R a (F c ) versus F c as the measure. For methods with accurate statistics a plot of E[R(F c )] versus F c should trace closely the line y~x.
Expected Error of F (i). The expected error of F (i) relative to R(i)~i=K (for 1ƒiƒK) is defined as and the standard deviation For methods with accurate statistics, plotting E½ ln ( R(i) F (i) ) versus R should track the line y~0 well and have small standard deviations for various R(i).

Results and Discussion
The four measures mentioned in the methods section are used to evaluate the accuracy of the unified P-value computed. In Figures 3, 4, 5 and 6, we show the results of combining a list of 12 P-values. The layout of each of these figure is identical. For each method considered, our simulation includes a total of N~500,000 iterations. At each iteration, we generated K lists, within which the ith list is obtained by taking the i entry of each of the 12 P-value vectors, (P 1 ,P 2 , Á Á Á ,P 12 ). By computing the pairwise correlation (see eq. (8)) among the P-value vectors, one obtains the average pairwise correlation E½r given by eq. (7). Each iteration, generating a 12-tuples of P-value vectors, thus yields an average correlation E½r.
For Figures 3,4,5,6, the data points in panels A and B respectively display the expected average sums of square errors (E½ASWE) and expected distances (E½D) versus E½r. More specifically, every data point plotted with x-axis value r k~0 :025z0:05 Ã (k{1) represents an average of 25,000 iterations, each of which has its 12-tuple's average correlation E½r fall in the range of r k +0:025. For panels C, D, E and F, each data point plotted is computed using all the N iterations from our simulation. The curves in panel C show the expected number of events with unified P-value computed less than or equal to a cutoff value F c . For methods with accurate statistics, by the definition of P-value, a plot of E½R(F c ) versus F c should follow the line y~x. Panels D and E (and F for Figures 3 and 4) display the expected F value together with its standard deviation as a function of R. Similar plots for the combination of 4 and 8 P-value vectors can be found in File S1. Figure 3 displays the results for methods that assume the the Pvalues to be combined are independent: Fisher's (eq. 3), Stouffer's (eq. 1) and Bhoj's (eq. 6) methods. These methods are expected to compute accurate combined P-values for E½r*0, corresponding to the first few data points of panels A and B. The data points in    panels A and B show that as E½r increases so does the E½ASWE and E½D, indicating the methods' inadequacy for handling correlation among P-values. All three curves in panel C lie above the y~x line, indicating that all three methods exaggerate significance when combining correlated P-values. The curves in panels D, E and F show that the average value (red solid curve) of ln (F=R) can deviate significantly from y~0 axis with wild fluctuations (error bars shown in blue). Also, a comparison with the plots obtained from combining 4, 8, and 12 P-value vectors indicates that the accuracy of the unified P-value decreases as the number of P-values combined increases from 4 to 12. Figure 4 shows the results for methods that combine weighted independent P-values: Good's (eq. 5), Lipták's (eq. 1) and Bhoj's (eq. 6) methods. These three methods may be viewed as extensions of the previous three methods with P-value weighting enabled. Comparison of the panels of Figure 4 with that of Figure 3 shows noticeable improvement on the accuracy of the combined Pvalues. Although the accuracy has improved by weighting the Pvalues, the computed P-value still differs significantly from the expected value. The observed improvement suggests that weighting P-values might weaken the effect of correlation by promoting one P-value over the rest in the list of P-values to be combined. Other studies have also recommended [32,33] weighting P-values to improve statistical power. Even though weighting P-values is recommended, there exists no consensus on how to determine the optimal weights [6,[24][25][26]. This is why in our simulation we have assigned random weights to the P-values to be combined. In principle, the accuracy of the computed P-value from the three methods above could be improved by using a different procedure to compute the weights. Such an investigation, although worth pursuing in its own right, is beyond the scope of the current study. Figure 5 shows the results from using methods designed to combine correlated P-values: Hartung's (eq. 9) and Hou's (eq. 14) methods. The curves in Figure 5 when compared with the curves of Figure 3 and 4 show a significant improvement in the accuracy of the combined P-value computed. From the curves of Figure 5 Hou's method seems to be the better performing one, it has a smaller expected error and standard deviation when compared with the curves obtained from Hartung's method. As shown in panel C of Fig. 6, Hou's E½R(F c ) vs F c curve also traces reasonable well the line y~x, deviating from it only by a factor of about 4.0 for F ƒ0:1.
Finally, in Figure 6 we have the evaluation results of methods that combine weighted correlated P-values: Hartung's (eq. 9) and Hou's (eq. 14) methods. When the curves of Figure 6 are compared with that of Figure 5, as before it shows that weighting P-values tends to improve the accuracy of the the computed Pvalue The curves also show that Hou's method has a larger improvement in accuracy by using weights in comparison to Hartung's method. As articulated earlier and supported by the observed results, there is a possibility that the accuracy of the combined P-value could be further improved by having a statistically and mathematically rigorous procedure that could render the optimal weights to be used.
In a brief summary, methods designed for combining independent P-values tend to yield exaggerated P-values when used to combining correlated P-values. On the other hand, most methods designed to handle correlated P-values tend to provide conservative estimates for the unified P-values. The first case can be understood easily since one is effectively using nearly identical evidences to corroborate one another. For the latter case, however, we can not provide an intuitive interpretation except that it might result from the heuristics those methods employed. Weighting Pvalues seems to weaken the effect of correlation. This can be roughly understood as follows. By weighting each of the m Pvalues, only the P-values assigned the highest weights play a role. This increase the likelihood of having the highest weighted Pvalues be nearly independent, thereby reducing the effect of correlations. Not only does it help the methods designed for combining independent P-values, it also helps the ones for combining correlated P-values as most of these methods are heuristic-based and get more accurate results when the correlation is weaker. Based on these results, when the lists of the P-value vectors are complete, it is best to calculate the corresponding pairwise correlations between any two P-value vectors, introduce weights, and then assign the final unified statistical significance to each hypothesis.
In real applications, however, one is often faced with incomplete lists of P-values. That is, one only has the P-values for the highest ranking hypotheses, not for all hypotheses tested. This prevents one from computing the correlations needed for the formalism for combining correlated P-values. In this case, i.e., when combining P-values of unknown correlation, one should exercise caution. Absent the correlation information, a better option might be to use the smallest of the P-values to be combined and then apply the Bonferroni correction by multiplying the smallest P-value by m, the number of P-values to be combined. This will guarantee a conserved statistics. However, under this approach, one might run into cases where the smallest P-values considered is larger than 1=m, thereby obtaining a corrected P-value that is larger than 1. Even if each of the P-value lists is complete, there are still scenarios not covered in this paper. For example, it is possible that higher order correlations (such as the three-body or four-body) exist among the P-value vectors. We did not consider these cases since we are not aware of any readily available methods designed to deal with such type of higher order correlations.
In conclusion our study recommends that the unified P-value obtained from combining P-values of unknown correlation should be used with caution to prevent from drawing false conclusions. Results from our study agree with previous investigations [6,8,10], supporting the hypothesis that weighting P-values has the potential to improve the accuracy of the combined P-value. However, the important issues of choosing the weights to optimize a method's power and estimating the correlation matrix elements among Pvalues from small sample sizes remain challenging [34,35]. Our results also show that when combining independent or weighted independent P-values, Bhoj's method produces more accurate Pvalues than other methods tested. In the case when the correlation information is available, among the methods investigated, Hou's method, able to accommodate P-value weighting, seems to be the best performing method.

Supporting Information
File S1 This pdf file contains eight figures showing P-value accuracy evaluation of methods considered in this manuscript when combining 4 and 8 P-value vectors. (PDF)