Combining Independent, Weighted P-Values: Achieving Computational Stability by a Systematic Expansion with Controllable Accuracy

Given the expanding availability of scientific data and tools to analyze them, combining different assessments of the same piece of information has become increasingly important for social, biological, and even physical sciences. This task demands, to begin with, a method-independent standard, such as the -value, that can be used to assess the reliability of a piece of information. Good's formula and Fisher's method combine independent -values with respectively unequal and equal weights. Both approaches may be regarded as limiting instances of a general case of combining -values from groups; -values within each group are weighted equally, while weight varies by group. When some of the weights become nearly degenerate, as cautioned by Good, numeric instability occurs in computation of the combined -values. We deal explicitly with this difficulty by deriving a controlled expansion, in powers of differences in inverse weights, that provides both accurate statistics and stable numerics. We illustrate the utility of this systematic approach with a few examples. In addition, we also provide here an alternative derivation for the probability distribution function of the general case and show how the analytic formula obtained reduces to both Good's and Fisher's methods as special cases. A C++ program, which computes the combined -values with equal numerical stability regardless of whether weights are (nearly) degenerate or not, is available for download at our group website http://www.ncbi.nlm.nih.gov/CBBresearch/Yu/downloads/CoinedPValues.html.


Introduction
Forming a single statistical significance out of multiple independent tests has been an important procedure in many scientific disciplines, including social psychology [1,2], medical research [3], genetics [4], proteomics [5], genomics [6], bioinformatics [7,8] and so on. Among the best known approaches are Fisher's method [9] and Good's formula [10]. To form a single significance assignment out of L independent tail-area probabilities, Fisher's method combines these L probabilities democratically while Good's formula weights every probability differently. Being able to weight more on better trusted P-values, Good's formula is versatile. Nevertheless, it suffers from numerical instabilities when weights are nearly degenerate [10]. This paper provides an analytic formula (see eq. (33)) to properly handle nearly degenerate weights. Employing complex variable theory, we have derived this controlled expansion, in powers of differences in inverse weights, that affords for the first time both accurate statistics and stable numerics.
In addition to the scenarios covered by Fisher's method and Good's formula, one may foresee the occurrence of the following general case (GC): independent P-values are categorized into groups within each of which P-values have the same weight, while weight varies by group. The criterion for grouping can be very general, ranging from previously known attributes to differences in experimental protocols. As an example, one may wish to group data and their associated P-values by type of experimental instruments and assign each group a different weight. When there is only one instrument type, the GC reduces to Fisher's consideration. When there exist no replicates within each instrument type, the GC coincides with the consideration of Good. In [10], Good also mentioned the possibility of obtaining an analytic expression for the GC, but did not provide it. Since Good's formula [10] contains, in the denominator, pairwise differences between weights, he cautiously remarked that his formula may become ill-conditioned when weights of similar magnitudes exist and thus calculations should be done by holding more decimal places. This statement has been paraphrased by numerous authors [11][12][13][14][15], and many of them have tried to seek numerically stable alternatives at the expense of using uncontrolled approximations. However, what remained elusive was a proper procedure that both provides accurate statistics and deals with nearly degenerate weights in a numerically stable manner.
The main result of this paper is an explicit formula (eq. (33)) that can properly handle nearly degenerate weights for the GC, including Good's formula of course. This derived, controlled expansion, in powers of differences in inverse weights, affords for the first time both accurate statistics and stable numerics. Employing a complex variable integral formulation, we also provide a novel derivation of the distribution function for the GC and thus become the first, in the context of combining P-values, to make available an analytic formula for the probability distribution function for the GC.
In the statistics community, attempts to obtain an overall significance level for the results of independent runs of studies date back to the 1930s [9,[16][17][18], if not earlier. Nevertheless, one should note that the mathematical underpinnings of combining Pvalues also appear in other areas of research. For example, the equivalent of Good's formula had emerged in 1910 in the context of sequential radioactive decay [19], while the first analytic expression for Fisher's combined P-value had emerged in 1960 as a special case of the former when all the decay constants are identical [20]. After Good's work [10], Good's formula was rederived by McGill and Gibbon [21], and later on by Likes [22]. As for the GC, Fisher's method included, the mathematical equivalents appear in different areas of studies mainly under the consideration of sum of exponential/gamma variables. The distribution functions of linear combinations of exponential/ gamma variables are useful in various fields. When limited to exponential variables, it results in the Erlang distribution that is often encountered in queuing theory [23]. It is also connected to the renewal theory [24] and time series problem [25], and it can be applied to model reliability [26]. The intimate connections between these seemingly different problems are not obvious at first glance. Consequently, it is not surprising that the distribution function of the GC has been rediscovered/rederived many times and that some information about it has not been widely circulated. Our literature searches show that the first explicit result (without further derivatives involved) for the distribution function for the GC was obtained by Mathai [27]. Subsequently, motivated by different contexts, Harrison [28], Amari and Mirsa [29], and Jasiulewicz and Kordecki [26] all rederived the same distribution function.
There also exist numerical approaches for combining independent P-values. These typically involve inverting cumulative distribution functions. For example, Stouffer's z-methods [1], whether unweighted [30] or weighted [31,32], require inverting the error function. Lancaster's generalization [33,34] of Fisher's formalism also requires inverting gamma distribution function to incorporate unequal weighting for P-values combined. Since our main focus is on analytic approaches, we shall refrain from delving into any numerical method.
In the Methods section, we will first summarize Fisher's and Good's methods for combining P-values, then present the mathematical definition of the GC. In the Results section, the subsection headed by ''Derivation of Prob(Qƒt)'' is devoted to the derivation of the probability distribution function and cumulative probability for the GC. Since both Fisher's and Good's considerations arise as special limiting cases of the GC, we also illustrate there that our cumulative probability distribution for the GC indeed reduces to the appropriate limiting formulas upon taking appropriate parameters. In the subsection headed by ''Accommodation of arbitrary weights'', we delve into our main innovative part -taming the instability caused by nearly degenerate weights -and provide a formula with controllable accuracy for combining P-values. A few examples of using the main results are then provided in the Example subsection. This paper then concludes with the Discussion section. A C++ program CoinedPValues, which combines independent weighted P-values with equal numerical stability regardless of whether weights are (nearly) degenerate, is available for download at our group website: http://www.ncbi.nlm.nih.gov/CBBresearch/Yu/downloads/ CoinedPValues.html.

Summary of Fisher's and Good's methods for combining P-values
Assume that a piece of information is assessed by L independent tests, each yielding a P-value. Each P-value obtained is between zero and one since, by definition, it is the probability for the experimental outcome to arise from the null model. Prior to combining these L independent P-values (p 1 ,p 2 , . . . ,p L ) to form a single significance level, we note the following. Although for any null model P-value must distribute uniformly over (0,1, the L Pvalues obtained need not have their average close to 0:5. This is especially the case when the piece of information we are evaluating is not well described by the null model(s) considered.
For later convenience, let us define where w i w0 is the weight associated with the ith P-value. To form a unified significance, Fisher and Good considered respectively the stochastic quantities Q F and Q G , defined by where each x i represents a random variable drawn from an uniform, independent distribution over (0,1. The following probabilities provide the unified statistical significances, corresponding respectively to Fisher's and Good's considerations, from combining L independent P-values. In eq. (6), the prefactor L l is given by Apparently, L l is ill-defined when the weight w l coincides with or is numerically close to any other weights w k . Although Fisher did not derive (5), from this point on, we shall refer to (5) as Fisher's formula and (6) as Good's formula.

General case including Fisher's and Good's formulas
Let us divide the L independent P-values into m groups with 1ƒmƒL. Within each group k, we weight the n k P-values equally; while P-values in different groups are weighted differently. Therefore, when m~L and n k~1 Vk, we have the Good's case; when m~1 and n 1~L , we reach Fisher's case. We will hence define the following quantities of interest where each x k;j represents again a random variable drawn from an uniform, independent distribution over (0,1. The quantity of interest Prob(Qƒt), if obtained, should cover results of both Fisher and Good as the limiting cases. In the next section, we will start by deriving an exact expression for Prob(Qƒt) and describing how to recover the results of Fisher and Good.

Derivation of Prob(Qƒt)
Let F (t):Prob(Qƒt), we may then write where h(x) is the heaviside step function, taking value 1 when xw0 and value 0 when xv0. Upon taking a derivative with respect to t, we obtain where d(x) is Dirac's delta function that takes value 0 everywhere except at x~0 and that Vaw0, To proceed, let us make the following change of variables we may then rewrite (11) as :e t f e (t) : Note that f e (t) is exactly the probability density function of a weighted, linear sum of exponential variables.
By introducing the integral representation of the d function we may re-express (12) as where r k :1=w k is introduced for the ease of analytic manipulation andf f is introduced for later convenience. Since all w k w0, implying that all r k w0, the poles of the integrand in (14) lie completely at the lower half of the q-plane. Consequently, the integral of q may be extended to enclose the lower half q-plane to result in Comparing eq. (16) with eqs. (12) and (13), we see that the right hand side of (16) is composed of the product of the factor e t and f e (t) of eq. (13). In fact, the explicit expression for f e (t), in addition to the new derivation presented here in eq. (16), was derived much earlier [27] under different context and was rediscovered/ rederived multiple times [26,28,29] by different means. Its connection to combining P-values, however, was never made explicit until now.
From (10), we know that F (t~0)~0, implying that where the function H is defined as Eq. (17) represents the most general formula that interpolates the scenarios considered by both Fisher and Good. Let us take the limiting cases from (17). For Fisher's formula, one weights every P-value equally, and thus corresponds to m~1 and n 1~L . The constraint in the sum of (17) forces g 1~n1 {1~L{1. Consequently, we have (by calling r 1 by r for simplicity) Notice that regardless whatever the weight w one assigns to all the P-values, the final answer is independent of the weight. This is because t~{ ln t~{w ln t F~( { ln t F )=r and therefore rt~ln (1=t F ). This results in exactly what one anticipates from (5). To obtain the results of Good, one simply makes m~L and n k~1 Vk, implying all g i~0 . In this case, (17) becomes (with r l~1 =w l , e {t~t G and H(r l t,0)~e {r l t~t G 1=w l ) reproducing exactly (6). One may also re-express eq. (17) in a slightly different form Note that in the expression (22), we have isolated an overall multiplying factor and have kept explicit the n 1ƒkƒm dependence for later convenience. As cautioned by Good [10] regarding Good's formula, the products of the inverse weight differences in eq. (22) may cause numerical instability in computing the combined P-values when some of the inverse weights become nearly degenerate. To see this point, let us consider varying r j from a bit smaller than r k to a bit larger than r k . Although the change of weight 1=r j is infinitesimal, some terms in (22) do change abruptly. We will provide some numerical examples in the Example subsection.

Accommodations of arbitrary weights
In our derivation of (20) in the previous subsection, it is explicitly shown that the final P-value obtained is independent of the weight w that was assigned to all the individual P-values, p 1 ,p 2 , . . . ,p L . It is thus natural to ask, if one starts by weighing each P-value differently, upon making the weights close to one another, will one recover Fisher's formula (5) from Good's formula (6) in the limit of degenerate weights? By continuity, the answer is expected to be affirmative. In the broader context of the GC, one would like to have a formal protocol to compute the combined Pvalue when some of the weights become (nearly) degenerate.
In this subsection, we first illustrate the transition from Good's formula to Fisher's formula by combining two P-values with almost degenerate weights. We will then provide a general protocol to deal explicitly with the numerical instability caused by nearly degenerate weights. Possible occurrences of this instability were first cautioned by Good [10] and subsequently by many authors [13][14][15].
Let us consider combining p 1 and p 2 with weights w 1 and w 2 using Good's formula. One has Without loss of generality, one assumes w 1 ww 2 and hence writes w 1 =w 2~1 zE with Ew0. We are interested in the case when the weights get close to each other, or when E?0. We now rewrite eq. (23) as In the limit of small [, we may rewrite (24) as Note that when the small weight difference w 1 {w 2 is near the machine precision of a digital computer, using formula (6) directly will inevitably introduce numerical instability caused by rounding errors.
To construct a protocol to deal with nearly degenerate weights, one first observes from eqs. (14)(15)(16)(17)(18)(19)(20)(21)(22) that it is the inverse weights r k :1=w k that permeate the derivation of the unified P-value. The closeness between weights is thus naturally defined by closeness in the inverse weights. As shown in eqs. (2) and (6), the combined Pvalue yielded by Good's formula depends only on the pairwise ratios of the weights. Making the observation that r k t in eq. (17) only depends on the ratios r k =r j=k , one deduces that for the GC the combined P-values (see (17)) also depend only on the ratios of weights, not the individual weights. We are thus free to choose any scale we wish. For simplicity, we normalize the inverse weight associated with each method by demanding the sum of inverse weights equal the total number of methods where 1=r j represents the weight associated method j and L represents the total number of P-values (or methods) to be combined. For the GC described in the Methods section, L~X m k~1 n k . This normalization choice makes the average inverse weight of participating methods be 1.
The next step is to determine, for a given list of inverse weights and the radius for clustering, the number of clusters needed. This task may be achieved in a hierarchical manner. After normalizing the inverse weights r k using eq. (26), one may sort the inverse weights in either ascending or descending order. For a given radius gw0, one starts to seek the pair of inverse weights that are closest but not identical, and check if their difference is smaller than the radius g. If yes, one will merge that pair of inverse weights by using their average, weighted by number of occurrences, as the new center and continue the process until every inverse weight in the list is separated by a distance farther than g. We use an example of L~8 to illustrate the idea. Let the normalized inverse weights where the number 2 inside the pair of parentheses after 0:70 simply indicates that there are two identical inverse weights 0:70 to start with. Assume that one chooses g, the radius for clustering, to be 0:005. Since every pair of adjacent inverse weights are separated by more than 0:005, no further clustering procedures is needed and one ends up having seven effective clusters: one cluster with two identical inverse weights 0:70, and six singletons. This corresponds to m~7, n 1~1 , n 2~2 , n 3~n4~Á Á Á~n 7~1 . Suppose one chooses the clustering radius g to be 0:05. In the first step, we identify that 0:70 and 0:71 are the closest pair of inverse weights. The weighted average between them is indicating that we have m~4 ( four clusters), with number of members being n 1~1 , n 2~4 , n 3~1 and n 4~2 . The centers of the four clusters are specified by the averaged inverse weights: 0:50,0:7125,1:03,1:81. This is a good place for us to introduce some notation. We shall denote by r k zg k;j the jth inverse weights of cluster k, whose averaged inverse weight is r k . With this definition, for the example above, we have g 1;1~0 , g 2;1~g2;2~{ 0:0125, g 2;3~{ 0:0025, g 2;4~0 :0275, g 3;1~0 , g 4;1~{ 0:01, and g 4;2~0 :01.
Using the hierarchical protocol mentioned above, the number of clusters m, the center r k of the inverse weight of cluster k, and the numbers of members n k of cluster k are all obtained along with fg k;j g once and for all. The g k;j , as will be shown later, constitute the key expansion parameters that yield, upon multiplying bỹ F F (t; n l ) with different fn l g, the higher order terms in our key result. We show below how this is done.
Following the derivation in the previous subsection, we obtain a probability density function very similar to (14) f (t)P From the preceding subsection, we see that the ill-conditioned situations emerge when some weights are nearly degenerate and the source of difference in inverse weights comes from obtaining F F (t; n 1 ,n 2 , . . . ,n m ) in (22) fromf f (t; n 1 ,n 2 , . . . ,n m ) in (15). Therefore, one may leave the prefactor P m l~1 P n l j~1 (r l zg l;j ) h i untouched and focus on the rest of the right hand side of eq. (27).
To proceed, we write Consequently, we may write where Y k;g : The product in eq. (27) may now be formally written as We now note a simplification by choosing r k to be the average inverse weight of the kth cluster. In this case, we have X n k j~1 g k;j~0 Vk. That is, Y k;1~0 always. This allows us to write eq. (30) as The key idea here is to Taylor expand the exponential and collect terms of equal number of 1=(qzir). Evidently, the first correction term starts with 1=(qzir) 2 . Furthermore, before the 1=(qzir) 4 order, there is no mixing between different clusters. Below, we rewrite eq. (27) to include the first few orders of correction terms This immediately leads to Note that when the clustering radius g is chosen to be zero, the only clusters are from groups of identical weights, and all g k;j must be zero. In this case, only the first term on the right hand side of (33) exists and the result derived in the previous subsection is recovered exactly. Since allF F are finite positive quantities, the errors resulting from truncating the expression in eq. (33) at certain order of g can be easily bounded. Therefore, any desired precision may be obtained via including the corresponding number of higher order terms. As the main result of the current paper, our expansion provides a systematic, numerically stable method to achieve desired accuracy in computing combined P-values.

Examples
Example (a). This example provides a numerical work flow to compute theF F(t; fn l g) function present in eq. (22). Assuming m~4, we show below how to open up the sum in eq. (22). The constraint P i g i~nk {1 implies that one only has m{1 (~3 here) independent g i s. Once the (m{1) g i s are specified, the remaining one is determined. To simplify the exposition, let us Note that in eq. (33), in the zeroth order term, the argument n l ofF F represents the number of members associated with cluster l. However, for higher order correction terms, the n l s enteringF F no longer carry the same meaning. Therefore, in the example shown here, one should not assume that n j is the number of methods associated with cluster j.
Example (b). This example illustrates the possibility of numerical instability associated with eqs. (6) and (22) when they are used to combine P-values with nearly equal weights. This instability arises from adding numbers with nearly identical magnitude but different signs, yielding a value containing few or no significant figures. We also show how such instabilities are resolved by using eq. (33). Consider the case of combining five Pvalues, {0.008000257, 0.008579261, 0.0008911761, 0.006967988, 0.004973110}, weighted respectively by {0.54531152, 0.54532057, 0.54531221, 0.54531399, 0.54531776}. Using eq. (2), one obtains t G~4 :30656196|10 {7 . The combined P-value is then obtained as the probability of attaining a random variable Q G , defined in eq. (4), such that it is less than or equal to t G .
Combining P-values using eq. (6) gives Example (c). One natural question to ask is how well does eq. (33) work when one chooses a larger clustering radius and group weights that are clearly distinguishable into a few clusters? To consider this case, let us use the five P-values from example (b) but with weights chosen differently. Let us assume that the inverse weights (r k :1=w k ) associated with these five P-values are f0:6,0:65,1:2,1:25,1:3g. For this case, t~t G~1 :935663| 10 {13 . Combining P-value using formulas (6)  When one uses g~0:1 as the clustering radius, one obtains two clusters: one with average inverse weight 0:625 and the other with average inverse weight 1:25. If one then uses eq. (33) to combine P-values, one attains the following results which contains no sign alternation and agrees well with the results from both (6) and (22). This illustrates the robustness of eq. (33) in combining P-values. Note that the third term on the right hand side of (35) is zero. This is because the multiplying factor Y k;3 is zero for both clusters. In general, Y k;3 measures the skewness of inverse weights associated with cluster k and for our case here both clusters of inverse weights are perfectly symmetrical with respect to their centers, leading to zero skewness. If the inverse weights of cluster k distribute perfectly symmetrically with respect to its center, it is evident from eq. (29) that Y k;g~0 for odd g.
Evidently if one chooses a large clustering radius g and then uses eq. (33) to combine P-values, many higher order terms in the expansion will be required to achieve high accuracy in the final combined P-value.

Discussion
Although the expression (17) provides access to exact statistics for a broader domain of problems and our expansion formula (33) provides accurate and stable statistics even when nearly degenerate weights are present, there remain a few unanswered questions that should be addressed by the community in the near future. For example, even though we can accommodate any reasonable Pvalue weighting, thanks to (33), the more difficult question is how does one choose the right set of weights when combining statistical significance [35][36][37][38][39]. The weights chosen should reflect how much one wishes to trust various obtained P-values. Ideally, a fully systematic method should also provide a metric for choosing appropriate weights. How to obtain the best set of weights remains an open problem and definitely deserves further investigations.
Another limitation of (17) and (33), and consequently of Fisher's and Good's formulas, is that one must assume the P-values to be combined are independent. In real applications, it is foreseeable that P-values reported by various methods may exhibit nonnegligible correlations. How to obtain the correlation [40][41][42] and how to properly incorporate P-value correlations [15,43,44] while combining P-values are challenging problems that we hope to address in the near future.