Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

The aggregation paradox for statistical rankings and nonparametric tests

Abstract

The relationship between social choice aggregation rules and non-parametric statistical tests has been established for several cases. An outstanding, general question at this intersection is whether there exists a non-parametric test that is consistent upon aggregation of data sets (not subject to Yule-Simpson Aggregation Paradox reversals for any ordinal data). Inconsistency has been shown for several non-parametric tests, where the property bears fundamentally upon robustness (ambiguity) of non-parametric test (social choice) results. Using the binomial(n, p = 0.5) random variable CDF, we prove that aggregation of r(≥2) constituent data sets—each rendering a qualitatively-equivalent sign test for matched pairs result—reinforces and strengthens constituent results (sign test consistency). Further, we prove that magnitude of sign test consistency strengthens in significance-level of constituent results (strong-form consistency). We then find preliminary evidence that sign test consistency is preserved for a generalized form of aggregation. Application data illustrate (in)consistency in non-parametric settings, and links with information aggregation mechanisms (as well as paradoxes thereof) are discussed.

Introduction

The relationship between social choice aggregation rules and non-parametric statistical analysis is well-established [see, e.g., 1, 2, 3, 4, 5, 6, 7, 8, 9]. In a related literature, fundamental relationships between discrete choice and non-parametric statistical analysis have been developed [see 10, 11, 12, 13, 14, 15, 16, 17]. While violations of social choice principles can inform us as to the possible paradoxes present in non-parametric tests, the mapping is imperfect due to the issue of statistical significance. Consider a comoparison of three groups, where the term “group” refers to sampled elements of the same population. If, for example, a violation of transitivity occurs in a set of raw, ordinal data when comparing three groups pairwise, this does not imply that any sort of paradox is evident in the results of a corresponding set of pairwise significance tests. This distinction is studied prominently in Bargagliotti and Greenwell [9].

In a seminal work, Haunsperger and Saari [4] find that “…two or more data sets each may individually support a certain ordering of the samples under [the] Kruskal-Wallis [test], yet their union, or aggregate, yields a different outcome” [18, p. 261]. In other words, the Kruskal-Wallis (KW) test [19] is not necessarily consistent upon aggregation of data. The works of Haunsperger and Saari [4] and later of Haunsperger [18] provide characterizations as to aggregation paradoxes in non-parametric statistics. These works extend a prominent literature concerning the classic Yule-Simpson (association or statistical aggregation) Paradox [see 20 or 21 and, e.g., 22, 23, 24, 25, 26, 27, 28, or 29 for later applications] to the non-parametric realm. The Yule-Simpson Paradox has far-reaching implications within statistics and econometrics. Wardrop [25] demonstrates that the Paradox has an important bearing upon the existence (non-existence) of the hot-hand fallacy and related law of small numbers. Specifically, Wardrop notes that researchers studying whether a hot hand fallacy exists should analyze data at the level of aggregation that (potentially fallacy-prone) fans view the data [see, e.g., Miller and Sanjurjo [30] for recent, related work on the the hot hand fallacy].

Bickel et al. [31] find evidence that the Paradox can influence evaluation of gender bias in graduate admissions data. Specifically, statistically significant results at the program level need not hold when evaluating graduate admissions at the university level. Albers [32] finds a related result in evaluating gender bias in research funding. In light of the Yule-Simpson Paradox, these studies establish the importance of analyzing data at the appropriate level of aggregation, whatever it may be. Charig et al. [33] examine treatment effects for small and large kidney stones, both separately and pooled. Consistent with the Paradox, they find evidence that treatment efficacy reverses from separate to pooled tests.

Further developing the Paradox for non-parametric settings, Haunsperger [18] discovers a computable criterion by which the consistency (upon aggregation) of an ordinal data set (when KW-tested) can be characterized. Bargagliotti [34] finds that Bhapkar’s V test [35] and the Wilcoxon-Mann-Whitney (WMW) test [Wilcoxon 1945, Mann and Whitney 1947]—two other central non-parametric tests—are also not generally consistent upon aggregation. Hao and Houser [38] state, “Both robust and simple to implement, [the WMW test] has gained exceptional popularity among empirical scientists, even social scientists. For example, in 2009 almost half of papers in Experimental Economics used the WMW test. Hao and Houser [38] demonstrate that adaptive procedures for the WMW test can lead to “substantial improvements in the ability to detect differences in locations” [p. 1940]. Specifically, Gastwirth [39] and Hogg et al. [40] develop adaptive ranks tests with such a quality. In some respects, the Yule-Simpson Paradox relates to the power and sensitivity of given statistical tests, and how these qualities (may) relate to sample size. The Paradox also relates to a broader literature on information aggregation and paradoxes in which more information is not always revealed to result in better (expected) outcomes [see, e.g., Kaufmann and Weber [41], Koessler et al. [42], Bennouri et al. [43], Axelrod et al. [44], Hanson et al. [45]]. Of course, information aggregation paradoxes may be symptomatic of the decision-maker, of the information aggregation mechanism, or of both. Bennouri et al. [43] demonstrate that accepted information aggregation mechanisms can vary substantially in terms of effectiveness (i.e., in supporting optimal decisions).

The KW test [see, e.g., 46; 47; 48 for more recent descriptions and generalizations of the test] represents a non-parametric statistical analog of the n(≥2) group rank sum aggregation rule in social choice theory. The WMW test represents a related two-sample rank sum test. The Wilcoxon rank sum test and the Mann-Whitney U test are nominally distinct (rank sum) tests; their respective test statistics are linearly dependent upon one another such that the two tests lead to the same end (in terms of significance results). Bhapkar’s V test, on the other hand, compares categorically-assigned data across groups. That these non-parametric tests fail to attain general consistency upon aggregation is both surprising and problematic. Philosophically, these results are akin to a social aggregation rule that takes as inputs two Yes votes (e.g., from an election governed by a Yes-No voting rule), aggregates them, and returns No as the social choice (outcome). We might rather expect two such primitive data sets to strengthen the case for the original result upon aggregation. That such an expectation is not met for this set of tests suggests not only that aggregated statistical rankings are potentially arbitrary but that the primitive rankings—or any statistical rankings from the test—are themselves potentially arbitrary. That is, even a primitive data set is an aggregation of data subsets from the primitive set’s power set.

While Bargagliotti [34] shows a necessary condition (of the data) to ensure consistent (upon aggregation) results for the KW and Bhapkar’s V tests, respectively, the literature has not determined whether inconsistency upon aggregation is a general paradox among non-parametric statistical tests. Both the title of Haunsperger and Saari (“The Lack of Consistency for Statistical Decision Procedures”) and that of Haunsperger (“Aggregated Statistical Rankings are Arbitrary”) certainly leave open the intriguing possibility of a general result. Selvitella [49] further considers the “ubiquity” of the Yule-Simpson Paradox. Based upon early results within the literature, Haunsperger [18] writes, “Haunsperger and Saari [4] show that this Simpson-like paradox occurs with many, if not most, statistical decision procedures. Because of the results [therein], we must expect KW to be the non-parametric system most immune to such difficulties” [p. 263]. We consider, then, whether general consistency under aggregation is an impossibility among non-parametric tests. Alternative tests, such as the sign test for matched pairs, apply distinct aggregation rules to non-parametric data in order to assign aggregated statistical rankings of groups. Herein, we examine whether the sign test for matched pairs (heretofore, “sign test”) is generally consistent upon aggregation, where the sign test represents another central non-parametric test with a social choice aggregation rule analog. It is important to note that the McNemar [50] test is closely related to the sign test for matched pairs. Specifically, the latter test represents an exact test of the former.

The sign test uses a binomial(n, p = 0.5) test statistic to pairwise rank two groups from inter-group (inter-sample) comparisons of matched-pair elements. The sign test statistic is calculated in a manner that is procedurally-equivalent to the two-candidate case of Borda rule aggregation and to the two-candidate case of majority rule aggregation (the latter two of which are, then, themselves equivalent to one another in the two-group case). That is to say, each paired-element comparison of the sign test yields a mutually exclusive “vote” in favor of one of the two groups. This is equivalent to two-candidate Borda or two-candidate majority rule voting, whereby each voter makes a binary preference comparison of two candidates and (assuming voting is sincere) casts a mutually exclusive vote in favor of her preferred candidate. The two candidates are then ranked according to majority rule type aggregation.

In Section 2 of this paper, we construct a theoretical framework by which to consider whether the sign test is consistent upon aggregation. In Section 3, we utilize this framework to explore the properties of aggregated statistical rankings from the sign test. We ultimately find that statistical rankings from the sign test are consistent upon aggregation, and we present a general proof of the result. That is, we utilize the cumulative distribution function (cdf) of a binomial(n, p = 0.5) random variable to show generally that the aggregation of two or more data sets—each rendering a sign test result of the same quality—can only strengthen the original result (i.e., can only decrease the p-value of the sign test statistic). This result is shown for any number of aggregated (primitive) data sets. We prove an additional result as to the nature of sign test consistency. Namely, the sign test p-value exhibits a greater proportional decrease (in aggregation) as the original (primitive data sign test) p-value itself becomes smaller. In aggregation, then, we expect a greater proportional p-value decline given significant sign test results for the constituent, primitive data (than for insignificant sign test results). This represents something of a strong form consistency result, whereby data aggregation reinforces significant sign test results to a greater (proportional) degree. Lastly, we construct and define a generalized notion of aggregation to consider the aggregation of data sets of different sample sizes. We find preliminary evidence that the sign test is also consistent upon generalized aggregation. Analysis of the generalized form is of great practical importance, as data sets of different sample sizes are often aggregated (e.g., in the case of many longitudinal and unbalanced panel data sets).

The matched pairs nature of the sign test is crucial to the result of consistency upon aggregation. In the aggregation of ordinal or numerical data with pre-assigned matched pairs, data are not aggregated into an overall outcome sequence but, rather, into a larger set of matched pairs. As such, sign test aggregation is reinforcing or consistent rather than potentially inconsistent. We find that it is possible for a non-parametric test to exhibit consistency.

In Section 4, we develop a generic application in which we present a set of ordinal data for two distinct groups. Elements of the data can be paired across group (based, e.g., on some characteristic such as “twin-ness” in a control-treatment study) and analyzed via the sign test (for matched pairs). Alternatively, elements for the respective groups may remain unpaired, and elements can then be aggregated into an outcome or rank-sequence such that overall group rankings are generated from a WMW rank sum test. Within the application, we demonstrate a case of inconsistency upon aggregation for the WMW test. We do so to show readers possible characteristics of an ordinal data that is susceptible to inconsistency. For the same ordinal data, we demonstrate an example of the (proved, general) consistency upon aggregation of the sign test. As such, the procedural differences between the two tests are also highlighted (with respect to the consistency property). For the application, we aggregate sets of ordinal data twice, thrice, and four times. It is shown that the sign test p-value shrinks monotonically in the number of data sets aggregated. This follows from the general results of Section 3. For the same ordinal data, we observe a case in which the WMW test p-value rises monotonically in the number of data sets aggregated (as was shown to be possible by Bargagliotti [34]).

Section 5 identifies and develops an empirical application from the sport of high school team cross country running, a sport that uses rank sum scoring to map from an individual outcome sequence (of individual finishing positions) to team scores. The empirical application uncovers evidence of rank suminconsistency” and sign test consistency for high school team cross country meet data. Section 6 concludes.

2 Theoretical set-up and preliminary Lemmas

2.1 Preliminary definitions

Let us begin our exposition with some definitions concerning data aggregation.

Definition of n-aggregation of data: Consider a primitive, n-element data set, A. Let n-aggregation involving A be an aggregation of A and r − 1 () other data set(s) each of sample size n to form an aggregated data set.

Haunsperger and Saari [4] and Haunsperger [18] construct a powerful methodology to assess consistency upon aggregation for KW. Namely, they find a condition on a matrix of ordinal data rankings that is equivalent to consistency (mutually exclusive with inconsistency). As they utilize a matrix approach, their methodology considers balanced sample size data aggregation (i.e., n-aggregation). Note that n-aggregation is a restricted form of data aggregation, as it does not consider the aggregation of two or more data sets of different sample sizes. The non-parametric aggregation paradox has focused on this restricted form of data aggregation to date (see, e.g., any of the seminal papers mentioned in the introduction). This is perhaps due to the tractability of examining this form of aggregation, as we will observe in the present analysis. In this study, our primary theorems will concern n-aggregation of ordinal data. However, we also present preliminary results for a more general version of data aggregation of the following form.

Definition of ni-aggregation of data: Consider a primitive, n0-element data set, A. Let ni-aggregation involving A be an aggregation of A and r − 1 () other data set(s) of respective sample sizes n1, n2, …, nr−1 to form an aggregated data set.

For tractability, n-aggregation has been incorporated as a standard approach in examining the Yule-Simpson Paradox for ordinal data. However, analysis of the generalized ni-aggregation form is an important consideration. Often, data sets of different sample sizes are pooled (e.g., in the case of many longitudinal and unbalanced panel data sets).

Let us now define consistency upon aggregation largely as in Haunsperger [18]. She states, “A statistical procedure that endows a set of data with an ordinal ranking of the candidates is consistent [upon] aggregation if the aggregate of any [r] sets of data, each of which corresponds to a given ordering of the candidates, gives rise to the same ordering of the candidates for any positive integer [r]” [p. 264]. We refine the definition slightly to incorporate the notion of statistical significance explicitly as follows.

Definition of Consistency upon Aggregation: A statistical procedure that endows a set of data with an ordinal ranking of the candidates is consistent upon aggregation if the aggregate of any r sets of data, each of which corresponds to a given statistically-significant ordering of the candidates, gives rise to the same statistically-significant ordering (i.e., at the same α-level) of the candidates for any positive integer r and for all possible α levels.

Thinking of consistency from the perspective of significance testing, we can evaluate the consistency of a procedure by simply verifying whether it is possible for the p-value of a test result to rise following aggregation. If so, statistically-significant strict orderings can be lost in aggregation such that the procedure is not generally consistent.

2.2 Theoretical setup

Let Xn be a Binomial random variable with parameters n and p = 1/2, and N = rn for some integer r > 1. Then, the sign test for matched pairs is consistent upon n-aggregation if and only if (iff) for any k between n/2 and n, (1)

From (1), it may appear that we are aggregating primitive data sets each of which renders the very same test statistic value, k. In fact, (1) provides a general condition for consistency upon aggregation. Let us think of k as the lowest binomial(n, p = 0.5) test statistic value (from tests of given sets of primitive data) such that a given statistically-significant ordering emerges from the primitive data (i.e., at a stipulated α-level). As such, the term rk represents a lower bound on the aggregated data test statistic value, and the left-hand side p-value in (1) represents an upper-bound for the p-value from a test of the aggregated data. Then, (1) states that the p-value for an associated test of the aggregated data will always be less than the highest (but still significant) p-value among the associated primitive data sets.

To establish (1), we first define (2) where (3) and (n − 1)/2 < kjn. Here, aj counts the distinct number of ways of obtaining between rj and rj + r − 1 successes in rn Bernoulli trials, and bj counts the number of ways of obtaining j successes in n Bernoulli trials. The coefficients cj and dj are useful for studying the relative variation in aj and bj and their upper tail partial sums. We now show that cj and consequently dj are strictly increasing in the range of our interest.

Lemma 1 For sequences aj, bj, cj given in (2), and aj(i) given in (3), the following properties hold.

  1. (a). For a fixed i > 0, aj(i)/bj strictly decreases as j increases in the interval [((n − 1)/2) − (i/r), n].
  2. (b). aj(0)/bj strictly decreases as j increases in the interval ((n − 1)/2, n], and when j = (n − 1)/2 is an integer, aj(0)/bj = aj+1(0)/bj+1.
  3. (c). cj = aj/bj strictly decreases for j in the interval [(n − 1)/2, n].

Proof. From the expressions for aj(i) and bj given in (3) and (2) respectively, we note that

Canceling out the common factors in the above two expressions, we conclude that the condition aj(i)/bj > aj+1(i)/bj+1 holds iff (4) for any j ≤ (n − 1) in the specified range. The first factor on the left side of (4) above exceeds 1 iff (nj)i > −i(j + 1). For any j, this is true for all i > 0, and when i = 0, the first factor equals 1. The ratio corresponding to the tth factor in the product exceeds 1 iff r(j + 1) + it > r(nj) − it, or equivalently j > (n − 1)/2 − i/r. When j = (n − 1)/2 − i/r is an integer, the product equals 1, and for i > 0 the first factor exceeds 1 and so the strict monotonicity of aj(i)/bj holds in [(n − 1)/2 − i/r, n]; thus the conclusion in part (a) is established.

When j = (n − 1)/2 is an integer, the first factor as well as the product term equal 1, leading us to the conclusion in part (b), and strict monotonicity holds for j > (n − 1)/2.

Since aj is the sum of the aj(i), and as long as at least one of the aj(i)/bj shows strict monotonicity in j, the strict monotonicity of cj follows. This claim holds for an integer j = (n − 1)/2 also, completing the proof of part (c).

Lemma 2 For the dj defined in (2),

Proof. We use an induction argument. For this, first we prove that dn < dn−1. Recall that dn = an/bn and dn−1 = (an + an−1)/(bn + bn−1), where the aj and bj are defined in (2). Hence, the condition: dn < dn−1 is equivalent to the condition: (5)

Since denominators above are positive, (5) holds iff an bn + an bn−1 < an bn + an−1 bn, or equivalently,

This condition follows from part (c) of Lemma 1 since cj = aj/bj.

Next, assuming dj+1 < dj, we will prove that dj < dj−1. From the definition of dj given in (2), we conclude that (6) upon cross-multiplication and cancellation of common terms. Similarly, the condition: dj < dj−1 is equivalent to the condition: (7)

The left side sum in (7) is (8) where the inequality above follows from (6). From part (c) of Lemma 1, it follows that cj < cj−1; that is, (aj/bj)<(aj−1/bj−1) or aj bj−1 < aj−1 bj. Hence, the upper bound in (8), this is the right side of the (strict) inequality in (7). This strict inequality (in (7)) also holds for j − 1 = (n − 1)/2 for odd n, and j − 1 = n/2 for even n. This establishes our claim.

Remark 1. This monotonicity property of the ratio of the partial sums is perhaps a well-known result in the “Inequalities” literature. We have not found a convenient reference, and have chosen to prove it with elementary principles; we have also fine-tuned the result for the even n case.

3 Main results

Theorem 1 Consistency upon Aggregation: Let Xn be a Binomial random variable with parameters n and p = 1/2. Then, with N = rn, the inequality in (1) holds for all integers kn/2, and integers r ≥ 2.

Proof. In terms of the coefficients introduced in (2) and (3), and where N = rn. The inequality in (1) can be expressed as (9)

We have seen in Lemma 2 that dk strictly increases as k decreases toward n/2. Thus, the maximum value for the ratio of the two binomial probabilities in (9) is attained with k = (n + 1)/2 for odd n and k = n/2 for even n. We now show that this maximum value is less than 1 for both odd and even n.

When n is odd and n = 2m − 1, we take k = m = (n + 1)/2 in (9). By the symmetry of the binomial probabilities, P(Xnm) = 1/2.

Now N(= rn) may be even or odd. When N is even, r(≥2) is also even, and rn/2 is the median of XN. Further, P(XNrn/2)>1/2 and P(XNrn/2 + 1)<1/2. Thus, where the first inequality holds since r/2 ≥ 1. When N(= rn) is odd, r(≥3) is also odd, and P(XN ≥ (rn + 1)/2) = 1/2. Thus, where the strict inequality holds since (rn + r)/2 and (rn + 1)/2 are both integers and the difference (r − 1)/2 ≥ 1. Thus, the strict inequality in (9) holds for all kn/2 when n is odd.

When n is even and n = 2m, we take k = n/2 = m, and note that N is always even. By the symmetry of the binomial probabilities,

Thus, the strict inequality in (9) holds iff (10)

We will now establish (10). Note that XN has the same distribution as Xn + Y where Y is Binomial with parameters Nn and p = 1/2, and Xn and Y are independent.

Since is largest when i = n/2 = m, so is P(Xn = i). Hence, which establishes (10). That is, (9) holds for all kn/2 when n is even as well.

A similar result on the lower tail probabilities can be established by symmetry. As such, we have shown that any significant sign test result is consistent upon n-aggregation. This result has important implications for classic results (e.g., by [31] and [33]). If we apply the sign test to the underlying data of these studies, while normalizing the sample size to equality across group and across treatment, it is not possible for an aggregation paradox to occur. Later in the paper, we will consider whether the sign test is susceptible to instances of aggregation paradox given unequal sample sizes across group. We can refine Theorem 1 by observing that we have proved a great deal more. In fact, we have established the following.

Theorem 2 Strong Form Consistency upon Aggregation: Let XN be a Binomial random variable with parameters N = rn and p = 1/2 where n ≥ 1 and r ≥ 2 is an integer. Then, as the integer k decreases in [n/2, n], P(XNrk)/P(Xnk) monotonically increases with a strict upper bound of 1.

That the ratio is bounded by 1 follows from Theorem 1. Theorem 2 tells us that the n-aggregated data associated p-value rises toward the primitive data associated p-value as the primitive data p-value rises. In other words, n-aggregation creates a greater proportional decrease in the sign test p-value as the original (primitive) p-value itself becomes smaller. In aggregation, then, we expect a greater proportional p-value decline given significant sign test results for the constituent, primitive data (than for insignificant sign test results). This represents something of a strong form consistency result, whereby data aggregation reinforces significant sign test results to a greater (proportional) degree. A more general result of the following form is highly desirable.

Theorem 3 (Conjecture) Let XN denote a Binomial random variable with parameters N and p = 1/2. Then for positive integers n1 < n2 and constant c ∈ [1/2, 1] such that cn1 and cn2 are integers, the following inequality holds. (11)

We are able to establish this claim for the boundary values of c (that is, c = 1/2 or 1) using the arguments presented earlier, and conjecture that the result is true for other c values in [1/2, 1]. Limited computational work supports this conjecture. Further, a large-sample approximation, described below, leads us to believe in the conjecture, especially when n1 and n2 are large and far apart.

From central limit theorem, as n → ∞, we know that . Further, the binomial distribution, being already unimodal and symmetric (as p = 1/2), the convergence is fast. So, if n is large we have the following approximation: (12) where Φ is the cdf of a standard normal random variable. This “close” approximation to the tail probability strictly decreases as n increases as long as c > 1/2 and hence we believe (11) holds for large n1 < n2.

Remark 2. When c = 1/2 and n1 and n2 are odd integers, the strict inequality claimed in (11) does not hold and each of the probabilities equals 1/2 as for an odd integer n,

Also, it follows from (12) that, irrespective of the odd or even nature of n, the limiting value of this probability is 1/2.

4 Application 1: Demonstrating aggregation of statistical rankings for Sign and WMW tests

4.1 Tests of a primitive, ordinal data set

Consider two groups, A and B. Each group consists of nine elements or data points (i.e., or n = 18). That is, A = {a1, a2, …, a9} and B = {b1, b2, …, b9} such that the primitive (unaggregated) data for the WMW rank sum test set-up is an outcome sequence of 18 ordinal (rank-ordered) data points. For the application, let represent the ordinal data (sequence). For the sign test set-up, each aiA is matched with a counterpart bjB according to the matching criterion. As we are a priori unsure (notationally) which bj will be matched with each given ai, we say that each ai is matched with some bπ(i), where π: DD is defined as a bijection between D = {1, 2, 3, …, 8, 9} and itself such that element aiA and element bπ(i)B uniquely pair with one another for all iD. As such, the following matched pairs data set is generated (uniquely in the present example) from the outcome sequence FAB:

The set MAB is a set of rank-ordered (matched) pairs of elementary data points. That is to say, MAB is a set whose elements are outcome sequences on matched pairs. If an element of MAB is represented as (ai, bj), this is equivalent to finding that aibj (“ai ranks higher than bj”) in outcome sequence F. If an element of MAB is represented as (bj, ai), this is equivalent to saying that bjai in outcome sequence FAB. As such, FAB and MAB are different representations of the same ordinal data set. Whereas FAB is a sequence representation of the data, MAB is a matched-pairs representation. For the application data—but not generally—MAB is uniquely derived from FAB. That is to say, the elements of FAB are sequenced such that the ordering of any given matched pair in FAB is invariant to the specific match formed. That is, when i ∈ {1, 2, 3, .., 8}, we know that aibj regardless of which j is paired with i. When i = 9, we know that bjai regardless of which j is paired with i. For the application data (but not generally), one can uniquely derive MAB from FAB. However, it is never possible to derive FAB uniquely from MAB. That is to say, FAB sometimes maps to a unique MAB. Given only MAB, however, one can never map to a unique FAB. In such a case, one is always missing some rank comparison information needed to support a super-ordering or outcome sequence of the overall data. Specifically, one cannot obtain rank comparisons of unmatched elements (ai and bπ(j), ji) from MAB.

From FAB, we can calculate rank sum scores for A and B, and . Generally, we define as follows. Consider two groups, X and Y, and let FXY represent an outcome sequence between elements of the two groups. For each element xiX, let . Then, the rank of xi in the sequence FXY, r(xi|FXY), equals , and the rank sum score for group X given FXY is represented as Given the outcome sequence FAB specified earlier in this application, then, we obtain and apply the WMW rank sum test. Let . Then, the upper-tail WMW rank sum p-value for this sample difference, , is equal to 0.0027. Then, we conclude from a WMW test of the primitive data that Aα = 0.01 B, where the notation “≻α = 0.01” reads “ranks significantly higher than at the α = 0.01 significance level.” In other words, A has a significantly lower rank sum score than B such that we conclude from FAB that group A ranks significantly higher than group B at the assigned significance level.

Let us consider the same data in matched pairs form, MAB. Of the 9 matched pairs, an element of A outranks the corresponding element of B for 8 of 9 matched pairs. Formally, let us define the sign test statistic value for A, , as the number of elements aiA such that ai: . Therefore, we obtain the sign test values for the primitive data as and and let

The upper-tail sign test for matched pairs p-value for this sample difference, is equal to 0.0098. Then, we conclude from a sign test for matched pairs of the primitive data that Aα = 0.01 B. For this primitive application data, the WMW test and the sign test each provide corresponding results at all standard significance levels.

4.2 Tests of aggregated, ordinal data

Now, let us replicate the (ordinality of the) primitive data once, combine the primitive data and its ordinal replicate, and re-test the aggregated data. One should note that an ordinally-replicated data set may be quite different numerically from the primitive (source) data from which it arises such that two ordinally-identical data sets can combine to form various, feasible aggregated sequences. The aggregated outcome sequence consists of 36 rank-ordered elements. One possible aggregated outcome sequence is: where is ordinally equivalent to FAB. In this aggregated sequence, we have simply spliced the replicate sequence, , into FAB such that is strung (consecutively) between the penultimate and ultimate elements of the original sequence. The implication of this aggregated outcome sequence is that—for the most part—the numerical values of the original data outrank those of the ordinal-replicate data (despite the identical ordinal quality of the two data sets). From , we find that

The upper-tail WMW rank sum p-value for this sample difference, , is equal to 0.0438. At the α = 0.01 significance level, then, we conclude from a WMW test of the aggregated data that Aα = 0.01 B (i.e., that A and B are rank-indistinguishable at the α = 0.01 significance level), where “∼α = 0.01” reads “is rank-indistinguishable to at the α = 0.01 significance level.”

We now consider a sign test for matched pairs upon the same aggregated data. Specifically, we combine MAB and its replicate into the same data set to obtain the set ,

As MAB is an unordered set of matched pairs (i.e., the set of elements is unordered, whereas each element is itself an ordered pair), the aggregation of MAB and its ordinal replicate, , is also an unordered set of matched pairs. As such, features the same matched pair ordinal comparisons as does MAB, while representing each such pair twice. Matches for the sign test are pre-assigned at the primitive data level according to the (also pre-assigned) matching criterion (e.g., “twinness”). Therefore, matched pairings are preserved in aggregation. This can be said in a more straightforward manner: The identity of one’s twin (match), once assigned, is invariant to the level of data aggregation. Matched pairing preservation in data aggregation is an important property that causes the sign test to be distinct from independent sample non-parametric tests. As the profile of matched pairings is itself a feature of the primitive data in the case of the sign test (in much the same way that the primitive data outcome sequence is a feature of the primitive data in the case of a WMW test), one would not be strictly aggregating the primitive data if one were to permute the matched pairings in aggregation. Rather, one would thereby change a feature of the primitive data at the same time (in the same manner that a re-ordering of primitive data would go beyond aggregation in the case of a WMW test). Given this feature of the sign test, the matched pair elements of MAB and , respectively, are not rank-compared with one another (across primitive group) within a sign test analysis of .

Of the eighteen matched pairs in the once ordinally-replicated data, an element of A outranks the corresponding element of B for sixteen matched pairs. Therefore, we have the followingsign-test statistic for the aggregated data:

Then, the upper-tailed sign test for matched pairs p-value for this test statistic value, is equal to 0.000484. Then, we conclude from a sign test for matched pairs of the aggregated data that Aα = 0.01 B. For the once ordinally-replicated data, the sign test retains significance at the α = 0.01 level. Consistent with our general results, the sign test p-value diminishes with data aggregation in this case. On the other hand, the WMW test loses significance at the α = 0.01 level when applied to the once ordinally-replicated data.

To consider the result of Theorem 2, we alter MAB as follows.

We also alter as follows.

Aggregating and , we obtain .

Given and , we have the following values for the sign test statistic:

From our previous analysis of MAB and , we have that:

With respect to Theorem 2, then, we have that

We also have that

This result illustrates the (general) property shown in Theorem 2. Namely, the sign test p-value exhibits a greater proportional decrease when replicated as the p-value for the sign test of the primitive data decreases (i.e., for more significant sign test results of the primitive data). Not only does the sign test exhibit consistency upon aggregation. Sign test results are more strongly reinforced in aggregation—in terms of proportional p-value decrease—as the original sign test p-value (of the primitive data) decreases.

Returning to FAB, MAB, , and , we iterate this aggregation to combine three and then four instances of the primitive ordinal data as follows.

The case of 3 Ordinal Replicates Aggregated.

We now take where is ordinally equivalent to FAB and therefore to , and

For this case, the WMW test p-value is 0.122, and the sign test p-value is 0.000027. The WMW test p-value has risen once again in level of aggregation, whereas the sign test p-value has once again fallen.

The case of 4 Ordinal Replicates Aggregated.

Now take where is ordinally equivalent to FAB, , and therefore to , and

For this case, the WMW test p-value is 0.209, and the sign test p-value is less than 0.00001. The WMW test p-value has risen once again in level of aggregation, whereas the sign test p-value has once again fallen. The overall significance testing results for the primitive and aggregated data are represented in Table 1.

Why does the WMW test display inconsistency upon aggregation?

Boudreau et al. [52] explore the social choice properties of rank sum scoring. They find that rank sum scoring approaches may invite various, related (social choice) paradoxes when a group with a superior rank sum “falls behind” (i.e., has fewer elements represented) at some point in the outcome sequence. In the application, group A has a superior rank sum in the primitive outcome sequence. However, group A also possesses the weakest overall element in the primitive outcome sequence. This makes group A’s significant rank superiority in the primitive outcome sequence vulnerable upon aggregation. In the aggregated data, if sufficiently many elements position after element b9 but before element a9 (as did occur), this particular interposition disproportionately inflates group A’s rank sum score. Such an outcome is not possible for matched pairs ordinal data aggregation. In that case, pairs remain assigned such that the primitive ordinal data and its replicate(s) reinforce one another.

5 (Empirical) Application 2: The aggregation paradox in ranking cross-country running team performance

As in Hammond [51] and Boudreau et al. [52], we use the setting of team cross country races to understand aggregation properties of ordinal data. In a cross country running meet, rank sum scoring is used as the official scoring methodology to assign team rankings. Boudreau et al. [52] describe the scoring methodology in detail:

In cross country running, for example, teams are compared on the performances of individual runners in a race; the outcome sequence is the overall ranking as to how runners finish. Rank sum scoring means that each group (team) receives a number of points equal to an element-wise ranking in the outcome sequence (1 for 1st place, 2 for 2nd, and so on), and comparison of groups is based on the sum of their element scores: groups with lower scores rank above those with higher scores [p. 220].

In the present application, we have identified two high school cross country programs for comparison: Carmel High School (Carmel, Indiana) and Fishers High School (Fishers, Indiana). We consider these two schools because their respective Varsity (Boys) teams competed against one another during the 2016 season, as did their respective Junior Varsity (Boys) teams. The two meets—the Varsity Indiana High School Athletic Association Noblesville Sectional meet and the Junior Varsity Hamilton County meet—occurred on the same running course (White River Elementary School Course; Noblesville, Indiana) such that the aggregation of the two data sets is valid (i.e., runners across meets faced the same basic performance elements). Moreover, Carmel’s Varsity and Junior Varsity teams won each respective meet by the very same convincing margin when rank sum scored against Fishers without the consideration of third-party teams (as in the WMW test). That is to say, respective pairwise rank sum scores were the same for each school in each race. Lastly, this particular setting was chosen because both the Varsity and Junior Varsity meets have been publicly recorded on Athletic.net (from which we obtained the data). It is typically difficult to find publicly-available Junior Varsity meet results. The results of these two cross country meets can be found at https://www.athletic.net/CrossCountry/meet/119851/results/481308 and at https://www.athletic.net/CrossCountry/Results/Meet.aspx?Meet=119845&show=all respectively.

This setting provides something of a natural experiment by which we can compare the two Cross Country programs at different levels of aggregation (i.e., a comparison of Varsity teams, a comparison of Junior Varsity teams, and an aggregated comparison of the two programs across the levels of competition) under the same basic competitive conditions. From the finishing time results, we obtain the observed primitive outcome sequence for the Varsity meet as where FV(C, S) symbolizes the observed Varsity race outcome sequence for our two Varsity teams, C = {c1, c2, c3, c4, c5, c6, c7} is a set representing the 7-element (7-runner) Carmel Varsity team, and S = {s1, s2, s3, s4, s5, s6, s7} is a set representing the 7-element (7-runner) Fishers Varsity team. The resulting rank sum scores are and the upper-tail WMW rank sum p-value for this sample difference, , is equal to 0.019. In a WMW rank sum test of the Varsity data, then, we conclude that Cα = 0.025 S. At the α = 0.025 significance level or any higher α-level, that is, we conclude that Carmel’s 2016 Varsity team was faster than Fisher’s 2016 Varsity team.

For the Junior Varsity meets, the observed race outcome sequence is given by where symbolizes the observed race outcome sequence for our two Junior Varsity teams in the Junior Varsity race, is a set representing the 7-element (7-runner) Carmel Junior Varsity team, and is a set representing the 7-element (7-runner) Fishers Junior Varsity team. This leads to rank sums

We see that the rank sum scores for the two teams are the same for the Varsity and Junior Varsity meets. Then, the upper-tail WMW rank sum p-value for the Junior Varsity sample difference, , is also equal to 0.019. In a WMW rank sum test of the Junior Varsity data, then, we conclude that Cα = 0.025 S. At the α = 0.025 significance level or at any higher α-level, that is, we conclude that Carmel’s 2016 Junior Varsity team was significantly higher in quality than Fisher’s 2016 Junior Varsity team.

One might ponder the utility of ranking teams at the Junior Varsity level. Junior Varsity competition often provides a leading indicator of a program’s future strength. Further, the quality of a Junior Varsity team can indicate program depth and robustness against injury. We might therefore expect a strong overall program to be strong at each level.

Using finishing times, we pool the two outcome sequences to create the aggregated sequence and obtain

As with its constituent outcome sequences, represents the true (observed) outcome sequence when the two meets (races) are pooled and runners are sequenced in ascending order of finishing time. Given the observed primitive race sequences, the pooled (aggregated) outcome sequence is simply the 14 Varsity elements (runners) in their primitive order followed by the 14 Junior Varsity elements. In other words, the first element in outcome sequence FJV follows the final element in outcome sequence FV. Each meet represents a distinct level of competition by which these two cross country programs are ranked against one another.

Given that and , the WMW rank sum p-value for this sample difference, , is equal to 0.069. In a WMW rank sum test of the aggregated data, then, we conclude that (Cα = 0.025 S), where “∼α = 0.025” reads “ranks indifferently to at the α = 0.025 significance level.” In fact, we also find that (Cα = 0.05 S). At the α = 0.05 significance level or any smaller α-level, that is, we conclude from the aggregated data that Carmel’s 2016 Boys Cross Country program was not significantly different from that of Fishers. We obtain this result despite finding significant evidence that each of Carmel’s Boys Cross Country (primitive or constituent) teams (i.e., the Varsity and Junior Varsity teams) separately ranks significantly higher than its respective counterpart team at Fishers High School. In other words, we find an empirical example of an aggregation paradox when comparing these two Cross Country programs. This aggregation paradox is fairly extreme. Not only does the test of aggregated data lose significance at the α = 0.025 level but also at the α = 0.05 level. Further, the p-value for the aggregated data is more than the sum of the p-values for the two constituent data sets!

Of course, there are alternatives in sport to rank sum scoring. For example, team tennis matches typically employ a methodology that is consistent with the sign test methodology. For a dual team tennis match, each team assigns an intra-squad rank to each player or set of players, and each player for a given team is match paired to play against the player on the opposing squad who possesses the same intra-squad rank. Matches then take place and a team winner is assigned based on majority rule aggregation of individual match results. We can apply such a scoring rule to the case of team cross country. More specifically, we can apply this sign test type team tennis scoring to the application data from the two high school cross country meets considered.

In sign test type team tennis scoring, each Carmel runner defeated his assigned pair on the Fishers team (for each of the two competition levels). As such,

Then, we have that . By aggregating the two data sets, we obtain and

Then, we have that . The sign test p-value diminishes in level of aggregation for the empirical application. Whereas the results of the WMW test are inconsistent upon aggregation in this empirical application, we again observe an instance of the general consistency upon aggregation quality of the sign test.

6 Application 3: Aggregation properties of the binomial test for equality of dependent proportions

McNemar’s test for the equality of proportions of matched pairs considers the discordant pairs with responses (Yes, No) and (No, Yes). Let n be the total number of discordant pairs, and X be the number of discordant pairs with response (Yes, No). Then the test statistic is given by , and the null hypothesis is rejected if T2 is too large (see for example, Rosner ([53], p. 375)). Under the null hypothesis, conditioned on n X has a Binomial distribution with parameters n and p = 1/2. When n is small (say under 20) the exact Binomial distribution of X is used ([53], p. 377) to compute the p value and the null hypothesis of equality of proportions is rejected when X is too small or large. When n is large, chi-square approximation with 1 degree of freedom is used and the upper tail probability associated with observed T2 is used as the p value. Thus, when number of discordant pairs in two data sets are equal, say n each, we can use Theorem 2 to establish strong form consistency upon aggregation of the test. When the number of discordant pairs vary, at least when the ni are large, (11) can be used to establish consistency upon aggregation.

7 Application 4: Computation of Yule-Simpson Paradox for a case of rank sum scoring

Here, we consider a case of rank sum scoring in which two groups, each with two elements, are compared. The outcome of the comparison is a 4-element sequence (e.g., a, b, b, a). There are 4!/(2!2!) = 6 such sequences. For each sequence, we replicate that same sequence and pool the two ordinally-identical data sets to create every possible pooled sequence of eight elements for the two groups that preserves the within-sample ordering for each original sequence. For example a, b, b, a can be replicated and pooled with its replicate, a′, b′, b′, a′ to create a′, b′, b′, a′, a, b, b, a or, alternatively, a, b, b, a′, a, b′, b′, a′. For each of the six sequences in this case, there are 70 possible poolings. This can be easily verified via a decision tree. Essentially, there are up to 5 “bins” in which to place elements of the replicate data set into the original data set, but the number of bins for a given element is constrained because one cannot re-order the original, constituent sequences. A decision tree shows that there are possible poolings for each original sequence. Across all 6 original sequences, there are 70 ⋅ 6 = 420 possible poolings. We find that 72 of these poolings (17.14 percent) result in an instance of Yule-Simpson Paradox of one of the following type: Rc(A)≤Rc(B) but Rp(A)>Rp(B) or Rc(A)≥Rc(B) but Rp(A)<Rp(B). That is, one group has at least as low a rank sum score as the other in the constituent data set, but this is not the case in the pooled data set. This result suggests that instances of Yule-Simpson Paradox are fairly regularly occurring for small sample cases. Qualitatively, the result is in line with previous small sample results on Yule-Simpson Paradox incidence for other tests (see [28]; [54]).

8 Conclusion

In this work, we have used social choice theory and non-parametric statistical theory to examine the Yule-Simpson Aggregation Paradox as it applies to non-parametric statistical tests. As discussed, the Paradox relates to a broader literature concerning the effectiveness of information aggregation mechanisms. Herein, we have shown that the sign test for matched pairs exhibits general consistency upon aggregation. This is the first non-parametric test for which this property has been demonstrated generally, while several non-parametric tests have been shown in the prior literature to not possess this property. This result illustrates that there exists a non-parametric statistical test for which consistency upon aggregation is possible. In the present work, moreover, we find paired matching of ordinal data (across groups) to be important toward the result of general consistency. We prove an additional result as to the nature of sign test consistency. Namely, the sign test p-value exhibits a greater proportional decrease in aggregation as the original (primitive data sign test) p-value itself becomes smaller. In aggregation, then, we expect a greater proportional p-value decline given significant sign test results for the constituent, primitive data (than for insignificant sign test results). This represents something of a strong form consistency result, whereby data aggregation reinforces significant sign test results to a greater (proportional) degree.

Incorporating a generalized form of data aggregation, we also generate preliminary evidence that the sign test possesses a generalized form of consistency upon aggregation, in which the primitive data sets have varying sample sizes. For tractability, n-aggregation has been incorporated as a standard approach in examining the Yule-Simpson Paradox for ordinal data. However, analysis of the generalized, niaggregation form is an important consideration. Often, data of different sample sizes are pooled (e.g., in the case of many longitudinal and unbalanced panel data sets). While we believe that the result holds for the niaggregation problem as well, our proof is incomplete; we can formally claim the consistency only for large samples (see Theorem/Conjecture 3). With the naggregation form, we proved the consistency for any n by directly comparing P(Xnk) and P(XNrk) where N = rn. This approach does not seem to work when the ni are unequal. We focused on the p = 1/2 case, as our interest was on the pvalue, or on the property of the test under the typical null hypothesis. If one is interested in the power properties, we need similar results for p ≠ 1/2. Using the normal approximation, we can conclude that the inequality stated in (11) holds for c > p when n1 and n2 are large and sufficiently apart.

We further incorporate a generic application that demonstrates tests of the same (aggregated) data by the sign test and by the WMW rank sum test, respectively. In the example, the WMW test results exhibit inconsistency upon aggregation. Sign test results demonstrate (the predicted) consistency upon aggregation, however. In the application, we further verify that the sign test results exhibit “strong form consistency” in aggregation.

An empirical application obtained from Indiana Boys High School Cross Country running data demonstrates a real-world application exhibiting both rank sum inconsistency and sign test consistency. Future work might evaluate alternative matched pairs style non-parametric tests to determine if there is a family of such tests possessing consistency upon aggregation. Indeed, the property is of central importance. As all data sets can be viewed as a potential aggregation of primitive data sets (i.e., from the power set of the aggregated set of data), consistency tells us generally whether a given statistical test evaluates data in an unambiguous manner.

References

  1. 1. Tideman T. N., & Plassmann F. (2013). Developing the aggregate empirical side of computational social choice. Annals of mathematics and artificial intelligence, 68(1-3), 31–64.
  2. 2. Yule G. U. (1903). Notes on the theory of association of attributes in statistics. Biometrika, 2(2), 121–134.
  3. 3. Yaari G., & Eisenmann S. (2011). The hot (invisible?) hand: can time sequence patterns of success/failure in sports be modeled as repeated random independent trials?. PloS One, 6(10), e24532. pmid:21998630
  4. 4. Gehrlein W. V., & Plassmann F. (2014). A comparison of theoretical and empirical evaluations of the Borda Compromise. Social Choice and Welfare, 43(3), 747–772.
  5. 5. Kock N. (2015). How Likely is Simpson’s Paradox in Path Models?. International Journal of e-Collaboration (IJeC), 11(1), 1–7.
  6. 6. Haunsperger D. B. (1992). Dictionaries of paradoxes for statistical tests on k samples. Journal of the American Statistical Association, 87(417), 149–155.
  7. 7. Matzkin R. L. (1992). Nonparametric and distribution-free estimation of the binary threshold crossing and the binary choice models. Econometrica, 239–270.
  8. 8. Wilcoxon F. (1945). Individual comparisons by ranking methods. Biometrics Bulletin, 1(6), 80–83.
  9. 9. Pavlides M. G., & Perlman M. D. (2009). How likely is Simpson black’s paradox? The American Statistician, 63(3), 226–233.
  10. 10. Haunsperger D. B. (1996). Paradoxes in nonparametric tests. Canadian Journal of Statistics, 24(1), 95–104.
  11. 11. Kruskal W. H., & Wallis W. A. (1952). Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association, 47(260), 583–621.
  12. 12. Pearl J. (2014). Comment: understanding Simpson black’s paradox. The American Statistician, 68(1), 8–13.
  13. 13. Bhapkar V. P. (1966). A note on the equivalence of two test criteria for hypotheses in categorical data. Journal of the American Statistical Association, 61(313), 228–235.
  14. 14. Hanson R., Oprea R., & Porter D. (2006). Information aggregation and manipulation in an experimental market. Journal of Economic Behavior & Organization, 60(4), 449–459.
  15. 15. Hammond T. H. (2007). Rank injustice?: How the scoring method for cross-country running competitions violates major social choice principles. Public Choice, 133(3-4), 359–375.
  16. 16. Datta S., & Satten G. A. (2005). Rank-sum tests for clustered data. Journal of the American Statistical Association, 100(471), 908–915.
  17. 17. Charig C. R., Webb D. R., Payne S. R., & Wickham J. E. (1986). Comparison of treatment of renal calculi by open surgery, percutaneous nephrolithotomy, and extracorporeal shockwave lithotripsy. Br Med J (Clin Res Ed), 292(6524), 879–882.
  18. 18. Gautier E., & Kitamura Y. (2013). Nonparametric estimation in random coefficients binary choice models. Econometrica, 81(2), 581–607.
  19. 19. Simpson E. H. (1951). The interpretation of interaction in contingency tables. Journal of the Royal Statistical Society. Series B (Methodological), 13, 238–241.
  20. 20. Bargagliotti A. E. (2009). Aggregation and decision making using ranked data. Mathematical Social Sciences, 58(3), 354–366.
  21. 21. Bhattacharya D. (2015). Nonparametric welfare analysis for discrete choice. Econometrica, 83(2), 617–649.
  22. 22. De Neve J., & Thas O. (2015). A regression framework for rank tests based on the probabilistic index model. Journal of the American Statistical Association, 110(511), 1276–1283.
  23. 23. Gastwirth J. L. (1965). Percentile modifications of two sample rank tests. Journal of the American Statistical Association, 60(312), 1127–1141.
  24. 24. Wardrop R. L. (1995). Simpson’s paradox and the hot hand in basketball. The American Statistician, 49(1), 24–28.
  25. 25. Rosner B. (2011). Fundamentals of Biostatistics, Seventh Edition. Brooks/Cole, Boston.
  26. 26. Bickel P. J., Hammel E. A., & O’Connell J. W. (1975). Sex bias in graduate admissions: Data from Berkeley. Science, 187(4175), 398–404. pmid:17835295
  27. 27. Mann H. B., & Whitney D. R. (1947). On a test of whether one of two random variables is stochastically larger than the other. The Annals of Mathematical Statistics, 18(1), 50–60.
  28. 28. McNemar Q. (1947). Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12(2), 153–157. pmid:20254758
  29. 29. Vartia Y. O. (1983). Efficient methods of measuring welfare change and compensated income in terms of ordinary demand functions. Econometrica, 51(1), 79–98.
  30. 30. Kaufmann C., & Weber M. (2013). Sometimes less is more—The influence of information aggregation on investment decisions. Journal of Economic Behavior & Organization, 95, 20–33.
  31. 31. Gou J., & Zhang F. (2017) Experience Simpson’s Paradox in the Classroom. The American Statistician, 71(1), 61–66.
  32. 32. Hausman J. A., & Newey W. K. (1995). Nonparametric estimation of exact consumers surplus and deadweight loss. Econometrica, 63(6), 1445–1476.
  33. 33. Akritas M. G., Antoniou E. S., & Kuha J. (2006). Nonparametric analysis of factorial designs with random missingness: bivariate data. Journal of the American Statistical Association, 101(476), 1513–1526.
  34. 34. Evans T. A., Seaton S. E., & Manktelow B. N. (2013). Quantifying the potential bias when directly comparing standardised mortality ratios for in-unit neonatal mortality. PLOS One, 8(4), e61237. pmid:23577213
  35. 35. Saari D. G. (1999). Explaining all three-alternative voting outcomes. Journal of Economic Theory, 87(2), 313–355.
  36. 36. Haunsperger D. B. (2003). Aggregated statistical rankings are arbitrary. Social Choice and Welfare, 20(2), 261–272.
  37. 37. Stringer S., Wray N. R., Kahn R. S., & Derks E. M. (2011). Underestimated effect sizes in GWAS: fundamental limitations of single SNP analysis for dichotomous phenotypes. PloS One, 6(11), e27964. pmid:22140493
  38. 38. Selvitella A. (2017). The ubiquity of the Simpson’s Paradox. Journal of Statistical Distributions and Applications, 4(1), 2.
  39. 39. Albers C. J. (2015). Dutch research funding, gender bias, and Simpson?s paradox. Proceedings of the National Academy of Sciences, 112(50), E6828–E6829.
  40. 40. Miller J. B., & Sanjurjo A. (2018). Surprised by the hot hand fallacy? A truth in the law of small numbers. Econometrica, 86(6), 20192047.
  41. 41. Hao L., & Houser D. (2015). Adaptive procedures for the Wilcoxon-Mann-Whitney test: Seven decades of advances. Communications in Statistics-Theory and Methods, 44(9), 1939–1957.
  42. 42. Bargagliotti A. E., & Greenwell R. N. (2011). Statistical significance of ranking paradoxes. Communications in Statistics: Theory and Methods, 40(5), 916–928.
  43. 43. Matzkin R. L. (1994). Restrictions of economic theory in nonparametric methods. Handbook of Econometrics, 4, 2523–2558.
  44. 44. Boudreau J., Ehrlich J., Raza M. F., & Sanders S. (2018). The likelihood of social choice violations in rank sum scoring: algorithms and evidence from NCAA cross country running. Public Choice, 174(3-4), 219–238.
  45. 45. Mattei N. (2011, October). Empirical evaluation of voting rules with strictly ordered preference data. In International Conference on Algorithmic Decision Theory (pp. 165-177). Springer, Berlin, Heidelberg.
  46. 46. Hauert C., De Monte S., Hofbauer J., & Sigmund K. (2002). Volunteering as red queen mechanism for cooperation in public goods games. Science, 296(5570), 1129–1132. pmid:12004134
  47. 47. Saari D. G. (1995). A chaotic exploration of aggregation paradoxes. SIAM Review, 37(1), 37–52.
  48. 48. Hogg R. V., Fisher D. M., & Randles R. H. (1975). A two-sample adaptive distribution-free test. Journal of the American Statistical Association, 70(351a), 656–661.
  49. 49. Koessler F., Noussair C., & Ziegelmeyer A. (2012). Information aggregation and belief elicitation in experimental parimutuel betting markets. Journal of Economic Behavior & Organization, 83(2), 195–208.
  50. 50. Briesch R. A., Chintagunta P. K., & Matzkin R. L. (2010). Nonparametric discrete choice models with unobserved heterogeneity. Journal of Business & Economic Statistics, 28(2), 291–307.
  51. 51. Bennouri M., Gimpel H., & Robert J. (2011). Measuring the impact of information aggregation mechanisms: An experimental investigation. Journal of Economic Behavior & Organization, 78(3), 302–318.
  52. 52. Axelrod B. S., Kulick B. J., Plott C. R., & Roust K. A. (2009). The design of improved parimutuel-type information aggregation mechanisms: Inaccuracies and the long-shot bias as disequilibrium phenomena. Journal of Economic Behavior & Organization, 69(2), 170–181.
  53. 53. McFadden D., & Train K. (2000). Mixed MNL models for discrete response. Journal of Applied Econometrics, 15(5), 447–470.
  54. 54. Haunsperger D. B., & Saari D. G. (1991). The lack of consistency for statistical decision procedures. The American Statistician, 45(3), 252–255.