The aggregation paradox for statistical rankings and nonparametric tests

The relationship between social choice aggregation rules and non-parametric statistical tests has been established for several cases. An outstanding, general question at this intersection is whether there exists a non-parametric test that is consistent upon aggregation of data sets (not subject to Yule-Simpson Aggregation Paradox reversals for any ordinal data). Inconsistency has been shown for several non-parametric tests, where the property bears fundamentally upon robustness (ambiguity) of non-parametric test (social choice) results. Using the binomial(n, p = 0.5) random variable CDF, we prove that aggregation of r(≥2) constituent data sets—each rendering a qualitatively-equivalent sign test for matched pairs result—reinforces and strengthens constituent results (sign test consistency). Further, we prove that magnitude of sign test consistency strengthens in significance-level of constituent results (strong-form consistency). We then find preliminary evidence that sign test consistency is preserved for a generalized form of aggregation. Application data illustrate (in)consistency in non-parametric settings, and links with information aggregation mechanisms (as well as paradoxes thereof) are discussed.


Introduction
The relationship between social choice aggregation rules and non-parametric statistical analysis is well-established [see, e.g., 1,2,3,4,5,6,7,8,9]. In a related literature, fundamental relationships between discrete choice and non-parametric statistical analysis have been developed [see 10,11,12,13,14,15,16,17]. While violations of social choice principles can inform us as to the possible paradoxes present in non-parametric tests, the mapping is imperfect due to the issue of statistical significance. Consider a comoparison of three groups, where the term "group" refers to sampled elements of the same population. If, for example, a violation of transitivity occurs in a set of raw, ordinal data when comparing three groups pairwise, this does not imply that any sort of paradox is evident in the results of a corresponding set of pairwise significance tests. This distinction is studied prominently in Bargagliotti and Greenwell [9].
In a seminal work, Haunsperger and Saari [4] find that ". . .two or more data sets each may individually support a certain ordering of the samples under [the] Kruskal-Wallis [test], yet their union, or aggregate, yields a different outcome" [18, p. 261]. In other words, the Kruskal-Wallis (KW) test [19] is not necessarily consistent upon aggregation of data. The works of for the original result upon aggregation. That such an expectation is not met for this set of tests suggests not only that aggregated statistical rankings are potentially arbitrary but that the primitive rankings-or any statistical rankings from the test-are themselves potentially arbitrary. That is, even a primitive data set is an aggregation of data subsets from the primitive set's power set.
While Bargagliotti [34] shows a necessary condition (of the data) to ensure consistent (upon aggregation) results for the KW and Bhapkar's V tests, respectively, the literature has not determined whether inconsistency upon aggregation is a general paradox among non-parametric statistical tests. Both the title of Haunsperger and Saari ("The Lack of Consistency for Statistical Decision Procedures") and that of Haunsperger ("Aggregated Statistical Rankings are Arbitrary") certainly leave open the intriguing possibility of a general result. Selvitella [49] further considers the "ubiquity" of the Yule-Simpson Paradox. Based upon early results within the literature, Haunsperger [18] writes, "Haunsperger and Saari [4] show that this Simpson-like paradox occurs with many, if not most, statistical decision procedures. Because of the results [therein], we must expect KW to be the non-parametric system most immune to such difficulties" [p. 263]. We consider, then, whether general consistency under aggregation is an impossibility among non-parametric tests. Alternative tests, such as the sign test for matched pairs, apply distinct aggregation rules to non-parametric data in order to assign aggregated statistical rankings of groups. Herein, we examine whether the sign test for matched pairs (heretofore, "sign test") is generally consistent upon aggregation, where the sign test represents another central non-parametric test with a social choice aggregation rule analog. It is important to note that the McNemar [50] test is closely related to the sign test for matched pairs. Specifically, the latter test represents an exact test of the former.
The sign test uses a binomial(n, p = 0.5) test statistic to pairwise rank two groups from intergroup (inter-sample) comparisons of matched-pair elements. The sign test statistic is calculated in a manner that is procedurally-equivalent to the two-candidate case of Borda rule aggregation and to the two-candidate case of majority rule aggregation (the latter two of which are, then, themselves equivalent to one another in the two-group case). That is to say, each pairedelement comparison of the sign test yields a mutually exclusive "vote" in favor of one of the two groups. This is equivalent to two-candidate Borda or two-candidate majority rule voting, whereby each voter makes a binary preference comparison of two candidates and (assuming voting is sincere) casts a mutually exclusive vote in favor of her preferred candidate. The two candidates are then ranked according to majority rule type aggregation.
In Section 2 of this paper, we construct a theoretical framework by which to consider whether the sign test is consistent upon aggregation. In Section 3, we utilize this framework to explore the properties of aggregated statistical rankings from the sign test. We ultimately find that statistical rankings from the sign test are consistent upon aggregation, and we present a general proof of the result. That is, we utilize the cumulative distribution function (cdf) of a binomial(n, p = 0.5) random variable to show generally that the aggregation of two or more data sets-each rendering a sign test result of the same quality-can only strengthen the original result (i.e., can only decrease the p-value of the sign test statistic). This result is shown for any number of aggregated (primitive) data sets. We prove an additional result as to the nature of sign test consistency. Namely, the sign test p-value exhibits a greater proportional decrease (in aggregation) as the original (primitive data sign test) p-value itself becomes smaller. In aggregation, then, we expect a greater proportional p-value decline given significant sign test results for the constituent, primitive data (than for insignificant sign test results). This represents something of a strong form consistency result, whereby data aggregation reinforces significant sign test results to a greater (proportional) degree. Lastly, we construct and define a generalized notion of aggregation to consider the aggregation of data sets of different sample sizes. We find preliminary evidence that the sign test is also consistent upon generalized aggregation. Analysis of the generalized form is of great practical importance, as data sets of different sample sizes are often aggregated (e.g., in the case of many longitudinal and unbalanced panel data sets).
The matched pairs nature of the sign test is crucial to the result of consistency upon aggregation. In the aggregation of ordinal or numerical data with pre-assigned matched pairs, data are not aggregated into an overall outcome sequence but, rather, into a larger set of matched pairs. As such, sign test aggregation is reinforcing or consistent rather than potentially inconsistent. We find that it is possible for a non-parametric test to exhibit consistency.
In Section 4, we develop a generic application in which we present a set of ordinal data for two distinct groups. Elements of the data can be paired across group (based, e.g., on some characteristic such as "twin-ness" in a control-treatment study) and analyzed via the sign test (for matched pairs). Alternatively, elements for the respective groups may remain unpaired, and elements can then be aggregated into an outcome or rank-sequence such that overall group rankings are generated from a WMW rank sum test. Within the application, we demonstrate a case of inconsistency upon aggregation for the WMW test. We do so to show readers possible characteristics of an ordinal data that is susceptible to inconsistency. For the same ordinal data, we demonstrate an example of the (proved, general) consistency upon aggregation of the sign test. As such, the procedural differences between the two tests are also highlighted (with respect to the consistency property). For the application, we aggregate sets of ordinal data twice, thrice, and four times. It is shown that the sign test p-value shrinks monotonically in the number of data sets aggregated. This follows from the general results of Section 3. For the same ordinal data, we observe a case in which the WMW test p-value rises monotonically in the number of data sets aggregated (as was shown to be possible by Bargagliotti [34]). Section 5 identifies and develops an empirical application from the sport of high school team cross country running, a sport that uses rank sum scoring to map from an individual outcome sequence (of individual finishing positions) to team scores. The empirical application uncovers evidence of rank sum "inconsistency" and sign test consistency for high school team cross country meet data. Section 6 concludes.

Preliminary definitions
Let us begin our exposition with some definitions concerning data aggregation.
Definition of n-aggregation of data: Consider a primitive, n-element data set, A. Let naggregation involving A be an aggregation of A and r − 1 (r 2 Z þ ) other data set(s) each of sample size n to form an aggregated data set.
Haunsperger and Saari [4] and Haunsperger [18] construct a powerful methodology to assess consistency upon aggregation for KW. Namely, they find a condition on a matrix of ordinal data rankings that is equivalent to consistency (mutually exclusive with inconsistency). As they utilize a matrix approach, their methodology considers balanced sample size data aggregation (i.e., n-aggregation). Note that n-aggregation is a restricted form of data aggregation, as it does not consider the aggregation of two or more data sets of different sample sizes. The nonparametric aggregation paradox has focused on this restricted form of data aggregation to date (see, e.g., any of the seminal papers mentioned in the introduction). This is perhaps due to the tractability of examining this form of aggregation, as we will observe in the present analysis. In this study, our primary theorems will concern n-aggregation of ordinal data. However, we also present preliminary results for a more general version of data aggregation of the following form.
Definition of n i -aggregation of data: Consider a primitive, n 0 -element data set, A. Let n iaggregation involving A be an aggregation of A and r − 1 (r 2 Z þ ) other data set(s) of respective sample sizes n 1 , n 2 , . . ., n r−1 to form an aggregated data set.
For tractability, n-aggregation has been incorporated as a standard approach in examining the Yule-Simpson Paradox for ordinal data. However, analysis of the generalized n i -aggregation form is an important consideration. Often, data sets of different sample sizes are pooled (e.g., in the case of many longitudinal and unbalanced panel data sets).
Let us now define consistency upon aggregation largely as in Haunsperger [18]. She states, "A statistical procedure that endows a set of data with an ordinal ranking of the candidates is consistent [upon] aggregation if the aggregate of any [r] sets of data, each of which corresponds to a given ordering of the candidates, gives rise to the same ordering of the candidates for any positive integer [r]" [p. 264]. We refine the definition slightly to incorporate the notion of statistical significance explicitly as follows.
Definition of Consistency upon Aggregation: A statistical procedure that endows a set of data with an ordinal ranking of the candidates is consistent upon aggregation if the aggregate of any r sets of data, each of which corresponds to a given statistically-significant ordering of the candidates, gives rise to the same statistically-significant ordering (i.e., at the same α-level) of the candidates for any positive integer r and for all possible α levels.
Thinking of consistency from the perspective of significance testing, we can evaluate the consistency of a procedure by simply verifying whether it is possible for the p-value of a test result to rise following aggregation. If so, statistically-significant strict orderings can be lost in aggregation such that the procedure is not generally consistent.

Theoretical setup
Let X n be a Binomial random variable with parameters n and p = 1/2, and N = rn for some integer r > 1. Then, the sign test for matched pairs is consistent upon n-aggregation if and only if (iff) for any k between n/2 and n, PðX N � rkÞ < PðX n � kÞ: From (1), it may appear that we are aggregating primitive data sets each of which renders the very same test statistic value, k. In fact, (1) provides a general condition for consistency upon aggregation. Let us think of k as the lowest binomial(n, p = 0.5) test statistic value (from tests of given sets of primitive data) such that a given statistically-significant ordering emerges from the primitive data (i.e., at a stipulated α-level). As such, the term rk represents a lower bound on the aggregated data test statistic value, and the left-hand side p-value in (1) represents an upper-bound for the p-value from a test of the aggregated data. Then, (1) states that the p-value for an associated test of the aggregated data will always be less than the highest (but still significant) p-value among the associated primitive data sets.
To establish (1), we first define where and (n − 1)/2 < k � j � n. Here, a j counts the distinct number of ways of obtaining between rj and rj + r − 1 successes in rn Bernoulli trials, and b j counts the number of ways of obtaining j successes in n Bernoulli trials. The coefficients c j and d j are useful for studying the relative variation in a j and b j and their upper tail partial sums. We now show that c j and consequently d j are strictly increasing in the range of our interest. Lemma 1 For sequences a j , b j , c j given in (2), and a j (i) given in (3), the following properties hold.
(b). a j (0)/b j strictly decreases as j increases in the interval ((n − 1)/2, n], and when j = (n − 1)/2 is an integer, Proof. From the expressions for a j (i) and b j given in (3) and (2) respectively, we note that Canceling out the common factors in the above two expressions, we conclude that the condition for any j � (n − 1) in the specified range. The first factor on the left side of (4) above exceeds 1 iff (n − j)i > −i(j + 1). For any j, this is true for all i > 0, and when i = 0, the first factor equals 1. The ratio corresponding to the tth factor in the product exceeds 1 iff r(j or equivalently j > (n − 1)/2 − i/r. When j = (n − 1)/2 − i/r is an integer, the product equals 1, and for i > 0 the first factor exceeds 1 and so the strict monotonicity of a j (i)/b j holds in [(n − 1)/2 − i/r, n]; thus the conclusion in part (a) is established. When j = (n − 1)/2 is an integer, the first factor as well as the product term equal 1, leading us to the conclusion in part (b), and strict monotonicity holds for j > (n − 1)/2.
Since a j is the sum of the a j (i), and as long as at least one of the a j (i)/b j shows strict monotonicity in j, the strict monotonicity of c j follows. This claim holds for an integer j = (n − 1)/2 also, completing the proof of part (c). (2),

Lemma 2 For the d j defined in
Proof. We use an induction argument. For this, first we prove that d n < d n−1 . Recall that d n = a n /b n and d n−1 = (a n + a n−1 )/(b n + b n−1 ), where the a j and b j are defined in (2). Hence, the condition: d n < d n−1 is equivalent to the condition: Since denominators above are positive, (5) holds iff a n b n + a n b n−1 < a n b n + a n−1 b n , or equivalently, a n b n < a nÀ 1 b nÀ 1 : This condition follows from part (c) of Lemma 1 since c j = a j /b j . Next, assuming d j+1 < d j , we will prove that d j < d j−1 . From the definition of d j given in (2), we conclude that upon cross-multiplication and cancellation of common terms. Similarly, the condition: The left side sum in (7) is where the inequality above follows from (6). From part (c) of Lemma 1, it follows that this is the right side of the (strict) inequality in (7). This strict inequality (in (7)) also holds for j − 1 = (n − 1)/2 for odd n, and j − 1 = n/2 for even n. This establishes our claim. Remark 1. This monotonicity property of the ratio of the partial sums is perhaps a wellknown result in the "Inequalities" literature. We have not found a convenient reference, and have chosen to prove it with elementary principles; we have also fine-tuned the result for the even n case.

Main results
Theorem 1 Consistency upon Aggregation: Let X n be a Binomial random variable with parameters n and p = 1/2. Then, with N = rn, the inequality in (1) holds for all integers k � n/2, and integers r � 2.
Proof. In terms of the coefficients introduced in (2) and (3), where N = rn. The inequality in (1) can be expressed as We have seen in Lemma 2 that d k strictly increases as k decreases toward n/2. Thus, the maximum value for the ratio of the two binomial probabilities in (9) is attained with k = (n + 1)/2 for odd n and k = n/2 for even n. We now show that this maximum value is less than 1 for both odd and even n.
When n is odd and n = 2m − 1, we take k = m = (n + 1)/2 in (9). By the symmetry of the binomial probabilities, P(X n � m) = 1/2. Now N(= rn) may be even or odd. When N is even, r(�2) is also even, and rn/2 is the median of X N . Further, P(X N � rn/2)>1/2 and P(X N � rn/2 + 1)<1/2. Thus, where the first inequality holds since r/2 � 1. When N(= rn) is odd, r(�3) is also odd, and P (X N � (rn + 1)/2) = 1/2. Thus, where the strict inequality holds since (rn + r)/2 and (rn + 1)/2 are both integers and the difference (r − 1)/2 � 1. Thus, the strict inequality in (9) holds for all k � n/2 when n is odd. When n is even and n = 2m, we take k = n/2 = m, and note that N is always even. By the symmetry of the binomial probabilities, Thus, the strict inequality in (9) holds iff We will now establish (10). Note that X N has the same distribution as X n + Y where Y is Binomial with parameters N − n and p = 1/2, and X n and Y are independent.
which establishes (10). That is, (9) holds for all k � n/2 when n is even as well.
A similar result on the lower tail probabilities can be established by symmetry. As such, we have shown that any significant sign test result is consistent upon n-aggregation. This result has important implications for classic results (e.g., by [31] and [33]). If we apply the sign test to the underlying data of these studies, while normalizing the sample size to equality across group and across treatment, it is not possible for an aggregation paradox to occur. Later in the paper, we will consider whether the sign test is susceptible to instances of aggregation paradox given unequal sample sizes across group. We can refine Theorem 1 by observing that we have proved a great deal more. In fact, we have established the following.
Theorem 2 Strong Form Consistency upon Aggregation: Let X N be a Binomial random variable with parameters N = rn and p = 1/2 where n � 1 and r � 2 is an integer. Then, as the integer k decreases in [n/2, n], P(X N � rk)/P(X n � k) monotonically increases with a strict upper bound of 1.
That the ratio is bounded by 1 follows from Theorem 1. Theorem 2 tells us that the n-aggregated data associated p-value rises toward the primitive data associated p-value as the primitive data p-value rises. In other words, n-aggregation creates a greater proportional decrease in the sign test p-value as the original (primitive) p-value itself becomes smaller. In aggregation, then, we expect a greater proportional p-value decline given significant sign test results for the constituent, primitive data (than for insignificant sign test results). This represents something of a strong form consistency result, whereby data aggregation reinforces significant sign test results to a greater (proportional) degree. A more general result of the following form is highly desirable.
Theorem 3 (Conjecture) Let X N denote a Binomial random variable with parameters N and p = 1/2. Then for positive integers n 1 < n 2 and constant c 2 [1/2, 1] such that cn 1 and cn 2 are integers, the following inequality holds.
We are able to establish this claim for the boundary values of c (that is, c = 1/2 or 1) using the arguments presented earlier, and conjecture that the result is true for other c values in [1/2, 1]. Limited computational work supports this conjecture. Further, a large-sample approximation, described below, leads us to believe in the conjecture, especially when n 1 and n 2 are large and far apart.
From central limit theorem, as n ! 1, we know that ffi ffi ffi n p ðX n =n À 1=2Þ ! d Nð0; 1=4Þ. Further, the binomial distribution, being already unimodal and symmetric (as p = 1/2), the convergence is fast. So, if n is large we have the following approximation: where F is the cdf of a standard normal random variable. This "close" approximation to the tail probability strictly decreases as n increases as long as c > 1/2 and hence we believe (11) holds for large n 1 < n 2 . Remark 2. When c = 1/2 and n 1 and n 2 are odd integers, the strict inequality claimed in (11) does not hold and each of the probabilities equals 1/2 as for an odd integer n, PðX n � n=2Þ ¼ PðX n � ðn þ 1Þ=2Þ ¼ 1=2: Also, it follows from (12) that, irrespective of the odd or even nature of n, the limiting value of this probability is 1/2.

Tests of a primitive, ordinal data set
Consider two groups, A and B. Each group consists of nine elements or data points (i.e., n 2 ¼ 9 or n = 18). That is, A = {a 1 , a 2 , . . ., a 9 } and B = {b 1 , b 2 , . . ., b 9 } such that the primitive (unaggregated) data for the WMW rank sum test set-up is an outcome sequence of 18 ordinal (rankordered) data points. For the application, let represent the ordinal data (sequence). For the sign test set-up, each a i 2 A is matched with a counterpart b j 2 B according to the matching criterion. As we are a priori unsure (notationally) which b j will be matched with each given a i , we say that each a i is matched with some b π(i) , where π: D ! D is defined as a bijection between D = {1, 2, 3, . . ., 8, 9} and itself such that element a i 2 A and element b π(i) 2 B uniquely pair with one another for all i 2 D. As such, the following matched pairs data set is generated (uniquely in the present example) from the outcome sequence F AB : M AB ¼ fða 1 ; b pð1Þ Þ; ða 2 ; b pð2Þ Þ; ða 3 ; b pð3Þ Þ; :::; ðb pð9Þ ; a 9 Þg: The set M AB is a set of rank-ordered (matched) pairs of elementary data points. That is to say, M AB is a set whose elements are outcome sequences on matched pairs. If an element of M AB is represented as (a i , b j ), this is equivalent to finding that a i � b j ("a i ranks higher than b j ") in outcome sequence F. If an element of M AB is represented as (b j , a i ), this is equivalent to saying that b j � a i in outcome sequence F AB . As such, F AB and M AB are different representations of the same ordinal data set. Whereas F AB is a sequence representation of the data, M AB is a matched-pairs representation. For the application data-but not generally-M AB is uniquely derived from F AB . That is to say, the elements of F AB are sequenced such that the ordering of any given matched pair in F AB is invariant to the specific match formed. That is, when i 2 {1, 2, 3, .., 8}, we know that a i � b j regardless of which j is paired with i. When i = 9, we know that b j � a i regardless of which j is paired with i. For the application data (but not generally), one can uniquely derive M AB from F AB . However, it is never possible to derive F AB uniquely from M AB . That is to say, F AB sometimes maps to a unique M AB . Given only M AB , however, one can never map to a unique F AB . In such a case, one is always missing some rank comparison information needed to support a super-ordering or outcome sequence of the overall data. Specifically, one cannot obtain rank comparisons of unmatched elements (a i and b π(j) , From F AB , we can calculate rank sum scores for A and B, R F AB ðAÞ and R F AB ðBÞ. Generally, we define R F XY ð:Þ as follows. Consider two groups, X and Y, and let F XY represent an outcome sequence between elements of the two groups. For each element x i 2 X, let Then, the rank of x i in the sequence F XY , r(x i |F XY ), equals jx þ i j þ 1, and the rank sum score for group X given F XY is represented as R F XY ðXÞ ¼ S x j 2X rðx j jF XY Þ: Given the outcome sequence F AB specified earlier in this application, then, we obtain and apply the WMW rank sum test. Let R F AB ¼ max fR F AB ðAÞ; R F AB ðBÞg. Then, the upper-tail WMW rank sum p-value for this sample difference, PðR F AB � 117jn ¼ 18Þ, is equal to 0.0027.
Then, we conclude from a WMW test of the primitive data that A� α = 0.01 B, where the notation "� α = 0.01 " reads "ranks significantly higher than at the α = 0.01 significance level." In other words, A has a significantly lower rank sum score than B such that we conclude from F AB that group A ranks significantly higher than group B at the assigned significance level.
Let us consider the same data in matched pairs form, M AB . Of the 9 matched pairs, an element of A outranks the corresponding element of B for 8 of 9 matched pairs. Formally, let us define the sign test statistic value for A, S M AB ðAÞ, as the number of elements a i 2 A such that a i � b p i : S M AB ðAÞ ¼ jfa i 2 A : a i � b p i gj. Therefore, we obtain the sign test values for the primitive data as S M AB ðAÞ ¼ 8 and S M AB ðBÞ ¼ 1 and let S M AB ¼ max fS M AB ðAÞ; S M AB ðBÞg: The upper-tail sign test for matched pairs p-value for this sample difference, PðS M AB � 8 j n 2 ¼ 9Þ is equal to 0.0098. Then, we conclude from a sign test for matched pairs of the primitive data that A � α = 0.01 B. For this primitive application data, the WMW test and the sign test each provide corresponding results at all standard significance levels.

Tests of aggregated, ordinal data
Now, let us replicate the (ordinality of the) primitive data once, combine the primitive data and its ordinal replicate, and re-test the aggregated data. One should note that an ordinallyreplicated data set may be quite different numerically from the primitive (source) data from which it arises such that two ordinally-identical data sets can combine to form various, feasible aggregated sequences. The aggregated outcome sequence consists of 36 rank-ordered elements. One possible aggregated outcome sequence is: is ordinally equivalent to F AB . In this aggregated sequence, we have simply spliced the replicate sequence, F 0 AB , into F AB such that F 0 AB is strung (consecutively) between the penultimate and ultimate elements of the original sequence. The implication of this aggregated outcome sequence is thatfor the most part-the numerical values of the original data outrank those of the ordinal-replicate data (despite the identical ordinal quality of the two data sets). From FF 0 AB , we find that The upper-tail WMW rank sum p-value for this sample difference, PðR FF 0 AB � 387 j N ¼ 2n ¼ 36Þ, is equal to 0.0438. At the α = 0.01 significance level, then, we conclude from a WMW test of the aggregated data that A* α = 0.01 B (i.e., that A and B are rankindistinguishable at the α = 0.01 significance level), where "* α = 0.01 " reads "is rank-indistinguishable to at the α = 0.01 significance level." We now consider a sign test for matched pairs upon the same aggregated data. Specifically, we combine M AB and its replicate into the same data set to obtain the set MM 0 AB , fða 1 ; b pð1Þ Þ; ða 2 ; b pð2Þ Þ; ða 3 ; b pð3Þ Þ; :::; ðb pð9Þ ; a 9 Þ; ða 0 1 ; b 0 pð1Þ Þ; ða 0 2 ; b 0 pð2Þ Þ; ða 0 3 ; b 0 pð3Þ Þ; :::; ðb 0 pð9Þ ; a 0 9 Þg: As M AB is an unordered set of matched pairs (i.e., the set of elements is unordered, whereas each element is itself an ordered pair), the aggregation of M AB and its ordinal replicate, M 0 AB , is also an unordered set of matched pairs. As such, MM 0 AB features the same matched pair ordinal comparisons as does M AB , while representing each such pair twice. Matches for the sign test are pre-assigned at the primitive data level according to the (also pre-assigned) matching criterion (e.g., "twinness"). Therefore, matched pairings are preserved in aggregation. This can be said in a more straightforward manner: The identity of one's twin (match), once assigned, is invariant to the level of data aggregation. Matched pairing preservation in data aggregation is an important property that causes the sign test to be distinct from independent sample nonparametric tests. As the profile of matched pairings is itself a feature of the primitive data in the case of the sign test (in much the same way that the primitive data outcome sequence is a feature of the primitive data in the case of a WMW test), one would not be strictly aggregating the primitive data if one were to permute the matched pairings in aggregation. Rather, one would thereby change a feature of the primitive data at the same time (in the same manner that a re-ordering of primitive data would go beyond aggregation in the case of a WMW test). Given this feature of the sign test, the matched pair elements of M AB and M 0 AB , respectively, are not rank-compared with one another (across primitive group) within a sign test analysis of MM 0 AB . Of the eighteen matched pairs in the once ordinally-replicated data, an element of A outranks the corresponding element of B for sixteen matched pairs. Therefore, we have the followingsign-test statistic for the aggregated data: Then, the upper-tailed sign test for matched pairs p-value for this test statistic value, PðS MM 0 AB � 16Þ is equal to 0.000484. Then, we conclude from a sign test for matched pairs of the aggregated data that A � α = 0.01 B. For the once ordinally-replicated data, the sign test retains significance at the α = 0.01 level. Consistent with our general results, the sign test pvalue diminishes with data aggregation in this case. On the other hand, the WMW test loses significance at the α = 0.01 level when applied to the once ordinally-replicated data.
To consider the result of Theorem 2, we alter M AB as follows.
M AB ¼ fðã 1 ;b pð1Þ Þ; ðã 2 ;b pð2Þ Þ;ða 3 ;b pð3Þ Þ; :::; ðb pð8Þ ;ã 8 Þ; ðb pð9Þ ;ã 9 Þg: We also alter M 0 AB as follows.  This result illustrates the (general) property shown in Theorem 2. Namely, the sign test pvalue exhibits a greater proportional decrease when replicated as the p-value for the sign test of the primitive data decreases (i.e., for more significant sign test results of the primitive data). Not only does the sign test exhibit consistency upon aggregation. Sign test results are more strongly reinforced in aggregation-in terms of proportional p-value decrease-as the original sign test p-value (of the primitive data) decreases.
Returning to F AB , M AB , FF 0 AB , and MM 0 AB , we iterate this aggregation to combine three and then four instances of the primitive ordinal data as follows.
The case of 4 Ordinal Replicates Aggregated. Now take rank sum "falls behind" (i.e., has fewer elements represented) at some point in the outcome sequence. In the application, group A has a superior rank sum in the primitive outcome sequence. However, group A also possesses the weakest overall element in the primitive outcome sequence. This makes group A's significant rank superiority in the primitive outcome sequence vulnerable upon aggregation. In the aggregated data, if sufficiently many elements position after element b 9 but before element a 9 (as did occur), this particular interposition disproportionately inflates group A's rank sum score. Such an outcome is not possible for matched pairs ordinal data aggregation. In that case, pairs remain assigned such that the primitive ordinal data and its replicate(s) reinforce one another.

(Empirical) Application 2: The aggregation paradox in ranking cross-country running team performance
As in Hammond [51] and Boudreau et al. [52], we use the setting of team cross country races to understand aggregation properties of ordinal data. In a cross country running meet, rank sum scoring is used as the official scoring methodology to assign team rankings. Boudreau et al. [52] describe the scoring methodology in detail: In cross country running, for example, teams are compared on the performances of individual runners in a race; the outcome sequence is the overall ranking as to how runners finish. Rank sum scoring means that each group (team) receives a number of points equal to an element-wise ranking in the outcome sequence (1 for 1st place, 2 for 2nd, and so on), and comparison of groups is based on the sum of their element scores: groups with lower scores rank above those with higher scores [p. 220].
In the present application, we have identified two high school cross country programs for comparison: Carmel High School (Carmel, Indiana) and Fishers High School (Fishers, Indiana). We consider these two schools because their respective Varsity (Boys) teams competed against one another during the 2016 season, as did their respective Junior Varsity (Boys) teams. The two meets-the Varsity Indiana High School Athletic Association Noblesville Sectional meet and the Junior Varsity Hamilton County meet-occurred on the same running course (White River Elementary School Course; Noblesville, Indiana) such that the aggregation of the two data sets is valid (i.e., runners across meets faced the same basic performance elements). Moreover, Carmel's Varsity and Junior Varsity teams won each respective meet by the very same convincing margin when rank sum scored against Fishers without the consideration of third-party teams (as in the WMW test). That is to say, respective pairwise rank sum scores were the same for each school in each race. Lastly, this particular setting was chosen This setting provides something of a natural experiment by which we can compare the two Cross Country programs at different levels of aggregation (i.e., a comparison of Varsity teams, a comparison of Junior Varsity teams, and an aggregated comparison of the two programs across the levels of competition) under the same basic competitive conditions. From the finishing time results, we obtain the observed primitive outcome sequence for the Varsity meet as where F V (C, S) symbolizes the observed Varsity race outcome sequence for our two Varsity teams, C = {c 1 , c 2 , c 3 , c 4 , c 5 , c 6 , c 7 } is a set representing the 7-element (7-runner) Carmel Varsity team, and S = {s 1 , s 2 , s 3 , s 4 , s 5 , s 6 , s 7 } is a set representing the 7-element (7-runner) Fishers Varsity team. The resulting rank sum scores are As with its constituent outcome sequences, FF 0 VJV represents the true (observed) outcome sequence when the two meets (races) are pooled and runners are sequenced in ascending order of finishing time. Given the observed primitive race sequences, the pooled (aggregated) outcome sequence is simply the 14 Varsity elements (runners) in their primitive order followed by the 14 Junior Varsity elements. In other words, the first element in outcome sequence F JV follows the final element in outcome sequence F V . Each meet represents a distinct level of competition by which these two cross country programs are ranked against one another.
Given that R 0 VJV ðCÞ ¼ 170 and R 0 VJV ðFÞ ¼ 236, the WMW rank sum p-value for this sample difference, PðR 0 VJV � 236jn ¼ 28Þ, is equal to 0.069. In a WMW rank sum test of the aggregated data, then, we conclude that (C* α = 0.025 S), where "* α = 0.025 " reads "ranks indifferently to at the α = 0.025 significance level." In fact, we also find that (C* α = 0.05 S). At the α = 0.05 significance level or any smaller α-level, that is, we conclude from the aggregated data that Carmel's 2016 Boys Cross Country program was not significantly different from that of Fishers. We obtain this result despite finding significant evidence that each of Carmel's Boys Cross Country (primitive or constituent) teams (i.e., the Varsity and Junior Varsity teams) separately ranks significantly higher than its respective counterpart team at Fishers High School. In other words, we find an empirical example of an aggregation paradox when comparing these two Cross Country programs. This aggregation paradox is fairly extreme. Not only does the test of aggregated data lose significance at the α = 0.025 level but also at the α = 0.05 level. Further, the p-value for the aggregated data is more than the sum of the p-values for the two constituent data sets! Of course, there are alternatives in sport to rank sum scoring. For example, team tennis matches typically employ a methodology that is consistent with the sign test methodology. For a dual team tennis match, each team assigns an intra-squad rank to each player or set of players, and each player for a given team is match paired to play against the player on the opposing squad who possesses the same intra-squad rank. Matches then take place and a team winner is assigned based on majority rule aggregation of individual match results. We can apply such a scoring rule to the case of team cross country. More specifically, we can apply this sign test type team tennis scoring to the application data from the two high school cross country meets considered. In sign test type team tennis scoring, each Carmel runner defeated his assigned pair on the Fishers team (for each of the two competition levels). As such, Then, we have that P S M V ð:Þ � 7j n 2 ¼ 7

Application 3: Aggregation properties of the binomial test for equality of dependent proportions
McNemar's test for the equality of proportions of matched pairs considers the discordant pairs with responses (Yes, No) and (No, Yes). Let n be the total number of discordant pairs, and X be the number of discordant pairs with response (Yes, No). Then the test statistic is given by T ¼ fjX À ðn À XÞj À 1g= ffi ffi ffi n p , and the null hypothesis is rejected if T 2 is too large (see for example, Rosner ([53], p. 375)). Under the null hypothesis, conditioned on n X has a Binomial distribution with parameters n and p = 1/2. When n is small (say under 20) the exact Binomial distribution of X is used ( [53], p. 377) to compute the p value and the null hypothesis of equality of proportions is rejected when X is too small or large. When n is large, chi-square approximation with 1 degree of freedom is used and the upper tail probability associated with observed T 2 is used as the p value. Thus, when number of discordant pairs in two data sets are equal, say n each, we can use Theorem 2 to establish strong form consistency upon aggregation of the test. When the number of discordant pairs vary, at least when the n i are large, (11) can be used to establish consistency upon aggregation.

Application 4: Computation of Yule-Simpson Paradox for a case of rank sum scoring
Here, we consider a case of rank sum scoring in which two groups, each with two elements, are compared. The outcome of the comparison is a 4-element sequence (e. g., a, b, b, a). There are 4!/(2!2!) = 6 such sequences. For each sequence, we replicate that same sequence and pool the two ordinally-identical data sets to create every possible pooled sequence of eight elements for the two groups that preserves the within-sample ordering for each original sequence. For example a, b, b, a can be replicated and pooled with its replicate, a 0 , b 0 , b 0 , a 0 to create a 0 , b 0 , b 0 ,  a 0 , a, b, b, a or, alternatively, a, b, b, a 0 , a, b 0 , b 0 , a 0 . For each of the six sequences in this case, there are 70 possible poolings. This can be easily verified via a decision tree. Essentially, there are up to 5 "bins" in which to place elements of the replicate data set into the original data set, but the number of bins for a given element is constrained because one cannot re-order the original, constituent sequences. A decision tree shows that there are ½S 5 k¼1 S k i¼1 i� ¼ 70 possible poolings for each original sequence. Across all 6 original sequences, there are 70 � 6 = 420 possible poolings. We find that 72 of these poolings (17.14 percent) result in an instance of Yule-Simpson Paradox of one of the following type: . That is, one group has at least as low a rank sum score as the other in the constituent data set, but this is not the case in the pooled data set. This result suggests that instances of Yule-Simpson Paradox are fairly regularly occurring for small sample cases. Qualitatively, the result is in line with previous small sample results on Yule-Simpson Paradox incidence for other tests (see [28]; [54]).

Conclusion
In this work, we have used social choice theory and non-parametric statistical theory to examine the Yule-Simpson Aggregation Paradox as it applies to non-parametric statistical tests. As discussed, the Paradox relates to a broader literature concerning the effectiveness of information aggregation mechanisms. Herein, we have shown that the sign test for matched pairs exhibits general consistency upon aggregation. This is the first non-parametric test for which this property has been demonstrated generally, while several non-parametric tests have been shown in the prior literature to not possess this property. This result illustrates that there exists a non-parametric statistical test for which consistency upon aggregation is possible. In the present work, moreover, we find paired matching of ordinal data (across groups) to be important toward the result of general consistency. We prove an additional result as to the nature of sign test consistency. Namely, the sign test p-value exhibits a greater proportional decrease in aggregation as the original (primitive data sign test) p-value itself becomes smaller. In aggregation, then, we expect a greater proportional p-value decline given significant sign test results for the constituent, primitive data (than for insignificant sign test results). This represents something of a strong form consistency result, whereby data aggregation reinforces significant sign test results to a greater (proportional) degree.
Incorporating a generalized form of data aggregation, we also generate preliminary evidence that the sign test possesses a generalized form of consistency upon aggregation, in which the primitive data sets have varying sample sizes. For tractability, n-aggregation has been incorporated as a standard approach in examining the Yule-Simpson Paradox for ordinal data. However, analysis of the generalized, n i − aggregation form is an important consideration. Often, data of different sample sizes are pooled (e.g., in the case of many longitudinal and unbalanced panel data sets). While we believe that the result holds for the n i − aggregation problem as well, our proof is incomplete; we can formally claim the consistency only for large samples (see Theorem/Conjecture 3). With the n − aggregation form, we proved the consistency for any n by directly comparing P(X n � k) and P(X N � rk) where N = rn. This approach does not seem to work when the n i are unequal. We focused on the p = 1/2 case, as our interest was on the p − value, or on the property of the test under the typical null hypothesis. If one is interested in the power properties, we need similar results for p 6 ¼ 1/2. Using the normal approximation, we can conclude that the inequality stated in (11) holds for c > p when n 1 and n 2 are large and sufficiently apart.
We further incorporate a generic application that demonstrates tests of the same (aggregated) data by the sign test and by the WMW rank sum test, respectively. In the example, the WMW test results exhibit inconsistency upon aggregation. Sign test results demonstrate (the predicted) consistency upon aggregation, however. In the application, we further verify that the sign test results exhibit "strong form consistency" in aggregation.
An empirical application obtained from Indiana Boys High School Cross Country running data demonstrates a real-world application exhibiting both rank sum inconsistency and sign test consistency. Future work might evaluate alternative matched pairs style non-parametric tests to determine if there is a family of such tests possessing consistency upon aggregation. Indeed, the property is of central importance. As all data sets can be viewed as a potential aggregation of primitive data sets (i.e., from the power set of the aggregated set of data), consistency tells us generally whether a given statistical test evaluates data in an unambiguous manner.