Factoring a 2 x 2 contingency table

We show that a two-component proportional representation provides the necessary framework to account for the properties of a 2 × 2 contingency table. This corresponds to the factorization of the table as a product of proportion and diagonal row or column sum matrices. The row and column sum invariant measures for proportional variation are obtained. Geometrically, these correspond to displacements of two point vectors in the standard one-simplex, which are reduced to a center-of-mass coordinate representation, (δ,μ)∈R2. Then, effect size measures, such as the odds ratio and relative risk, correspond to different perspective functions for the mapping of (δ, μ) to R1. Furthermore, variations in δ and μ will be associated with different cost-benefit trade-offs for a given application. Therefore, pure mathematics alone does not provide the specification of a general form for the perspective function. This implies that the question of the merits of the odds ratio versus relative risk cannot be resolved in a general way. Expressions are obtained for the marginal sum dependence and the relations between various effect size measures, including the simple matching coefficient, odds ratio, relative risk, Yule’s Q, ϕ, and Goodman and Kruskal’s τc|r. We also show that Gini information gain (IGG) is equivalent to ϕ2 in the classification and regression tree (CART) algorithm. Then, IGG can yield misleading results due to the dependence on marginal sums. Monte Carlo methods facilitate the detailed specification of stochastic effects in the data acquisition process and provide a practical way to estimate the confidence interval for an effect size.


Introduction
In research with contingency tables, the ability to compare experimental results from different studies is essential for studying the dependence between categorical variables and how it is maintained. However, the data acquisition is controlled by sample size parameters that appear as row and column sums for the various categories. Association coefficients that are not adjusted for unbalanced sample size can differ between tables even if the underlying system response is unchanged [1,2]. The dependence of the ϕ coefficient on the margins led to the development of the normalized form, ϕ/ϕ max [3,4]. Recently, VanLiere and Rosenberg investigated the allele frequency dependence of the r 2 linkage disequilibrium measure [5]; note that ϕ and r refer to the same coefficient. Olivier and Bell discussed the limitations of the ϕ coefficient and proposed effect size thresholds for the odds ratio because it is a measure that is "not a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 problematic" [6]. The odds ratio is invariant to scaling of rows or columns, but there is continuing debate on the merits of the odds ratio versus the relative risk [7][8][9][10]. Warrens [11] showed that members of the general family of association coefficients that are linear transformations of the simple matching coefficient do not satisfy all three desiderata for a well-behaved coefficient. The lack of consensus on the utility of the many alternative effect size measures [11,12] led us to consider whether there might be a core set of principles and elementary properties for 2 × 2 tables that might broadly apply. In this paper, we review coordinate systems for representing proportional variation in a 2 × 2 table, which corresponds to a two-component system of point vectors in the standard one-simplex with two degrees of freedom. Then, we examine the equivalence class of tables induced by an odds ratio. The scaling invariance corresponds to a diagonal symmetry such that an odds ratio does not possess a simple interpretation in terms of proportional effects. We discuss the connections between proportion difference, odds ratio, Yule's Q, and relative risk and show that an effect size statistic is more generally regarded as a perspective function, i. e., a linear fractional transformation [13] of proportional variation. A contingency table factors into a product of proportion and diagonal row or column sum matrices. Rows and columns of the proportion matrix correspond to different representations of the relation between categorical variables. Therefore, a 2 × 2 table is associated with four different forms of proportional variation. Together, these constitute the full implementation of the Goodman and Kruskal proposal that adjustment for unbalanced sample size is needed in the estimation of effect size [2]. Various forms of stochastic effects can affect a data acquisition process, so a 2 × 2 table is associated with a distribution. We discuss the use of Monte Carlo methods as a practical way to simulate a distribution of tables and estimate the confidence interval for an effect size. Finally, our interest in effect size measures developed in the course of plant breeding research at DuPont to identify agriculturally beneficial genetic variation in maize [14]. These studies involved high-dimensional search to assess linkage disequilibrium and genome-wide association (GWAS) in maize populations, including the use of the classification and regression tree (CART) algorithm. An essential step in CART is an exhaustive search over the range of each independent variable for an optimal binary partition of the response data [15,16]. We show that the Gini information gain is equivalent to ϕ 2 , and we compare their behavior with a scaling invariant effect size measure using a publicly available data set. Satisfactory resolution of these longstanding issues in the application of effect size for statistics would have broad implications for high-dimensional data analysis and machine learning. The main novel contributions of this work are: 1) identification of the correspondence between factoring the 2 × 2 table and effect size, 2) identification of the four forms of proportional variation with row or column sum invariance, 3) identification of an effect size measure for a 2 × 2 table as a mapping of proportional variation for a two-component system in 4 1 × 4 1 to R 2 , 4) identification of the equivalence between Gini information gain and the ϕ coefficient, 5) development of an improved CART association algorithm using a proportional displacement measure with correction for unbalanced sample size for the response.

Notation
In this work, we study the connection between odds ratio, proportion and ϕ for a 2 × 2 table. Our notation for the three required coordinate systems is briefly summarized here. We deviate slightly from convention and use the symbol 4 1 to designate the standard one-simplex [13] such that the dot product of a vector, u 2 4 1 , with the one-vector satisfies the condition u � 1 = 1. Ratio vectors, (α, 1) and (β, 1), with a; b 2 R 1 are elements of the projective line, P 1 . (α, 1) corresponds to the proportion, p α = α/(α + 1), and the proportion vector, p α = (p α , 1 − p α ), in 4 1 . The subscript for a proportion corresponds to its P 1 coordinate. Similarly, (β, 1) corresponds to the proportion vector p β = (p β , 1 − p β ). (a, b), (c, d), (a, c), and (b, d) are vectors in R 2 . (a, b) corresponds to the ratio vector, (a/b, 1), in P 1 . (a/b, 1) corresponds to the proportion, , and the proportion vector, (p a/b , p b/a ) = (p a/b , 1 − p a/b ), in 4 1 . Ratio and proportion vectors are defined in a similar way for the other R 2 vectors. The slightly cumbersome subscript notation is necessary because we are working with proportions for both row space such as 'p a/b ', and column space such as 'p a/c '. However, in subscripts for marginal sum proportions the division by N is dropped; e. g., p a+c = (a + c)/N where N = a + b + c + d. Ratio and proportion vectors are examples of perspective functions of the general form Pðu; tÞ ¼ u t for u 2 R N , t 2 R 1 , and t > 0 [13]. Another familiar example is normalization by the Euclidean norm, Pðu; jjujjÞ ¼ 1 jjujj u.

Coordinate systems for proportion and odds ratio
In this section, we discuss coordinate systems for representing binary proportional variation in categorical data analysis. For the point vector ða; bÞ 2 R 2 , the ratio corresponds to a linear fractional transformation a b ¼ ða þ bÞ þ ða À bÞ ða þ bÞ À ða À bÞ ; where δ s is the difference in proportion The 's' designation arises from the connection with the proportional displacement, δ s , between the pair of vectors (a, b) and (b, a), and the correspondence of these vectors to a diagonally 'symmetric' 2 × 2 table as described in Section 1.4. We will encounter several expressions of the form Eq (1), indicating that elements of projective geometry [13,17] provide the framework for the analysis of proportional variation. Consequently, our objective is to identify vector algebraic structures for representing proportional variation in asymmetric 2 × 2 tables. They provide the framework for analyzing the relationships between binary proportion, odds ratio, Yule's Q, relative risk, and ϕ. Proportional normalization of a ratio vector produces a proportion vector which is an element of 4 1 (Fig 1). Then, a proportion vector has the form v = (v 1 , 1 − v 1 ), with In contrast, the corresponding ratio vector has the form Then, the difference between proportion vectors u and v δ ¼ u À v; is parameterized by a single parameter, u 1 − v 1 , and variation in binary proportion corresponds to translation in 4 1 . The difference between ratio vectors is also parameterized by a single parameter, Therefore, 4 1 and P 1 correspond to different constraints in representing proportional variation. However, the order of categories in a contingency table is arbitrary, and it is not possible to identify a unique category that should serve as the perspective coordinate for a ratio. This introduces ambiguity, as we will see later in the discussion of the odds ratio. On the other hand, in factoring out the effects of marginal sums, the 4 1 representation provides an important function in the analysis of 2 × 2 tables. Now, we discuss the representation of a two-component system of binary proportions in 4 1 and P 1 coordinate systems, and describe intrinsic properties of various effect size measures. The formulae take on a more compact, intuitive form because scaling invariance is built-in. The algebraic intuition gained here helps in comprehending the more cumbersome expressions obtained later using the R 2 representation. The exception is the ϕ coefficient, which does not possess a 4 1 representation due to the lack of scaling invariance (section 1.4). In particular, we discuss properties of the odds ratio, ω = β/α, where α, β � 0, corresponding to (α, 1) and (β, 1) on the P 1 line, respectively. Then, relative risk is defined as ρ = p β /p α , where p β = β/ (β + 1) and p α = α/(α + 1). The corresponding proportional basis consists of p α = (p α , 1 − p α ) and p β = (p β , 1 − p β ). Next, we introduce the center-of-mass basis with the parameters d ¼ 2 ; note that the alternative basis δ α−β and μ α+β would also suffice. Then, variation is represented by the two-parameter vector (δ, μ), reflecting the fact that there are two degrees of freedom. Using the relations a Then, we introduce Yule's Q [1] to obtain Similarly, the relative risk is expressed as and the ratio difference is expressed as Inspection of Eqs (3)(4)(5)(6) shows that the odds ratio and relative risk correspond to linear fractional transformations of proportional variation, and an effect size statistic corresponds to a perspective function P((δ, μ), t) = (δ/t, μ/t), where t is a polynomial function of δ and μ. However, algebraic considerations alone are not sufficient to explain why a particular form might be preferred for t or to provide operational interpretations for the different perspective normalizations in Eqs (4-6). In his 1912 paper, Yule remarked that the Q coefficient has the merit of possessing a simple form "but the demerit of not possessing an equal simplicity of interpretation" [1]. Given the lack of an interpretation for the different normalizations, we find that Yule's remark also extends to the odds ratio and relative risk. Furthermore, rearranging Eqs (4) and (5) gives the corresponding relations with 0 � μ − δ � 1 and 0 � μ + δ � 1. Each of the four forms of proportional variation identified in the section 1.3 satisfies these relations. Thus, there are a range of values of (δ, μ) for a fixed value of either Q, or ρ (Fig 2). This ambiguity in proportional effects explains why the question of the merits of the odds ratio versus relative risk is still not resolved [18,19]. A more precise approach would take into account the two-dimensional nature of the proportional variation, which could involve separate thresholds for δ and μ. In any case, the specification of a perspective function should be based on the assessment of cost-benefit trade-offs for variations in δ and μ, which will depend on the particular application.

Decomposition of proportional variation for a 2 × 2 contingency table
In this section, the two-component framework is used in the analysis of proportional variation for a 2 × 2 table (Table 1). We are particularly concerned with the confounding effect of the row and column sums in the formulation of association measures [2,5,11]. Each marginal sum corresponds to a categorical sample size that is determined by experimental procedure.
Suppose the first row of Table 1 is multiplied by a number k to reflect a change in sample size; then, (a, b) 7 ! (ka, kb). Then, the simple matching coefficient [11], s M , is expressed as which is not invariant to scaling by k. Alternatively, each marginal sum serves as a proportional normalization factor; e. g., P((a, b), a + b). Then, s M can be expressed as the weighted sum of proportions for columns or rows, respectively. The proportions are invariant to scaling of either rows or columns, but the corresponding weights (x i ) are not because the overall sum, a + b + c + d, does not distinguish between row or column sums. Therefore, s M can differ between two tables because of differences in sample size even though the underlying system response properties might be unchanged. Warrens [11] has shown that members of the general family of coefficients that are linear transformations of s M do not satisfy the criteria for a well-behaved coefficient. As discussed by Goodman and Kruskal [2], dependence on sample size parameters complicates the interpretation of association coefficients. The concepts discussed in this paper support their proposal that normalization to adjust for unbalanced sample sizes is necessary.
The invariance of the odds ratio to scaling of either rows or columns is expressed as k > 0. This expression remains valid if either bc 7 ! cb or ad 7 ! da. Thus, the odds ratio does not distinguish between ratios for rows and columns [18,20], which introduces ambiguity with respect to proportional effects. Consider the equivalence class of tables obtained by unitary scaling of the diagonal elements ('u-scaling'),   [1], also known as the coefficient of colligation [21]. However, row and column sums are linearly related by a column proportion matrix This linear relation is not preserved by u-scaling because of the mixing of effects between rows and columns (Table 2), so the odds ratio by itself is not suitable as an effect size measure. The linear relation also implies that row and column sums play equal roles as sample size parameters directly or indirectly, and that either rows or columns can be equalized, but not both simultaneously. It is necessary to choose between rows or columns in conditioning a contingency table for unbalanced sample sizes.
A self-consistent representation of proportional variation must account for the scaling invariance of the odds ratio. Therefore, our objective is to obtain a decomposition of the odds ratio in terms of elementary proportions by conditioning for the effect of the marginal sums. Consider scaling of the expression ωbc − ad = 0 by column sums to obtain the fractional representation where n 1 and n 2 are normalization factors for the subsequent conversion to proportion vectors. Since there are two ways to express the odds ratio as a product of ratios, there are also two ways to group the fractional products to form proportion vectors. The standard grouping is formed from the columns of the table with n 1 = n 2 = 1 to obtain the two vectors However, we also obtain a second pair of vectors formed from the rows with The proportions in both Eqs (13) and (14) are invariant to scaling of columns, as required. The second form of proportional variation corresponds to an effect size measure with the normalization needed for experimental work, and has not been previously mentioned in the effect size literature to the best of my knowledge. Proportion vectors invariant to the scaling of rows are obtained in a similar way. A more concise way to obtain the proportion vectors is to observe that a matrix can be factored as a product of a diagonal column sum (M csum ) or a row sum (M rsum ) and proportion matrices, P csum,c|r or P rsum,c|r , respectively.
The N csum,c|r and N rsum,c|r proportion normalization factors provide the different scaling structures (Eq 12) needed for column and row proportion matrices, which correspond to different projective representations of the relationship between variables (Fig 4). The standard protocol is to equalize the marginal sums for the response or dependent variable, and calculate the response effect size for variation of the treatment or independent variable. Depending on whether the response variable is listed in columns or rows, the corresponding representation would be either P csum,r or P rsum,c , respectively. Examples of corresponding proportion difference measures, δ c,a−c and δ r,a−b , are also shown in Fig 4. Our subscript notation is explained by the following example, Thus, δ r,a−b corresponds to the difference between 'a' and 'b' elements of the P rsum,c proportion matrix. Then, calculation of an effect size requires the specification of a perspective function for mapping the relevant (δ, μ) vector to R 1 (Section 1.2). Proper practice also requires that an effect size estimate must be qualified by a confidence interval (Section 1.5).

The ϕ coefficient
In this section, we discuss why ϕ does not serve as a well-behaved effect size measure and further explain the connection between δ s and diagonally symmetric 2 × 2 tables. The ϕ coefficient is of particular importance in GWAS because it serves as a standard measure of linkage disequilibrium between molecular markers [3,5]. The popularity of ϕ is due to its correspondence with Pearson's correlation coefficient. Binary {0, 1} representations are invoked for the categorical variables, then the correlation coefficient formula is applied to obtain � ¼ ad À bc ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ða þ bÞðc þ dÞða þ cÞðb þ dÞ which is also often referred to as 'r'. However, the limitations of ϕ as an association measure are well known [3,5,6,11,22]. Alternatively, ϕ is obtained from the relation with Pearson's chi-squared statistic, χ 2 = (a + b + c + d)ϕ 2 [3,23], which also averages over rows and columns resulting in confounding effects. Introducing the ratio product for marginal sums, ϕ can be written as the row sum factorization Consequently, each proportion difference, δ i , is associated with a factorization ϕ = M i δ i , where M i depends on marginal sums. Therefore, ϕ corresponds to a weighted average of the δ i . The multiplication of row and column sums together in each M i has a compounding effect because the sums are not independent.
Consider a diagonally symmetric 2 × 2 table with d = a and c = b in Table 1, and equal row and column sums. Then, Eq (12) becomes ob 2 À a 2 ða þ bÞ 2 ¼ 0; which corresponds to the proportion vectors 1 aþb ða; bÞ and 1 aþb ðb; aÞ, and proportion difference δ s (Eq 2). Since 1 aþb ½ða; bÞ þ ðb; aÞ� ¼ ð1; 1Þ, the 4 1 coordinates are d s ; 1 2 À � , so there is only one degree of freedom. Thus, there is a correspondence between 2 × 1 tables [23] and diagonally symmetric 2 × 2 tables. However, M i = 1 for diagonally symmetric tables, and Eq (17) simplifies to give ϕ s = δ s . Thus, δ s and ϕ s are equivalent measures of proportional variation. Conversely, the δ i in Fig 4 can be regarded as constituting an extension of ϕ s to asymmetric tables. The ϕ coefficient per se does not account for the loss of symmetry when M i � 1, because it does not distinguish between the δ i . However, when M i ≊ 1 the four expressions collapse into one or nearly so, and the values of ϕ and δ i will be approximately the same. This includes the case where either b = c = 0 or a = d = 0 resulting in a diagonal 2 × 2 table. The connection with ϕ suggests that Cohen's recommendations of effect sizes of 0.1, 0.3 and 0.5 for small, medium, and large effects, respectively, for ϕ [6,24] can also be invoked for the various forms of δ i , but this assumes that the μ i coordinate is irrelevant.

Confidence interval for proportional effects
Each step of a data acquisition process is subject to stochastic effects, and data quality can vary between data sets. Therefore, the specification of a confidence interval (CI) for the effect size is an integral part of data analysis [25,26]. A contingency table for experimental data is associated with a distribution of tables, PðyÞ, and corresponding distributions for the effect size. The specification of PðyÞ must be based on a realistic assessment of all sources of error and uncertainty to form an error model for the data, EðyÞ. For binary variables, a common approach is to estimate variance from a binomial distribution; the normal distribution is a useful approximation for large sample sizes. Then, estimating the CI for an effect size requires a propagation of error calculation, which is often not straightforward. Analytical approaches for estimating confidence intervals for ratios [27,28], proportion and difference of two proportions [29,30], correlation coefficients [31,32], and odds ratios [9] are already quite involved. Fractional transformation, the bounded range, and the discrete properties of an effect size for proportional variation introduce complications that make it difficult to obtain convenient expressions for error propagation. Alternatively, Monte Carlo (MC) methods [33,34] provide a more practical approach to estimate confidence intervals for quantities such as δ r,b−a and δ c,c−a . In an MC simulation, a 2 × 2 MC table is obtained by generating the N = a + b + c + d events by making random draws according to specified sample proportions [9] and EðyÞ. A set of MC tables is obtained by repeating the sampling process many times; MC distributions are formed for proportions and effect size from the MC tables. Many MC runs are performed, collecting the relevant statistics for each MC distribution, including the mean, median, variance, and histogram. Finally, the degree of convergence for the MC simulation is estimated from the statistics for the MC runs. Fig 5A and 5C shows constrained MC simulations with fixed column sums n 1 = a + c and n 2 = b + d and sampling proportions 1 aþc ða; cÞ and 1 bþd ðb; dÞ, respectively. Fig 5B and  5D shows greater internal scatter because only the overall sum, N, is fixed, with corresponding sampling proportions 1 N ða; b; c; dÞ. Even though the underlying distributions are discrete, the ±2σ interval for a normal distribution serves as a good approximation for the δ c,c−a confidence interval in this example. More generally, the distribution of effect size is asymmetric which would be represented by separate confidence intervals for positive and negative deviation from the median. The advantage of the MC method is that the simulation can accommodate a detailed specification of EðyÞ, including heteroscedasticity [25,35] and correction for attenuation from misclassification [35,36]. This capability is essential in accounting for the effects of instrumental and other operational factors on the quality of data produced by a data acquisition system.

Decomposition of proportional effects for an r × c table
A table with more than two rows or columns is commonly referred to as an r × c table. The matrix factorization (Eqs 15 & 16) extends in a straightforward way to produce the r × c proportion matrices. For independent and dependent variables with r and c categories, respectively, proportional variation is represented as r points in the standard 4 c−1 simplex, with r(c − 1) degrees of freedom. Various multicategorical association measures have been proposed for r × c tables. However, we choose Cramer's V 2 [37,38] as an example to illustrate the difficulties. V 2 is defined as a normalization of Pearson's χ 2 such that χ 2 = n(q − 1)V 2 , where n is the total event count and q = min(r, c). V is equivalent to ϕ for 2 × 2 tables. Similarly, it is straightforward to show that Goodman and Kruskal's τ c and τ r [37] are both equivalent to ϕ 2 for 2 × 2 tables. These equivalences confirm that Pearson's χ 2 , V 2 , τ c and τ r are composite statistical quantities that average over alternative forms of variation and are therefore subject to ambiguous interpretation. The R rðcÀ 1Þ 7 !R 1 mappings consist of multidimensional sums and products across rows and columns, resulting in confounding effects because of dependence between them.  (δ c,c−a , μ c,c+a ), for a 2 × 2 table with a,  b, c, d = [10, 30, 30 In the absence of an engineering or functional model, the specification of a vector basis for proportional variation for an r × c table is not a well-posed problem [39]; i. e., there isn't a unique solution. This constitutes a fundamental limitation for the formulation of an effect size measure. Consider a two-component proportional system represented by vectors, u, v 2 4 N with N > 1, and u; v 2 R Nþ1 . The two default center-of-mass vectors are μ = (u + v)/2, and δ = (u − v)/2. However, there isn't a standard procedure for choosing the additional 2N − 2 vectors needed to form a complete basis. Alternatively, a single coordinate or a sum of coordinates could serve as the basis for estimating an effect size. This corresponds to choosing a 4 1 × 4 1 subspace for the representation of proportional variation; e. g., A representation of the 2N degrees of freedom for a two-component 4 N × 4 N system would require the specification of N 2 × 2 tables. Therefore, the 2 × 2 table serves an elementary role in the decomposition of multiproportional variation due to the minimal properties of 4 1 . The recommended approach is to adopt a multidimensional representation of proportional variation and "reduce any multiple-level or multiple-variable relationship to a set of two-variable relationships" [25]. Similar advice has been given for avoiding the compounding effect of the ANOVA null hypothesis, to break down "complicated hypotheses into smaller, more easily understood pieces" [40]. Ways in which an r × c table might be partitioned and marginalized have been described by Kateri [41]. The objective is to construct a set of 2 × 2 tables that encompass relevant forms of proportional variation for the particular application. This multidimensional representation should be combined with the specification of cost-benefit trade-offs in assessing the effect size for proportional variation. In the next section, we discuss the use of 2 × 2 tables in the CART algorithm. However, highdimensional search is still a developing area [42,43], and a detailed assessment of the pros and cons for various approaches is beyond the scope of this paper.

Gini information gain and ϕ 2
In this section, we examine connections between effect size and information gain (IG) measures used in standard implementations of the CART algorithm. CART creates a binary decision tree by the recursive partitioning of the association between response and independent variables [44][45][46]. Each node of the tree corresponds to a binary partition of the range of an independent variable. Each terminal node is a classification identified by a unique combination of intervals of the independent variables. In standard implementations, the partition parameters for a node are determined by maximizing IG for the response variable in an exhaustive search of associations over all independent variables. In each iteration, the set of statistics obtained for the binary partitions of an independent variable constitutes a CART association graph. Our objective is to compare CART graphs for effect sizes including IG. To simplify the discussion, we consider the case where the response variable is binary. Then, the data for a partition correspond to a 2 × 2 table [47]. Then, IG is defined as the parent node impurity, I(S), minus the weighted impurities for the subnodes I(S 1 ) and I(S 2 ), where the weight factor is x i ¼ n i n , n i is the number of elements in node S i , and n = n 1 + n 2 . Two popular impurity measures are the entropy, E = −∑p j lnp j , and Gini impurity, G ¼ 1 À P p 2 j , where p j is the proportion of class 'j' items in a set [16]. For a binary proportion vector, ðp m=n ; p n=m Þ ¼ 1 mþn ðm; nÞ, and the Gini impurity becomes G(p m/n , p n/m ) = 2p m/n p n/m . However, the x i are subject to the same limitations as the weight factors for s M (Eqs 9 & 10), and both IG E and IG G depend on the marginal sums. More concretely, we show that IG G and ϕ 2 are equivalent in CART. Let the rows and columns of Table 1 correspond to the subnodes and categories for the response variable, respectively. Then, G(S) for the parent node is G(S 1 ) and G(S 2 ) are calculated from proportions for the row vectors (a, b) and (c, d), respectively. Then, with substitution of the ϕ coefficient from Eq (17). Since G(S) is a constant for binary partitions at a parent node, we conclude that IG G is equivalent to ϕ 2 . This confirms that IG G depends on marginal sums due to the x i , in which the normalization factor N = a + b + c + d does not distinguish between rows and columns. Information gain measures of the form Eq (23) will be subject to this limitation, including IG E . It is known that IG E and IG G yield very similar results in CART [48], which confirms that IG E is subject to dependence on marginal sums ( Table 4). The limitations of IG G raise the question of whether the column sum invariant δ c,a−c statistic might be more appropriate for CART, which we consider in the next section.

Data preparation
The Centers for Medicare and Medicaid (CMS) conduct regular inspections of nursing homes to assess compliance with regulations and survey residents to assess the quality of patient care. The CMS quality measures data and Five-Star rating assignments are publicly available from the Nursing Home Compare (NHC) website [49]. The analysis of NHC data is an important problem in itself [50][51][52] and is the subject of our ongoing work [53]. Nursing homes are dynamic systems where the measurement of performance is essential for managing cost, but this constitutes a complex problem for which there is not a unique or 'best' solution. The challenge is to develop data analysis methods that can help identify public health criteria for classifying the quality of patient care in nursing homes, or some approximation thereof. However, in this work our interest is limited to the comparison of CART association graphs for effect size measures. First Quarter, 2018 NHC data for eighteen quality measures were retrieved, selecting only those nursing homes with either a 1 star or 5 star overall rating, corresponding to 1394 and 2649 nursing homes, respectively. Selecting '1 star, 5 star' rating data creates a binary response data set, which is convenient for our purpose; otherwise, data for all five ratings would be included in the CART analysis. The distributions of NHC 'Percentage of shortstay residents who were rehospitalized after a nursing home admission' (Rehospitalized) quality measure data for 1 star and 5 star overall ratings are broad and largely overlap (Fig 6A). This result implies that the M i for the corresponding contingency tables will tend to be much less than 1, as required for our demonstration.

Effect size in CART
In demonstrating the marginal sum dependence of various effect size measures, we must choose an elementary contingency table analysis problem. CART analysis for a binary response variable (bCART) is well suited for this purpose. In searching for an optimal binary partition of an independent variable, bCART generates a set of 2 × 2 tables where the sample sizes, n 1 and n 2 , of the two subnodes vary over almost the entire range of the fixed sum N = n 1 + n 2 ; a minimum size is usually specified because a partition where either of the subnodes is too small is not informative. We let the rows and columns of Table 1 correspond to the two subnodes and the '1 star, 5 star' rating for the response variable, respectively. Effect size results for a bCART scan for association between the Rehospitalized quality measure and NHC '1 star, 5 star' overall rating are shown in Fig 6. The exact match between IG G and ϕ 2 (Fig 6A) is consistent with Eq (24) because G(S) is constant. The parabolic variation of ϕ 2 is explained by Eq (21) because the variation in the marginal sum factor, M c,a−c , outweighs the much smaller variation in the proportional effect, δ c,a−c (Fig 6B). The parabolic variation of M 2 c;aÀ c is in turn explained by the approximate similarity with ω M . Replacing each marginal sum in Eq (18) by the corresponding proportion yields o M ¼ p aþb p cþd p aþc p bþd ; ¼ p aþb ð1 À p aþb Þ p aþc ð1 À p aþc Þ : The denominator corresponds to the binomial variance for the parent set, which is constant.  The numerator corresponds to the binomial variance for subnode size proportions, (a + b): (c + d), so ω M has a maximum when a + b = c + d, which coincides with the median Rehospitalized value. Consequently, the parabolic dependence of ϕ 2 , with the maximum near the median value, largely reflects the variation in the subnode sample size instead of '1 star, 5 star' composition. In contrast, δ c,a−c is column sum invariant and yields very similar results to Yule's Q, which is invariant to scaling of either rows or columns (Fig 7); the correlation is higher than 0.99 for 16 NHC quality measures, and the lowest is 0.91. Note that this similarity does not represent a special relation with Q and results from the numerical properties of δ c,a−c and μ c,a+c for these data (Eq 4). The lower correlation (r = 0.78) between δ c,a−c and δ r,a−c confirms that different forms of proportional variation can be distinguished; δ r,a−c also measures the difference in subnode composition but is row sum invariant. The U-shaped δ c,a−c association graph has two maxima, so there are two possible CART partitions ( Table 4). The relatively small subnode with Rehospitalized below 13.3% is enriched in the 5 star rating, corresponding to better than average patient care. Above 32.6%, the patient care is worse than average because it is associated with enrichment of the 1 star rating. The middle range from 13.3-32.6% includes the majority of nursing homes with average performance. In comparison, IG G and IG E produce subnodes that are nearly equal in size and with much lower degrees of enrichment in the '1 star, 5 star' proportions. Thus, δ c,a−c is more effective than IG G and IG E in identifying partitions that correspond to a difference in the '1 star, 5 star' composition. The logistic regression method provides a graphical view of the effect of sample size parameters on proportional variation in categorical data analysis (Fig 8A). The '1 star, 5 star' rating data were analyzed using the LogisticRegression function in the scikit-learn library with the 'lbfgs' solver [54]. A moving average of the '5 star' rating proportion is included in the graph as a reference for the logistic curve. The normalized '5 star' proportion adjusted for inequality in the '1 star, 5 star' sample sizes and the corresponding adjusted logistic curve are shown in ( Fig  8B). The variation in proportion confirms that the left and right tails of the Rehospitalized distribution correspond to nursing homes with above and below average performance, respectively, consistent with the CART association results. The logistic model for the '5 star' proportion, y ¼ c 5 c 1 þc 5 , is usually expressed as where parameters, (a, b), are determined from the curve fit. The adjustment for the logistic curve was obtained using the change in coordinates a ¼ À bx 0 À ln n 1 n 5

� � ;
where n 1 and n 5 are the sample sizes for the 1 star and 5 star ratings in the data set, respectively. Substitution into Eq (25) yields y ¼ 1 1 þ n 1 n 5 e À bðxÀ x 0 Þ such that yðx 0 Þ ¼ n 5 n 1 þn 5 . In a data set where n 1 = n 5 , y(x 0 ) = 1/2, and x 0 correspond to the midpoint value for the logistic curve. Then, there are two sample-size-independent parameters, b and x 0 .

Discussion
The renewed warnings from the statistics community about the limitations of statistical significance methodology has created a perplexing situation, given that there is a wide range of opinion on the underlying causes and solutions [55,56]. Claims have also been made about effect size [25,26,57] as a better alternative, but the lack of consensus on the utility of commonly used association coefficients, such as the odds ratio [8,10], the simple matching coefficient and ϕ [5,11], hinders development of this approach. In this paper, we describe a rigorous framework for representing proportional variation in a 2 × 2 table, which helps in resolving the marginal sum dependence problem for association coefficients. We show that a 2 × 2 table is associated with four forms of proportional variation resulting from the factorization as a product of proportion and diagonal row or column sum matrices. Association coefficients, such as ϕ, the odds ratio, and the simple matching coefficient, which do not distinguish between rows or columns, correspond to averages of proportional effects and lack clear interpretation. The two-component structure implies that there are two degrees of freedom corresponding to the displacement of two point vectors in the standard one-simplex, 4 1 . An effect size measure then requires the specification of a perspective function of the center-of-mass coordinates, (δ, μ), which is potentially unique for each application because of differences in cost-benefit trade-offs. In practice, classification problems vary widely in difficulty depending on the degree of overlap between the underlying distributions. Fisher's irises data set [58] is an example of a classification problem for well separated distributions, where different association coefficients achieve similar results because of degeneracy, particularly when the 2 × 2 table is diagonally symmetric or the effects are highly correlated. Conversely, differences in performance between association coefficients are best observed when the underlying distributions overlap. We also show that both Gini and entropy information gain are subject to dependence on marginal sums, which degrades the performance of the CART algorithm. Alternatively, the proportion difference with marginal sum invariance for the response variable provides a significant improvement in the performance of the CART algorithm. We conclude that the results in this paper demonstrate that equalization of either row or column sums of a 2 × 2 table serves as a correction for unbalanced sample sizes, as suggested by Goodman and Kruskal [2].