Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Factoring a 2 x 2 contingency table

Abstract

We show that a two-component proportional representation provides the necessary framework to account for the properties of a 2 × 2 contingency table. This corresponds to the factorization of the table as a product of proportion and diagonal row or column sum matrices. The row and column sum invariant measures for proportional variation are obtained. Geometrically, these correspond to displacements of two point vectors in the standard one-simplex, which are reduced to a center-of-mass coordinate representation, . Then, effect size measures, such as the odds ratio and relative risk, correspond to different perspective functions for the mapping of (δ, μ) to . Furthermore, variations in δ and μ will be associated with different cost-benefit trade-offs for a given application. Therefore, pure mathematics alone does not provide the specification of a general form for the perspective function. This implies that the question of the merits of the odds ratio versus relative risk cannot be resolved in a general way. Expressions are obtained for the marginal sum dependence and the relations between various effect size measures, including the simple matching coefficient, odds ratio, relative risk, Yule’s Q, ϕ, and Goodman and Kruskal’s τc|r. We also show that Gini information gain (IGG) is equivalent to ϕ2 in the classification and regression tree (CART) algorithm. Then, IGG can yield misleading results due to the dependence on marginal sums. Monte Carlo methods facilitate the detailed specification of stochastic effects in the data acquisition process and provide a practical way to estimate the confidence interval for an effect size.

Introduction

In research with contingency tables, the ability to compare experimental results from different studies is essential for studying the dependence between categorical variables and how it is maintained. However, the data acquisition is controlled by sample size parameters that appear as row and column sums for the various categories. Association coefficients that are not adjusted for unbalanced sample size can differ between tables even if the underlying system response is unchanged [1, 2]. The dependence of the ϕ coefficient on the margins led to the development of the normalized form, ϕ/ϕmax [3, 4]. Recently, VanLiere and Rosenberg investigated the allele frequency dependence of the r2 linkage disequilibrium measure [5]; note that ϕ and r refer to the same coefficient. Olivier and Bell discussed the limitations of the ϕ coefficient and proposed effect size thresholds for the odds ratio because it is a measure that is “not problematic” [6]. The odds ratio is invariant to scaling of rows or columns, but there is continuing debate on the merits of the odds ratio versus the relative risk [710]. Warrens [11] showed that members of the general family of association coefficients that are linear transformations of the simple matching coefficient do not satisfy all three desiderata for a well-behaved coefficient. The lack of consensus on the utility of the many alternative effect size measures [11, 12] led us to consider whether there might be a core set of principles and elementary properties for 2 × 2 tables that might broadly apply. In this paper, we review coordinate systems for representing proportional variation in a 2 × 2 table, which corresponds to a two-component system of point vectors in the standard one-simplex with two degrees of freedom. Then, we examine the equivalence class of tables induced by an odds ratio. The scaling invariance corresponds to a diagonal symmetry such that an odds ratio does not possess a simple interpretation in terms of proportional effects. We discuss the connections between proportion difference, odds ratio, Yule’s Q, and relative risk and show that an effect size statistic is more generally regarded as a perspective function, i. e., a linear fractional transformation [13] of proportional variation. A contingency table factors into a product of proportion and diagonal row or column sum matrices. Rows and columns of the proportion matrix correspond to different representations of the relation between categorical variables. Therefore, a 2 × 2 table is associated with four different forms of proportional variation. Together, these constitute the full implementation of the Goodman and Kruskal proposal that adjustment for unbalanced sample size is needed in the estimation of effect size [2]. Various forms of stochastic effects can affect a data acquisition process, so a 2 × 2 table is associated with a distribution. We discuss the use of Monte Carlo methods as a practical way to simulate a distribution of tables and estimate the confidence interval for an effect size. Finally, our interest in effect size measures developed in the course of plant breeding research at DuPont to identify agriculturally beneficial genetic variation in maize [14]. These studies involved high-dimensional search to assess linkage disequilibrium and genome-wide association (GWAS) in maize populations, including the use of the classification and regression tree (CART) algorithm. An essential step in CART is an exhaustive search over the range of each independent variable for an optimal binary partition of the response data [15, 16]. We show that the Gini information gain is equivalent to ϕ2, and we compare their behavior with a scaling invariant effect size measure using a publicly available data set. Satisfactory resolution of these longstanding issues in the application of effect size for statistics would have broad implications for high-dimensional data analysis and machine learning. The main novel contributions of this work are: 1) identification of the correspondence between factoring the 2 × 2 table and effect size, 2) identification of the four forms of proportional variation with row or column sum invariance, 3) identification of an effect size measure for a 2 × 2 table as a mapping of proportional variation for a two-component system in △1 × △1 to , 4) identification of the equivalence between Gini information gain and the ϕ coefficient, 5) development of an improved CART association algorithm using a proportional displacement measure with correction for unbalanced sample size for the response.

1 Methods

1.1 Notation

In this work, we study the connection between odds ratio, proportion and ϕ for a 2 × 2 table. Our notation for the three required coordinate systems is briefly summarized here. We deviate slightly from convention and use the symbol △1 to designate the standard one-simplex [13] such that the dot product of a vector, u ∈ △1, with the one-vector satisfies the condition u1 = 1. Ratio vectors, (α, 1) and (β, 1), with are elements of the projective line, . (α, 1) corresponds to the proportion, pα = α/(α + 1), and the proportion vector, pα = (pα, 1 − pα), in △1. The subscript for a proportion corresponds to its coordinate. Similarly, (β, 1) corresponds to the proportion vector pβ = (pβ, 1 − pβ). (a, b), (c, d), (a, c), and (b, d) are vectors in . (a, b) corresponds to the ratio vector, (a/b, 1), in . (a/b, 1) corresponds to the proportion, , and the proportion vector, (pa/b, pb/a) = (pa/b, 1 − pa/b), in △1. Ratio and proportion vectors are defined in a similar way for the other vectors. The slightly cumbersome subscript notation is necessary because we are working with proportions for both row space such as ‘pa/b’, and column space such as ‘pa/c’. However, in subscripts for marginal sum proportions the division by N is dropped; e. g., pa+c = (a + c)/N where N = a + b + c + d. Ratio and proportion vectors are examples of perspective functions of the general form for , , and t > 0 [13]. Another familiar example is normalization by the Euclidean norm, .

1.2 Coordinate systems for proportion and odds ratio

In this section, we discuss coordinate systems for representing binary proportional variation in categorical data analysis. For the point vector , the ratio corresponds to a linear fractional transformation (1) where δs is the difference in proportion The ‘s’ designation arises from the connection with the proportional displacement, δs, between the pair of vectors (a, b) and (b, a), (2) and the correspondence of these vectors to a diagonally ‘symmetric’ 2 × 2 table as described in Section 1.4. We will encounter several expressions of the form Eq (1), indicating that elements of projective geometry [13, 17] provide the framework for the analysis of proportional variation. Consequently, our objective is to identify vector algebraic structures for representing proportional variation in asymmetric 2 × 2 tables. They provide the framework for analyzing the relationships between binary proportion, odds ratio, Yule’s Q, relative risk, and ϕ.

Proportional normalization of a ratio vector produces a proportion vector which is an element of △1 (Fig 1). Then, a proportion vector has the form v = (v1, 1 − v1), with derivative dv = dv1(1, −1) such that dv1 = 0 for 0 ≤ v1 ≤ 1. In contrast, the corresponding ratio vector has the form with derivative Then, the difference between proportion vectors u and v is parameterized by a single parameter, u1v1, and variation in binary proportion corresponds to translation in △1. The difference between ratio vectors is also parameterized by a single parameter, Therefore, △1 and correspond to different constraints in representing proportional variation. However, the order of categories in a contingency table is arbitrary, and it is not possible to identify a unique category that should serve as the perspective coordinate for a ratio. This introduces ambiguity, as we will see later in the discussion of the odds ratio. On the other hand, in factoring out the effects of marginal sums, the △1 representation provides an important function in the analysis of 2 × 2 tables.

thumbnail
Fig 1. Coordinates for a two-component binary proportional system.

Proportional variation for vectors, (a, b) and (c, d), is represented either as points, and , in , or as points in the standard one-simplex, △1. δ is the proportional displacement between the vectors. Proportion and ratio are related by a linear fractional transformation, as indicated by the dashed lines.

https://doi.org/10.1371/journal.pone.0224460.g001

Now, we discuss the representation of a two-component system of binary proportions in △1 and coordinate systems, and describe intrinsic properties of various effect size measures. The formulae take on a more compact, intuitive form because scaling invariance is built-in. The algebraic intuition gained here helps in comprehending the more cumbersome expressions obtained later using the representation. The exception is the ϕ coefficient, which does not possess a △1 representation due to the lack of scaling invariance (section 1.4). In particular, we discuss properties of the odds ratio, ω = β/α, where α, β ≥ 0, corresponding to (α, 1) and (β, 1) on the line, respectively. Then, relative risk is defined as ρ = pβ/pα, where pβ = β/(β + 1) and pα = α/(α + 1). The corresponding proportional basis consists of pα = (pα, 1 − pα) and pβ = (pβ, 1 − pβ). Next, we introduce the center-of-mass basis with the parameters and ; note that the alternative basis δαβ and μα+β would also suffice. Then, variation is represented by the two-parameter vector (δ, μ), reflecting the fact that there are two degrees of freedom. Using the relations , , pα = μδ and pβ = μ + δ, we obtain (3) Then, we introduce Yule’s Q [1] to obtain (4) Similarly, the relative risk is expressed as (5) and the ratio difference is expressed as (6) Inspection of Eqs (36) shows that the odds ratio and relative risk correspond to linear fractional transformations of proportional variation, and an effect size statistic corresponds to a perspective function P((δ, μ), t) = (δ/t, μ/t), where t is a polynomial function of δ and μ. However, algebraic considerations alone are not sufficient to explain why a particular form might be preferred for t or to provide operational interpretations for the different perspective normalizations in Eqs (46). In his 1912 paper, Yule remarked that the Q coefficient has the merit of possessing a simple form “but the demerit of not possessing an equal simplicity of interpretation” [1]. Given the lack of an interpretation for the different normalizations, we find that Yule’s remark also extends to the odds ratio and relative risk. Furthermore, rearranging Eqs (4) and (5) gives the corresponding relations (7) (8) with 0 ≤ μδ ≤ 1 and 0 ≤ μ + δ ≤ 1. Each of the four forms of proportional variation identified in the section 1.3 satisfies these relations. Thus, there are a range of values of (δ, μ) for a fixed value of either Q, or ρ (Fig 2). This ambiguity in proportional effects explains why the question of the merits of the odds ratio versus relative risk is still not resolved [18, 19]. A more precise approach would take into account the two-dimensional nature of the proportional variation, which could involve separate thresholds for δ and μ. In any case, the specification of a perspective function should be based on the assessment of cost-benefit trade-offs for variations in δ and μ, which will depend on the particular application.

thumbnail
Fig 2. Center-of-mass coordinates for a two-component biproportional system.

In the △1 representation, the center-of-mass coordinates are μ = (pα + pβ)/2 and δ = (pβpα)/2. The proportional variation, (μ, δ), for fixed odds ratio, ω = 2, and relative risk, ρ = 1.2, are shown. The odds ratio and relative risk are perspective functions of the center-of-mass coordinates.

https://doi.org/10.1371/journal.pone.0224460.g002

1.3 Decomposition of proportional variation for a 2 × 2 contingency table

In this section, the two-component framework is used in the analysis of proportional variation for a 2 × 2 table (Table 1). We are particularly concerned with the confounding effect of the row and column sums in the formulation of association measures [2, 5, 11]. Each marginal sum corresponds to a categorical sample size that is determined by experimental procedure. Suppose the first row of Table 1 is multiplied by a number k to reflect a change in sample size; then, (a, b) ↦ (ka, kb). Then, the simple matching coefficient [11], sM, is expressed as which is not invariant to scaling by k. Alternatively, each marginal sum serves as a proportional normalization factor; e. g., P((a, b), a + b). Then, sM can be expressed as the weighted sum of proportions (9) (10) for columns or rows, respectively. The proportions are invariant to scaling of either rows or columns, but the corresponding weights (xi) are not because the overall sum, a + b + c + d, does not distinguish between row or column sums. Therefore, sM can differ between two tables because of differences in sample size even though the underlying system response properties might be unchanged. Warrens [11] has shown that members of the general family of coefficients that are linear transformations of sM do not satisfy the criteria for a well-behaved coefficient. As discussed by Goodman and Kruskal [2], dependence on sample size parameters complicates the interpretation of association coefficients. The concepts discussed in this paper support their proposal that normalization to adjust for unbalanced sample sizes is necessary.

The invariance of the odds ratio to scaling of either rows or columns is expressed as (11) k > 0. This expression remains valid if either bccb or adda. Thus, the odds ratio does not distinguish between ratios for rows and columns, [18, 20], which introduces ambiguity with respect to proportional effects. Consider the equivalence class of tables obtained by unitary scaling of the diagonal elements (‘u-scaling’), with j, k > 0 (Table 2). The two numerical examples of such tables shown in Fig 3 demonstrate that while the odds ratio and Q are invariant, the proportions are not. Furthermore, in the special case where and , the row and column sums are equalized due to the geometric averaging of the diagonal elements, and the Yule symmetric table (Table 3) is obtained. This table serves as the basis for Yule’s ω coefficient [1], also known as the coefficient of colligation [21]. However, row and column sums are linearly related by a column proportion matrix This linear relation is not preserved by u-scaling because of the mixing of effects between rows and columns (Table 2), so the odds ratio by itself is not suitable as an effect size measure. The linear relation also implies that row and column sums play equal roles as sample size parameters directly or indirectly, and that either rows or columns can be equalized, but not both simultaneously. It is necessary to choose between rows or columns in conditioning a contingency table for unbalanced sample sizes.

thumbnail
Fig 3. Contingency tables with fixed odds ratio.

While the odds ratio, , is fixed in these tables, the proportions are not. The Yule Q statistic is also invariant because it is related to ω by the linear fractional transformation .

https://doi.org/10.1371/journal.pone.0224460.g003

A self-consistent representation of proportional variation must account for the scaling invariance of the odds ratio. Therefore, our objective is to obtain a decomposition of the odds ratio in terms of elementary proportions by conditioning for the effect of the marginal sums. Consider scaling of the expression ωbcad = 0 by column sums to obtain the fractional representation (12) where n1 and n2 are normalization factors for the subsequent conversion to proportion vectors. Since there are two ways to express the odds ratio as a product of ratios, there are also two ways to group the fractional products to form proportion vectors. The standard grouping is formed from the columns of the table with n1 = n2 = 1 to obtain the two vectors (13) However, we also obtain a second pair of vectors formed from the rows with and yielding (14) The proportions in both Eqs (13) and (14) are invariant to scaling of columns, as required. The second form of proportional variation corresponds to an effect size measure with the normalization needed for experimental work, and has not been previously mentioned in the effect size literature to the best of my knowledge. Proportion vectors invariant to the scaling of rows are obtained in a similar way. A more concise way to obtain the proportion vectors is to observe that a matrix can be factored as a product of a diagonal column sum (Mcsum) or a row sum (Mrsum) and proportion matrices, Pcsum,c|r or Prsum,c|r, respectively. (15) (16) The Ncsum,c|r and Nrsum,c|r proportion normalization factors provide the different scaling structures (Eq 12) needed for column and row proportion matrices, which correspond to different projective representations of the relationship between variables (Fig 4). The standard protocol is to equalize the marginal sums for the response or dependent variable, and calculate the response effect size for variation of the treatment or independent variable. Depending on whether the response variable is listed in columns or rows, the corresponding representation would be either Pcsum,r or Prsum,c, respectively. Examples of corresponding proportion difference measures, δc,ac and δr,ab, are also shown in Fig 4. Our subscript notation is explained by the following example, Thus, δr,ab corresponds to the difference between ‘a’ and ‘b’ elements of the Prsum,c proportion matrix. Then, calculation of an effect size requires the specification of a perspective function for mapping the relevant (δ, μ) vector to (Section 1.2). Proper practice also requires that an effect size estimate must be qualified by a confidence interval (Section 1.5).

thumbnail
Fig 4. Four forms of proportional variation for a 2 × 2 table.

Separate proportion matrices are obtained in factoring a 2 × 2 matrix for scaling by the column sum (csum) or the row sum (rsum). Columns and rows of a proportion matrix correspond to different representations of the relationship between categorical variables.

https://doi.org/10.1371/journal.pone.0224460.g004

1.4 The ϕ coefficient

In this section, we discuss why ϕ does not serve as a well-behaved effect size measure and further explain the connection between δs and diagonally symmetric 2 × 2 tables. The ϕ coefficient is of particular importance in GWAS because it serves as a standard measure of linkage disequilibrium between molecular markers [3, 5]. The popularity of ϕ is due to its correspondence with Pearson’s correlation coefficient. Binary {0, 1} representations are invoked for the categorical variables, then the correlation coefficient formula is applied to obtain (17) which is also often referred to as ‘r’. However, the limitations of ϕ as an association measure are well known [3, 5, 6, 11, 22]. Alternatively, ϕ is obtained from the relation with Pearson’s chi-squared statistic, χ2 = (a + b + c + d)ϕ2 [3, 23], which also averages over rows and columns resulting in confounding effects. Introducing the ratio product for marginal sums, (18) ϕ can be written as the row sum factorization which corresponds to the scaling of Prsum,r Therefore, ϕ corresponds to u-scaling of the 2 × 2 table with and . Alternatively, the column sum factorization for ϕ is which corresponds to the u-scaling of the 2 × 2 table with . The following factorizations also hold: (19) (20) (21) (22) Consequently, each proportion difference, δi, is associated with a factorization ϕ = Miδi, where Mi depends on marginal sums. Therefore, ϕ corresponds to a weighted average of the δi. The multiplication of row and column sums together in each Mi has a compounding effect because the sums are not independent.

Consider a diagonally symmetric 2 × 2 table with d = a and c = b in Table 1, and equal row and column sums. Then, Eq (12) becomes which corresponds to the proportion vectors and , and proportion difference δs (Eq 2). Since , the △1 coordinates are , so there is only one degree of freedom. Thus, there is a correspondence between 2 × 1 tables [23] and diagonally symmetric 2 × 2 tables. However, Mi = 1 for diagonally symmetric tables, and Eq (17) simplifies to give ϕs = δs. Thus, δs and ϕs are equivalent measures of proportional variation. Conversely, the δi in Fig 4 can be regarded as constituting an extension of ϕs to asymmetric tables. The ϕ coefficient per se does not account for the loss of symmetry when Mi ≪ 1, because it does not distinguish between the δi. However, when Mi ≊ 1 the four expressions collapse into one or nearly so, and the values of ϕ and δi will be approximately the same. This includes the case where either b = c = 0 or a = d = 0 resulting in a diagonal 2 × 2 table. The connection with ϕ suggests that Cohen’s recommendations of effect sizes of 0.1, 0.3 and 0.5 for small, medium, and large effects, respectively, for ϕ [6, 24] can also be invoked for the various forms of δi, but this assumes that the μi coordinate is irrelevant.

1.5 Confidence interval for proportional effects

Each step of a data acquisition process is subject to stochastic effects, and data quality can vary between data sets. Therefore, the specification of a confidence interval (CI) for the effect size is an integral part of data analysis [25, 26]. A contingency table for experimental data is associated with a distribution of tables, , and corresponding distributions for the effect size. The specification of must be based on a realistic assessment of all sources of error and uncertainty to form an error model for the data, . For binary variables, a common approach is to estimate variance from a binomial distribution; the normal distribution is a useful approximation for large sample sizes. Then, estimating the CI for an effect size requires a propagation of error calculation, which is often not straightforward. Analytical approaches for estimating confidence intervals for ratios [27, 28], proportion and difference of two proportions [29, 30], correlation coefficients [31, 32], and odds ratios [9] are already quite involved. Fractional transformation, the bounded range, and the discrete properties of an effect size for proportional variation introduce complications that make it difficult to obtain convenient expressions for error propagation. Alternatively, Monte Carlo (MC) methods [33, 34] provide a more practical approach to estimate confidence intervals for quantities such as δr,ba and δc,ca. In an MC simulation, a 2 × 2 MC table is obtained by generating the N = a + b + c + d events by making random draws according to specified sample proportions [9] and . A set of MC tables is obtained by repeating the sampling process many times; MC distributions are formed for proportions and effect size from the MC tables. Many MC runs are performed, collecting the relevant statistics for each MC distribution, including the mean, median, variance, and histogram. Finally, the degree of convergence for the MC simulation is estimated from the statistics for the MC runs. Fig 5A and 5C shows constrained MC simulations with fixed column sums n1 = a + c and n2 = b + d and sampling proportions and , respectively. Fig 5B and 5D shows greater internal scatter because only the overall sum, N, is fixed, with corresponding sampling proportions . Even though the underlying distributions are discrete, the ±2σ interval for a normal distribution serves as a good approximation for the δc,ca confidence interval in this example. More generally, the distribution of effect size is asymmetric which would be represented by separate confidence intervals for positive and negative deviation from the median. The advantage of the MC method is that the simulation can accommodate a detailed specification of , including heteroscedasticity [25, 35] and correction for attenuation from misclassification [35, 36]. This capability is essential in accounting for the effects of instrumental and other operational factors on the quality of data produced by a data acquisition system.

thumbnail
Fig 5. Two sets of constrained Monte Carlo (MC) simulations of the distribution of proportional variation, (δc,ca, μc,c+a), for a 2 × 2 table with a, b, c, d = [10, 30, 30, 20].

A,C: MC with fixed column sums, n1 = a + c and n2 = b + d. B,D: MC with fixed overall sum N = a + b + c + d. A,B: Data for 10000 MC tables. The dashed lines indicate the expected values, δc,ca = 0.18 and μc,a+c = 0.473. C,D: Each histogram is the mean of 64 MC runs with 10000 MC tables per run. Each whisker is the ±2 standard deviation interval. The normal distribution, μ = 0.18 and σ = 0.0506, is shown as a dashed curve.

https://doi.org/10.1371/journal.pone.0224460.g005

1.6 Decomposition of proportional effects for an r × c table

A table with more than two rows or columns is commonly referred to as an r × c table. The matrix factorization (Eqs 15 & 16) extends in a straightforward way to produce the r × c proportion matrices. For independent and dependent variables with r and c categories, respectively, proportional variation is represented as r points in the standard △c−1 simplex, with r(c − 1) degrees of freedom. Various multicategorical association measures have been proposed for r × c tables. However, we choose Cramer’s V2 [37, 38] as an example to illustrate the difficulties. V2 is defined as a normalization of Pearson’s χ2 such that χ2 = n(q − 1)V2, where n is the total event count and q = min(r, c). V is equivalent to ϕ for 2 × 2 tables. Similarly, it is straightforward to show that Goodman and Kruskal’s τc and τr [37] are both equivalent to ϕ2 for 2 × 2 tables. These equivalences confirm that Pearson’s χ2, V2, τc and τr are composite statistical quantities that average over alternative forms of variation and are therefore subject to ambiguous interpretation. The mappings consist of multidimensional sums and products across rows and columns, resulting in confounding effects because of dependence between them.

In the absence of an engineering or functional model, the specification of a vector basis for proportional variation for an r × c table is not a well-posed problem [39]; i. e., there isn’t a unique solution. This constitutes a fundamental limitation for the formulation of an effect size measure. Consider a two-component proportional system represented by vectors, u, v ∈ △N with N > 1, and . The two default center-of-mass vectors are μ = (u + v)/2, and δ = (uv)/2. However, there isn’t a standard procedure for choosing the additional 2N − 2 vectors needed to form a complete basis. Alternatively, a single coordinate or a sum of coordinates could serve as the basis for estimating an effect size. This corresponds to choosing a △1 × △1 subspace for the representation of proportional variation; e. g., δ = (ui + uj) − (vi + vj), with {(ui + uj, 1 − uiuj), (vi + vj, 1 − vivj) ∈ △1}. A representation of the 2N degrees of freedom for a two-component △N × △N system would require the specification of N 2 × 2 tables. Therefore, the 2 × 2 table serves an elementary role in the decomposition of multiproportional variation due to the minimal properties of △1. The recommended approach is to adopt a multidimensional representation of proportional variation and “reduce any multiple-level or multiple-variable relationship to a set of two-variable relationships” [25]. Similar advice has been given for avoiding the compounding effect of the ANOVA null hypothesis, to break down “complicated hypotheses into smaller, more easily understood pieces” [40]. Ways in which an r × c table might be partitioned and marginalized have been described by Kateri [41]. The objective is to construct a set of 2 × 2 tables that encompass relevant forms of proportional variation for the particular application. This multidimensional representation should be combined with the specification of cost-benefit trade-offs in assessing the effect size for proportional variation. In the next section, we discuss the use of 2 × 2 tables in the CART algorithm. However, high-dimensional search is still a developing area [42, 43], and a detailed assessment of the pros and cons for various approaches is beyond the scope of this paper.

1.7 Gini information gain and ϕ2

In this section, we examine connections between effect size and information gain (IG) measures used in standard implementations of the CART algorithm. CART creates a binary decision tree by the recursive partitioning of the association between response and independent variables [4446]. Each node of the tree corresponds to a binary partition of the range of an independent variable. Each terminal node is a classification identified by a unique combination of intervals of the independent variables. In standard implementations, the partition parameters for a node are determined by maximizing IG for the response variable in an exhaustive search of associations over all independent variables. In each iteration, the set of statistics obtained for the binary partitions of an independent variable constitutes a CART association graph. Our objective is to compare CART graphs for effect sizes including IG. To simplify the discussion, we consider the case where the response variable is binary. Then, the data for a partition correspond to a 2 × 2 table [47]. Then, IG is defined as the parent node impurity, I(S), minus the weighted impurities for the subnodes I(S1) and I(S2), (23) where the weight factor is , ni is the number of elements in node Si, and n = n1 + n2. Two popular impurity measures are the entropy, E = −∑pjlnpj, and Gini impurity, , where pj is the proportion of class ‘j’ items in a set [16]. For a binary proportion vector, , and the Gini impurity becomes G(pm/n, pn/m) = 2pm/n pn/m. However, the xi are subject to the same limitations as the weight factors for sM (Eqs 9 & 10), and both IGE and IGG depend on the marginal sums. More concretely, we show that IGG and ϕ2 are equivalent in CART. Let the rows and columns of Table 1 correspond to the subnodes and categories for the response variable, respectively. Then, G(S) for the parent node is G(S1) and G(S2) are calculated from proportions for the row vectors (a, b) and (c, d), respectively. Then, (24) with substitution of the ϕ coefficient from Eq (17). Since G(S) is a constant for binary partitions at a parent node, we conclude that IGG is equivalent to ϕ2. This confirms that IGG depends on marginal sums due to the xi, in which the normalization factor N = a + b + c + d does not distinguish between rows and columns. Information gain measures of the form Eq (23) will be subject to this limitation, including IGE. It is known that IGE and IGG yield very similar results in CART [48], which confirms that IGE is subject to dependence on marginal sums (Table 4). The limitations of IGG raise the question of whether the column sum invariant δc,ac statistic might be more appropriate for CART, which we consider in the next section.

thumbnail
Table 4. Classification tree partitions for NHC ‘short-stay rehospitalized’ data.

https://doi.org/10.1371/journal.pone.0224460.t004

2 Data analysis and results

2.1 Data preparation

The Centers for Medicare and Medicaid (CMS) conduct regular inspections of nursing homes to assess compliance with regulations and survey residents to assess the quality of patient care. The CMS quality measures data and Five-Star rating assignments are publicly available from the Nursing Home Compare (NHC) website [49]. The analysis of NHC data is an important problem in itself [5052] and is the subject of our ongoing work [53]. Nursing homes are dynamic systems where the measurement of performance is essential for managing cost, but this constitutes a complex problem for which there is not a unique or ‘best’ solution. The challenge is to develop data analysis methods that can help identify public health criteria for classifying the quality of patient care in nursing homes, or some approximation thereof. However, in this work our interest is limited to the comparison of CART association graphs for effect size measures. First Quarter, 2018 NHC data for eighteen quality measures were retrieved, selecting only those nursing homes with either a 1 star or 5 star overall rating, corresponding to 1394 and 2649 nursing homes, respectively. Selecting ‘1 star, 5 star’ rating data creates a binary response data set, which is convenient for our purpose; otherwise, data for all five ratings would be included in the CART analysis. The distributions of NHC ‘Percentage of short-stay residents who were rehospitalized after a nursing home admission’ (Rehospitalized) quality measure data for 1 star and 5 star overall ratings are broad and largely overlap (Fig 6A). This result implies that the Mi for the corresponding contingency tables will tend to be much less than 1, as required for our demonstration.

thumbnail
Fig 6. CART association graph.

A: Stacked histograms for First Quarter, 2018 ‘Percentage of short-stay residents who were rehospitalized after a nursing home admission’ data for nursing homes with a 1 star or 5 star overall rating; the dashed line is the median value. CART associations between Nursing Home Compare ‘Rehospitalized’ quality measure and ‘1 star, 5 star’ overall rating for IGG and ϕ2 are also shown. IGG was scaled to match ϕ2. Both and ωM were scaled by 1/50. B: Column scaling invariant center-of-mass coordinates, (δc,ac, μc,a+c), for the two-component proportional variation in the standard one-simplex, △1.

https://doi.org/10.1371/journal.pone.0224460.g006

2.2 Effect size in CART

In demonstrating the marginal sum dependence of various effect size measures, we must choose an elementary contingency table analysis problem. CART analysis for a binary response variable (bCART) is well suited for this purpose. In searching for an optimal binary partition of an independent variable, bCART generates a set of 2 × 2 tables where the sample sizes, n1 and n2, of the two subnodes vary over almost the entire range of the fixed sum N = n1 + n2; a minimum size is usually specified because a partition where either of the subnodes is too small is not informative. We let the rows and columns of Table 1 correspond to the two subnodes and the ‘1 star, 5 star’ rating for the response variable, respectively. Effect size results for a bCART scan for association between the Rehospitalized quality measure and NHC ‘1 star, 5 star’ overall rating are shown in Fig 6. The exact match between IGG and ϕ2 (Fig 6A) is consistent with Eq (24) because G(S) is constant. The parabolic variation of ϕ2 is explained by Eq (21) because the variation in the marginal sum factor, Mc,ac, outweighs the much smaller variation in the proportional effect, δc,ac (Fig 6B). The parabolic variation of is in turn explained by the approximate similarity with ωM. Replacing each marginal sum in Eq (18) by the corresponding proportion yields The denominator corresponds to the binomial variance for the parent set, which is constant. The numerator corresponds to the binomial variance for subnode size proportions, (a + b): (c + d), so ωM has a maximum when a + b = c + d, which coincides with the median Rehospitalized value. Consequently, the parabolic dependence of ϕ2, with the maximum near the median value, largely reflects the variation in the subnode sample size instead of ‘1 star, 5 star’ composition. In contrast, δc,ac is column sum invariant and yields very similar results to Yule’s Q, which is invariant to scaling of either rows or columns (Fig 7); the correlation is higher than 0.99 for 16 NHC quality measures, and the lowest is 0.91. Note that this similarity does not represent a special relation with Q and results from the numerical properties of δc,ac and μc,a+c for these data (Eq 4). The lower correlation (r = 0.78) between δc,ac and δr,ac confirms that different forms of proportional variation can be distinguished; δr,ac also measures the difference in subnode composition but is row sum invariant. The U-shaped δc,ac association graph has two maxima, so there are two possible CART partitions (Table 4). The relatively small subnode with Rehospitalized below 13.3% is enriched in the 5 star rating, corresponding to better than average patient care. Above 32.6%, the patient care is worse than average because it is associated with enrichment of the 1 star rating. The middle range from 13.3-32.6% includes the majority of nursing homes with average performance. In comparison, IGG and IGE produce subnodes that are nearly equal in size and with much lower degrees of enrichment in the ‘1 star, 5 star’ proportions. Thus, δc,ac is more effective than IGG and IGE in identifying partitions that correspond to a difference in the ‘1 star, 5 star’ composition.

thumbnail
Fig 7. Scaling invariant effect size statistics for CART.

Yule’s Q and δc,ca yield similar results in the CART association between the First Quarter, 2018 Nursing Home Compare ‘Rehospitalized’ data and ‘1 star, 5 star’ overall rating. δc,ca and δr,ca are the column and row scaling invariant proportion differences, respectively. Q is invariant to scaling of either columns or rows.

https://doi.org/10.1371/journal.pone.0224460.g007

The logistic regression method provides a graphical view of the effect of sample size parameters on proportional variation in categorical data analysis (Fig 8A). The ‘1 star, 5 star’ rating data were analyzed using the LogisticRegression function in the scikit-learn library with the ‘lbfgs’ solver [54]. A moving average of the ‘5 star’ rating proportion is included in the graph as a reference for the logistic curve. The normalized ‘5 star’ proportion adjusted for inequality in the ‘1 star, 5 star’ sample sizes and the corresponding adjusted logistic curve are shown in (Fig 8B). The variation in proportion confirms that the left and right tails of the Rehospitalized distribution correspond to nursing homes with above and below average performance, respectively, consistent with the CART association results. The logistic model for the ‘5 star’ proportion, , is usually expressed as (25) where parameters, (a, b), are determined from the curve fit. The adjustment for the logistic curve was obtained using the change in coordinates where n1 and n5 are the sample sizes for the 1 star and 5 star ratings in the data set, respectively. Substitution into Eq (25) yields such that . In a data set where n1 = n5, y(x0) = 1/2, and x0 correspond to the mid-point value for the logistic curve. Then, there are two sample-size-independent parameters, b and x0.

thumbnail
Fig 8. Sample size effects in logistic regression.

A: Logistic model for Nursing Home Compare ‘Rehospitalized’ data and ‘1 star, 5 star’ overall rating. The moving average ‘5 star’ proportion is included for reference. The ‘5 star’ sample size proportion, , is shown as a horizontal line; n1 = 1394, and n5 = 2649. B: Normalized ‘5 star’ proportion adjusted for unequal sample sizes, and the adjusted logistic curve. The midpoint value, x0 = 22.4, for the logistic curve is shown as a vertical line; the ‘Rehospitalized’ median is 22.2.

https://doi.org/10.1371/journal.pone.0224460.g008

3 Discussion

The renewed warnings from the statistics community about the limitations of statistical significance methodology has created a perplexing situation, given that there is a wide range of opinion on the underlying causes and solutions [55, 56]. Claims have also been made about effect size [25, 26, 57] as a better alternative, but the lack of consensus on the utility of commonly used association coefficients, such as the odds ratio [8, 10], the simple matching coefficient and ϕ [5, 11], hinders development of this approach. In this paper, we describe a rigorous framework for representing proportional variation in a 2 × 2 table, which helps in resolving the marginal sum dependence problem for association coefficients. We show that a 2 × 2 table is associated with four forms of proportional variation resulting from the factorization as a product of proportion and diagonal row or column sum matrices. Association coefficients, such as ϕ, the odds ratio, and the simple matching coefficient, which do not distinguish between rows or columns, correspond to averages of proportional effects and lack clear interpretation. The two-component structure implies that there are two degrees of freedom corresponding to the displacement of two point vectors in the standard one-simplex, △1. An effect size measure then requires the specification of a perspective function of the center-of-mass coordinates, (δ, μ), which is potentially unique for each application because of differences in cost-benefit trade-offs. In practice, classification problems vary widely in difficulty depending on the degree of overlap between the underlying distributions. Fisher’s irises data set [58] is an example of a classification problem for well separated distributions, where different association coefficients achieve similar results because of degeneracy, particularly when the 2 × 2 table is diagonally symmetric or the effects are highly correlated. Conversely, differences in performance between association coefficients are best observed when the underlying distributions overlap. We also show that both Gini and entropy information gain are subject to dependence on marginal sums, which degrades the performance of the CART algorithm. Alternatively, the proportion difference with marginal sum invariance for the response variable provides a significant improvement in the performance of the CART algorithm. We conclude that the results in this paper demonstrate that equalization of either row or column sums of a 2 × 2 table serves as a correction for unbalanced sample sizes, as suggested by Goodman and Kruskal [2].

Acknowledgments

It is a pleasure to acknowledge helpful discussions and suggestions from many colleagues in the DuPont Genetic Discovery group, particularly Ada Ching, Antoni Rafalski, and Scott Tingey. I also thank my colleagues at the Science, Technology and Research Institute of Delaware for their support, and Open Data Delaware for supporting the development of the NursingHomeMeasures.com website.

References

  1. 1. Yule GU. On the Methods of Measuring Association Between Two Attributes. Journal of the Royal Statistical Society. 1912;75(6):579–652.
  2. 2. Goodman LA, Kruskal WH. Measures of Association for Cross Classifications. J Amer Statis Assoc. 1954;49:732–764.
  3. 3. Hedrick P. Gametic disequilibrium measures: proceed with caution. Genetics. 1987;341:331–341.
  4. 4. Davenport EC, El-Sanhurry NA. Phi/Phimax: Review and Synthesis. Educational and Psychological Measurement. 1991;51(4):821–828.
  5. 5. VanLiere JM, Rosenberg NA. Mathematical properties of the r2 measure of linkage disequilibrium. Theoretical Population Biology. 2008;74(1):130–137. pmid:18572214
  6. 6. Olivier J, Bell ML. Effect Sizes for 2 × 2 Contingency Tables. PLoS ONE. 2013;8(3):e58777. pmid:23505560
  7. 7. Haddock CK, Rindskopf D, Shadish WR. Using odds ratios as effect sizes for meta-analysis of dichotomous data: A primer on methods and issues. Psychological Methods. 1998;3(3):339–353.
  8. 8. Kraemer HC. Reconsidering the odds ratio as a measure of 2 × 2 association in a population. Statistics in Medicine. 2004;23(2):257–270. pmid:14716727
  9. 9. Ruxton GD, Neuhäuser M. Review of alternative approaches to calculation of a confidence interval for the odds ratio of a 2 × 2 contingency table. Methods in Ecology and Evolution. 2013;4(1):9–13.
  10. 10. Grant RL. Converting an odds ratio to a range of plausible relative risks for better communication of research findings. BMJ. 2014;348(jan24 1):f7450–f7450. pmid:24464277
  11. 11. Warrens MJ. On Association Coefficients for 2 × 2 Tables and Properties That Do Not Depend on the Marginal Distributions. Psychometrika. 2008;73(4):777–789. pmid:20046834
  12. 12. Hubálek Z. Coefficients of Association and Similarity, Based on Binary (Presence-Absense) Data: An Evaluation. Biological Reviews. 1982;57(4):669–689.
  13. 13. Boyd SP, Vandenberghe L. Convex optimization. New York, NY: Cambridge University Press; 2004.
  14. 14. Beló A, Zheng P, Luck S, Shen B, Meyer DJ, Li B, et al. Whole genome scan detects an allelic variant of fad2 associated with increased oleic acid levels in maize. Molecular Genetics and Genomics. 2008;279(1):1–10.
  15. 15. Loh WY. Classification and regression trees. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. 2011;1(1):14–23.
  16. 16. Krzywinski M, Altman N. Points of Significance: Classification and regression trees. Nature Methods. 2017;14(8):757–758.
  17. 17. Reid M, Szendröi B. Geometry and Topology. New York: Cambridge University Press; 2005.
  18. 18. Bland JM, Altman DG. Statistics Notes: The odds ratio. BMJ. 2000;320(7247):1468–1468. pmid:10827061
  19. 19. Newcombe RG. A deficiency of the odds ratio as a measure of effect size. Statistics in Medicine. 2006;25(24):4235–4240. pmid:16927451
  20. 20. Sistrom CL, Garvan CW. Proportions, Odds, and Risk. Radiology. 2004;230(1):12–19. pmid:14695382
  21. 21. Pearson K, Heron D. On Theories of Association. Biometrika. 1913;9:159–315.
  22. 22. Zysno PV. The modification of the phi-coefficient reducing its dependence on the marginal distributions. Methods of Psychological Research. 1997;2(1):41–53.
  23. 23. Richardson JT. The analysis of 2 × 1 and 2 × 2 contingency tables: an historical review. Statistical Methods in Medical Research. 1994;3(2):107–133. pmid:7952428
  24. 24. Cohen J. A power primer. Psychological Bulletin. 1992;112(1):155–159. pmid:19565683
  25. 25. Nakagawa S, Cuthill IC. Effect size, confidence interval and statistical significance: a practical guide for biologists. Biological reviews of the Cambridge Philosophical Society. 2007;82(4):591–605. pmid:17944619
  26. 26. Cumming G. Understanding The New Statistics. New York, NY: Routledge; 2012.
  27. 27. Marsaglia G. Ratios of Normal Variables. Journal of Statistical Software. 2006;16(4):1–10.
  28. 28. von Luxburg U, Franz VH. A Geometric Approach to Confidence Sets for Ratios: Fieller’s Theorem, Generalizations, and Bootstrap. Statistica Sinica. 2009;19:1095–1117.
  29. 29. Newcombe RG. Interval estimation for the difference between independent proportions: comparison of eleven methods. Statistics in Medicine. 1998;17(8):873–890. pmid:9595617
  30. 30. Agresti A. Dealing with discreteness: making ‘exact’ confidence intervals for proportions, differences of proportions, and odds ratios more exact. Statistical Methods in Medical Research. 2003;12(1):3–21. pmid:12617505
  31. 31. Banik S, Kibria BM. Confidence Intervals for the Population Correlation Coefficient ρ. International Journal of Statistics in Medical Research. 2016;5(2):99–111.
  32. 32. Bishara AJ, Hittner JB. Confidence intervals for correlations when data are not normal. Behavior Research Methods. 2017;49(1):294–309. pmid:26822671
  33. 33. Bevington PR, Robinson DK. Data Reduction and Error Analysis for the Physical Sciences. 3rd ed. New York, NY: McGraw-Hill; 2003.
  34. 34. Kroese DP, Brereton T, Taimre T, Botev ZI. Why the Monte Carlo method is so important today. Wiley Interdisciplinary Reviews: Computational Statistics. 2014;6(6):386–392.
  35. 35. Buonaccorsi JP. Measurement error: models, methods, and applications. Boca Raton: Chapman and Hall/CRC; 2010.
  36. 36. Höfler M. The effect of misclassification on the estimation of association: a review. International Journal of Methods in Psychiatric Research. 2005;14(2):92–101.
  37. 37. Berry KJ, Johnston JE, Mielke PW. A Measure of Effect Size for R × C Contingency Tables. Psychological Reports. 2006;99(1):251–256. pmid:17037476
  38. 38. Thomson G, Single RM. Conditional Asymmetric Linkage Disequilibrium (ALD): Extending the Biallelic r2 Measure. Genetics. 2014;198(1):321–331. pmid:25023400
  39. 39. Logan JD. Applied Mathematics. 2nd ed. New York, NY: John Wiley & Sons, Inc.; 1997.
  40. 40. Casella G, Berger R. Statistical Inference. 2nd ed. Pacific Grove, CA: Duxbury; 2002.
  41. 41. Kateri M. Contingency Table Analysis. New York, NY: Springer New York; 2014.
  42. 42. Kettenring JR. Coping with high dimensionality in massive datasets. Wiley Interdisciplinary Reviews: Computational Statistics. 2011;3(2):95–103.
  43. 43. Coveney PV, Dougherty ER, Highfield RR. Big data need big theory too. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences. 2016;374(2080):20160153.
  44. 44. Duda RO, Hart PE, Stork DG. Pattern classification. Wiley; 2001.
  45. 45. de Ville B. Decision trees. Wiley Interdisciplinary Reviews: Computational Statistics. 2013;5(6):448–455.
  46. 46. Loh WY. Fifty Years of Classification and Regression Trees. International Statistical Review. 2014;82(3):329–348.
  47. 47. Mingers J. An empirical comparison of selection measures for decision-tree induction. Machine Learning. 1989;3(4):319–342.
  48. 48. Krzywinski M, Altman N. Error bars. Nature Methods. 2013;10(10):921–922. pmid:24161969
  49. 49. Nursing Home Compare datasets; 2018. Available from: https://data.medicare.gov/data/nursing-home-compare.
  50. 50. Quartararo M, Glasziou P, Kerr CB. Classification Trees for Decision Making in Long-Term Care. The Journals of Gerontology Series A: Biological Sciences and Medical Sciences. 1995;50A(6):M298–M302.
  51. 51. Alexander GL. An analysis of nursing home quality measures and staffing. Quality management in health care. 2008;17(3):242–51. pmid:18641507
  52. 52. Raju D, Su X, Patrician PA, Loan LA, McCarthy MS. Exploring factors associated with pressure ulcers: A data mining approach. International Journal of Nursing Studies. 2015;52(1):102–111. pmid:25192963
  53. 53. Nursing Home Quality Measures; 2019. Available from: https://nursinghomemeasures.com/.
  54. 54. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research. 2011;12(Oct):2825–2830.
  55. 55. Wasserstein RL, Lazar NA. The ASA’s Statement on p-Values: Context, Process, and Purpose. The American Statistician. 2016;70(2):129–133.
  56. 56. Leek J, McShane BB, Gelman A, Colquhoun D, Nuijten MB, Goodman SN. Five ways to fix statistics. Nature. 2017;551(7682):557–559. pmid:29189798
  57. 57. Grissom RJ, Kim JJ. Effect Sizes for Research. 2nd ed. New York, NY: Routledge; 2011.
  58. 58. Fisher RA. The use of multiple measurements in taxonomic problems. Annals of Eugenics. 1936;7(2):179–188.