Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

The quantification of Simpson’s paradox and other contributions to contingency table theory

Abstract

The analysis of contingency tables is a powerful statistical tool used in experiments with categorical variables. This study improves parts of the theory underlying the use of contingency tables. Specifically, the linkage disequilibrium parameter as a measure of two-way interactions applied to three-way tables makes it possible to quantify Simpson’s paradox by a simple formula. With tests on three-way interactions, there is only one that determines whether the partial interactions of all variables agree or whether there is at least one variable whose partial interactions disagree. To date, there has been no test available that determines whether the partial interactions of a certain variable agree or disagree, and the presented work closes this gap. This work reveals the relation of the multiplicative and the additive measure of a three-way interaction. Another contribution addresses the question of which cells in a contingency table are fixed when the first- and second-order marginal totals are given. The proposed procedure not only detects fixed zero counts but also fixed positive counts. This impacts the determination of the degrees of freedom. Furthermore, limitations of methods that simulate contingency tables with given pairwise associations are addressed.

1 Introduction

Categorical variables are observed in many branches of science. Contingency table theory serves to infer such data. A great spectrum of analytical methods was presented by Agresti [1]. In the present paper, some parts of the theory are improved and some methods are added.

In their historical overview, Fienberg and Rinaldo [2] recognized Bartlett’s [3] important contribution to the theory of contingency tables. Simpson [4] clarified some remaining questions from Bartlett’s`paper on the three-way interaction in a 2×2×2 table. In addition to theoretical results, Simpson gave an example in which the health benefits of a drug appeared separately in both males and females. However, if the data were merged, no effect was seen. Furthermore, Blyth [5] showed that the merged data might even indicate a strong negative effect of the drug. This phenomenon was called “Simpson’s paradox”.

Several examples have been found in real life, demonstrating the principle’s great practical relevance and the many different situations in which it may arise. Many studies have investigated how to circumvent this paradox, how best to deal with it, or how to interpret it (e.g., [614]). However, no short and elucidating presentation has so far succeeded in showing the relation of the paradox and the inner structure of the table.

Different measures are used for the association between two categorical variables, particularly for a 2×2 table (odds ratio, Yule’s Q, Pearson’s φ and ρ). Quantitative genetics, for example, uses the so-called linkage disequilibrium (LD). The application of LD to the two-way and partial associations delivers a closed formula quantifying Simpson’s paradox. The formula is derived in Section 2 and applied in Section 7.1, and it allows a clear, correct, and straightforward interpretation of a famous Berkeley data set.

In a strong sense, Bartlett [3] did not investigate a “three”-way interaction but a “third”-way interaction for a 2×2×2 table. He considered the question of whether a third variable (sex) has an effect on the association between the other two variables (success and treatment). He suggested comparing the odds ratios of the partial 2×2 tables (one for males and one for females). When they agree, the third variable has no effect.

Simpson [4] realized that Bartlett’s definition of no three- (or third-) way interaction implies a symmetry property: when the third variable has no effect on the interaction between variables one and two (agreeing odds ratios of both sub-tables), then automatically, the first variable has no effect on the interaction between variables two and three, and the second variable has no effect on the interaction between variables one and three. Therefore, Bartlett’s [3] test on “no three-way interaction” is a global one, and the alternative hypothesis would be “there is at least one variable with three-way interaction”. Although such a test is not senseless at all, it is hard to believe that someone is interested in whether the interaction between treatment and sex for the group of successful patients equals the interaction between treatment and sex for the group of failed patients.

Therefore, a test for a single variable (“sex has no influence on the effect of a drug” versus “the effect of the drug differs between males and females”) is still needed. It is clear that, for such a test, the odds ratio is not a suitable measure. A measure of association is needed that does not have the symmetry property. Simpson [4] mentioned that symmetry is lost for the root mean square contingency parameter, what we now call the correlation coefficient. However, he did not investigate this measure. It appears that this important issue has not been treated elsewhere so far, possibly because it does not fit the hierarchical log-linear model approach. In Section 3, this gap in the theory is closed. The method is applied to the Berkeley data in Section 7.2.

In quantitative genetics, the concept of LD has been generalized to three and four variables as the so-called three- or four-locus LD [1520]. The three-locus LD of Bennett [15] is an additive measure and related to the additive measure of Lancaster [21, 22]. It was shown [23, 24] that this measure is not consistent with Bartlett’s criterion, which is actually the solution of a cubic equation.

Bartlett’s criterion, although appearing intuitive, turned out to agree with the maximum- likelihood equation of the log-linear model. Streitberg [25, 26] discussed the shortcomings of the log-linear model, treated the tables as multinomial distributions, and argued for additive measures. Obviously (and unfortunately), he was not aware of the investigations into tables and entropy performed by Good [27].

Shannon’s [28] principle of entropy is a successful concept in physics, engineering, information theory, and statistics. Khinchin [29] delivered mathematical foundations for this principle. In particular, he investigated a measure H for the information content of an experiment (with a finite size n of possible events) as a functional of the probability function. The higher the value of H, the lower the information content of the experiment. He made two assumptions: (i) H is largest when the events have unique probabilities 1/n and (ii) if an experiment consists of two experimental parts, A and B, then the information content of the whole experiment, H(AB), should be the sum of the information content of the first part H(A) and the information content of the second part, given the first part, denoted by H(B|A), i.e., (1)

He showed that, under these reasonable assumptions, there is only one measure that is continuous: the entropy , where λ is a positive constant and often set to one, i.e., (2)

Good [27] treated contingency tables as multinomial distributions and determined the distribution with maximum entropy and given restraints, such as one- and two-way marginals. It turned out that his solution for 2×2×2 tables agreed with Bartlett’s criterion.

There is another point speaking against Bennett’s linear measure. In genetic multi-locus linkage analyses, Hill [23] showed that a table with an absence of three-way interactions may have negative “probabilities”. That is, given a table with a three-way interaction, the corresponding hypothetical table without a three-way interaction would not exist. Such a dilemma cannot arise by applying the entropy principle because of its concavity.

It can be concluded that, for 2×2×2 tables, the multiplicative measure has a deeper impact than the additive one. On the other hand, the additive measure is much more tractable. Therefore, we ask which additive measure comes nearest (is most similar) to the multiplicative one. Section 4 examines whether Bennett’s measure is the first-order Taylor expansion of Bartlett’s measure.

A central theme in the progress of contingency table theory is the introduction and development of the log-linear model. In their historical overview, Fienberg and Rinaldo [2] show that a special point was the difficulty in handling zero counts. The nonexistence of the maximum-likelihood estimator (MLE) was indicated by the lack of convergence of the algorithms used to compute the MLE. Later, Fienberg and Rinaldo [30, 31] generated a numerical procedure specifically designed to check for the existence of the MLE. They based their approach on investigations of extended exponential families and the geometrical properties of log-linear models. Practically, the question about zero counts was whether the marginal totals enforce the cells to have zero counts. In such cases, the cell is fixed and this therefore also influences the degrees of freedom. So far, it has been overlooked that not only zero count cells but also positive count cells might be fixed. Section 5 presents an elementary algorithm that detects all fixed cells.

There are variables with categories that have an obvious order, and such variables are called ordinally scaled. [3235] documented the progress and problems with simulating ordinally scaled variables with given pairwise Pearson’s correlation coefficients. The techniques are modifications and adaptations of simulation techniques for multivariate normally distributed variables with a given correlation matrix. However, there are no procedures available that work for every admissible correlation matrix. Section 6 presents a simulation method that has no such theoretical limitations.

[35] handled the same task but with demanded pairwise associations measured with Goodman and Kruskal’s γ. [36] generated a program for Lee’s procedure. Although the authors did not mention it, the method is not suitable for simulating all admissible scenarios. These shortcomings are overcome in Section 6.

In Section 7, a real data set reflecting Simpson’s paradox is analyzed with tools derived in Sections 2 and 3.

The paper concludes with a discussion of the issues. Special attention is given to the application of the entropy principle.

2 The quantification of Simpson’s paradox

Let X and Y be two random categorical variables with IX and IY categories, respectively. In an experiment, n objects are inspected to identify which categories of variables X and Y apply. The counts ni,j, i = 1,2,⋯,IX, j = 1,2,⋯,IY, are written in a IX×IY contingency table. The probability that an object matches categories Xi and Yj is pi,j = P(X = XiY = Yj), and its estimate is ni,j/n. The association between categories Xi and Yj is defined by the linkage disequilibrium (LD) measures: (3)

The point indicates summation over the assigned variable, e.g., , delivering marginal probabilities.

The LD is assigned to the pair (Xi, Yi) of categories. The relation of this pair to all other pairs can be summarized by collapsing the IX×IY table into the 2×2 table , where the bar over an index means summation over all categories with exception of the category defined by the index. The 2×2 table then takes the form (4)

It is easy to check that holds. Pearson’s correlation coefficient can then be written as (5) which coincides with Pearson’s φ.

With Z being a third categorical variable, the cell probabilities of the associated IX×IY×IZ table are now pi,j,k = P(X = XiY = YjZ = Zk), k = 1,2,⋯,IZ. Eq (3) then change to (6)

Using the definition of conditional probabilities, pi,j|k = pi,j,k/p•,•,k, the conditional analogue to Eq (3) is (7)

Because and, analogously, , the weighted sum is

The result is formulated as a theorem.

Theorem: For an IX×IY×IZ table, the difference between the two- way LD and the weighted sum of the partial LDs is (8)

For a 2×2×2 table, the difference becomes (9)

The simplification for the 2×2×2 table follows from inserting IZ = 2 into Eq (8) and regarding the well- known formula .

3 Testing the equality of partial interactions for one variable

The null hypothesis for Bartlett’s test concerning 2×2×2 tables is the agreement of all partial interactions (measured as the odds ratio), while the alternative hypothesis is that at least one pair of partial interactions is unequal. Here, the effect (if any) of the third variable on the interaction between the first and the second variables is inferred. Let the first variable be the outcome of the experiment with categories “success” and “no success”, the second variable be the applied treatment with categories “1” and “2” (one treatment could be a placebo), and the third variable be the sex of the patient with categories “male” and “female”.

The null hypothesis is that both partial interactions of the third variable coincide. The hypothetical table with agreeing partial interactions and the observed table have several parameters in common: three two-way marginal totals, three one-way marginal totals, and the sample size (zero-way marginal total). The equational system for the probabilities is then (10) where the second vector defines the abbreviations of the first one. Solving system (10) gives (11)

With eight cells and seven conditions, there is one free parameter, p1,1,1. The partial tables for male and female patients are presented in Table 1.

thumbnail
Table 1. Success × treatment sub-tables for the sexes with given one- and two-way marginals.

https://doi.org/10.1371/journal.pone.0262502.t001

There is certainly no effect due to sex if both sub-tables agree. Solving this system of linear equations gives p1,2 = p1 p2, p1,3 = p1 p3 and p2,3 = p2 p3; i.e., all variables were pairwise independent.

If the odds ratios of both sub-tables agree, this would apply also for the sub-tables of the other variables, as acknowledged by Simpson [4]. Hence, it would not be a specific property of sex.

Thinking about LD, which is a relative measure (since the maximum and minimum depend on the one-way marginals), and the correlation coefficient, the better measure for associations in agreement will be the correlation coefficient.

Determination for both sub-tables gives (12) (13) with and .

Solving with respect to p1,1,1 finds (14)

Now we have the observed table and, via Eq (11), the table under the null hypothesis (no sex effect). The χ2 test with one degree of freedom can be used for the decision between the null and alternative hypothesis (sex has an effect). A measure for the third-way interaction can be defined by setting .

4 A linear expression for Bartlett’s measure for three-way association

Bartlett’s measure D for a three-way association in a 2×2×2 table is determined by solving (15) for D. Here, the probabilities pi,j,k are assigned to counts ni,j,k by pi,j,k = ni,j,k/n•,•,•. Vanishing D indicates the absence of a three-way association.

Bennett [15] introduced an additive measure of the three-way association: (16)

In the introduction, it was concluded that the multiplicative measure (15) has more impact and the linear measure (16) could be an approximation. Therefore, it can be checked whether the linear measure could be the first-order Taylor expansion of the nonlinear one.

Substituting the LD expressions pi,j = Di,j+pi pj for the two-way probabilities of Eq (11) leads to (17)

Inserting the cell probabilities into Eq (15) gives a cubic equation of argument p1,1,1. Using Mathematica, the roots of (15) were determined and the first-order Taylor expansions at LDs of zero and one-way probabilities of one half were carried out. The real solution was (18)

The appropriate measure for the three-way interaction would be . It can be seen that Bennett’s measure differs from this but covers the four simplest terms. Therefore, Bennett’s criterion can be interpreted as a simplified version of the first-order Taylor expansion of Bartlett’s criterion. The second-order expansion was also available. The only unexpected result was that one coefficient was not a power of 2: the largest term was . The other coefficients were the following (excluding linear terms): 212 (three terms), 29 (three terms), 28 (three terms), and 25 (six terms).

5 The determination of fixed cells

5.1 The application of linear programming

Assume an observed I1×I2×⋯×Ic contingency table n, i.e., there are c categorical variables, with Ii categories per variable i, i∈(1,2,⋯,c}. The contingency table n is characterized by the counts , with counts overall. The one-way marginal totals can be written as (19) where ki, 1≤kiIi, is a category of variable i.

Analogously, the two-way marginal totals can be written as (20) where i and j define the involved variables and ki and kj define the appropriate categories.

The marginal totals with indices Ii, i∈{1,2,⋯,c} can be determined by the others: (21)

Let m be the vector of the considered marginal totals. Then, m can be written as (22)

We order the cells of n into a vector , with (23) and thus establish a one-to-one relation between and , with . If j is given, the corresponding c-tuple i1, i2,…,ic−1, ic must be evaluated sequentially. The length of is .

Then, the restraints can be formulated as (24) where matrix A has entries zero or one and ensures the addition of the demanded components of . For a 2×2×2 table, matrix A can be seen in Eq (10). In the first row of A, there are only ones, ensuring the addition of all components of . In the second row, there is a one if the corresponding component of has category one at the first variable, etc.

We now introduce a table with the same dimensions as the observed table n. As with Eq (23), the unknown cell counts are ordered into a vector , with (25)

Since we are looking for a table that has the same zero-, one-, and two-way marginals as the observed table n, (26) must be valid, where matrix A is the same as before. Then, the set S with (27) where is meant componentwise and ensures nonnegativity, contains all admissible tables satisfying the constraints.

There is at least one element of S: the observed table n. Assuming that there are two elements in S, n1 and n2, then the linear combination λ n1+(1−λ) n2 with 0≤λ≤1 is also admissible. Hence, the set of admissible tables is convex and, furthermore, the theory of linear optimization is applicable. In particular, there exist unique solutions (e.g., see [37]) for the linear optimization problems (28A) and (28B) with i∈{1,2,⋯,d}. An upper bound can be obtained from Eq (28A) and a lower bound from Eq (28B) for . This is of importance, since means that is fixed. The aim is to find such components. Sequential checking of , leads to the set, say Ω, of the components for which the equality is valid. Several numerical software packages contain a linear programming or optimization procedure. As an example, the 4×4×4 data from Table 6 of Fienberg and Ricardo [2] was analyzed, with the results presented in Table 2. There were 24 zero and 12 nonzero counts, which turned out to be fixed.

thumbnail
Table 2. Cell counts for given zero-, one-, and two-way marginal totals for Table 6 of Fienberg and Rinaldo [2].

(The original table is obtained by inserting and ).

https://doi.org/10.1371/journal.pone.0262502.t002

Table 2 provides more information (such as than can be obtained from the described algorithm. We will come to that now.

5.2 The application of algebraic software

Applying algebraic software such as Mathematica [38] to problems (28A) and (28B) has an advantage compared to pure numerical algorithms. The system of equations can be solved using the procedure “Solve”, thus decreasing the number of variables. For a 2×2×2 table, the solution of Eq (24) is (29)

Here, the representation x = xh+xs is used [37], where xh is the solution of the homogenous system, i.e., 0 = A xh, and xs is a special solution, i.e., m = A xs. Eq (29) is just Eq (11) times n. From the eight initial variables of the table, there is only one left: .

In the general case of an I1×I2×⋯×Ic table, there are cells or initial variables. In matrix A of Eq (24), there are linear independent rows for the one-way totals, linear independent rows for the two-way totals, and one row for the overall total n. The number, r, of linear independent rows is therefore (30) (31)

The system of linear equations can be solved for r components of . There remain f = dr free variables, i.e., (32)

For a three-way table, investigated by Roy and Kastenbaum [39], this number is the well-known f = (I1−1)(I2−1)(I3−1). For c≥3 and a unique number of categories I, f turns out to be (33)

Let y be the vector of the f free variables. Then, the linear optimization problems have the Mathematica forms (34A) and (34B) and must be calculated for i = 1,…,d. Let Ω be the set , i.e., the set of indices of fixed cells.

When Ω is not empty, the analysis can be refined. In Eq (26), the fixed counts substitute for the variables of the fixed cells. This means that vector in has to be renewed by setting . Analogously, the list of variables is reduced by canceling the variables of the fixed cells. The solution of the new system of equations then further reduces the number of free variables. Applying this to Table 6 of Fienberg and Rinaldo [2] yielded Table 2 of this study. Only four free variables are left over. In this way, a very compact expression for admissible tables was reached.

6 The simulation of ordinally scaled variables with predefined associations

6.1 Association measured by Pearson’s ρ

The aim is to simulate an I1×I2×⋯×Ic contingency table of c numerical variables with given one- way marginals and Pearson‘s correlation coefficients ρi,j, i,j∈{1,2,⋯,c}. The categories of the variables are characterized by numbers. These numbers, , i∈{1,2,⋯,c}, ki∈{1,…,Ii}, may agree with the index of the category, i.e., vi(i1) = 1, vi(i2) = 2,…, for variable i.

The contingency table is characterized by the probabilities , where ki∈{1,…,Ii} defines the category of variable i. The probabilities are unknown at present. Now the equations are collected to ensure the validity of the given conditions.

The one-way probabilities, , are assumed to be known. Here, i is the number of the variable and ki is the category of that variable. As before, may be obtained from the c-way probabilities, , by summing over all indices except index i, i.e., (35)

Hence, the expectations and the variances of the variables i∈{1,…,c} are also known.

The pairwise correlation coefficients, ρi,j, are defined by (36) where are the two-way probabilities of variables i and j, which can be determined via . Hence, (37) must hold. The left-hand side of this equation only involves constants; i.e., the left-hand side is a constant. The right-hand side of Eq (37) is a linear combination of the cell probabilities; therefore, the theory of linear programming can be applied.

For convenience, Mathematica and the principles of Section 5.2 are used to proceed. First, the system W of the linear equations ( equations for the one-way marginals and (c−1)c/2 equations for the two-way correlations) is solved.

Then, the lower and upper bounds for the first free variable are determined using the procedures Minimize and Maximize of Mathematica. Three cases need to be considered. (1) The procedure finds no solution, in which case there is no table satisfying the demanded correlations (in the literature, there was no practical and sufficient criterion for the existence of a table). (2) The lower bound and upper bound agree. Then, the first free variable is fixed. (3) The lower and upper bound differ, so there are a variety of tables satisfying the demanded correlations. Therefore, it must be decided whether an average table or an extreme one is preferred. We suggest simulating at least an average table and possibly afterward simulating extreme tables. For the average table, we assign the mean of the bounds to the first free variable. For extreme tables, either the lower or the upper bound can be assigned to the first free variable.

In either case, the first free variable is now assigned to a constant value, and the system of equations W is updated by inserting that value for the variable. Then, the lower and upper bounds for the new first free variable are determined. (The output “no solution” may no longer appear.) This algorithm is repeated until there are no free variables left and all cell probabilities are determined.

Now, we have all cell probabilities. We interpret them to define a d– point distribution. Then, the inversion algorithm of Lee [35] can be used to simulate the table.

For the case with no solution for the restraints, admissible scenarios can be determined. Instead of maximizing and minimizing the cell probabilities, we determine the bounds for the correlation parameters. For example, let the one-way marginals of a 3×3×3 table be p1 = (0.1, 0.3, 0.6)′, p2 = (0.2, 0.4, 0.4)′, and p3 = (0.3, 0.3, 0.4)′. The appropriate expectations and variances are thereby defined. Then, the procedures Maximize and Minimize are used to calculate the bounds for the correlation parameters. The obtained bounds are −0.797≤ρ1,2≤0.797, −0.808≤ρ1,3≤0.808, and −0.837≤ρ2,3≤0.933. Maximizing ρ1,2+ρ1,3+ρ2,3 yields 2.537, with ρ1,2 = 0.797, ρ1,3 = 0.808, and ρ2,3 = 0.933. Minimizing ρ1,2+ρ1,3+ρ2,3 yields −1.400, with ρ1,2 = −0.598, ρ1,3 = −0.449, and ρ2,3 = −0.354. For ρ = ρ1,2 = ρ1,3 = ρ2,3, we obtain the admissible interval −0.598≤ρ≤0.797.

6.2 Association measured by Goodman and Kruskal’s γ

Lee [35] developed an algorithm for the simulation of a table with given one-way marginal totals and given pairwise association measures in terms of Goodman and Kruskal’s γ. Ibrahim and Suliadi [36] provided a macro program of this algorithm.

This section is organized as follows. First, the algorithm of Lee [35] is described, including three improvements. Then, we use it in two examples showing scenarios of association parameters where a table satisfying the demands does not exist and, even when such a table exists, it cannot be determined with Lee’s [35] method. Later, hints are provided for how to handle these problems.

Consider two ordinally scaled categorical variables Y1 and Y2 with I1 and I2 categories, respectively. Let the (unknown) joint probabilities be denoted by pi,j = P(Y1 = iY2 = j). Consider two random objects with observations of both variables, O1 = (Y1, Y2) and O2 = (Y1, Y2). The probability that the first object has categories i and j and the second object has categories i′ and j′ is then pi,j pi′,j. In addition to being objects with observations, O1 and O2 are also two points of the I1×I2 table, and they may be concordant (i<i′ and j<j′ or i>i′ and j>j′), discordant (i<i′ and j>j′ or i>i′ and j<j′), or indifferent (at least one equality sign appears). Adding the concordant and the discordant cases, the definition of γ becomes (38)

Note that this definition deviates from that of Lee [35] ones. (This is the first improvement by the author.) In the version from [35] or [40] the right-hand-side double sums do not appear. In that case, however, we can obtain differing association values if we rename or interchange the variables. Since this is not judicious in the actual context, the symmetrical version (38) is applied. However, this does not affect the ideas of Lee [35] in an essential way.

Following Lee [35], for given one-way marginals, i.e., for given and , the maximum gamma is γ = 1. The probabilities pi,j carrying this property can be determined by the following routine. With an outer loop i = 1,2,⋯,I1 and for each i with an inner loop j = 1,2,⋯,I2 (or vice versa), set (39)

(This is the second improvement by the author. The author thought that Lee [35] meant the same, but his version was hard to understand.)

For negative gammas, the method must be modified. [35] and [36] stated that a two-way table with perfect negative association (i.e., γ = −1) can be obtained from the two- way table with perfect positive association (i.e., γ = 1) by reversing the order of categories for one of the variables. To see that this is not correct, consider a table with three variables where all association parameters are γ = 1. Reversing the order of categories of the first variable changes two associations to γ = −1. If one then reverses the order of categories of the second or third variable, there remain two associations with γ = −1 and one with γ = 1. If we then reverse the order of categories of the remaining variable that has not changed so far, we again have three associations of γ = 1. Therefore, a table with three variables, where all association parameters are γ = −1, cannot be generated.

However, the joint probabilities for γ = −1 can be determined by reversing the components of the one-way marginal totals p, i.e., , applying routine (39), and next reversing the rows of matrix , thus obtaining the table pmin with the originally demanded one-way marginal totals. (This was the third improvement by the author.)

Denote the generated I1×I2 table with popt and the I1×I2 table for independent Y1 and Y2 with p0, i.e., . Then, the convex linear combination p(λ) = (1−λ)p0+λ popt, 0≤λ≤1, defines a table p(λ) satisfying the one-way marginal totals. For λ = 0, p(0) = p0 holds and the appropriate gamma is zero, i.e., γ[p(0)] = 0. Also, for λ = 1, p(1) = popt holds and the appropriate gamma is one, i.e., γ[p(1)] = 1. Since γ[p(λ)] is a continuous function of λ, there must be a λ* so that γ[p(λ*)] =Γ, 0≤Γ≤1, where Γ is the nominal amount of association. Therefore, Lee [35] solves numerically the equation (40) with respect to λ. With solution λ*, the table pΓ = λ* p0+(1−λ*) popt satisfies the nominal Γ.

The main aim is to generate an I1×I2×⋯×Ic table for c categorical variables with given one-way marginal totals and nominal pairwise associations Γi,j, i,j∈{1,2,⋯,c}, i<j. For each Γi,j, routine (40) is applied, leading to c(c−1)/2 two-way marginal totals . Each entry , with i′∈{1,2,⋯,Ii} and j′∈{1,2,⋯,Ij}, can be expressed as a sum of the c–way probabilities, thus exhibiting linear equations. Lee [35] acknowledged that a solution of a system of linear equations with additional inequalities, pi≥0, can be found by applying linear programming. Having determined an admissible table, the simulation is carried out with the inversion algorithm.

The described method of determining an admissible table will be called the γ−method from here on.

Neither Lee [35] nor Ibrahim and Suliadi [36] mentioned any problems finding a solution and gave the impression that the procedure always finds one. An example is given to prove that this is not always the case.

Consider a 2×2×2 table with given one-way marginal totals (0.2, 0.8)′, (0.4. 0.6)′, and (0.5, 0.5)′ for variables one, two, and three, respectively. It can be confirmed that a table exists that satisfies the nominal pairwise association parameters 1 = Γ1,2 = Γ1,3 = Γ2,3. Now, the nominal pairwise association parameters are set to Γ1,2 = −1, Γ1,3 =1, and Γ2,3 = 1. Routine (39) delivers the probabilities for the three 2×2 sub-tables , and .

It is not necessary to solve (40), since a priori, λ = 1 holds.

Now, the zero-, one-, and two-way marginal totals are known, and the system of linear equations can be established. There is one free parameter, and the three-way table satisfying the restraints is presented in Table 3.

thumbnail
Table 3. A: Cell probabilities for given one- and two-way marginal totals.

B: Table for association parameters Γ1,2 = −1, Γ1,3 = 1, and Γ2,3 = 0.714.

https://doi.org/10.1371/journal.pone.0262502.t004

From p1,1,2 = −p1,1,1 and p1,1,1, p1,1,2≥0, it follows that p1,1,1 = p1,1,2 = 0 must hold. Therefore, p2,2,1 = −0.1 would follow; i.e., there is no table satisfying the restraints.

One could think that, if an admissible table exists, it can be determined by the γ-method. We now show that this is not correct. As an example, our task is to generate a 3×3×3 table with one-way marginal totals p1 = (0.1, 0.3, 0.6)′, p2 = (0.2, 0.4, 0.4)′, and p3 = (0.3, 0.3, 0.4)′ and pairwise Goodman and Kruskal’s association parameters −Γ1,2 = Γ1,3 = Γ2,3 = 0.6023. The sub-tables with maximum associations are determined via (39) and presented in Table 4.

thumbnail
Table 4. 3×3 sub-tables for association parameters Γ1,2 = −1, Γ1,3 = 1, and Γ2,3 = 1.

https://doi.org/10.1371/journal.pone.0262502.t005

The 3×3 sub-tables for independent variables were determined and are presented in Table 5.

Now, the 3×3 sub-tables for association parameters −Γ1,2 = Γ1,3 = Γ2,3 = 0.6023 are generated by determining the coefficient λ due to Eq (40). The results are given in Table 6.

thumbnail
Table 6. 3×3 sub-tables for association parameters −Γ1,2 = Γ1,3 = Γ2,3 = 0.6023.

https://doi.org/10.1371/journal.pone.0262502.t007

These two-way marginal totals are written as a linear system. Together with the inequalities pi,j,k≥0, they should be solved by linear programming. As it turns out in this case, there is no solution. To see why, the linear system is solved to reduce the number of variables. From 3×3×3 variables pi,j,k, there are eight free variables. It is not necessary to present the complete table. Five cell probabilities are the following:

From the first four equations, it follows that p1,1,1≤0.0107, p1,2,1≤0.0213, p2,1,1≤0.032, and p2,2,1≤0.064. Hence, the sum p1,1,1+ p1,2,1+ p2,1,1+ p2,2,1 is less than or equal to 0.128. Then, p3,3,1, given by the last equation, is smaller than zero. Therefore, the γ-method is not able to find a solution for the formulated task.

However, there is a table satisfying the conditions that was found with the procedure NMaximize from Mathematica. The variable Γ was maximized under the restraints of the zero- and one-way marginal totals Γ = −Γ1,2 = Γ1,3 = Γ2,3 and nonnegative variables. The maximum was Γ = 0.6023, and the obtained table is given in Table 7.

thumbnail
Table 7. ×3×3 table satisfying 0.6023 = −Γ1,2 = Γ1,3 = Γ2,3 and one-way marginal totals p1 = (0.1, 0.3, 0.6)′, p2 = (0.2, 0.4, 0.4)′, and p3 = (0.3, 0.3, 0.4)′.

https://doi.org/10.1371/journal.pone.0262502.t008

To simplify the check of the side conditions, the sub-tables are given in Table 8. Although there are similarities to the two-way sub-tables in Table 4, there is one specific difference: zeros do appear, supporting an extreme table.

The tool to determine an admissible association parameter scenario is still applied to the 2×2×2 table from above. Since there was no solution for −Γ1,2 = Γ1,3 = Γ2,3 = 1, it would be interesting to find an extreme constellation for which a solution would exist. The term −Γ1,21,32,3 was maximized under the restraints of the one-way marginal totals and nonnegative variables using the procedure NMaximize from Mathematica. The maximum was −Γ1,21,32,3 = 2.714 and the obtained table is given in Table 3B. The determination of the sub-tables and the comparison with the sub-tables of Table 2 shows that the first two sub-tables agree, but the third differs.

The association parameter turned out to be 0.714. Therefore, it is possible to simulate a table for −Γ1,2 = Γ1,3 =1 and Γ2,3 = 0.714. To simulate a table with agreeing association parameters (absolute values), we can determine an admissible table by applying NMaximize with the restraints Γ = −Γ1,2 = Γ1,3 = Γ2,3 and maximize Γ. In this case, we obtain Γ = 0.859. If we infer the restraints Γ = −Γ1,2 = Γ1,3 = Γ2,3, we obtain the admissible interval −0.859≤Γ≤1.

6.3 Association measured by Somers’ d

There is one similarity between Pearson’s ρ and Goodman and Kruskal’s γ: both take values between −1 and 1. A difference is that ρ = 1 means determinism, i.e., the observation of the category of one variable of an object is sufficient to know the category of the second variable.

This is not generally true for γ = ±1, since such an event only indicates that the table with maximum or minimum association is present. In fact, Lee [35] called it misleading perfect (negative) association. The three 2×2 tables shown here for γ = 1, 0, -1 have the same one-way marginals, p1 = (0.8, 0.2)′ and p2 = (0.1, 0.9)′.

The respective Pearson’s correlation coefficients are ρ = 0.1667, ρ = 0, and ρ = −0.6667. That means, for the given one-way marginals, that γ = 1 stands for low (positive) association, while γ = −1 stands for large (negative) association. Hence, the γ-scale is a relative one and worthless without additional information. Following [40], Somers’ d is a better measure of association (dependence) between ordinal variables. It is a modification of Goodman and Kruskal’s γ. Since a symmetrical version is needed here, the definition of T becomes (42) and the definition of d is (43)

The right-hand-side version results from and .

It is easy to see that d = 0 holds if the variables are independent, and d = ±1 holds if the category of one variable can be deduced from knowing the category of the other variable, i.e., when the table has a (anti-) diagonal structure. For the 2×2 tables from above, d = 0.087, d = 0, and d = −0.471 hold, respectively.

To give an impression of the relation of Somers’ d and Pearson’s ρ, the parameters were calculated for the sub-tables of Table 8. Somers’ values were d1,2 = −0.686, d1,3 = 0.622, and d2,3 = 0.857, and Pearson’s values were ρ1,2 = −0.797, ρ1,3 = −0.808, and ρ2,3 = 0.933.

To find an admissible table satisfying the one-way marginals and the nominal pairwise association parameters Δi,j, it is possible to apply a slightly modified version of the γ−method. For a certain pair i,j of variables, p0 and popt (which are pmax for Δ>0 and pmin for Δ<0) are determined as before. It is useful to calculate d for the table popt. The nominal Δ should reflect less association than d. Then, similar to the γ-method, for each pair of variables, (44) must be solved numerically. As with the γ–method, the obtained two-way marginals p(λ*) are written as a system of linear equations. These are solved by linear programming software. If no solution is obtained, the nominal association parameters need to be weakened.

This was the analog to the γ-method. The additional tools presented in Sections 6.1 and 6.2 can be adapted.

Consider again the example of the 3×3×3 table with one-way marginals p1 = (0.1, 0.3, 0.6)′, p2 = (0.2, 0.4, 0.4)′, and p3 = (0.3, 0.3, 0.4)′. Then, the bounds −0.686≤d1,2≤0.595, −0.667≤d1,3≤0.622, and −0.667≤d2,3≤0.857 are obtained. Maximizing d1,2+d1,3+d2,3 yields 2.074, with d1,2 = 0.594, d1,3 = 0.622, and d2,3 = 0.857. Minimizing d1,2+d1,3+d2,3 yields −0.984, with d1,2 = −0.244, d1,3 = −0.073, and d2,3 = −0.667. For d = d1,2 = d1,3 = d2,3, the admissible interval is −0.308≤d≤0.594.

For the example of Table 3B, d1,2 = −0.250, d1,3 = 0.323, and d2,3 = 0.286 are calculated. (For comparison, Pearson’s correlation coefficients were ρ1,2 = −0.408, ρ1,3 = 0.500, and ρ2,3 = 0.408.)

For the sub-tables of Table 8, d1,2 = −0.255, d1,3 = 0.282, and d2,3 = 0.318 are calculated. (For comparison, Pearson’s correlation coefficients were ρ1,2 = −0.390, ρ1,3 = 0.429, and ρ2,3 = 0.475.)

6.4 How to obtain tables for all admissible associations measured by Somers’ d

As was worked out in the last section, the adapted γ–method does not always allow the determination of a table satisfying nominal associations measured by Somers’ d, although one exists. It was also reported that a numerical maximization was able to find the solution. However, a large number of iterations were necessary, and the method may fail if the number of variables increases.

Assume that a table p* exists that satisfies the nominal two-way associations measured by Somers’ d. Let the expression d* = d(p*) define this property, where d* is the vector of the nominal two-way associations and d(p) indicates the vector of the actual two-way associations from table p.

From Section 6.1, it is known how to determine a table with given Pearson’s correlation coefficients ρ, where ρ is the vector of the two-way correlation coefficients. Denote the generated table with p(ρ). We are looking now for ρ, so that p(ρ) has the desired property with respect to d*, i.e., d* = d[p(ρ)]. This is realized by a minimization procedure: (45)

We used the function FindMinimum of Mathematica with starting points ρ = d* and the Euclidean norm. The function p(ρ) had to be specified in two ways, and subsequently, the minima and maxima of the free variables were evaluated. The cell of the actual variable was set to the mean of the minimum and maximum. We denote the specification with p(ρ, mean). When ρ left the admissible region, i.e., when there was no solution for the restraints, the penalty term ‖d*−d[p(ρ, mean)]‖ was set to a large value. The argument for which the norm is minimum is named ρ*.

The procedure was applied to the repeatedly used one-way marginals of a 3×3×3 table. In Section 6.3, the admissible range −0.308≤d≤0.594 was determined for d = d1,2 = d1,3 = d2,3. Nearly extreme scenarios have been chosen. For , the appropriate ρ* vector became (−0.437, −0.447, −0.456)′. The obtained table is not presented here but can be determined via p(ρ*, mean) and the technique from Section 6.1. For d* = (0.59, 0.59, 0.59)′, the appropriate ρ* vector became (0.791, 0.764, 0.750)′. It was confirmed with other examples that the Pearson’s coefficients were often (absolutely) larger the Somers’. But this is not a general rule, as proven with d* = (0, 0, 0)′. Then, the appropriate ρ* vector became (0.182, 0.057, 0.013)′ and deviated considerably from the expected ρ*≈(0, 0, 0)′.

Recall that a solution need not be unique. Assuming independence between all pairs of variables, the related table is determined by multiplying the one-way marginals involved in the specified cells. For this independence table, d = (0, 0, 0)′ and ρ = (0, 0, 0)′ hold. If we wish to generate the independence table, given the demand d* = (0, 0, 0)′, we must give up the choice of the mean value of the admissible intervals (bl, bu) of the free variables. Instead, we take that value of the interval that is nearest to the value pind of the independence table. If blpindbu holds, p = pind is taken, and if pind<bl holds, p = bl is taken. For bu<pind, the choice is p = bu. We denote this specification with p(ρ, ind). The application of this principle led indeed to ρ* = (0, 0, 0)′. Applied to d* = (−0.3, −0.3, −0.3)′, the appropriate ρ* vector became (−0.442, −0.440, −0.454)′. For d* = (0.59, 0.59, 0.59)′, the appropriate ρ* vector became (0.790, 0.750, 0.735)′. Obviously, for high associations, the difference between p(ρ, ind) and p(ρ, mean) was not great. For the most extreme associations, −0.308 and 0.594, p(ρ, ind) and p(ρ, mean) result in the same table.

For d* = (0, 0, 0)′, p(ρ, max) was still evaluated, i.e., the maximum was always chosen from the admissible intervals for the free variables. Then, the appropriate ρ* vector became (0.193, 0.130, 0.026)′. Analogously, with p(ρ, min), the appropriate ρ* vector became (−0.100, −0.090, −0.086)′. This might suffice to illustrate the admissible range of tables satisfying nominal associations.

When the minimum of (45) was not zero for a nominal d*, no table for the demands was found. Then, the nearest admissible table due to the used norm was obtained.

7 Application to the Berkeley data

7.1 Why the two-way LD differs from the partial LDs and their mean

One real-life example for Simpson’s paradox is particularly impressive. The University of California, Berkeley, was sued for bias against women who had applied for admission. The reduced data version found at https://en.wikipedia.org/wiki/Simpson%27s_paradox is presented in columns 1–5 of Table 9.

thumbnail
Table 9. Numbers of denied and admitted applications at six departments as part of the study [41].

Variable 1 is sex (men—women), variable 2 is admittance (denied–admitted), and variable 3 is the department (1 to 6). is the LD between variable 1 and variable 3 (which is now dichotomous: Department i versus the rest). is the LD between variable 2 and variable 3 (which is again dichotomous: Department i versus the rest). Parameter stands for the frequency of applications to department i. is the LD between the first category of variable 1 and the first category of variable 2 within Department i, and is the corresponding correlation coefficient.

https://doi.org/10.1371/journal.pone.0262502.t012

Dividing the number of admitted men by the number of applying men shows a rate of 1198/2691 = 44.5%, while dividing the number of admitted women by the number of applying women shows a rate of 557/1835 = 30.4%. The large difference between 44.5% and 30.4% resulted in a perception of discrimination against women. Therefore, the question was whether women were really handicapped or if there were other reasons that led to the differing rates.

Bickel and collegues [41] examined the department-level data and did not find clear evidence of discrimination against women. Averaged over the departments, they found a moderate preference for women. In principle and qualitatively, this corresponds to the inspection of the last two columns of Table 9. Note that positive values mean that more men than women relative to their frequencies were denied; i.e., more women were admitted. The authors of [41] also worked out the reason for the great discrepancy between the apparent overall handicap for women and the almost absent handicaps within the departments. The reason was the preferred applications of women to departments with low admission rates. However, this reason was not found by straightforward theory but by good detective work.

With the LD approach of Section 2, the parameter for the overall association between sex and admission is D1,2. The overall LD is D1,2 = (1493×557−1198×1278)/45262 = −0.0341, showing the handicap for women. (Significance tests should and can be applied, but they are not the focus here.) This approach does not account for the influence of the departments. Therefore, the averaged LDs of the departments is a more reliable parameter. Direct evaluation of via the eighth column of Table 9 gives , reflecting a small preference for women. The difference between both parameters is .

Now, the result (8) of Section 2 is applied. With it, the difference can be determined in a completely different manner. The difference between D1,2 and is . The first summand, , is determined as follows. belongs to the 2×2 table for Department 1, where the first column is assigned to “men” and the second column to “women”. The first row is assigned to “Department i”. The second row is assigned to the complement, i.e., to the rest of departments. In Department 1, 825 men and 108 woman applied. Overall, 2691 men and 1835 women applied; i.e., 2691−825 men and 1835−108 women applied to the other departments. The numbers are presented in Table 10 together with those for denied and admitted applicants at department 1.

thumbnail
Table 10. Numbers of men and women with application to department 1 and numbers of denied and admitted applicants at department 1.

“Rest” means departments two to six.

https://doi.org/10.1371/journal.pone.0262502.t013

We obtain . The positive sign says that the applications of men appeared more often than the average. Analogously, is calculated. The negative sign indicates that admittance was more often than the average. We need still , the probability of application to Department 1, which is . Hence, . When columns 6, 7, and 8 are completed, the difference between D1,2 and can be calculated: (46)

It is easy to see that the difference becomes particularly large (positive) when and have the same sign within the departments, because then all summands are positive. Also, the difference becomes particularly small (negative) when and have different signs within the departments, because then all summands are negative. Inspection of columns 6 and 7 of Table 9 proves the negative correlation of and within the Berkeley data.

The interpretation is that the apparent discrimination against women was caused by a property of the departments. Those with high admittance rates had more male applicants and those with low admittance rates had more female applicants. While Bickel and colleagues [41] had to be good detectives to discover this trend, the new approach makes it obvious immediately. The remark in [41] concerning the role of the size of the departments has to be verified, since appears in the denominator in (46). See also Eq (49B).

7.2 The determination of parsimonious models fitting the data

The aim is to infer whether a proven three-way interaction is caused by all three-way interaction parameters or only by a subset of them. With the Berkeley data, it is shown that the search for a parsimonious model fitting the data can be successful.

The multinomial distribution of Table 9 has 2×2×6 = 24 parameters. Eliminating zero-, one-, and two-way marginals results in five free variables. To determine the distribution without a three-way interaction, the entropy was maximized for these variables. The maximum 2.888 was reached for , and . Comparison with the observed table yields a χ2 value of 18.8, speaking to nonagreement (p = 0.002) and the existence of a three-way interaction.

The three-way interaction is quantified by the three-way interaction parameters . The counts for the variables were n1,1,1 = 313, n1,1,2 = 207, n1,1,3 = 205, n1,1,4 = 279, and n1,1,5 = 138. The corresponding three-way interaction parameters are therefore D1,1,1 = 0.0038, D1,1,2 = 0.00014, D1,1,3 = −0.0024, D1,1,4 = −0.00018, and D1,1,5 = −0.0016. The sixth three-way disequilibrium parameter, D1,1,6, linearly depends on the others. Actually, the sum is zero, i.e., D1,1,6 = 0.00021. The largest absolute value appeared for the first department.

All three-way interaction parameters differing from zero reflect a contribution to three-way interaction. To quantify these contributions, the partial 2×2 tables under the hypothesis of an absence of three-way interactions were compared with the observed ones. The χ2 values for the six categories were 20.6, 0.1, 2.4, 0.01, 2.1, and 0.1, respectively. Obviously, the first department indeed plays a dominant role.

A table {pi,j,k} fitting the observed table must therefore trim D1,1,1 to zero. This can be guaranteed by setting the free parameter p1,1,1 to n1,1,1/4526. Then, there remain four free parameters. Theoretically, one could now derive maximum-likelihood estimates to fit them, but the use of the maximum entropy principle under restraints is easier. The restraints are the zero-, one-, and two-way marginals and p1,1,1 = 0.0692. Then, the four remaining free parameters are determined by numerically maximizing the entropy. The maximum H = 2.886 was reached for p1,1,2 = 0.0454, p1,1,3 = 0.0463, p1,1,4 = 0.0605, and p1,1,5 = 0.0314. A comparison of the corresponding table with the observed one yielded a χ2 value of 2.56; i.e., the data were met. For the hypothetical table, the partial correlations for Departments 1 to 6 were 0.136, −0.003, −0.007, −0.007, −0.006, and −0.004, respectively; i.e., with the exception of , they were absolutely small. The complete table, multiplied by n, is presented in Table 11 under Method A.

thumbnail
Table 11. Fitted counts of the Berkeley data.

Five free variables were fitted in three ways. A: First variable n1,1,1 taken from Table 9, four from maximizing entropy. B: Five from D1,1,i = 0, i = 2,3,…,6. C: Four from agreeing , with the fifth the log-likelihood estimate.

https://doi.org/10.1371/journal.pone.0262502.t014

Now an alternative method is investigated. In Section 3, we worked out for 2×2×2 tables how to infer the agreement of partial correlations. This approach is straightforward to generalize to 2×2×I3 tables with I3>2. (For that, the right-hand-side expression of Eq (14) is useful. There, i substitutes for index “1” and I3 substitutes for index “2”. The same substitutions are necessary for the third indices of A and B.) Then, the global hypothesis for the Berkeley data would be that all partial correlations with respect to the third variable agree. It can be formulized by demanding ; i.e., there are actually five equations. Since there are also five degrees of freedom, the hypothetical table may be calculated explicitly. The hypothetical agreeing partial correlation coefficients became 0.019. A comparison of the hypothetical 2×2×6 table with the observed data gave a χ2 value of 17.4 (p = 0.0036); i.e., one cannot be convinced with unique correlation coefficients. Comparisons of the partial 2×2 tables with the observed ones showed one significant deviation. For Department 1, the χ2 value became 19.2. All other χ2 values were smaller than 2.1.

Comparisons of the partial 2×2 tables with the tables for independence showed no significant deviation. All χ2 values were smaller than 0.4. That means can be assumed. These five equations are solved by the five arguments p1,1,1 = 0.0683, p1,1,2 = 0.0455, p1,1,3 = 0.0466, p1,1,4 = 0.0608, and p1,1,5 = 0.0316. The comparison of the associated table with the observed data gave a χ2 value of 3.69; i.e., the model fits the data. Comparisons of the partial 2×2 tables with the observed ones also showed no significant deviation. All χ2 values were less than 1.2. Comparisons of the partial 2×2 tables with those under independence showed one significant deviation. For Department 1, the χ2 value became 10.8 (p = 0.001, the correlation was , somewhat smaller than before). All other χ2 values were of course zero. The table is presented under Method B in Table 11.

Method B gained from the finding that five partial correlations could be set to zero. Under different circumstances, it could be possible that the five partial correlations agree but are not zero. In that case, can be assumed for i∈{2, 3, 4, 5}. For these four equations, four variables can be eliminated. The last variable, p1,1,1, is determined via maximum likelihood, i.e., by maximizing ∑i,j,kni,j,kln(pi,j,k). The solution can be viewed under Method C in Table 11. The comparison of the table with the observed data gave a χ2 value of 2.73; i.e., the model fits the data. Comparisons of the partial 2×2 tables with the observed ones also showed no significant deviation. All χ2 values were less than 0.72. Comparisons of the partial 2×2 tables with those under independence showed one significant deviation. For Department 1, the χ2 value became 16.7 (p = 0.00004, the correlation was ). All other χ2 values were less than 0.03.

Thus, the apparent discrimination against women with respect to admittance turned out to be untrue. In Departments 2 to 6, men and women were admitted equally. In the first department, men had a significant handicap.

8 Discussion

In Section 2, the LD parameter was used to quantify Simpson’s paradox. The difference between a two-way interaction and the averaged partial interactions was derived. For a 2×2×2 table, the difference was (47)

(Note that the notation D1,2, e.g., means the LD between the first category of variable 1 and the first category of variable 2, formerly denoted by .) In many experiments, there is one response variable (here it is the first one) and two explanatory variables. The latter ones can be arranged to ensure D2,3 = 0, for example, by applying the treatments to the same fraction of males and females. In this way, the difference is zero and Simpson’s paradox is circumvented.

Let D1,3 and D2,3 differ from zero. One could think that the difference is largest when p3 is near zero or one. However, the value of LD depends on the one-way marginal totals. Using the correlation coefficients (5) instead gives (48)

Thus, the absolute difference is largest when p1 and p2 are one half, and it is smallest when p1 or p2 are zero or one. On the other hand, when p1 and p2 are zero or one, the associated LD, i.e., D1,2, is zero. Therefore, it is useful to also consider the relative difference: (49A)

For an I1×I2×I3 table, the appropriate expression is (49B)

When we are interested in the association between two categorical variables, such as sex and admission at a university, it is useful to determine D1,2 or ρ1,2. If one finds preference for one sex, this does not mean that the other sex experienced discrimination. The reason could be that the abilities of the sexes happened to be different. Therefore, it was reasonable to consider an index for the high school report as a factor. With the Berkeley data, it turned out that the departments need to be considered as a factor. When a third factor has an effect, then gives a better estimate for the association of the two variables than D1,2. However, the value of D1,2 is still useful. If the difference is greater than zero, it follows automatically from (47) that D1,3 and D2,3 cannot be zero and they have the same signs. If the difference is zero, it follows that D1,3 or D2,3 are zero. If the difference is smaller than zero, D1,3 and D2,3 cannot be zero and they have different signs; i.e., the interactions have different directions.

Hence, Eqs (8) and (9) are a great help for interpreting the tables. It is particularly interesting that their validity is independent of the free parameters.

Section 3 was dedicated to the question of whether the amount of interaction between two variables depends on another categorical variable. In Section 7.2, the approach was generalized to 2×2×I3 tables and applied to the Berkeley data. It was possible to find parsimonious models that fit the data.

If all variables have more than two categories, i.e., for a general I1×I2×I3 table, the third variable has no effect on the associations between the first and second variable if (50) holds for i = 1,2,⋯,I1−1,j = 1,2,⋯,I2−1, and k = 1,2,⋯,I3−1. Analogously to (12), (13), and (14), this system of linear equations can be solved, thereby delivering the hypothetical table that can be compared with the observed one. If it does not fit the data, subsequently it can be checked whether the k−th category of the third variable plays a special role. For each k, the (I1−1)(I2−1) equations of (50) are solved. The remaining (I1−1)(I2−1)(I3−2) free variables are found by maximizing entropy. For each k, the hypothetical table can again be tested against the observed table.

When there is still no hit, the largest deviations from an average ρ can be searched in different ways. There is a need for further investigations into an optimal systematic strategy to find a parsimonious model. The model choice and multiple testing theories have to be kept in mind.

However, the new approach is suited to answering important questions and surely enriches the theory of contingency tables.

In Section 4, the relation between Bartlett’s and Bennett’s measure on the three-way interaction was investigated. As summarized in the introduction, Bartlett’s measure (which he mentioned came from R.A. Fisher) had a high degree of impact, while Bennett’s measure was a generalization of the two-way LD based on intuition. The meaning and correctness of this measure could therefore only be checked through its relation to Bartlett’s measure. As it turned out, it is a simplified version of the first-order Taylor expansion of the latter one.

For 2×2×2×2 tables, the criterion for an absence of four-way interaction is a straightforward generalization of Bartlett’s multiplicative criterion, shown by Good [27]. A generalization of Bennett’s linear three-way measure is not straightforward [1620].

Unfortunately, the roots of the seven-degree polynomial arising for the multiplicative measure cannot be determined algebraically, and the Taylor expansion cannot be generated directly. However, the criterion is a function of the parameter p1,1,1,1, which is itself a function of the one-, two-, and three-way marginal totals. Thus, further progress depends on the availability of an effective algorithm to derive multivariate Taylor expansions for implicit functions. Since the implicit function is a polynomial (where the order of derivatives unequal to zero is finite) and there is a high amount of symmetry, there appears to be hope.

The focus in Section 5 was on tables with zero counts. The question was whether these counts appeared by chance or whether they were a necessary consequence from the given two-way marginal totals. The application of linear programming was successful in obtaining fixed zero counts. Furthermore, fixed nonzero counts can be determined.

One example of Fienberg and Rinaldo [2] was reanalyzed and lead to Table 2. For completeness, the other examples were also investigated. For their Tables 4 and 7, no fixed cells were obtained. For their Table 5, all cells turned out to be fixed. Fienberg and Rinaldo [2] characterized this table as yielding no MLE and wrote, “In fact, the values of both goodness of fit statistics will always be almost zero, no matter what the positive counts are”. This underlines that they did not acknowledge that the contingency table was the only one with the given marginal totals. So far, there was no tool available to find this simple but important truth.

In cases where the number of variables could be reduced, such as with Table 2, the question arises whether the determination of the MLEs can be optimized. One way would be to modify commonly used procedures. Alternatively, Good’s [27] method of maximizing the entropy under restraints can be used. Numerical maximization of the entropy of the data of Table 2 (divided by n = 113), given the two-way marginal totals, yielded p1,1,2 = 0.0316, p1,1,3 = 0.0120, p2,3,1 = 0.0102, and p3,2,1 = 0.0081. Due to the concavity of entropy, the convergence was excellent. The other cell counts can be calculated according to the expressions in Table 2.

In Section 6, improvements were achieved for the simulation of ordinally scaled variables. The main task was to determine an admissible table satisfying the demands. As noted above, there might be difficulties in obtaining an admissible table, simply because such a table does not exist when the restraints are too strong. Ignoring the assumptions pi≥0, a solution would exist (if the number of equations expressing the restraints does not exceed the number of variables). However, the bona fide table would have negative entries. Therefore, one can search for an admissible solution by minimizing (51) where the free variables must be fit. If the obtained minimum is zero, an admissible solution is found. Alternatively, the maximum entropy principle under restraints can be used. When a bona fide table has negative entries, the entropy becomes complex. Therefore, it is useful to minimize (52) where Im(x) is the imaginary part of x. Our limited experiences would emphasize the latter method.

Lee [35] derived an algorithm for simulating nominal variables with given pairwise correlations measured with Goodman and Kruskal’s τ, 0≤τ≤1. Since the original measure does not ensure τ(Y1, Y2) = τ(Y2, Y1), Lee suggested the symmetric measure τ = Max[τ(Y1, Y2), τ(Y2, Y1)]. This measure is one when τ(Y1, Y2) or τ(Y2, Y1) is one. However, a maximum correlation of one should only appear when both τ(Y1, Y2) and τ(Y2, Y1) showed a correlation of one. Therefore, it is more appropriate to define τ = [τ(Y1, Y2)+τ(Y2, Y1)]/2.

It can be shown that nominal variables result in similar problems as with ordinally scaled variables. Their treatment is analogous to that presented in Sections 6.2 and 6.3. Unfortunately, the method from Section 6.4 cannot be applied. The reason is that a measure for nominal variables is invariant with respect to permutations of the categories, while this does not apply to the correlation coefficient generally.

Although a lot of care was spent on the simulation of tables with nominal pairwise association measures, it seems that the meaning of such scenarios is limited. In practice, when an observed table is analysed, it is more important to simulate either tables under a null hypothesis or tables under different alternative hypotheses. In both cases, the two-way marginal totals can be viewed as fixed. With given two-way marginal totals, the two-way associations can be determined. (When there are c variables, even the c-way marginal totals can be viewed as fixed.) Then, it remains to define the properties of the table to simulate and to determine the cell properties. The simulation can then be carried out with the inversion method of Lee [35].

The statements resulting from such simulations are normally about the effect of certain properties of a table. This is only correct when there is just one table with the properties. Commonly, there are several such tables; i.e., it is necessary to simulate at least some extreme tables (where the cell probabilities are edges of the convex set of admissible tables) and an average table (e.g., the table with the given restraints and maximum entropy).

In this study, Good’s [27] investigations on maximum entropy under restraints were repeatedly used. It allows us to determine hypothetical tables without knowing the MLEs of the log- linear model explicitly, as it suffices to formulate the equations of the hypotheses.

The 2×2×2 data of Mood [42] were repeatedly used to demonstrate improved theories. The observations were n1,1,1 = 79, n1,1,2 = 73, n1,2,1 = 62, n1,2,2 = 168, n2,1,1 = 177, n2,1,2 = 81, n2,2,1 = 121 and n2,2,2 = 75. Application of the maximum entropy principle led to the results in Table 12. Due to the concavity of entropy, convergence was excellent. Comparison with results of previous theories shows the appropriateness of the principle. A comment on the results [43] is given below.

thumbnail
Table 12. The χ2 values for certain interaction hypotheses concerning the data of Mood (1950).

https://doi.org/10.1371/journal.pone.0262502.t015

Note that the row for maximum entropy satisfies the theoretical results of Roy and Kastenbaum [39] on the MLE for the log-linear model with given zero-, one-, and two-way margins. The approach [43] yielded the correct result only in one case.

Due to the concavity of entropy, the maximum is a global one. This explains why Bartlett’s criterion, which is a cubic equation, has one real and two complex solutions under all admissible circumstances.

This fact makes it easy to prove that Bartlett’s and Bennett’s criteria agree if p1 = p2 = p3 = 1/2 holds or if at least two of the three two-way interactions are zero. When these conditions are substituted into (17), the criterion (15) with the appropriate cell frequencies can be expanded. The result is in both cases (53)

Good’s [27] paper on maximum entropy under restraints, however, was sometimes overlooked. For example, Streitberg [25, 26] did not include this approach in his contemplations. When Fienberg and Rinaldo [2] cited Good, they did not write about entropy, and when Fienberg and Rinaldo [30, 31] wrote about entropy, they did not cite Good. The authors of [43] wrote about entropy without recognizing Good’s results. They stated that an equivalence test for the independence between one variable and the remaining two in [39] was not correct. For a 2×2×2 table, the statement of [39] can be formulated as follows: Assuming D1,3 = D2,3 = 0, the validity of Bartlett’s criterion, i.e., D = 0, is equivalent to pi,j,k = p•,•,k pi,j,• for i,j,k∈{1,2}.

Applying D1,3 = D2,3 = 0 to (16), we get via (53)

(54)

A proof for Roy and Kastenbaum’s [39] statement is now given. Starting the proof with D = 0, we get p1,1,1 = p3 p1,2. Substituting p3 p1,2 for p1,1,1 in (11), together with p1,3 = p1p3 and p2,3 = p2p3, gives (55)

i.e., indeed pi,j,k = p•,•,k pi,j,• holds for i,j,k∈{1,2}.

Starting with p1,1,1 = p2 p1,2 and regarding (54), we immediately get D = 0. Therefore, the statement in [43] was not correct. This caused the wrong results in Table 12.

Another advantage of the entropy principle is that there are no problems with zero counts. As Khinchin [29] noted, an event with probability zero need not be considered. It might be viewed as mythical to exclude an event from a contingency table, but this view is overcome when the table is considered as a multinomial distribution. In that case, the dimension simply reduces. In this way, there are also no problems with the evaluation of the χ2 – or G2 − test statistics, since singularities cannot appear.

Hence, a numerical procedure that maximizes entropy should test whether a probability pi is larger than, say, 10−8. Otherwise, the term pi ln pi is set to zero when summing up entropy via (2).

The concept of entropy has a great meaning in thermodynamics. There, a system drives to an equilibrium state, one with maximum entropy. Similar processes are observable in population genetics, where large populations with random mating converge to independence of genotypes, even for closely linked loci. (Only the one-way margins are maintained.) The obtained state is named the linkage equilibrium, while the presence of a two-way interaction is called the linkage disequilibrium. In population genetics, there are also events that decrease entropy, such as mutations, inbreeding, and selection. While the aspects of processes concerning populations are complex, they are simple compared with social or metabolic processes. In human society, there are forces toward increase of entropy and forces toward reduction of entropy, from the smallest groups up to the human race.

Hence, analyzing disequilibria using contingency tables encompasses the task of thinking about forces that affect a process.

In this study, two points were repeatedly applied: the maximization of entropy and the treatment of a contingency table as a multinomial distribution. The question arises whether entropy has an analytical relation to the likelihood function of a multinomial distribution.

The k-dimensional multinomial probability distribution is . In the ideal case pi = ni/n (which is at least asymptotically satisfied), the log-likelihood of the factor is . This relation suggests that the maximal likelihood is related to the minimal entropy and the maximum entropy to the minimal likelihood. In the given context, however, the observations ni underlie constraints, such as the given one- and two-way marginal totals. Therefore, the multinomial coefficient is no longer a constant. The application of Stirling’s formula leads to (56)

Therefore, (57) is an asymptotic expansion of the log-likelihood function. The likelihood function is then (58) where the right-hand-side expression corresponds to the formulas given in [45].

Leaving out the constants, the derivative of the likelihood function is (59)

The constraints investigated in this study, given one-, two-, or three-way marginal totals, lead to cells ni or pi, which are linear combinations of free parameters and given constants. Actually, each free parameter x appears in the cells either as x or–x; see Eqs (4), (11), (17) and Table 2. Therefore, dni/dx is either 1 or -1. Thus, the derivative (59) is zero if (60) is satisfied. The plus sign means that summation has to be taken over all cells where x appears as +x, and the minus sign means that summation has to be taken over all cells where x appears as–x. Both cases appear with the same frequency. Therefore, Eq (60) indicates that the agreeing harmonic means guarantee an optimum. It can be shown that the second derivative is strictly positive; i.e., solving Eq (60) gives the value x for which the likelihood is minimal.

Two examples of 2×2 tables are presented in Fig 1.

thumbnail
Fig 1. The log-likelihood function ln L and the entropy H for the free parameter x of two 2×2 tables.

The broken lines correspond to the asymptotic expression (57).

https://doi.org/10.1371/journal.pone.0262502.g001

One can see in Fig 1 that the minimum likelihood corresponds to the maximum entropy. For the first example, the condition (60) for the minimum of L, and therefore also for ln L, is (61)

The solution is x≈12.3, while the maximum entropy appears for x = 12. The second example particularly shows the goodness of the asymptotic expression, as it nearly agrees with the exact one.

The asymptotic condition for the minimum likelihood (60) has a special relation to the maximum entropy principle. When Good [27] determined the maximum entropy in the same context as here, he found the condition (62)

A comparison of this condition with condition (60) proves that the equality of geometric means applies for maximizing the entropy while the equality of harmonic means applies for minimizing the likelihood.

The most elementary restraint is that the number of observations is just n. Then, we can write , and both Eqs (60) and (62) give the same results nk = ni, i = 1,2,⋯,k−1. From this, ni = n/k and pi = 1/k result for i = 1,2,⋯,k; i.e. the maximum entropy solution is identical with the asymptotic minimum likelihood solution.

9 Conclusions

Five methods contributed to a considerable improvement in the theory of contingency tables: (1) the use of the LD measure, (2) the treatment of a table as a multinomial distribution, (3) the use of algebraic software, (4) the consequent utilization of linear programming, and (5) the application of the maximization of entropy under restraints.

Using the linkage disequilibrium parameter D as a measure of association between two categorical variables, which is essentially the determinant D = p11p22p12p21 of a 2×2 table, sufficed to quantify Simpson’s paradox. The difference between a two-way interaction and the averaged partial interactions for the categories of a third variable was derived. For a 2×2×2 table, the difference was . It became particularly clear that the agreement of D1,2 and can only arise when the third variable is independent of the first or second one (because Di,j = 0 means independence between variables i and j).

In many experiments, there is one response variable together with two explanatory variables. The latter ones can be arranged to ensure D2,3 = 0, for example, by applying the treatments to the same fraction of males and females. In this way, the difference is zero and Simpson’s paradox is circumvented.

However, with unplanned experiments or with two or three response or random variables, Simpson’s paradox, which is essentially , is to be expected.

It is of particular interest that, with knowledge of merely one- and two-way parameters, implications for the three-way structure are possible. This was demonstrated with the Berkeley data and could be shown with other real data reflecting Simpson’s paradox.

With the log-linear model, a variety of important hypotheses can be tested. However, practically relevant hypotheses were not the focus. One such hypothesis is that the actual degree of interaction between two categorical variables is the same within all levels of another categorical variable. This study derived a model for this hypothesis and applied it to the Berkeley data. It was possible to show that the model of agreeing associations between sex and admittance (measured with Pearson’s correlation coefficient) within the departments does not fit the data. However, with a refinement, it could also be shown that just one department caused the heterogeneity. Within the other departments, there was no significant association between sex and admittance.

There is a need for further investigations into an optimal systematic strategy to find a parsimonious model. However, the new approach presented here is suitable for answering important questions and surely enriches the theory of contingency tables.

Tables with zero counts provoke the question of whether these counts appeared by chance or whether they were a necessary consequence of the given two- or more-way marginal totals. The application of linear programming provided a much simpler and successful way to obtain fixed zero counts than other methods used so far. Furthermore, fixed nonzero counts can be determined; thus, the number of independent variables (degrees of freedom) could be further reduced.

Improvements were achieved for the simulation of categorical variables with given relationships, and the restrictions of formerly used procedures could be circumvented.

In this study, Good’s [27] investigations on maximum entropy under restraints were repeatedly used. Maximizing entropy under restraints means determining the table or multinomial distribution that is characterized by a minimum of information, the largest disorder, or as-uniform-as-possible cell frequencies under given assumptions. It allows the numerical determination of hypothetical tables by incorporating the equations of the hypotheses. It was recalled that, with appropriate hypotheses, the results of the maximum entropy principle agree with those of the MLEs of the log-linear model. Recent doubts about the validity of hierarchical log-linear models could be eliminated.

The relation between Bartlett’s multiplicative and Bennett’s additive measure of the three-way interaction was investigated. As it turned out, Bennett’s measure is a simplified version of the first-order Taylor expansion of Bartlett’s measure. Since Bartlett’s measure (which is in concordance with the maximum entropy principle and with the log-linear model) has a deeper meaning than Bennett’s measure, it is concluded that Bartlett’s measure is the first choice. When an easy-to-calculate measure is preferred, the full first-order Taylor expansion should be applied instead of Bennett’s measure.

It was shown for contingency tables that the concept of entropy is related to the likelihood principle for the multinomial distribution. In particular, a hypothetical table with maximum entropy under linear restraints (like the given marginal totals) and a table with minimum likelihood under the same restraints are similar but not identical. The tables at the bounds of the admissible region yield local minima of entropy and local maxima of the likelihood function.

It is hoped that applicants feel encouraged to test not only the classical hypotheses but also those of particular interest and that theoreticians further improve the suggested methods.

Acknowledgments

The author thanks two referees of an earlier version of this manuscript for valuable suggestions.

References

  1. 1. Agresti A. Categorical Data Analysis. 2nd edition. John Wiley & Sons, New York; 2013.
  2. 2. Fienberg SE, Rinaldo A. Three centuries of categorical data analysis: Log-linear models and maximum likelihood estimation. Journal of Statistical Planning and Inference. 2007;137:3430–45.
  3. 3. Bartlett MS. Contingency table interactions. Journal of the Royal Statistical Society (Suppl). 1935;2:248–252.
  4. 4. Simpson EH. The interpretation of interaction in contingency tables. Journal of the Royal Statistical Society, Series B. 1951;13:238–41.
  5. 5. Blyth CR. On Simpson’s paradox and the sure-thing principle. Journal of the American Statistical Association. 1972;67:364–6.
  6. 6. Shapiro SH. Collapsing contingency tables–A geometric approach. The American Statistician. 1982;36:43–6.
  7. 7. Wagner CH. Simpson’s paradox in real life. The American Statistician. 1982;36:46–8.
  8. 8. Haunsperger DB, Saari DG. The lack of consistency for statistical decision procedures. The American Statistician. 1991;45:252–5.
  9. 9. Appleton DR, French JM, Vanderpump MPJ. Ignoring a covariate: An example of Simpson’s paradox. The American Statistician. 1996;50:340–1.
  10. 10. Pavlides MG, Perlman MD. How likely is Simpson’s paradox? The American Statistician. 2009;63:226–33.
  11. 11. Alin A. Simpson’s paradox. WIREs Computational Statistics. 2010;2:247–250.
  12. 12. Selvitella A. The ubiquity of the Simpson’s Paradox. Journal of Statistical Distributions and Applications. 2017;4;2. Available from: https://doi.org/10.1186/s40488-017-0056-5.
  13. 13. Wang B, Wu P, Kwan B, Tu XM, Feng C. Simpson’s Paradox: Examples. Shanghai archives of psychiatry. 2018 Apr 25;30(2):139–143. pmid:29736137
  14. 14. Rojanaworarit C. Misleading Epidemiological and Statistical Evidence in the Presence of Simpson’s Paradox: An Illustrative Study Using Simulated Scenarios of Observational Study Designs. Journal of Medicine and Life. 2020 Jan-Mar;13(1):37–44. pmid:32341699
  15. 15. Bennett JH. On the theory of random mating. Annals of Eugenics. 1954;18:311–7. pmid:13148997
  16. 16. Slatkin M. On treating the chromosome as the unit of selection. Genetics. 1972;72:157–68. pmid:4672513
  17. 17. Nijenhuis LE, D’Amaro J. Three-locus haplotype interactions in the analysis of linkage disequilibrium. Tissue Antigens. 1985;26:215–26. pmid:3865455
  18. 18. Gorelick R, Laubichler MD. Decomposing multilocus linkage disequilibrium. Genetics. 2004;166:1581–3. pmid:15082571
  19. 19. Nielsen DM, Ehm MG, Zaykin DV, Weir B. Effect of two- and three- locus linkage disequilibrium on the power to detect marker/phenotype associations. Genetics. 2004;168(2):1029–40. pmid:15514073
  20. 20. Kim Y, Feng S, Zeng Z-B. Measuring and partitioning the high-order linkage disequilibrium by multiple order Markov chains. Genetic Epidemiology. 2008;32:301–12. pmid:18330903
  21. 21. Lancaster HO. Complex contingency tables treated by the partition of chi-square. Journal of the Royal Statistical Society, Series B. 1951;13:242–9.
  22. 22. Lancaster HO. The Chi-Squared Distribution. London; 1969.
  23. 23. Hill WG. Non-random association of neutral linked genes in finite populations. In: Population Genetics and Ecology (eds. Karlin S. and Nevo E.), Academic Press Inc., New York, 1976;339–76.
  24. 24. Töwe J, Bock J, Kundt G. Interactions in contingency table analysis. Biometrical Journal. 1985;27:17–24.
  25. 25. Streitberg B. Lancaster interactions revisited. The Annals of Statistics. 1990;18:1878–85.
  26. 26. Streitberg B. Exploring interactions in high-dimensional tables: A bootstrap alternative to log-linear models. The Annals of Statistics. 1999;27: 405–13.
  27. 27. Good IJ. Maximum entropy for hypothesis formulation, especially for multidimensional contingency tables. The Annals of Mathematical Statistics. 1963;34:911–34.
  28. 28. Shannon CE. The mathematical theory of communication. Bell System Technical Journal. 1948;27:379–423.
  29. 29. Khinchin AI. Mathematical Foundations of Information Theory. Dover Publications, Inc., New York; 1957.
  30. 30. Fienberg SE, Rinaldo A. Maximum likelihood estimation in log-linear models. The Annals of Statistics. 2012;40:996–1023.
  31. 31. Fienberg SE, Rinaldo A. Maximum likelihood estimation in log-linear models. Supplementary material: Algorithms. Technical Report, Carnegie Mellon University. 2012. Available from: http://www.stat.cmu.edu/~arinaldo/Fienberg_Rinaldo_Supplementary_Material.pdf.
  32. 32. Gange SJ. Generating multivariate categorical variates using the iterative proportional fitting algorithm. The American Statistician. 1995;49:134–8.
  33. 33. Demirtas H. A method for multivariate ordinal data generation given marginal distributions and correlations. Journal of Statistical Computation and Simulation. 2007;76:1017–25.
  34. 34. Kaiser S, Träger D, Leisch F. Generating Correlated Ordinal Random Values. Technical Report Number 94, Department of Statistics, University of Munich; 2011.
  35. 35. Lee AJ. Some simple methods for generating correlated categorical variates. Computational Statistics and Data Analysis. 1997;26:133–48.
  36. 36. Ibrahim NA, Suliadi S. Generating correlated discrete ordinal data using R and SAS IML. Computer Methods and programs in Biomedicine. 2011;104:e122–32. pmid:21764167
  37. 37. Zeidler E. Oxford Users’ Guide to Mathematics. Oxford University Press; 2004.
  38. 38. Wolfram S. The Mathematica Book, 4th edition. Wolfram Media/Cambridge University Press; 1999.
  39. 39. Roy SN, Kastenbaum MA. On the hypothesis of no “interaction” in a multi-way contingency table. Annals of Mathematical Statistics. 1956;27:749–57.
  40. 40. Upton G, Cook I. A Dictionary of Statistics, 3rd edition. Oxford University Press; 2014.
  41. 41. Bickel PJ, Hammel EA, O’Connell JW. Sex bias in graduate admissions: Data from Berkeley. Science. 1975;187:398–404. pmid:17835295
  42. 42. Mood A. M. (1950). Introduction to the Theory of Statistics. McGraw-Hill.
  43. 43. Cheng PE, Liou JW, Liou M, Aston JAD. Data information in contingency tables: A fallacy of hierarchical loglinear models. Journal of Data Science. 2006;4:387–98.
  44. 44. Snedecor GW. Chi-squares of Bartlett, Mood, and Lancaster in a 23 contingency table. Biometrics. 1958;14:560–2.
  45. 45. Feller W. An introduction to probability theory and its applications. Third edition. John Wiley & Sons, New York; 1970.