A discussion on significance indices for contingency tables under small sample sizes

Hypothesis testing in contingency tables is usually based on asymptotic results, thereby restricting its proper use to large samples. To study these tests in small samples, we consider the likelihood ratio test (LRT) and define an accurate index for the celebrated hypotheses of homogeneity, independence, and Hardy-Weinberg equilibrium. The aim is to understand the use of the asymptotic results of the frequentist Likelihood Ratio Test and the Bayesian FBST (Full Bayesian Significance Test) under small-sample scenarios. The proposed exact LRT p-value is used as a benchmark to understand the other indices. We perform analysis in different scenarios, considering different sample sizes and different table dimensions. The conditional Fisher’s exact test for 2 × 2 tables and the Barnard’s exact test are also discussed. The main message of this paper is that all indices have very similar behavior, except for Fisher and Barnard tests that has a discrete behavior. The most powerful test was the asymptotic p-value from the likelihood ratio test, suggesting that is a good alternative for small sample sizes.


Introduction
We discuss indices for homogeneity, independence, and Hardy-Weinberg equilibrium hypotheses [1,2] in contingency tables. We propose an exact evaluation of the Likelihood Ratio Test (LRT) as a benchmark significance index. Based on the work of [3], its idea is to evaluate the probability distribution of all possible tables on the sample space under the null hypothesis. Once the distribution for sampling contingency tables under the hypothesis is known, we are able to compute the exact distribution of the Likelihood Ratio Test (LRT) statistics. The main difficulty for this procedure is that it is computationally time-consuming, being only feasible for small sample sizes and/or for tables of small dimension.
The exact LRT p-value presented as a way to do exact inference. The aim is to compare the behavior of the frequentist LRT asymptotic p-value [4], the exact LRT p-value, the Fisher's exact test p-value [5], the Chi-Square test asymptotic p-value [6,7] and the Barnard's exact test p-value [8][9][10][11]. These frequentist indices are also compared to the e-value from the Full Bayesian Significance Test (FBST) [12,13]. It was considered the asymptotic e-value and its a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 approximation (based on a Markov Chain Monte Carlo procedure) of the exact e-value. The choice of adding a Bayesian index to the comparison study originates from the known asymptotic relationship between the LRT and the FBST [14]. Moreover, the FBST and its e-value can be viewed as a Bayesian p-value counterpart, and therefore it is interesting to understand this Bayesian method when compared to frequentist methods. It is important to point out that we are mainly interested in the values of the indices, not in the acceptance or rejection of the hypothesis; that is, our focus is on the significance test, which consists of the evaluation of the p-(e-)values. In an applied setting, the researcher can, based on the indices, make his/her decision about his/her application. We are not interested in comparing the values of the indices with some fixed significance value (generally 5%) to decide the if the hypothesis should be accepted or rejected. With this goal in mind, all significance indices considered here are in agreement with the ASA's statement on significance indices [15].
From a historical perspective, hypothesis testing has been the most widely used statistical tool in many fields of science [16][17][18]. For categorical data, [19] discusses some exact procedures to perform inference and [20] presents methodological procedures for hypothesis testing for contingency tables. Tests for homogeneity hypothesis in contingency tables have been compared by [21], who compared the conditional and unconditional, and by [22], who compares, under an asymptotic perspective, two tests for equality of two proportions considering Goodman's Y 2 and χ 2 statistics. Regarding tests for the independence of two classifiers in contingency tables, [23] presents an algorithm for finding the exact permutation significance level for r × c contingency tables. [24], studies a simple way to compare two correlated proportions. More recently, [25] presents the exact likelihood ratio test for equality of two normal populations, and [26] discuss exact unconditional tests for homogeneity hypothesis in 2 × 2 tables.
One important aspect that differentiates the tests procedures is how each one deals with the elimination of the nuisance parameter. Basu [27] lists several methods but focuses on marginalization and conditioning. He defines marginalization as every procedure that replaces the observed sample x by the observed value of a suitable statistic T(x) = t. Therefore, instead of working with the original experiment E and data x, one should use the marginal experiment E T and the recorded value T(x) since the marginal statistical model would depend only on the parameter of interest. To justify these procedures, Basu adds that researchers usually recur to invariance or partial sufficiency arguments.
By conditioning, Basu defines methods of elimination that also consist of choosing a suitable statistic, but such that the conditional distribution of the observed sample, x, given the observed value of the statistic depends on the full parameter space only through the parameter of interest. Another commonly used approach that Basu describes is the one he calls maximization. In this case the nuisance parameter is eliminated from the risk function by some sort of maximization (or minimax) principle or directly from the likelihood, usually maximizing it with respect the nuisance parameters.
A final important strategy mentioned by Basu is the one he called Bayesian solution. In this case, one should derive the full posterior and integrate out the nuisance parameters, obtaining the posterior marginal distribution necessary to perform the required statistical inference. It is important to point out that the FBST does not follow this Bayesian strategy, since its evidence value is computed considering the full posterior. The proposed exact LRT p-value is based on the idea of integrating out the nuisance parameter, which is in some way related to Basu's Bayesian solution [26]. The methods for elimination of nuisance parameters, maximization and Bayesian solution can be considered as unconditional methods.
The Likelihood Ratio Test (LRT) asymptotic p-value [28], the Chi-Square test asymptotic p-value [29], Fisher's homogeneity exact test [29,30], Barnard's exact test [8], and the Full Bayesian Significance Test (FBST) asymptotic and exact e-value [12,13] are presented in detail for the case of 2 × 2 contingency tables considering homogeneity hypothesis (Section 1.1). The theoretical results for homogeneity and independence hypotheses for tables of any dimension and Hardy-Weinberg equilibrium hypothesis are presented in sections 1.2, 1.3 and 1.4.
We study the relationship between indices in Section 2.1. [14] perform a similar study, however they consider continuous random variables using the e-value and the LRT p-value and show that these indices share an asymptotic relationship. In our case, the asymptotic LRT pvalue, the exact LRT p-value and the Chi-Square p-value have similar behavior, including in small sample size scenarios. Both Fisher's exact test and Barnard's exact test have a discrete behavior for their p-values, being more clear for the Barnard's exact test p-value. All tests are unconditional tests, except for the Fisher one, that is a conditional test. It is important to draw attention to the fact that the present results are not based on a simulation study, we compute the indices for all possible tables in the sample space.
In addition to our focus on the study of significance indices, we also provide, for the frequentist indices, a study of the power functions to compare the tests considering the homogeneity hypothesis (2 × 2 tables) and Hardy-Weinberg equilibrium hypothesis (Section 2.2). The Fisher's exact test was the least powerful, followed by the Barnard's exact test, Chi-Square test, the exact LRT and the asymptotic LRT, the most powerful one. We did not evaluate the power function for the FBST; firstly, because it is not the aim of the Bayesian paradigm, and secondly, to do so, it would be necessary to define a decision rule for the FBST, which is not in the scope of this paper. We also note that, under the hull hypothesis, considering the significance level 5%, all frequentist indices achieved 5% rejection as expected.

Homogeneity test for 2 × 2 contingency tables
Let X 1 and X 2 be two random variables, representing the rows (1 and 2) of Table 1, x 11 and x 21 being their observed values, and n 1Á and n 2Á fixed sample sizes. Consider the distributions of X 1 as Binomial(n 1Á , θ 11 ) and X 2 a Binomial(n 2Á , θ 21 ) for describing the chances of a subject belong to category (column) C 1 in two distinct populations. Both populations are partitioned into two categories (columns) C 1 and C 2 and the objective is to test homogeneity among the two unknown population frequencies, H: θ 11 = θ 21 = θ. This hypothesis is geometrically represented by a diagonal line of the unit square.
The likelihood function is specified by where 0 θ i1 1, i = 1, 2. Under H, the likelihood function simplifies to and the LRT test statistics is: x 11 x 12 n 1Á x 12 x 21 n 2Á in which Θ H is the parametric set defined by the hypothesis.
• Exact LRT p-value: To define this p-value, we use the predictive distributions of X 1 and X 2 before any data were observed. The proposed p-value is an alternative way to calculate an exact p-value for the LRT. The goal is to find a distribution for the contingency table under H that is not a function on θ. We consider θ a nuisance parameter in the likelihood function in (2) and integrate it over θ in order to eliminate it, as suggested by [27]. The idea is to incorporate the concept of the Bayesian solution nuisance parameter elimination approach but in a frequentist setting, which means using the likelihood function instead of a posterior distribution. That is, To obtain the probability function Pr(X 1 = x 11 , X 2 = x 21 j H), one needs to find a normalization constant.
Note that to calculate (5), we evaluate h(Á, Á) for all possible tables. In the case of a homogeneity hypothesis for 2 × 2 contingency tables, hði; jÞ ¼ 1. We present the table's probability in terms of this sum to obtain a general formula for all hypotheses and table dimensions considered here, since in other scenarios this quantity does not sum up to 1 (for example, the sum of h for all possible 2 × 2 tables considering independence hypothesis with n = 2 is 2304). The exact p-value calculation follows directly from the test statistic distribution: in which R is the set of all pairs (i, j) such that λ(i, j) λ(x 11 , x 21 ), and λ(x 11 , x 21 ) is the observed test statistic, as in (3).

• Barnard's Exact Test:
Consider that n 1Á and n 2Á are fixed in Table 1. The random variables X 1 and X 2 are independent Binomial distribution with parameters θ 11 and θ 21 . The probability of a sample {x 11 , x 21 } be drawn is and, under hypothesis H, We define the critical region as R = {λ(X 1 , X 2 ) λ(x 11 , x 21 )}, then the Barnard's exact pvalue is obtained by That is, the Barnard's exact test consider the p-values for all possible points of the parameter space under H, and takes the maximum p-value. In this test, the chosen approach for nuisance parameter elimination among the ones presented by Basu is maximization.
• Full Bayesian Significance Test: The Bayesian approach considered is based on the FBST (Full Bayesian Significance Test) [12,13].
Definition 1 Let π(θ j x) be the posterior density function of θ given the observed sample and TðxÞ ¼ fy 2 Y : pðy j xÞ ! sup y2Y H pðy j xÞg. The supporting evidence measure for the hypoth- Consider that, a priori, θ 11 and θ 21 are independent and both follow a Uniform(0, 1) distribution. The choice of uniforms priors is to avoid a subjective prior to have a fair comparison with frequentist indices. Recall that X 1 and X 2 given θ 11 and θ 21 are Binomial distributed. Hence, the posterior distributions for θ 11 and θ 21 are independent Beta(x 11 + 1, n 1Á − x 11 + 1) and Beta(x 21 + 1, n 2Á − x 21 + 1). Under the hypothesis H, the posterior distribution is pðy j x 11 ; x 21 ; n 1Á ; n 2Á ; HÞ ¼ y x 11 þx 21 ð1 À yÞ and by maximizing it in θ we obtain sup θ2(0,1) π(θ j x 11 , x 21 , n 1Á , n 2Á , H), where ℬðÁ; ÁÞ is the Beta function. Since x 11 , x 21 , n 1Á and n 2Á are integers, the hypothesis' tangent set, T, is To calculate the approximate e-value, we use the following algorithm: 1. A random sample of size k is generated from posterior distribution of θ 11 , θ 21 , obtaining fy x 11 1 ; y x 21 1 g; . . . ; fy x 11 k ; y x 21 k g.

2.
The e-value is calculated by I pðy x 11 i ; y x 21 i j x 11 ; x 21 ; n 1Á ; n 2Á Þ ! sup • Other indices: For the LRT, the statistic −2 ln[λ(X 1 , X 2 )] has asymptotically a chi-square distribution with 1 degree of freedom, which is dim(Θ) − dim(Θ H ) [28]. The FBST uses the same statistic, however its asymptotic distribution is a chi-square with 2 degrees of freedom [13], which is dim (Θ). For the chi-square test and the Fisher's exact test for homogeneity see [29].

Homogeneity hypothesis for ℓ × c contingency tables
Let X i , i = 1, . . ., ℓ, be random variables that are represented by the rows of Table 2 and n 1Á , n 2Á , . . ., n ℓÁ are known constants.
In this setting, and we can obtain the e-value from Definition 1.

• Other indices:
Both asymptotic LRT p-value and asymptotic e-value are calculated as Pr[−2 ln(λ(X)) −2 ln(λ(x))], but while the LRT considers that this statistic follows a X 2 distribution with (ℓ − 1)(c − 1) degrees of freedom, the FBST considers that it follows a X 2 distribution with ℓ(c − 1) degrees of freedom. The Chi-Square homogeneity test is also obtained.

Hardy-Weinberg equilibrium
An individual's genotype is formed by a combination of alleles. If there are two possible alleles for one characteristic (say A and a), the possible genotypes are AA, Aa or aa. Considering a few premises true [31], the principle says that the allele probability in a population does not change from generation to generation. It is a fundamental principle for the Mendelian mating allelic model. If the probabilities of alleles are θ and 1 − θ, the expected genotype probabilities are (θ 2 , 2θ(1 − θ), (1 − θ) 2 ) 0 θ 1.
Considering the Hardy-Weinberg equilibrium, the aim is to verify if a population follows these genotypes proportions. Therefore, the equilibrium hypothesis is H : in which θ 1 , θ 2 , θ 3 are the proportions of AA, Aa, and aa, respectively. This hypothesis is geometrically represented in  Let X be a random vector. Table 3 represents the genotype frequencies for the population in question. Considering n known, we assume that X follows a Trinomial(n, θ 1 , θ 2 , θ 3 ) distribution. The likelihood function for this model is in which x = {x 1 , x 2 , x 3 }, θ 1 + θ 2 + θ 3 = 1 and θ i > 0, i = 1, 2, 3. Under the hypothesis H, The maximum likelihood estimator for θ under H isŷ ¼ ð2x 1 þ x 2 Þ=ð2nÞ and the LRT λ statistic is • Exact LRT p-value: Calculations follow as for the other indices and in this scenario • Barnard's Exact Test: The critical region is R = {λ(X) λ(x)}, and the Barnard's exact p-value is obtained by • FBST: Assuming a Dirichlet(1, 1, 1) prior for θ and that X follows a Trinomial(n, θ 1 , θ 2 , θ 3 ) distribution, the posterior distribution is θ j x * Dirichlet(x 1 + 1, x 2 + 1, x 3 + 1). In this setting, • Other indices: Both asymptotic LRT p-value and asymptotic e-value are obtained, the p-value considering that −2 ln(λ(X)) follows a X 2 distribution with 1 degrees of freedom and the FBST considering that it follows a X 2 distribution with 2 degrees of freedom.

Relations between the indices
In many practical situations, mainly in biological studies, asymptotic distributions are used to evaluate indices even for small samples. With that in mind, one of our interests is to understand how the use of asymptotic results for small sample size settings compares to the use of an exact index. Surprisingly, the values of exact and asymptotic indexes do not diverge considerably.
As our objective is to compare the indices, we consider different scenarios for each hypothesis. For each scenario, we evaluate the significance indices of all test procedures presented here. Note that this is not a simulation study; for each sample size, we evaluate the indices for all possible contingency tables of a fixed dimension and size. For example, considering homogeneity hypothesis in a 2 × 2 table with marginals (10, 10), there are 121 possible tables or considering independence hypothesis in a 2 × 3 table with marginal 15, there are 15504 possible tables. We evaluated the indices for all tables that fit into each specification. For the e-value computation, non-informative priors for the parameters are considered (that is, π(θ) / 1). This way, no extra information is added besides the data, allowing fair comparisons between frequentist and Bayesian indices.
For each scenario, plots are drawn to illustrate differences between the indices' values. The indices studied are the exact LRT p-value, asymptotic p-value for the LRT, asymptotic p-value for the chi-square test, e-value and asymptotic e-value. For the homogeneity hypothesis in 2 × 2 tables, Fisher and Barnard exact tests were also considered, and for Hardy-Weinberg equilibrium hypothesis the Barnard's exact test was also obtained. We considered many different scenarios, however, since the aim is to understand the indices in small sample size, the scenarios presented here are in Table 4. Figs 3, 4 and 5 illustrate the results of the discussion above. For all hypotheses, exact and asymptotic e-values are very similar for both large and small sample sizes. Looking into the frequentist indices, exact LRT p-values and asymptotic p-values, both LRT and Chi-Square, are also very similar to each other. The difference found between e-values when compared to asymptotic LRT p-value happens as a result of the way these indices are formulated: while evalues consider the full dimension of the parameter space, p-value consider the complementary dimension of the set corresponding to hypothesis H. This is expected from the asymptotic relationship between e-value and p-value from the LRT [13,14]. Since the exact LRT p-value is directly related to the asymptotic LRT p-value, we observe the same behavior of the differences between e-values and asymptotic LRT p-value. Fisher

Power function
Power functions are a useful tool to compare hypothesis tests. For all θ 2 Θ, the power function provides the probability of rejecting the hypothesis for a given θ. In fact, we look for a test that does not reject the hypothesis for θ 2 Θ H and the further the θ value is from the hypothesis, the probability of rejection increases. The power functions presented are the ones that we are able to represent in R 3 , which are the power functions for the homogeneity hypothesis in 2 × 2 contingency tables and for the Hardy-Weinberg equilibrium hypothesis.
We used p-values less than 0.05 as a decision rule to reject the hypothesis. This choice is based on what is vastly used in most fields of science as a decision rule. In this case, Power (θ 1 , θ 2 ) = P(reject H|(θ 1 , θ 2 ) and Reject H if index 0.05.
We obtain the power function for all tests but the FBST. The FBST is a Bayesian significance test and in order to obtain a power function, one would need a decision rule. Since its construction differs from that of the p-values, we cannot use the same decision rule, and constructing a decision rule is not in the scope of this paper.
We used a Monte Carlo procedure to evaluate the power function of these tests. We consider a grid for the unit square with 100 × 100 points on the axes (θ 1 , θ 2 ). For each point in the grid we generated 1000 tables. From these 1000 tables we evaluate the proportion of rejections, which is an approximation of the power function.
We plot pairs of power functions to illustrate and compare their shapes. For the homogeneity hypothesis in a table with marginals (10, 10), Fig 6 shows that Fisher's exact test is less powerful than the Barnard's exact test, the Barnard's exact test is has similar power when compared with the Chi-square test, while the Chi-square is less powerful than the proposed exact LRT p-value, which is less powerful than the asymptotic p-value for the LRT. To have a clear picture, we plot the power functions from different tests against each other . Fig 7a consists of the power functions for tables with marginal equals to (10,10). It shows that the use of the asymptotic p-value for the LRT results in a more powerful test than the other indices. When comparing the proposed exact p-value to other indices, it is more powerful than the Chisquare test and the Fisher's exact test. Between the Chi-square and the Fisher's exact test, the Chi-square test is more powerful.  For tables with marginal equals to (100, 100), the graphs are more concentrated near the identity line (Fig 7b), showing that all indices are more alike. The ordering still exists, but it is less severe. It is interesting to point out that, as expected, the Chi-square test works better with larger samples.
For the Hardy-Weinberg hypothesis, the results are similar to the ones obtained for the homogeneity hypothesis and are shown in Figs 8 and 9. In this case, the most powerful test was the asymptotic p-value for the LRT, followed by the exact p-value for the LRT, which is more powerful to the Chi-square test, that is similar the Barnard's exact test. We call attention to the fact that, under hypothesis H, the power function achieves the value of 0.05, as expected, since this is the significance level chosen to build the power functions.

Conclusion
After evaluating the indices for tables in different scenarios, we noticed that all of them had very similar behaviors, independently of the perspective (Bayesian or frequentist), sample size and table dimension. The exceptions are the p-values for Fisher and Barnard's exact tests for the homogeneity hypothesis in 2 × 2 tables, and Barnard's exact test for Hardy-Weinberg  equilibrium, which show a discretized behavior. Studying the power functions considering homogeneity hypothesis in 2 × 2 tables and Hardy-Weinberg equilibrium hypothesis, the LRT presented itself as a powerful test when considering small sample sizes, while Fisher's exact test was the least powerful one for the homogeneity hypothesis and the Barnard's exact test was the least powerful for the Hardy-Weinberg equilibrium hypothesis. By enlarging sample sizes, the power of these tests increases accordingly.
Finally, we finish this paper listing our main conclusions: • The LTR asymptotic p-value seems to be a good frequentist alternative for small sample sizes.
• Since there is an asymptotic relationship between the p-value for the LRT and the e-value (FBST), we consider that both indices are equivalent in the explored settings.
• In cases where there is available information besides the data that to be taken into account, represented by informative priors, we consider the e-value a more appropriate index than a frequenstist one, since the e-value offers a mechanism to incorporate that information.