Fisher’s exact approach for post hoc analysis of a chi-squared test

This research is motivated by one of our survey studies to assess the potential influence of introducing zebra mussels to the Lake Mead National Recreation Area, Nevada. One research question in this study is to investigate the association between the boating activity type and the awareness of zebra mussels. A chi-squared test is often used for testing independence between two factors with nominal levels. When the null hypothesis of independence between two factors is rejected, we are often left wondering where does the significance come from. Cell residuals, including standardized residuals and adjusted residuals, are traditionally used in testing for cell significance, which is often known as a post hoc test after a statistically significant chi-squared test. In practice, the limiting distributions of these residuals are utilized for statistical inference. However, they may lead to different conclusions based on the calculated p-values, and their p-values could be over- o6r under-estimated due to the unsatisfactory performance of asymptotic approaches with regards to type I error control. In this article, we propose new exact p-values by using Fisher’s approach based on three commonly used test statistics to order the sample space. We theoretically prove that the proposed new exact p-values based on these test statistics are the same. Based on our extensive simulation studies, we show that the existing asymptotic approach based on adjusted residual is often more likely to reject the null hypothesis as compared to the exact approach due to the inflated family-wise error rates as observed. We would recommend the proposed exact p-value for use in practice as a valuable post hoc analysis technique for chi-squared analysis.


Background
This research is motivated by one survey study conducted by Gerstenberger et al. [1] to assess potential influence of introducing zebra mussels to the Lake Mead National Recreation Area (LMNRA), Nevada, USA. Zebra mussels are relative small (finger-nail-sized for adult zebra mussels). Their extremely high reproductive rates raise the concern that they could clog water intakes in the LMNRA as it is the main water resource for the city [2]. They can be easily moved from an affected lake to an unaffected one by attaching to boats, nets, docks, and so on. In this study, surveys approved by United States Fish and Wildlife Service were used a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 to collect data on six different sites in the LMNRA between 2002 and 2003 [1]. All the 274 participants were asked in person about their boating activity types (Pleasure, Angler, Jet Ski, and Other) and their awareness of zebra mussels (Yes/No), see Table 1 for data from this study. The chi-squared test was used to test independence between boater activity type and awareness of zebra mussels, and a very small p-value indicated a strong association between the two factors.
Researchers are often interested in identifying significant cells/relationships after a statistically significant chi-squared test [3,4]. Two test statistics are commonly used to test the significance for each cell. The first test is standardized residual that is calculated as raw residual divided by the squared root of the expected value, where raw residual is defined as the difference between the observed value and the expected value. The second test is adjusted residual: raw residual divided by its standard error. Both tests follow the standard normal distribution asymptotically. These two tests have different conclusions for testing the cells of data in Table 1. In addition to that, statistical inference of these two tests relies on how close the limiting distribution is to the true distribution. For a cell with a relatively small value, asymptotic approaches are often not reliable. Recently, Sharpe [5] reviewed several approaches to conduct a post hoc test after a statistically significant chi-squared test: residual comparison, ransacking, and partitioning. The goal of a post hoc test is to find the source of overall significance.
To overcome the unsatisfactory performance from the existing asymptotic approaches for testing each individual cell in a contingency table after a significant chi-squared test, we propose using Fisher's approach to compute exact p-value by enumerating all possible tables with the same marginal row and column totals as the observed data. The two aforementioned test statistics can be used to order the sample space, and so does raw residual. It could be very computationally intensive to enumerate all possible tables due to the exponentially increased size of the searching sample space, even with the utilization of efficient numerical search algorithms [6]. For this particular problem, we find that the complete sample space can be reduced to a set of 2 × 2 tables instead of all possible R × C tables to test the significance of each cell. In addition, we theoretically show that the exact p-values based on the three test statistics are the same, thus they have the same conclusion.
The rest of the article is organized as follows. In Section 2, we review the commonly used approaches to test the significance of each cell after a statistically significant chi-squared test, and propose the exact p-value calculation by using Fisher's approach. We theoretically prove the relationship between exact p-values based on different test statistics considered in this article. In Section 3, we illustrate the application of the proposed exact p-value by using two real examples including the motivation example from our survey study. We then conduct extensive Monte Carlo simulation studies to compare the performance between the proposed exact approach and the existing asymptotic approaches. Finally, we conclude our research with some remarks in Section 4.

Methods
In the case that the overall chi-squared test is significant, the next step is to perform a post hoc test to find out which cells from the contingency

Residuals
Raw residual is computed as the difference between the observed value and the expected value, which is where e ij is the expected value of the ij-th cell under the independence hypothesis. It has been pointed that the T RawR is insufficient for hypothesis testing since the T RawR value tends to be large when the value in that cell is large [5]. For this reason, the following two test statistics were traditionally used for testing independence in the ij-th cell. Standardized residual is the component from the chi-squared test, which is and adjusted residual uses the standard error of x ij − e ij in the test statistic [7,8] where m i , n j , and N are the row marginal total, the column marginal total, and the total sample size, respectively. Both T StdR and T AdjR follow the standard normal asymptotically [9]. Both T StdR and T AdjR can be used for testing the independence hypothesis for each cell by comparing the calculated test statistics to the critical value from the standard normal distribution. It should be noted that they could reach a different conclusion based on their asymptotic p-values. It is easy to find out that the p-value based on T AdjR is always less than that based on T StdR , because |T AdjR | is always larger than |T StdR | for an observed data. For this reason, T AdjR is often recommended for use in practice as compared to T StdR as the latter test could be too conservative [10].

Exact post-hoc p-value
The accuracy of the limiting distribution for p-value calculation relies on multiple factors: marginal row and column totals, and whether the observed value in that cell is too small. In addition to that, the type I error control by using the limiting distribution is often unsatisfactory [11][12][13][14][15][16][17]. To overcome these limitations from using asymptotic approaches for statistical inference, we propose using Fisher's exact approach in testing the independence. All the possible data with the same marginal row and column totals as the observed data are enumerated and used in the p-value calculation, and the rejection region is determined by using any of the three test statistics: T RawR , T StdR and T AdjR . Suppose that the marginal row and column totals are m 1 , m 2 , Á Á Á, m R , and n 1 , n 2 , Á Á Á, n C in a R × C contingency table. The probability of observing a data with values X = {x ij , i = 1, Á Á Á, R, and j = 1, Á Á Á, C} is computed as which is often known as the hypergeometric probability. Let T be the test statistic to order the sample space, and X Ã be the observed data. Then, the exact p-value based on Fisher's approach is calculated as X |} is the rejection region, and P(X) is the probability of data X as given in Eq (1). It is very computational to calculate exact p-values without using network search algorithms to find the rejection region effectively. The network algorithm developed by Mehta and Patel [6] has been utilized by many statistical software in computing exact Fisher's p-value for categorical data that can be organized in a contingency table. Obviously, this algorithm provides a much faster method to find the rejection region than a direct and naive full enumeration which could quickly become impossible as the table size and the total sample size increase. For this particular problem, we can simplify the exact p-value because two data sets having the same n ij would have the same test statistic. In other words, if a data is in the rejection region, then a set of data that have the same n ij as that data, should also be in the rejection region. For this reason, the sample space in exact p-value calculation is the collection of data as in Table 2.
This new sample size is a collection of data , and the probability of data Y is calculated as For a 2 by 2 table as in Table 2, it is much easier to enumerate all possible data without the involvement of efficient network search algorithms. Suppose Y Ã is the observed data. The new exact p-value based on Fisher's exact approach is computed as where I(a) is an indicator function with I(a) = 1 when a is true, and zero otherwise. Theorem 2.1 Exact p-value calculations based on the three test statistics are the same. Proof. The proposed exact p-value by using Fisher's approach depends on the test statistic T to order the sample space. The rejection region is defined as In the new exact p-value calculation, the row and column marginal totals in Table 2 are considered as fixed. It follows that e ij and e ij (1 − m i /N)(1 − n j /N) in the denominate of T StdR and T AdjR Table 2. Reorganized data for testing the independence from the ij-th cell.

C j
Other columns combined Total By the definition of exact p-value in Eq (2), exact p-values based on these three test statistics are the same for a given data.
We have shown that the three test statistics lead to the same exact p-value from this theorem. They agree with each other for testing individual independence in each cell. For simplicity, we use T AdjR for sample space ordering to compute exact p-value by using Fisher's approach.
The classic approach to adjust the significance level for multiple comparisons is the Bonferroni method, which is α/W, where W is the number of comparisons. This correction method is widely used for a problem with independent multiple comparisons. However, in the considered problem for all cells in a contingency table, they are correlated, where the Holm-Bonferroni method can be used. In this method, all W p-values are sorted from the smallest to the largest, and the k-th smallest p-value is compared with α/(W + 1 − k). This method is uniformly more powerful than the traditionally used Bonferroni method. Later, Simes proposed an improved method for multiple comparisons with the adjusted significance level of αk/W for the k-th smallest p-value [18]. The method by Simes is often more powerful than the two aforementioned methods for multiple comparisons. For this reason, we use the method by Simes for both the asymptotic approach and the proposed approach.

Results
We first use two real examples to illustrate the application of the proposed exact p-value calculation for a post hoc test after a chi-squared test, then we conduct extensive numerical studies to compare the proposed exact approach with the existing approaches.

Real data application
The first example is a cross-sectional study to study malignant melanoma [19,20]. In this study, 408 cases were randomly selected from all patients from New South Wales, Australia who was diagnosed with malignant melanoma. Tumor types (4 categories: Hutchinson's melanotic freckle (H), Indeterminate (I), Nodular (N), and Superficial spreading melanoma (S)) and tumor site (3 categories: Head and neck, Trunk, and Extremities) were recorded for each case. Data of this study is presented in a 4 × 3 contingency table: Table 3. The chi-squared test statistic is calculated as 65.81, with the p-value of 2.9×10 −12 which is much less than 0.05. Since the overall chi-squared test is significant, we would reject the null hypothesis that tumor type and tumor site are independent.
We compute p-values for each cell in this contingency table of this example. First, we use the limiting distributions of test statistics T StdR and T AdjR for p-value calculation, see Table 4. This table is sorted by the T AdjR test statistic from the largest to the smallest. As can be seen from the table, T StdR is relatively conservative as compared to T AdjR since T AdjR has three cells with significant results as compared to one based on T StdR . Suppose T AdjR is used for statistical inference. We can conclude that the expected count is significantly different from the observed count for tumor types of H at all three tumor sites, and S when head and neck is the tumor site. In addition to these results by using asymptotic approaches, we also provide the proposed exact p-value based on T AdjR to order the sample space in the last column of We revisit the awareness survey in Introduction section as the second example. This personal interview survey data is presented in Table 1, and the overall p-value to test the independence between boater activity type and awareness of zebra mussels in the Lake Mead is calculated as 1.4 × 10 −5 , which indicates a significant association between boater activity type and awareness of zebra mussels. Following a significant chi-squared test, we compute the three test statistics, asymptotic p-values based on T StdR and T AdjR , and the proposed exact p-value, see Table 5. No significant cell is found by using T StdR , while boaters for pleasure, Jet ski, or other are shown to be significant by using either T AdjR or the exact approach. In this example, Table 4. P-value calculation for each cell of data from the malignant melanoma example. The calculated p-value for each cell is compared to the multiple comparison correction method by Simes [18]. The cells with significant p-values are bold.  Table 5. P-value calculation for each cell of data from the survey for the awareness of zebra mussels. The calculated p-value for each cell is compared to the multiple comparison correction method by Simes [18]. The cells with significant p-values are bold. a few observations have the same cell p-values. For such cases, we use the largest adjusted p-value for those having the same p-value. In this example, T AdjR and the proposed exact approach for p-value calculation have the same conclusion. It should be noted that when a factor only has 2 levels (the awareness in this example, j = 1, 2), T AdjR is the same within each level of the other factor (T AdjR (x i1 ) = T AdjR (x i2 )) [9]. This leads to the same exact p-values for the these two cells as observed in the table.

Simulation study
We conduct an extensive simulation study to further compare the existing asymptotic approach based on T AdjR and the proposed exact approach. It has been observed that the asymptotic approach based on T StdR is relatively conservative as compared to that based on T AdjR . For this reason, we exclude T StdR in the comparison. For a given total sample size (N) and the size of table (R × C), we first simulate the row and column marginal totals, (m 1 , Á Á Á, m R ) and (n 1 , Á Á Á, n C ). We simulate 1,000 sets of the marginal totals. For each simulated marginal totals, we then use an R function, r2dtable, to randomly generate 2,000 R × C contingency tables by using Patefield's algorithm [21,22]. For each simulated data from these 2,000 tables, we compute the asymptotic p-value based on the limiting distribution of T AdjR and the exact p-value. We compute the family-wise error rate (FWER) for each approach when performing R × C hypotheses at the same time for each simulated data. The FWER is calculated as the average of the number of tables whose hypotheses are rejected from at least one cell. The significance level is set as 0.05k/(R × C), k = 1, 2, Á Á Á, R × C by using the Simes correction method for multiple comparisons. Fig 1 shows the FWERs for both asymptotic and exact approaches for a contingency table size with sizes of 3 × 3, 5 × 5, and 8 × 8, and sample sizes from 50 to 500. It can be seen that the asymptotic approach does not guarantee the type I error in the majority of cases, and it is almost 5 times the nominal level in one case. The performance of the asymptotic approach gets worse as the size of table increases. It could be caused by the reason that the chance of rejecting at least one of the null hypotheses is increased when more hypotheses are tested simultaneously. The proposed exact approach guarantees the type I error rate.
Suppose Γ Asy and Γ Exact are the numbers of cells with significant p-values by using the asymptotic approach and the exact approach, respectively. We include the cases that have at least one cell being significant based on one of the two approaches, max(Γ Asy , Γ Exact ) > 0. In other words, the cases with Γ Asy = 0 and Γ Exact = 0 are excluded in the performance comparison.
In Table 6, we compare the existing asymptotic approach based on T AdjR and the proposed exact approach by using all cases with max(Γ Asy , Γ Exact ) > 0 for given N and the table size (R = 3 and C = 3). The last row of this table shows the total number of such cases from the total 1,000× 2,000 = 2,000,000 simulated data. We find that the proportion of the two approaches having the same conclusion Γ Asy = Γ Exact , increases as the total sample size goes up, and the proportion of Γ Asy > Γ Exact (the number of cell rejected by the asymptotic approach is more than that by using the exact approach), is a decreasing function of N. Among the cases with Γ Asy > Γ Exact , the majority of them are the ones that the exact approach has no significant pvalue from any cell. The number of cases such that the exact approach has more rejected cells than the asymptotic approach, is relatively low, which is less 0.15% for the cases studied. When N is small, such as N = 50, the asymptotic approach always rejects at least the same number of cells as the exact approach, Γ Asy ! Γ Exact .
We present the frequency and proportion of simulated data from a 3 × 5 contingency table in Table 7 and a 5 × 5 contingency table in Table 8. When the total sample size is small, the proportion of Γ Asy = Γ Exact is less than that of Γ Asy > Γ Exact , and this trend is reversed as the sample size increases. As the table size increases, the proportion of two approaches having different numbers of rejected cells (Γ Asy 6 ¼ Γ Exact ), goes up. Similar to Table 6, these two tables show that the proportion of Γ Asy > Γ Exact is relatively large as compared to that of Γ Asy < Γ Exact .
When we compare the three tables in Eqs (6), (7), and (8) with different table sizes, we find that the proportion of max(Γ Asy , Γ Exact ) > 0 among the total 2 million simulations, is increased as the table size increases. Within the 3 × 5 or 5 × 5 contingency table, the proportion of max (Γ Asy , Γ Exact ) > 0 is a decreasing function of N, while in Table 6 for a 3 × 3 contingency table, this proportion is almost constant across different total sample sizes.

Discussion
It is well known that asymptotic approaches could lead to different conclusions based on their limiting distributions for p-value calculation. In this article, we theoretically prove that exact p-values produce the same result by using any of the three commonly used test statistics. For this reason, we would like to recommend the proposed exact p-value for use in practice. We develop the software program to compute exact p-value by using the statistical software R [23], and it is available from the first author's website at: https://faculty.unlv.edu/gshan/ under the Software development section. In addition to that, we also provide a website for researchers who do not use R, which is: http://gshan.i2.unlv.edu/ZPostHoc. We would appreciate any comments from users to further improve the R function and the website. We do not find an alternative approach based on the exact framework. The existing approaches are generally based on asymptotic limiting distributions or simulation. For the approach based on simulation, it can only simulate a certain number of cases, and it may delete some cases (e.g., the ones with one or more zeros in the table). Although simulation is an approach to utilize when it is difficult to enumerate all possible samples, especially for a study with the total sum fixed [24][25][26][27].
In addition to the considered three test statistics for testing cells, several other approaches were developed after a significant chi-squared test. Partitioning is one of them, and this approach basically divide a contingency table into a set of 2 × 2 tables. Obviously, the total possible number of set is R . Due to the large number of partitioning, a set of orthogonal partitions was proposed [28] to avoid having too many unnecessary comparisons [5,29]. Alternatively, Jin and Wang [30] suggested to implement multiple comparisons on one factor. When that factor has R levels, each data for a post hoc test is a 2 × C contingency table. Then, the total number of comparisons is R 2 . They compute p-value for each 2 × C contingency table by using the chi-squared test. One can always consider using exact approaches for pvalue calculation for such data [13]. We consider this as future work.

Conclusions
In this article, we propose using Fisher's approach to compute exact p-value for each cell in a contingency table after a significant overall chi-squared test [31][32][33][34][35]. The existing approaches are often based on asymptotic limiting distributions of their associated test statistics. From our extensive simulation studies conducted in this article, we find that the FWERs of the asymptotic approach based on T AdjR could be much larger than the nominal level, while the proposed exact approach guarantee the FWER. Due a lack of an existing approach with the FWER guaranteed, we do not have another approach to be included to compare with the proposed exact approach with regards to power.