Homogeneity Test for Correlated Binary Data

In ophthalmologic studies, measurements obtained from both eyes of an individual are often highly correlated. Ignoring the correlation could lead to incorrect inferences. An asymptotic method was proposed by Tang and others (2008) for testing equality of proportions between two groups under Rosner's model. In this article, we investigate three testing procedures for general g ≥ 2 groups. Our simulation results show the score testing procedure usually produces satisfactory type I error control and has reasonable power. The three test procedures get closer when sample size becomes larger. Examples from ophthalmologic studies are used to illustrate our proposed methods.


Introduction
In randomized clinical trials [2], patients are usually randomized into two or more treatment groups, and patients within each group receive the same treatment. Often a control group or a group with standard treatment is included for testing the efficiency of new treatments. After the randomization, all patients are followed up in exactly the same way as designed, and the only difference is the treatment assigned to each group. A randomized clinical trial is a good choice to eliminate many of the biases and to avoid ethical problems that may arise from comparing treatments [3] [4]. For example, in a double-blinded two-arm clinical trial for an ophthalmologic study, all patients are randomized into two treatment groups and the same treatment is applied to both eyes of patients from the same group. Such clustered data with a cluster size of two often arises from statistical and medical applications, for example, ophthalmologic studies, orthopaedic studies, otolaryngological studies and twin studies.
We wish to test if the outcomes are identical among the two or more treatment groups. Obviously, the information collected from two eyes of a single person tends to be highly correlated. Any statistical methods that ignore the feature of dependence, such as t tests, analyses of variance, or chi-square tests, could lead to incorrect inferences (see, [5] [6] [7] [8] [9]).
In this article, we consider the case of a dichotomous outcome, such as the presence of a disease or some other binary trait. Several statistical tests have been proposed. Rosner [5] proposed a parametric model and a test statistic for testing homogeneity of proportions among g groups. However, the maximum likelihood estimates (MLEs) and likelihood-based tests were not given. [1] [10] considered this problem for two groups only and proposed several asymptotic testing procedures, including the score test. It is difficult to extend the testing procedures from 2 groups to g groups (g > 2), due to the complexity of deriving the information matrix and maximum likelihood estimates which can be obtained only by numerical iterations. The score test statistic has demonstrated better type I error control and power than other testing procedures, when comparing two treatment groups [1]. We expect the score test, investigated for comparing multiple treatment groups in this article, to perform well as compared to other procedures.
In this article, we present the methods for comparing proportions among any g groups, with g ! 2. The maximum likelihood estimate under Rosner's model and three different methods (Likelihood Ratio test, Wald-type test, Score test) are derived and investigated in Section 2. In Section 3, Monte Carlo simulation studies are conducted to compare the performance of various tests and comparisons are evaluated with respect to empirical type I error rates and powers. Examples from otolaryngological studies are illustrated to demonstrate our methodologies in Section 4. Finally, we give some concluding remarks in Section 5.

Methods
Suppose we wish to compare g groups of individuals from an ophthalmologic study with m i individuals in the ith group, i = 1, . . ., g; N = ∑m i total subjects (Table 1). Let Z ijk = 1 if the kth eye of jth individual in the ith group has a response at the end of the study, and 0 otherwise, i = 1, . . ., g, j = 1, . . ., m i , k = 1, 2. Let m li denote the number of subjects who has exactly l responses in the ith group, and S l be the number of subjects who has exactly l responses (e.g., affected eyes) A parametric model proposed by [5] is given as i = 1, . . ., g, j = 0, . . ., m i , k = 1, 2 for some positive R. The constant R is a measure of dependence between two eyes of the same person. If R = 1, the two eyes from the same patient are completely independent. If Rπ i = 1, the eyes of each patient in the i-th group are completely dependent. The R satisfies 0 < R 1/a, if a 1/2; (2 − 1/a)/a R 1/a, if a > 1/2; where a = max{π i , i = 1, . . ., g}. From the conditional probability in Eq (1), it is easy to show that the We wish to test whether the response rates of the g groups are identical. The hypotheses are given as Based on the observed dataM ¼ ðm 01 ; Á Á Á ; m 0g ; m 11 ; Á Á Á ; m 1g ; m 21 ; Á Á Á ; m 2g Þ, the corresponding log-likelihood can be expressed as Differentiating l(π 1 , . . ., π g ; R) with respect to parameters π 1 , . . ., π g and R yields Under the null hypothesis H 0 : π 1 = Á Á Á = π g = π, the maximum likelihood estimates of π and R satisfy @l @R ¼ 0 and @l @p A direct algebra calculation results in the MLEs of p 0 i s and R . . . ; g andR as the maximum likelihood estimate of π i , i = 1, . . ., g and R, respectively.p i ; i ¼ 1; . . . ; g andR are the solution of the following equations @l @p i ¼ 0; i ¼ 1; . . . ; g; @l @R ¼ 0: There is no closed form solution and it has to be solved iteratively. We can simplify the formula in Eq (3) as the following 3rd order polynomial (for i = 1, . . ., g) The (t + 1)th update for π i can be directly obtained by the real root of the above equation, and R can be updated by the Fisher scoring method R ðtþ1Þ ¼ R ðtÞ À @ 2 l @R 2 ðp ðtÞ 1 ; . . . ; p ðtÞ g ; R ðtÞ Þ À1 @l @R ðp ðtÞ 1 ; . . . ; p ðtÞ g ; R ðtÞ Þ: See next section for the formula of @ 2 l @R 2 .

Information matrix
Differentiating @l @p i ; i ¼ 1; . . . ; g and @l @R with respect to π i , i = 1, . . ., g and R respectively yields Then we have The (g + 1) × (g + 1) information matrix is denoted as I(π 1 , . . ., π g ; R) = (I ij ). Under the null hypothesis H 0 : π 1 = Á Á Á = π g = π, it is straightforward but tedious to show that the inverse of the information matrix can be expressed as where With the MLEs and information matrix derived, we consider the following test statistics.

Likelihood ratio test (T LR )
The likelihood ratio (LR) test is given by Under the null hypothesis, T LR is asymptotically distributed as a chi-square distribution with g − 1 degrees of freedom.

Wald-type test (T W )
The null hypothesis H 0 : π 1 = Á Á Á = π g can be alternatively expressed as C β T = 0 where β = (π 1 , Á Á Á, π g , R) and . . . Wald-type test statistic (T W ) for testing H 0 can be expressed as where I is the Fisher information matrix for and T W is asymptotically distributed as a chisquare distribution with g − 1 degrees of freedom. T W can be simplified as Other multivariate tests of π's can be similarly done by choosing the corresponding C matrix in the above statistic. Further, a Wald-type test statistic for testing H 0a : π i = π j vs H 1a : π i 6 ¼ π j , i 6 ¼ j can be given by where c = (0, . . ., 1, . . ., −1, . . ., 0) with 1 in the ith element and −1 in the jth element. T Wa is asymptotically distributed as a chi-square distribution with 1 degree of freedom. T Wa (i, j) can be simplified as The score test statistic T SC is given by and see (Eq 5) for the formula of the inverse of the information matrix I(π, R) −1 . It can be simplified as after lengthy algebra calculations. T 2 SC is asymptotically distributed as a chi-square distribution with g − 1 degree of freedom.
Remark 1. One limitation of the score statistic is that it cannot be computed if S 0 = 0 or S 1 = 0. We dealt with this problem by adding 1/(2g) to m ij for such situations. Remark 2.
[1] derived a score test T SC for g = 2 as which is equivalent to (Eq 6).

Monte Carlo simulation studies
We now investigate the performance of proposed statistics and testing procedures discussed in the previous section. First, we investigate the behavior of the type I error rates of various procedures for g = 2,3,4,5; sample size m 1 = Á Á Á = m g = 20, 40, 60, 80 and 100; π 1 = Á Á Á = π g = π 0 = 0.5 (0.1)0.8; and R = 1 + ρ(1 − π 0 )/π 0 where ρ = 0.4(0.1)0.6. An imbalanced sample setting is also considered. In each configuration, 50,000 replications are generated based on the null hypothesis, and empirical type I error rates are computed as the number of rejections/50000. The results are presented in Table 2. Following [1], we calculated the ratio of empirical type I error rate to the nominal type I error rate. A test is said to be liberal if the ratio is greater than 1.2 (i.e., > 6% for α = 5%), conservative if the ratio is less than 0.8 (i.e., < 4%), and robust if the ratio is between 0.8 and 1.2. Generally, score tests T 2 SC produce satisfactory type I error controls for any configuration while LR tests and Wald tests are liberal, especially for small samples and larger numbers of groups (g). When g > 2, Wald tests are more liberal than LR tests and these tests get closer when sample size becomes larger.
LR tests and Wald-type tests are extremely liberal for a small sample size (i.e., m = 20), and their actual sizes inflate with the increase of the correlation coefficient (i.e., ρ).
Next, we evaluate the power performance of proposed methods. We consider the alternative hypotheses with  Table 3.
Based on the simulation results, LR and Wald tests are generally more powerful than score tests and Ronser's T generally has less power. However, power of LR and Wald tests is often overestimated in small sample size due to the inflated type I error rates from these tests being observed (see Table 2). For moderate or large sample sizes, the powers of the three proposed methods are close. Overall, the score test is highly recommended as it has reasonable power with satisfactory type I error control.

Examples
We first reanalyze the data presented by [5] to illustrate the newly proposed methods. The data consists of 218 persons aged 20-39 with retinitis pigmentosa (RP) who were seen at Massachusetts Eye and Ear Infirmary. They were classified into four genetic types, namely, autosomal dominant RP (DOM), autosomal recessive RP (AR), sex-linked RP (SL), and isolate RP (ISO). The differences between these four groups on the Snellen visual acuity (VA) were assessed. An eye was considered affected if VA was 20/50 or worse, and normal if VA was 20/40 or better. The sample used for this analysis consists of 216 persons out of the sample of 218 persons each of whom had complete information for VA on both eyes (Table 4).
An overall significant difference between the proportions of affected eyes in the four groups is from 0.0769 to 0.1173 based on proposed methods and 0.010 on Rosner's statistic T (Table 5). Table 2. The type I error rates (percent) of various procedures under H 0 : π 1 = Á Á Á = π g = π 0 at α = 0.05 based on 50,000 replicates. The maximum likelihood estimates and pairwise comparisons based on Wald-type test T Wa (i, j) are shown in Table 6. It shows a significant difference between DOM and AR (p = 0.0207).
Another example was a recent study from a cross-sectional, population-based sample in Iran assessing the prevalence of avoidable blindness [11]. Nearly 3000 persons were examined and blindness was assessed for seven age groups (Table 7). Test statistics T 2 LR ¼ 134:7, T 2 W ¼ 89:1, T 2 SC ¼ 161:1, and T = 202.0 consistently show the significant age differences (pvalues < 0.0001), and MLER ¼ 3:35 (r i ¼p i 1Àp i ðR À 1Þ are from 0.05 to 0.59) shows positive correlations between eyes for the same person.

Concluding remarks
In this article, we investigated three procedures for testing the homogeneity of correlated data with a cluster size of two. We derived the maximum likelihood estimate algorithm by utilizing  the root of third order polynomial equations. The Fisher scoring method is usually criticized for converging slowly, especially when the number of parameters is large (e.g. g is large). However, the algorithm derived in this paper is very efficient. This is because only R is updated by Fisher scoring iterations, and π i , i = 1, . . ., g are the roots of third order polynomials, a closed form solution.
Simulation results showed that the proposed approach (score test) has satisfactory type I error control and reasonable power, regardless of number of groups, sample size, or parameter configurations. On the other hand, the LR test and Wald test have inflated type I error in small sample size. When sample size becomes larger, the three test procedures get closer.
For binary correlated data, there are alternative ways to solve the MLE iteratively or perform hypothesis testing by model-based methods, e.g., GENMOD or GLIMMIX in SAS. However, neither iterative version of test statistics nor model-based method provides the explicit form of the test statistic. The explicit form of the test statistic in our proposed method is useful not only for its simplicity, but also for further development of the exact test. For example, in small sample situation, an exact test may overcome the inflated type I error rate and thus is highly desirable. To perform exact test, it will requires extensive calculations which makes it nearly impossible using iterative versions of test statistics or model-based methods.
To overcome inflated type I error control in asymptotic tests, [10] and [12] considered exact tests for g = 2. We consider the exact tests for g > 2 as interesting future work.
A user-friendly web-based calculator, including model estimations, hypothesis testings, and simulations, is available from the corresponding author upon request.