Analysis of Group Randomized Trials with Multiple Binary Endpoints and Small Number of Groups

The group randomized trial (GRT) is a common study design to assess the effect of an intervention program aimed at health promotion or disease prevention. In GRTs, groups rather than individuals are randomized into intervention or control arms. Then, responses are measured on individuals within those groups. A number of analytical problems beset GRT designs. The major problem emerges from the likely positive intraclass correlation among observations of individuals within a group. This paper provides an overview of the analytical method for GRT data and applies this method to a randomized cancer prevention trial, where multiple binary primary endpoints were obtained. We develop an index of extra variability to investigate group-specific effects on response. The purpose of the index is to understand the influence of individual groups on evaluating the intervention effect, especially, when a GRT study involves a small number of groups. The multiple endpoints from the GRT design are analyzed using a generalized linear mixed model and the stepdown Bonferroni method of Holm.


Introduction
This paper addresses data analysis issues related to group randomized trials (GRTs) with multiple endpoints and a small number of groups. In GRT studies, groups serve as the primary sampling unit in the selection process; groups are randomized into two or more arms, and then responses are measured on individuals within those groups. Typical examples of groups include clinics, schools, work sites, churches, or communities.
GRT is becoming a standard study design to assess the effect of an intervention program for health promotion or disease prevention, especially when the intervention is delivered more efficiently to groups than directly to individuals [1].
Consider the data from a study of the cancer Screening Office System (cancer SOS) [2], which motivated this research and are analyzed herein. In the study, eight primary care clinics were randomly selected and assigned to either the intervention or control arms. Then, patients from the clinics were randomly selected and a chart review was conducted to assess whether or not they took a given cancer screening test sometime during the preceding year. This assessment was made at three time points: baseline, 12 and 24 months post intervention.
In contrast to standard randomized clinical trials in which individuals are the sampling units for randomization, GRT designs have additional analytical problems. The major problem is due to the expected positive intraclass correlation (ICC) among observations from individuals in the same group. In addition, the number of groups involved in GRT studies is generally quite small, and thus problems arise in obtaining an accurate estimate of the ICC.
In spite of its difficulties, however, the GRT approach may sometimes be the only feasible approach; and if the characteristics of the GRT design are not properly accounted for in the analysis, the statistical significance of the intervention effect is typically overestimated, resulting in misleading public health information.
Data analysis issues of GRTs have been intensively discussed by health science researchers [1,[3][4][5] and a number of studies that used inappropriate statistical analysis methods have been identified [4,[6][7][8][9][10]. Recently, a statement from the Consolidated Standard of Reporting Trials (CONSORT) group emphasized the major analytical problems associated with GRTs, and recommended guidelines for reporting about them [6]. Many journals now require that the reports conform to the CONSORT guidelines. We believe that further education regarding the proper analysis of GRT-type data is necessary.
Analytically, it is most straightforward for a clinical trial to have a single primary question. For some research, however, multiple primary endpoints are important and relevant scientifically, medically, or for public health purposes, as it is often difficult to fully assess the efficacy of a new intervention using a single endpoint. Moreover, these studies are often expensive in both cost and effort, so researchers would like to get answers to many questions with a given study, if feasible. For example, the cancer SOS intervention study was designed to increase the use of three major cancer screening tests. For each patient, evaluations of the intervention were made for each test. Thus, patients were measured for the three tests as co-primary endpoints to better characterize the efficacy of the intervention, leading to an outcome variable that is a vector of three responses.
Multivariate analysis involving multiple responses per subject is a major research area in statistics, and several statistical computing packages (e.g., SAS, S-PLUS) have software procedures for analyzing such data. In spite of recent progress though, multivariate analysis is still not a standard approach for analysts who do not routinely use mixed models. Although many papers have explained the theoretical basis for multivariate analysis, few practical introductions to GRTs with multiple endpoints have been written, especially in the context of public health medicine.
Generalized linear mixed models (GLMMs) [11,12] provide a suitable framework for handling a GRT design. In this paper, we show how to use GLMMs to analyze GRT data arising from a randomized cancer prevention trial. We can incorporate covariates into GRT analysis by formulating the problem in the context of GLMMs.
When a study involves multiple endpoints addressing equally important objectives of the proposed intervention, the potential for drawing false positive conclusions exists unless an appropriate adjustment for multiplicity is used to control the overall statistical error rate. Several possible approaches exist, with appropriateness depending upon the study design. In this study, the multiple endpoints will be adjusted using the stepdown Bonferroni method, which we will refer to as Holm's method [13].
Use of a small number of groups in a GRT raises additional concerns regarding the statistical analysis of the intervention effect. Also, statistical power of a GRT generally depends more on the number of groups randomized than on the average number of individuals within a group [3]. Unfortunately, due to logistics and cost, a GRT commonly involves a small number of groups. From an analytical perspective, a GRT involving few groups intensifies the effect that any single group's behavior has on the overall intervention effect. Thus, it is critical for investigators to understand each group's influence, which poses greater problems if one group is markedly different than the others. In this study, we will develop an index of extra variability to investigate groupspecific effects on response. The index is designed to gauge the heterogeneity of response among the individual groups and provide insights into their influence on the GRT study.
This article provides practical advice and methodology for analyzing GRTs data, especially with multiple endpoints and a small number of groups. The overarching goal of this work is to disseminate statistical knowledge for public health benefit. This paper can be viewed as a tutorial for researchers who have little theoretical background or practical experience analyzing GRTs data, especially with multiple endpoints and a small number of groups. Mathematical details are avoided unless needed to illustrate important concepts.

Group randomized trials and statistical issues
A GRT is a randomized clinical trial to investigate an intervention. Unlike standard randomized clinical trials, however, the intervention is delivered to individuals through ''groups'', which are assigned to either intervention or control arms. Responses are measured on individuals within those groups nested in the arms. Consequently, the responses for individuals within the same group are expected to be positively correlated. This correlation is called the intraclass correlation coefficient (ICC), and is denoted by r.
Although r in most GRTs is usually rather small, this dependence can substantially influence the design and analysis of the GRT. If the ICC is ignored, the point estimate of the intervention effect is not affected. However, statistical inferences through the standard errors, p-values, and confidence intervals can be substantially affected.
To explain the concept of the ICC simply, suppose we conduct a nested cross-sectional GRT design. Let the two arms have the same number of groups, g, and all groups have the same number of members, m, following Murray's notation [1]. In addition, suppose we measure the endpoint at least two times post intervention. The total variance of the response s 2 y is defined as s 2 ð Þ , and s 2 e are the between-group, group-by-time interaction, and within-group variances, respectively. Here, g a ð Þ and gt a ð Þ represent group and interaction between group and time are nested within arm. Then, the ICC is defined as Note that the ICC varies depending upon the endpoint, the design, and the analyses. For example, with a nested cohort study in which the endpoints are repeatedly observed over time from the same subject, the ICC above will include the within-subject variance over time.
As seen in equation (1), ICC reflects the proportion of the total variance explained by the variance of the group-by-time interaction. The variance of the group mean in a GRT is then, Due to the GRT design, the variance includes a multiplicative factor 1z m{1 ð Þr ½ , called the variance inflation factor (VIF) [3]. Note that VIF = 1 if r~0. If r is greater than 0, ignoring the VIF results in underestimating the standard error of the intervention effect, and hence overestimating the statistical significance of the intervention effect.
As equation (1) shows, estimation of the ICC requires estimates of the between-group variance and the group-by-time interaction variance, with the number of groups as degrees-of-freedom. Primarily for logistical reasons, GRTs commonly have few groups (e.g., the four clinics per arm used in the cancer SOS study), resulting in a small number of degrees-of-freedom (df). This small df will be used to estimate the standard error for the intervention effect, and will yield an inflated Type I error. Further, this will reduce the statistical power of the test. Consequently, both the design stage and the analysis stage should account for the GRTs' characteristics.

Analysis of GRTs using GLMM
In this section, we briefly review the generalized linear mixed model (GLMM) to analyze GRTs. Let's assume that a vector of binary response data Y follows a Bernoulli distribution with unknown parameter m. The response probability, m, needs a link function so that all possible values of a set of linear predictors map into the interval between 0 and 1. This is accomplished by using the log of the odds of m, called the logit link function, written by where X is a matrix of observed explanatory variables and Z is a matrix for the random effects. The vector b contains the unknown fixed effects that need to be estimated, while the vector c of random effects is not estimable, and is assumed to be normally distributed with mean 0 and a variance matrix. The variance of the binary data is defined as, We analyze the cancer SOS study by testing the intervention effect for the multiple endpoints in a global fashion. Specifically, for each of the three screening tests, let Y ijk l ð Þ be the binary response (yes/no) and m ijk l ð Þ~P r Y ijk(l)~1 À Á be the probability of taking a screening test for the ith member nested within the k th clinic and the l th arm and observed at the j th time. The odds ratio for the effect of intervention can be obtained using a logistic regression model based on equation (2), and it is given by where i~1, Á Á Á ,m patients; j~0, Á Á Á ,t times; k~1, Á Á Á ,c groups; and l~0,1 arms. Time is modeled as a continuous variable. In the model, A l is the intervention indicator (1 for intervention, 0 for the control) that estimates the difference between the intercepts, T j is the j th time point, and the coefficient represents an average slope for the control arm, and TA j l ð Þ is the time-by-intervention interaction that is the main test of interest as it tests for the average departure from the slope due to the intervention arm.
In order to account for the expected ICC, group-related factors were included in the analysis as random effects: the random effect of the k th group nested within arm l, denoted G k l ð Þ , as a groupspecific intercept and the random effect of the combination of the j th time and the k th group nested within arm l, denoted TG jk l ð Þ , as a group-specific slope. TG jk l ð Þ accounts for the possibility that the random effect G k l ð Þ may not have an identical distribution at each time point. These random effects allow for correlation among members within a group and for correlation among members within a group | time combination. We assume that the simple diagonal covariance matrix models a different variance component for each random effect, although other covariance matrices are possible. The estimated odds ratios for the fixed effects are then given by expb b . More precisely, the intervention effect is tested by considering the statistical significance of the interaction term,b b 3 . For the final analysis, we added fixed effects for the baseline covariates, age (as a continuous variable) and race (a categorical variable, as shown in Table 1), to the above model to adjust for possible confounding factors.
When a GRT involves a small number of groups, a few statistical methods are available to adjust the degrees-of-freedom in mixed models. For example, Kenward and Roger [15] proposed a general purpose method based on restricted maximum likelihood. Their method uses an adjusted F-statistic that reduces the small sample bias. Its distribution is approximated by the Fdistribution with an approximate denominator degrees-of-freedom. This method is easily implemented in some commercial statistical packages including SAS.
Once the model (4) is fitted, the ICC and the VIF can be estimated based on the equation (1), with the estimated variances for the within group, between groups, and a between group-bytime interaction.
Several possible approaches exist to adjust for multiplicity due to multiple endpoints, and the appropriateness depends upon the study design. The adjustment approach could be very conservative (e.g., Bonferroni method) or less conservative (e.g., Hochberg or Hommel method). We prefer Holm's method [13], also known as the stepdown Bonferroni method (a modification from the Bonferroni procedure), as it controls the familywise error rate (FWE) under very general conditions. More powerful methods rely on additional assumptions, and these methods do not always control the FWE. Holm's method is now being widely used and has been cited over 3,000 times to date (www.isiknowledge.com). We refer readers to Brown and Russell [16] for detailed comparisons of various multiple testing procedures. We fit a GLMM for each endpoint and obtain a p-value corresponding to b 3 of TA j l ð Þ from equation (4). These marginal p-values are then adjusted for multiplicity, using Holm's method (implemented in several statistical packages, including SAS).

Group and time-specific influences
While VIF provides an interpretation of the overall variance inflation due to a GRT design, it is critical to understand the influence of individual groups on evaluating the intervention effect, especially, when the number of groups is small.
Let us define the change on each endpoint within a particular interval as, If we assume that the responses at two time points are independent, the nominal variance of the average change { d j is based on the binomial distribution. We denote the standard deviation of this binomial variance by STD. It is obtained as where p p j is the estimated sample mean of the proportions computed over all groups in each arm, and { m j is the average number of individuals within arm at time j. Similarly, p p jz and { m jz represent the analogous values for time jz1. Using STD, the associated 1{a ð Þ100% confidence interval for { d j can be computed.
As an index of extra variability (IEV kj ), STD d kj À Á is compared to the variability based on the binomial distribution, using the ratio; IEV kj~S TD d kj À Á STD: The IEV index describes how spread out the individual group's responses are from the mean, within an arm for a particular period. IEV kj values close to 1 can be interpreted as an indication of strong similarity in response across the groups.

Cancer SOS intervention
Roetzheim and others [2] developed a low-cost office-systems intervention called cancer Screening Office Systems (cancer SOS), for primary care clinics serving disadvantaged populations. Disadvantaged populations were defined as patients: 1) belonging to racial or ethnic minorities, 2) of low socioeconomic status, 3) uninsured, or 4) insured by Medicaid. The scientific question for this study was whether or not the intervention prompts patients to take the cancer screening tests described in the next paragraph. In previous reports [2,14], both men and women were included in the analyses. The present study focuses only on women.
The intervention was implemented in a county-funded health insurance plan in Hillsborough County, FL from 2002 to 2004. Eight clinics were randomly selected, and each was randomly assigned to either the intervention or control arms.
The intervention included a cancer screening checklist completed by patients which indicated whether or not they were due for screening. The intervention targeted three cancer screening tests: Papanicolaou (Pap) smears, mammograms, and fecal occult blood tests (FOBT). It is generally recommended that each of these tests be performed annually for women who are age 50 or older. One hundred fifty patient's charts from the clinics were randomly selected and abstracted to obtain the demographic and clinical variables at baseline, and outcome variables at baseline, 12, and 24 months after the intervention. By chance, a few patients were selected more than once. However, as the study was not intended to follow individuals over time, this design is called a nested crosssectional study rather than a cohort study. Table 1 provides descriptive statistics of the cancer SOS study sample at the individual level. The table includes p-values from Fisher's exact tests comparing control and intervention arms. Patients attending intervention clinics were more likely to be African American, married, and have fewer health care visits. However, there was no significant difference in age, language, health insurance, or smoking status at baseline.

Application
This section illustrates applications of GLMMs to a GRT study where the scientific question is whether or not the cancer SOS intervention convinces patients to take cancer screening tests or not. Men were excluded from the analysis in this paper, as two of the three target screening tests applied to women only. Also, the following exclusion criteria for the women were applied: personal history of breast cancer for the analysis of mammography, personal history of cervical cancer or hysterectomy for the analysis of Pap smear screening, and personal history of colon cancer, or colonoscopy of double-contrast barium enema in the previous 10 years for the analysis of FOBT.
In each of eight clinics, patients' usage of the three screening tests (Pap smear, mammography, and FOBT) was assessed at three time points (baseline, year 1, and year 2). The responses were binary (1 if the patient took the test and 0 otherwise). On average, 150 patients per clinic were measured at each time point, with four clinics assigned to each of two arms (intervention or control).
The distribution of each cancer screening test (numbers and percentages) at each time period is summarized by arm in Table 2. The number of eligible women for the Pap smear test was lower than for the other tests due to a substantial number of hysterectomies. As an initial informal analysis, it is useful to examine the screening rates by time. For all tests, the difference between the two arms was lower at the 24-month compared to the 12-month follow-up. For FOBT, the two arms showed a considerable difference at baseline (24 vs. 36% for controls and interventions, respectively), which was maintained over time. Figure 1 depicts the data for the three screening tests plotted against time for the 8 clinics. The screening rate trajectories vary among clinics (solid and dot lines represent intervention and control groups, respectively). For the intervention arm, the screening rates increased slightly at year 1 for all three tests, but had declined by year 2. By contrast, the 1-year screening rate declined for all tests in the control arm. Individual clinic trends were fairly similar for mammography and Pap smears, but strikingly different for FOBT. For the most part, the screening rates for interventions were higher than those for controls.
The results of the analysis are shown in Table 3. Age and race (for mammography), and race (for Pap smear) were not significant and were removed from the final model. None of the screening tests showed statistical significance at a~0:05 for the intervention effect (Intervention | Time). Notably, this result differs from the original analysis [2,14], which inappropriately ignored the GRT design: those studies reported that the intervention was significant for Pap smear and FOBT screening tests after one year and two years, respectively. Table 4 shows the estimated covariances of the random effects from the above model as well as the estimated ICC and VIF values. Even though the ICCs might appear to be modest in size, the VIFs are substantial. This indicates that the variances of the intervention effects range from 1.3 to 19.9 times larger than they would have been with random assignment of individual members, providing a major explanation for the differences in findings from the original analyses.
The huge differences of VIF across screening tests are more clearly explained in Figure 2, which depicts the change of screening rates (%) between two time points by clinic and arm for each screening test. The symbol of star and the brackets represent the mean change across clinics and the 90% confidence intervals (CIs) based on equation (5). Overall, for the Pap smear test, the changes for the individual clinics (denoted by circle symbols) were near the mean, and the index of extra variability (IEV) relatively close to 1. By contrast, the IEV values were much higher for the FOBT test, except for the 1-2 year change for the control arm. Specifically, for the intervention arm with the period of 0 to 1 year (IEV = 5.6), two of the four clinics showed marked distances from the group mean at 234% and 51%. The IEV values for mammography were usually somewhat higher than they were for the Pap smear test, in spite of the fact that the VIF for the mammography test was much lower (1.3. vs. 5.7). This contrast can be explained by the relative lack of consistency in when the shifts occurred for Pap smears; the rate was fairly stable from baseline to one year, but decreased about 15% in the second year for the intervention groups, whereas the most of the decline among the control groups was in the first year. The FOBT, likewise, shows large shifts whose pattern differs by treatment arm.

Discussion
GRT designs are becoming increasingly important in public health interventions, as it is often easier to randomize at the group level. Investigators are often surprised by the degree to which information is attenuated due to intraclass correlation, even though ICC levels are often quite low in practice. Thus, GRT designs are usually improved by having as many groups as are logistically possible. We have developed an index of extra variability (IEV) with a corresponding graphical presentation for understanding in greater detail how the effects from individual groups influence the overall findings. We believe that the IEV concept will assist researchers in the planning future GRT studies.
The results of the cancer SOS GRT study were disappointing. Further, the primary investigator of the study, a non-statistician, was understandably confused at the discrepancy in findings based  on the statistical method employed. The original analysis, a generalized estimating equations (GEE) [17] approach for each individual endpoint [2,14], found that the intervention increased all three screening tests at one year follow-up, and had a persistent effect on the mammography at two year follow-up. While the GEE method allows for correlation among patients within a group, it is inappropriate for GRT study designs when the number of clinics is small, with less than 20 groups per arm, as noted by Murray et al. [5] and Bellamy et al. [18]. It is known that the GEE approach generally yields biased estimates of the variance of fixed effects when the number of groups/clusters is small [19,20]. Recently, standard software packages are beginning to correct for the bias issues for GEE with small samples. In addition, while commonly ignored, the univariate approach for each individual endpoint requires an adjustment for multiplicity.
While analyzing the cancer SOS study, we faced a few additional analytical issues that should be discussed. As the endpoints were measured at three times, some patients were seen at multiple timepoints by chance. Among the patients surveyed at baseline, 22% were seen at year 1, and 8% were selected at all three timepoints. Ideally, the expected intra-patient correlation, which exists as a consequence of repeated measurements across time, should be adjusted for when performing statistical inference. However, in this study we did not consider this additional source of correlation. The effect of the possible correlation across time is unknown and needs further investigation. Also, each patient had three endpoints, resulting in another source of intra-patient correlation. As the univariate approach ignores the stochastic dependence among the individual endpoints, it may yield conservative results. To account for this type of intra-correlation, one may use a multivariate approach; all three endpoints per subject unit are simultaneously analyzed by adding a random effect for individuals. By accounting for the correlation among endpoints, improved power to detect the overall intervention effect is expected. In addition, this multivariate approach needs no multiplicity adjustment, since only one test is being carried out. However, it should be noted that if the endpoints describe unrelated aspects of the individual response, or if there is considerable discrepancy across endpoints, this multivariate approach is not a reasonable analytical strategy. In this case, individual tests that are suitably adjusted, as we have done here, should be used. The cancer SOS showed striking differences in Table 3. Summary results of the univariate GLMM analyses for the cancer SOS study. VIF across the three endpoints (i.e., 1.3 to 19.9); thus, averaging the three endpoints provides no reasonable interpretation. What appears to have happened with the SOS GRT is that the intervention groups were either stable or had increases in the first year, while the control groups declined. However, these modest gains were short-lived, and resulted in large year 2 drops for the intervention groups. The GRT was not designed to assess this kind of change, or cope with this degree of heterogeneity, given the small number of groups. As VIF, IEV, and the graphical presentation in Figure 2 all provide different, albeit related, information for GRTs, we believe that they all have a useful role in their analysis. Selecting an appropriate number of groups and the number of subjects sampled per group in GRTs depends on several factors: the endpoint variable type, the method of the analysis, the expected effect size of intervention, and the estimated intra-cluster correlation coefficient. The sample size issue of GRTs is another topic that requires separate discussion; we refer to Donner and Klar [3], and Murray [1] for details.
Our hope is that this paper will contribute to the responsible analysis of GRTs, thereby helping with the scientific accuracy of research findings. Multivariate analysis for multiple endpoints in GRTs is obviously one area that we need to continue further investigation. Further research on small group issues also is warranted at both the design and analysis stages.
The SAS code used for this article is available at the first author's web: http://personal.health.usf.edu/jlee2/software. Note that it also includes a Newton-Raphson optimization option to deal with convergence problems, which we otherwise frequently ran into for moderately complex mixed models.