Power and sample size calculations for comparison of two regression lines with heterogeneous variances

The existence of interactive effects of a dichotomous treatment variable on the relationship between the continuous predictor and response variables is an essential issue in biological and medical sciences. Also, considerable attention has been devoted to raising awareness of the often-untenable assumption of homogeneous error variance among treatment groups. Although the procedures for detecting interactions between treatment and predictor variables are well documented in the literature, the corresponding problem of power and sample size calculations has received relatively little attention. In order to facilitate interaction design planning, this article describes power and sample size procedures for the extended Welch test of difference between two regression slopes under heterogeneity of variance. Two different formulations are presented to explicate the implications of appropriate reliance on the predictor variables. The simplified method only utilizes the partial information of predictor variances and has the advantages of statistical and computational simplifications. However, extensive numerical investigations showed that it is relatively less accurate than the more profound procedure that accommodates the full distributional features of the predictors. According to the analytic justification and empirical performance, the proposed approach gives reliable solutions to power assessment and sample size determination in the detection of interaction effects. A numerical example involving kidney weigh and body weigh of crossbred diabetic and normal mice is used to illustrate the suggested procedures with flexible allocation schemes. Moreover, the organ and body weights data is incorporated in the accompany SAS and R software programs to illustrate the ease and convenience of the proposed techniques for design planning in interactive research.


Introduction
Interactive analysis has received fairly extensive attention in biological and medical sciences. The occurrence of interaction implies that the influence of a predictor variable on the response variable varies as a function of the level of a treatment variable. In general, the existence and magnitude of interaction effects can be measured and appraised by the statistical methods for analyzing the product terms of treatment and predictor variables. Alternatively, the procedure for evaluating the interaction effects between treatment and predictor variables is methodologically identical to that for testing the equality of regression slope coefficients in two or more regression lines. The common testing procedures of multiple linear regression described by Fleiss [1], Kleinbaum et al. [2], and Kutner et al. [3] can be readily applied for the detection of different forms of interaction effects. Among various pedagogical development and exposition of interaction studies, particular emphasis has been paid to the fundamental attributes of two treatment groups, such as Hewett and Lababidi [4], Hill and Padmanabhan [5], Spurrier, Hewett, and Lababidi [6], and Tsutakawa and Hewett [7], among others. The traditional multiple linear regression adopts the fundamental assumptions of independence, normality, and constant variance. Accordingly, homogeneity of error variance is one of the essential assumptions for the appropriate use of common test procedures. The detrimental effects of heterogeneous error variance on the regular test procedures have been investigated in Alexander and DeShon [8], DeShon and Alexander [9], Dretzke, Levin, and Serlin [10], Overton [11], and Shieh [12]. The numerical findings indicated that the resulting Type I error rate and power levels may be considerably affected when group sample sizes are equal, and severely distorted when group sample sizes are unequal. Therefore, the traditional tests of interaction effects are not robust to the heterogeneity of error variance for evaluating regression slope differences across groups. It is emphasized in Dretzke et al. [10] and Shieh [12] that the importance of choosing a proper test procedure for two regression slope differences when group error variances are unequal. Moreover, under the prevalent scenario of variance heterogeneity, a theory-based approach to power calculation for assessing the existence of treatment by predictor interactions remains unavailable.
To enhance the adoption of appropriate techniques for statistical inference and research design, this article aims to present power and sample size procedures for the well-recognized Welch [13] test of parallelism of two regression lines with heterogeneous variances. Accordingly, the suggested techniques are direct extensions of those considered in Shieh [14] under a homogeneous variance setting, just as the Welch [13] procedure to the traditional two-sample t test for mean comparisons. Four important aspects of this study should be pointed out. First, the primitive Welch solution to the Behrens-Fisher problem for comparing the means of two groups under heterogeneity of variance has been widely discussed in the literature (Kim and Cohen [15]). Although there exist several viable alternatives to mitigate the impact of violating the assumption of homogeneous error variance, the accurate performance and computational ease demonstrated in Dretzke et al. [10] and Shieh [12] ensure that the extended Welch procedure for regression slope equality can be recommended. Second, the actual values of predictors cannot be known in advance before conducting a research study just as the primary responses. In order to account for the stochastic nature of predictors, two different general or unconditional distributions of the extended Welch statistic are described to obtain feasible power functions and sample size formulas. Third, comprehensive numerical assessments were conducted to illustrate the discrepancy between two potential approaches under a wide range of model configurations. The empirical findings revealed that the more elaborate procedure generally outperforms the alternative simplified method for correctly accommodating the predictor features. Fourth, to ease the computational demands of the proposed power and sample size calculations, both SAS and R computer algorithms are presented. The program codes are computationally efficient and will give accurate outputs when all the necessary information is properly specified. With the most plausible model configurations, the recommended techniques are practically useful for power analysis and sample size determination in planning interaction studies.

Methods
Consider the two simple linear regression models with unknown and possibly unequal coefficient parameters and error variances: where ε 1j and ε 1k are iid Nð0; s 2 1 Þ and Nð0; s 2 2 Þ) random variables, respectively, j = 1, . . ., N 1 , and k = 1, . . ., N 2 . Note that Shieh [14] focused on the scenario with homogeneous error variance s 2 1 ¼ s 2 2 . Moreover, the regression model with unequal slopes in Eq 1 can be formulated as an interactive model using a dichotomous treatment variable M: Evidently, the slope difference between the two regression lines β 1D = β 11 −β 12 is also the coefficient associated with the cross-product term between the predictor and treatment variables. The existence and magnitude of β 1D indicates the influences of a dichotomous treatment indicator on the direction and/or the strength of the relationship between the response and predictor variables. For the purpose of detecting the interaction effect, the problem is naturally concerned with the hypothesis and the distributional properties of the corresponding least squares estimators of slope coefficientsb 11 andb 12 . Under the model formulation in Eq 1, the standard results show thatb 11 � ðX 2k À � X 2 Þ 2 , and � X 1 and � X 2 are the respective sample means of the X 1j and X 2k observations.
To detect the non-parallelism of two regression lines or the interactive effects of a dichotomous treatment variable between the predictor and response variables in terms of H 0 : β 1D = 0 versus H 1 : β 1D 6 ¼ 0, Welch [13] proposed an extension of the well-known approximate degrees of freedom solution to the Behren-Fisher problem of comparing the difference between two means. Accordingly, the extended Welch statistic has the form: where tðnÞ is a t distribution with degrees of freedomn and The null hypothesis is rejected at the significance level α if where tn ;a=2 is the 100(1−α/2) percentile of the t distribution tðnÞ with degrees of freedomn.
The statistical inferences about the interactive effect are based on the conditional distribution of the continuous outcomes {X 1j , j = 1, . . ., N 1 } and {X 2k , k = 1, . . ., N 2 }. Consequently, the particular results would depend on the observed values of the predictor variables. At the design stage, however, the actual values of both the response and predictor variables cannot be determined. Hence, it is more appropriate to consider the random or unconditional framework as demonstrated in Binkley and Abbot [16], Cramer and Appelbaum [17], Gatsonis and Sampson [18], and Sampson [19]. Although the unconditional properties of the test procedure are relatively more involved, the associated hypothesis testing and parameter estimation remain the same under both conditional and unconditional modeling settings. The fundamental distinction between the two modeling scenarios is essential when performing power and sample size calculations. It is of theoretical importance to recognize the random characteristic of the predictor variables and to appraise the unconditional property of the test statistic over the possible values of the predictors.
In order to explicate the critical notion of accommodating the distributional properties of the predictor variables, two different power functions are presented in Appendix. The simplified t method only utilizes the partial information of predictor variances. The associated power function C ST defined in Eq A3 has the advantages of statistical and computational simplifications. On the other hand, the mixed t approach considers the improved power function C MT given in Eq A6 and accommodates the full distributional features of the predictors. Despite the close resemblance between the two power formulas, the corresponding behavior may differ substantially for finite samples. In practice, a research study requires adequate statistical power and sufficient sample size to detect scientifically credible effects. For the purpose of planning interaction design, the prescribed two power formulas can be employed to determine the sample sizes N 1 and N 2 needed to attain the specified power through a simple iterative search for the chosen significance level and model configurations. Accordingly, it is of practical interest and theoretical importance to examine the underlying characteristics of the two approaches in power and sample size calculations.

Simulation study
Due to the complexity of the power and sample size problem, a complete theoretical evaluation of the associated issues is not feasible. To clarify the similarities and differences of the prescribed two different distributions for the extended Welch statistic, extensive numerical appraisals were conducted for their performance in power and sample size calculations. The investigation was carried out in two steps. The first step involved sample size determinations across a wide range of manipulated model configurations. In the second step, a Monte Carlo simulation study was performed to compare the accuracy of the alternative sample size formulas through power assessments under the design characteristics specified in the first step.

Sample size determinations
The determination of necessary sample size for testing regression slope differences requires detailed specifications of the magnitudes of slope coefficients, error variances, predictor variances, nominal power, and significance level. Notably, the impact of each of these critical factors on sample size calculations not only differs but also varies with the coexisting magnitudes of other components. To provide a systematic evaluation, the empirical examinations are administered by fixing all but one of the following critical attributes and by varying a single component in the appraisal: (1) slope differences, (2) sample size ratios, (3) predictor variances, and (4)  The diverse settings not only include both balanced and unbalanced schemes, but also generate direct-and inverse-pairing with variance component patterns. Note that the empirical investigation involves a total of 90 combined conditions. It should be emphasized that these model configurations are designated to represent the possible extent of characteristics in practical study without unrealistic or excessive settings. In fact, similar setups were considered in Overton [11] and Shieh [12]. Throughout these numerical comparisons, the significance level and nominal power are set as α = 0.05 and 1−β = 0.80, respectively.
With the aforementioned specifications, the necessary sample sizes can be computed by the supplemental SAS/IML (SAS Institute [20]) or R (R Development Core Team [21]) programs for the simplified t and mixed t methods with the power functions defined in Eqs A3 and A6, respectively. The resulting sample sizes {N 1 , N 2 } associated with the slope difference β 1D = {0.05, 0.75, 1.00} in combination with sample size ratio r = {1, 3} are summarized in Tables 1-6, respectively. Accordingly, there are 15 combined cases of ft 2 1 ; t 2 2 g and fs 2 1 ; s 2 2 g in each table.   For ease of illustration, the total sample size N T and the working effect size δ � given in Eq A4 are also presented. As expected, the computed sample sizes increase with smaller magnitude of slope difference β 1D or effect size δ � . Also, the necessary sample sizes associated with larger error variances fs 2 1 ; s 2 2 g are consistently larger than those of the smaller error variances when   all other factors are held constant. However, the situation is reversed in terms of the predictor variances. In other words, the scenarios with larger ft 2 1 ; t 2 2 g incurred smaller sample sizes than those with smaller predictor variances when all other characteristics remain the same.
Also, the unbalanced designs often demand larger sample size to attain the same nominal power than the balanced designs. But there are some exceptions, for example, Cases 3, 11, 12, and 13 in Tables 1 and 2 signify the notable situations. Specifically, the total sample size N T = 320 and 324 are slightly larger than those of N T = 300 and 304 for the two methods associated with Case 3 in Tables 1 and 2, respectively. More importantly, the sample sizes of the simplified t method are relatively smaller than those of the mixed t approach. The only one situation with the same sample sizes is associated with {N 1 , N 2 } = {24, 72} of Case 13 in Table 4 when β 1D = 0.75, ft 2 1 ; t 2 2 g = {4, 1}, and fs 2 1 ; s 2 2 g = {1, 4}. However, with the two different power functions C ST and C MT , the corresponding estimated powers are 0.8164 and 0.8028, respectively. Note that if problems or deficiencies were to be seen with the alternative power and sample size procedures, they would be most apparent with small sample sizes. Thus, some of these sample sizes are designated to be smaller than those in related studies. In order to evaluate the accuracy of the two power functions, the estimated power or computed power are also listed. Because of the underlying metric of integer sample sizes, the attained values are slightly larger than the nominal level for both procedures.

Simulation results
To investigate the accuracy of sample size determinations, Monte Carlo simulations were performed in the second stage. With the designated parameter configurations and computed sample sizes, estimates of the true power were computed via Monte Carlo simulation of 10,000 independent data sets for both approaches. For each replicate, {N 1 , N 2 } predictor values were generated from the selected normal distributions. The resulting values of predictor variables in turn determined the mean responses for generating {N 1 , N 2 } normal outcomes with the chosen simple linear regression models. Without loss of generality, the intercepts of the two linear equations and means of the two normal predictors were set as β 01 = β 02 = θ 1 = θ 2 = 0. Next, the test statistic T � was computed and the simulated power was the proportion of the 10,000 replicates whose test statistics |T � | exceeded the corresponding critical value t^n ; 0:025. Therefore, the adequacy of the two sample size procedures is determined by the error (= estimate power- Table 6. Computed sample size, estimated power, and simulated power when the slope difference β 1D = 1.00, sample size ratio r = 3, Type I error α = 0.05, and nominal power 1−β = 0.80.

Simplified t method
Mixed t method simulated power) between the estimated power computed from analytic formulas and the simulated power of Monte Carlo study. The simulated power and error are also summarized in Tables 1-6 for all 90 design conditions. The results in Tables 1-6 indicate that there exists certain discrepancy between the two competing sample size procedures. The outcomes of the simplified t method deteriorate to some extent for larger slope differences β 1D and sample size ratios r. Correspondingly, the resulting errors in Table 6 with β 1D = 1.00 and r = 3 are clearly greater than those in Table 1 with β 1D = 0.50 and r = 1 for the identical combination of variance patterns {t 2 1 ; t 2 2 } and fs 2 1 ; s 2 2 g. It is notable that the largest error of the simplified t technique is 0.1096 associated with Case 11 in Table 6, whereas the error of Case 11 in Table 1 is only 0.0134. Therefore, the power function C ST does not maintain a consistent performance for the conditions examined here and the corresponding sample size computation is especially less accurate for large effect sizes and unbalanced designs. On the other hand, the power discrepancy suggests a close agreement between the estimated power and the simulated power for the MT procedure regardless of the model configurations. Specifically, all the absolute errors of the 90 designs are less than 0.01 and the only exception has an error of 0.0143 occurred with Case 7 in Table 3. Relatively, the error induced by the simplified t method is 0.0288 for the same setting. From a practical standpoint of providing a versatile solution without specifically confining to any particular settings, the failure to give reliable performance is obvious limitation of the simplified t method. Considering the potentially diverse interaction and predictor configurations of field studies, the mixed t approach is relatively more consistent and accurate than the simplified t method to be considered as a general tool. In addition, the fundamental behaviors of Type I error control of the two contending power procedures were also examined. Following the sample sizes and other settings in Tables 1-6, simulated Type I errors were computed with 10,000 iterations under the null values β 11 = β 12 = 0. The results show that all the simulated Type I error rates are contained within a close range of the nominal level α = 0.05, Specifically, the maximum absolute errors of the simplified t and mixed t methods are 0.0048 and 0.0056, respectively. Hence, the two contending power approximations of the extended Welch test provide adequate Type I error performance to justify the presented power appraisals for the model configurations considered here.

Empirical study
To exemplify the typical situation most frequently encountered in the planning stage of a research study, the numerical data involving the organ weights of crossbred diabetic and normal mice in Bishop [22] is exploited to illustrate the computational aspects of the suggested power and sample size procedures for assessing the interaction effects between a continuous predictor and a dichotomous treatment variable. The example concerns whether the diabetic mice have kidney weights which are disproportionately larger, relative to their body weights, than those of the normal mice.
The subgroup simple regression with body weight (in grams) predicting kidney weight was conducted for the crossed diabetic and normal groups with sample size {N 1 , N 2 } = {9, 25}. The resulting slope estimates and error variances are fb 11 ;b 12 g = {15.9286, 3.8398} and fŝ 2 1 ;ŝ 2 2 g = {10124.8980, 9097.9625}, respectively. Moreover, the summary statistics for the body weights are � X 1 = 42.8889, � X 2 = 33.8800,t 2 1 ¼ SSX 1 =k 1 = 31.1111, andt 2 2 ¼ SSX 2 =k 2 = 23.8600. The traditional analysis leads to a test statistic T = 1.6492 with degrees of freedom df = 30 and pvalue = 0.1098. Alternatively, the computed value of the extended Welch test statistic is T � = 1.6073 with degrees of freedomn = 12.9349 and p-value = 0.1321. Hence, the two tests for a difference between the two slope coefficients is reported as statistically non-significant for α = 0.05. In other words, it is concluded that there is no group effect (crossed diabetic and normal groups) on the relationship between the body weight and kidney weight.

Power and sample size calculations
In view of the advance nature of research design, the practice guidelines suggest that typical sources like published finding or pilot study can give creditable and sensible planning values for the model configurations. The reader is referred to the excellent texts of Cohen [23], Ryan [24], and Shao, Wang, and Chow [25] for essential issues and further details on power analysis and sample size determination. Therefore, the prescribed data of Bishop [22] is employed to provide planning values of the regression parameters, error components, and predictor configurations to guide future diabetic and body weight interaction studies. With the sample sizes of  Moreover, in an actual experiment, the available resources are generally limited and the cost for obtaining a crossbred diabetic mouse is often more expensive than a normal one. Consequently, an unbalanced design may be more efficient and a flexible approach allowing unequal sample size allocation is desired. Using a sample size ratio r = N 2 /N 1 = 3, the necessary sample sizes to attain the nominal power levels of 0.80 and 0.90 are {N 1 , N 2 } = {28, 84} and {37, 111}, respectively. The two computed sample sizes of 112 (= 28 + 84) and 148 (= 37 + 111) are (112-34)/34 = 2.29 and (148-34)/34 = 3.35 times more than is employed for the original design, respectively. The prescribed explications demonstrate the usefulness of power and sample size assessments in planning a desirable research design. In contrast, without a proper and thorough evaluation, it may lead to some distorted power performance and unsatisfactory research outcome for the conducted study. Note that these exemplifying configurations are listed in the user specifications of the SAS/IML and R programs in the supplemental files. Following the numerical demonstration, researchers can easily identify the input statements in the computer code and then replace the required values with their own model specifications.

Conclusions and discussion
Assessing interaction effects of categorical treatment variables or testing the equality of regression slopes is a fundamental procedure in biology, medicine, and other fields of science. The present study focuses on the two-group case because it represents the great majority of interactive studies. In view of the frequent violation of the homogeneity of error variance assumption in applications, the extended Welch approach has been proposed as a vital alternative to the ordinary least squares method for the detection of interaction effect between a dichotomous treatment variable and a continuous predictor. However, the associated power computation and sample size determination for interaction analysis remain unavailable. To alleviate the difficulty in detecting interaction due to insufficient statistical power, this article seeks to contribute to the literature by presenting practical solutions to power and sample size calculations for the extended Welch tests of regression slope difference under variance heterogeneity. Accordingly, two distinct techniques are described to emphasize and account for the stochastic nature of the predictor variables for approximating the general distribution and power behavior of the extended Welch statistic. In addition to the detailed theoretical explications for the inherent relationship and distinct property of the two approaches, numerical investigations are conducted to reveal the relative performance of the contending procedures as a reliable technique to accommodate the predictor features that make interaction research particularly distinctive. The suggested approach has the important advantages over the simplified method in theoretical justification, great accuracy, and wide applicability. Due to the limitation of existing software packages and the practical usefulness of the proposed methodology, computer algorithms are presented to implement the recommended power calculations and sample size determinations.
Within the context of statistical point estimation, unbiasedness is an essential principle in the selection of desirable point estimators. However, the unbiased criterion does not guarantee the chance of obtaining a correct estimate. For example, it is well known that the sample mean is unbiased of the population mean. Although the probability of making a correct estimate is zero for the case of continuous variables, the sample mean was not deprived of its role as a reliable and established estimator. For the problem of sample size determination, some researchers may suggest that it makes no difference to apply a simple formula given the many unknowns in the planning stage of research design. Nevertheless, the underlying principle of research design planning is to conduct advance analysis based on the best knowledge of the model configurations. It is prudent to adopt a feasible procedure that incorporates into power and sample size calculations as much information about the model configurations as possible. Note that hierarchical linear models (Raudenbush & Bryk [26]) have the distinct structure for accommodating the notion of random coefficients in the traditional linear models. Here, the stochastic nature of predictor variables is taken into account for the described power and sample size methodology for interaction research. Essentially, the developed algorithms will yield accurate power appraisal and sample size determination provided that all the required features are suitably specified.
In addition to the relaxed scenario of variance heterogeneity, it should be stressed that the recommended power and sample size approaches also require the independence and normality assumptions. The robust feature clearly depends on the extent of how poorly the underlying response and predictor distributions diverge from the normality setting. However, Poncet et al. [27] and Schmider et al. [28] reinvestigated the robustness of two-sample t test and ANOVA F test against violations of the normal distribution assumption by Monte Carlo simulations. Their results gave strong support for the robustness of the t and F tests under application of non-normally distributed data. Although they did not evaluate the Welch method, the updated findings in some sense reinforce the documented robustness for the Welch-type procedures under mild departures from normality assumption. More importantly, Lix and Keselman [29] noted that the Welch-type tests preserve reasonably good Type I error control and compare favorably with the traditional method in terms of statistical power for many types and degrees of non-normality likely to be encountered in practical research. Moreover, the established class of generalized linear models offers an excellent and distinct method for analyzing non-normal data. Related details and follow-up procedures can be found in McCullagh and Nelder [30] and McCulloch, Searle, and Neuhaus [31].