Applications of the Wei-Lachin Multivariate One-Sided Test for Multiple Outcomes on Possibly Different Scales

Many studies aim to assess whether a therapy has a beneficial effect on multiple outcomes simultaneously relative to a control. Often the joint null hypothesis of no difference for the set of outcomes is tested using separate tests with a correction for multiple tests, or using a multivariate T 2-like MANOVA or global test. However, a more powerful test in this case is a multivariate one-sided or one-directional test directed at detecting a simultaneous beneficial treatment effect on each outcome, though not necessarily of the same magnitude. The Wei-Lachin test is a simple 1 df test obtained from a simple sum of the component statistics that was originally described in the context of a multivariate rank analysis. Under mild conditions this test provides a maximin efficient test of the null hypothesis of no difference between treatment groups for all outcomes versus the alternative hypothesis that the experimental treatment is better than control for some or all of the component outcomes, and not worse for any. Herein applications are described to a simultaneous test for multiple differences in means, proportions or life-times, and combinations thereof, all on potentially different scales. The evaluation of sample size and power for such analyses is also described. For a test of means of two outcomes with a common unit variance and correlation 0.5, the sample size needed to provide 90% power for two separate one-sided tests at the 0.025 level is 64% greater than that needed for the single Wei-Lachin multivariate one-directional test at the 0.05 level. Thus, a Wei-Lachin test with these operating characteristics is 39% more efficient than two separate tests. Likewise, compared to a T 2-like omnibus test on 2 df, the Wei-Lachin test is 32% more efficient. An example is provided in which the Wei-Lachin test of multiple components has superior power to a test of a composite outcome.


Introduction
In many studies an objective is to assess whether an experimental therapy (E) versus control (C) has beneficial effects on multiple component outcomes. This is becoming increasingly common in the evaluation of the comparative effectiveness of therapies. For example, the NIDDK-funded ''Glycemia Reduction Approaches in Diabetes: A Comparative Effectiveness'' (GRADE) Study will compare four agents commonly used to control glucose levels in type 2 (adult) diabetes [1], clinicaltrials.gov NCT01794143. The primary objective is to evaluate the durability of glucose control over 3-6 years of treatment, the primary outcome being the time to a confirmed rise of HbA1c (a measure of average glucose levels) $7% (the therapeutic target being a value ,7%) using a logrank test. A secondary outcome is to compare each pair of treatments with respect to multiple components of effectiveness, specifically whether one treatment is superior to the other with respect to durability of control (eventtimes), absence of hypoglycemia over 3 years of treatment (proportions), and a lower mean body weight at 3 years. Herein we describe how such a test could be conducted and evaluate the power of the test or the required sample size.
For illustration, throughout we consider the case of two outcomes, say A and B, although all the procedures herein generalize to $2 outcomes. We wish to test the null hypothesis H 0 : (A E ;A C )>(B E ;B C ) that the experimental therapy is equivalent to control for both outcomes versus the alternative H 1 : with at least one strict superiority, where '';'' means equality for an outcome and where '']'' means superiority. The test against such an alternative is called a multivariate one-directional (or one-sided) test.
Wei and Lachin [2] proposed a simple 1 df test for such a hypothesis that was described as a test against an ordered alternative, or a test of stochastic ordering. The test was later studied by Lachin [3] and Frick [4,5]. Herein the application of this test to multiple outcomes is described for a test of means, a test of proportions, a test of event times and a test with mixed components such as where one outcome is quantitative (using means) and another qualitative (using proportions). For each application, equations are also derived for evaluation of sample size and power of the test. Multiple model-based tests are also described. For an analysis of multiple mean differences we show that the Wei-Lachin test is more powerful than an analysis based on either separate tests for each outcome, multiplicity adjusted, or a multivariate T 2 -like omnibus test. An example from a major clinical trial is presented.
Many other tests have also been proposed, principally in the setting of tests for differences in means. These are reviewed in the discussion section.

Wei-Lachin Multivariate One-Directional Test and Its Power
Three versions of the Wei-Lachin test are described. The first employs the measurements using the original scale of measurement. This test, however, is not invariant to scale transformations of the individual components. Two scale invariant tests are also described, one based on standardized values and another based on scale-independent Z-tests.

Scale-Based Test For Multiple Outcomes
Let X ij designate the jth outcome variable in the ith group with expectation E(X ij ) = m ij , i = E, C; j = a, b. The subscripts a, b are used through out to refer to the two outcomes. The jth outcome could be a quantitative measure or a binary variable (among others). Assume that a more favorable outcome is represented by a decreasing expectation for X. Let Thus, H 1S designates that the experimental therapy is at least as effective as control for both outcomes and is superior to control for either or both outcomes. This is called the multivariate onedirectional hypothesis.
In the context of an analysis of repeated measures, or multivariate observations, Wei and Lachin [2] described a multivariate one-directional test, what they termed a test of stochastic ordering, i.e. a test of the null hypothesis that is directed towards an alternative hypothesis of the form H 1S in (2). Lachin [3,6] contrasts this test with other tests, such as the omnibus test.
Consider group-specific estimatesm m ij with expectation m ij . Letd The Wei-Lachin test is then provided by s s 2 S~V V (d d a zd d b )~ŝ s 2 a zŝ s 2 b z2ŝ s ab Â Ã using consistent estimates of the variances and covariance, where J~(1 1)'. Asymptotically Z S *N(0,1) under H 0 from Slutsky's theorem. The test rejects H 0 in favor of H 1S when Z S §Z 1{a at level a one-sided. The above generalizes to K.2 outcomes. Note that the test can also be obtained from the unweighted average of the group differences relative to its standard error that provides a convenient average measure of the group differences when all outcomes are measured on the same scale. Specific applications include a large sample test of means [3] or proportions [7], a generalized linear regression model using quasi likelihoods with a covariance matrix estimated using the information sandwich, i.e. GEE [8]; or a normal errors model for the analysis of repeated measures [9]; or a proportional hazards model using the information sandwich [10]; or these estimates can be based on a distribution-free estimate such as the Mann-Whitney difference that provides a Wilcoxon test [3,11] with the Wei-Lachin [2] estimate of the covariance matrix. These and other methods allow for some observations for some outcomes in some subjects to be missing either completely at random or at random (conditionally).
Although often termed a multivariate one-directional (onesided) test, it is possible to conduct a two-sided one-directional test that either E is superior to C for all components, or C is superior to E. In that case, the Wei-Lachin 1 df test statistic is referred to the two-sided critical value rather than the one-sided value. Herein we describe the one-sided test.
If beneficial values of X a are lower, but those for X b are higher, such as for a test of LDL and HDL, respectively, then the test would be constructed using the negative of the values for X b such that d b~mEb {m Cb . If higher values of both measures demonstrate benefit for the treatment, then both d a and d b can be defined as the difference of treated minus control. This test would be appropriate when all of the outcome measurements were on the same scale; for example, as for a test of a beneficial effect on both systolic and diastolic blood pressure (both mm Hg), or a test of a beneficial effect on both LDL and HDL (both mg/dl). Other variations described below would be appropriate for outcomes with different variances, or measures on different scales or mixtures of different types of measures, such as A being a quantitative variable and B being a binary variable.
An alternative approach commonly applied to test the superiority of an experimental therapy is to base the inference on the two separate one-sided tests. These tests would require a correction for multiple tests such as using the Holm [12] improved Bonferroni procedure which requires that the minimum of the two p-values be #0.025 (one-sided) and the other #0.05 in order to declare significance at the 0.05 level for the two tests. The corresponding alternative hypothesis is However, the alternative H 1P includes the case where the experimental therapy is beneficial for one outcome but harmful for the other, such as where d a w0 and d b v0 or vice versa.
Yet another possible test would be the omnibus test using a T 2like test of the null hypothesis H 0 versus that is provided by which is asymptotically distributed as chi-square on 2 (or more generally K) df. This is likewise inappropriate because the alternative includes cases where the experimental therapy is worse than control for either or both outcomes.

Maximin Efficiency of the Wei-Lachin Test
For the case of two measures as herein, the restricted alternative multivariate one-dimensional hypothesis H 1S in (2) corresponds to all points in the positive orthant of the two-dimensional parameter space for (d a ,d b ). Since the test is a sum of the two estimates, the rejection region is defined by the line of values (d d a ,d d b ) satisfying Z S~Z1{a that simply connects the points (d a ,0) and (0,d a ) where d a~Z1{aŝ s S . Thus the rejection region principally includes an area of the positive orthant away from the origin, but also includes elements of the sample space where eitherd d a v0 ord d b v0, but not both. With large sample sizes, the probability of such points is negligible for true values (d a ,d b ) away from zero, i.e towards the central projection (the 45u line) of the positive orthant. Lachin [6] provides figures to illustrate these relationships.
For a given pair of values D 1~( d a1 d b1 )' specifying a point in the positive orthant (d a1 ,d b1 ), it is readily shown [13] that the optimal likelihood ratio test of H 0 : D~(0 0)' versus the point alternative H D1 : D~D 1 based on (3) is where x 2 LR is distributed as chi-square on 1 df under H 0 . Note that x 2 LR is based on a weighted sum of the estimated differences Thus, for a given S, every point D 1~( d a1 ,d b1 ) that defines a unique alternative hypothesis value in the two dimensional parameter space entails a different optimal linear combination of the observedD D. Further, the same weights are optimal for any alternative hypothesis defined by points proportional to (d a1 =s a , d b1 =s b ) with the same correlation, such as the point (cd a1 =s a , cd b1 =s b ) for any c.0. This implies that the same weights would be optimal for all points in the parameter space falling on the vector projection defined by the specified (d a1 =s a , d b1 =s b ). Thus, there are an infinite number of alternative hypotheses corresponding to all possible projections in the positive orthant, each with a different optimal test. Unfortunately it is not known which projection is optimal since the actual parameter values (d a ,d b ) are unknown. However, Frick [4,5] showed that the Wei-Lachin test is maximin efficient with respect to whichever weighted test is in fact optimal under the condition thatŜ SJ §0. That is, among the family of linear combinations of the estimates, the Wei-Lachin test minimizes the loss in efficiency (power) relative to the unknown optimal linear combination when this condition applies, in which case it is the optimal robust linear test of H 0S versus H 1S . For two or more measures with positive correlations, as would be the case under the alternative hypothesis, Frick's condition SJ §0 is satisfied.
When this simple condition does not apply, Frick [4] shows that a simple weighted test is provided by that is also maximin efficient where L satisfies the restriction L'Ŝ SJ~1. For a givenŜ S, the vector L is obtained as L~B'Ŝ S where B is the quadratic program solution to min y ½y'Ŝ S {1 y under the constraints that y i §0 Vi and y'J~1. This test will principally be required in cases where the null hypothesis applies, or the treatment is inferior for some of the component outcome measures. A SAS program for this computation is available from the author (see Discussion).

Scale-based Test for Multiple Means
To illustrate the construction of the Wei-Lachin test, consider a large sample test for a difference between groups in the means of two outcomes where it is assumed that X ij *f (m ij ,y 2 ij ) with some distribution f where y 2 ij~V (X ij ) is the variance of the observations for the jth outcome in the ith group, or the residual variance after adjusting for other covariates, and y iab~C ov(X ia ,X ib ), i = E, C; j = a, b. To simplify, assume that there is a common covariance matrix in the two groups (homoscedasticity) with correlation r ab~yab =(y a y b ). Then asymptotically y 2 a =n ia y ab n iab n ia n ib y ab n iab n ia n ib where (n ia , n ib , n iab ) are the numbers in the ith group with observed values for outcome A and B separately and jointly, i = E, C [3].
' is asymptotically distributed as in (3) where the variances y 2 a , y 2 b and covariance y ab can be estimated directly from the available observations [3] under the homoscedasticity assumption. The estimated variance of the sum of mean differences isŝ These then provide the test statistic Z S in (5), or Z S,L in (10) if Frick's condition is not satisfied.

Standardized Score Test for Multiple Means
For an analysis of the means of quantitative variables, the Wei-Lachin test Z S is not invariant to a change of scale for either of the two measures. In cases where there is a mixture of quantitative variables with different dispersions or units, such as LDL measured in mg/dl and systolic blood pressure measured in mm Hg, it is more meaningful to compute a scale-invariant test using the average of the corresponding standardized differences. This might also be preferred when the variances of the measures differ substantially, even though measured on the same scale.
Let Y ij denote the standardized value Y ij~Xij =y j with V (Y ij )~1. Then the standardized difference between groups for the jth outcome iŝ The resulting standardized Wei-Lachin test is then provided by that is consistently estimated from the estimate of the correlation r r ab . When the variances of the outcomes are equal (ŷ y a~ŷ y b ), then Z S,Y~ZS . With equal sample sizes and no missing values, n ia~nib~niab~n~N =2, (i~E,C), then As above, with positive correlations, Frick's condition S Y J §0 is satisfied. If not, then the weighted test is provided by Z S,L usinĝ D D Y~(d d a =ŷ y ad d b =ŷ y b )' andŜ S Y in (10).

Z-Based Test
In some cases, it may be desired to conduct a test with mixtures of quantitative and qualitative outcomes (or other types), e.g. combining tests for means, proportions and/or life-times. In such cases a multivariate one-directional test with respect to the multiple outcomes can be obtained from a combination of the individual Z-test values of the form where z j~d d j =ŝ s j and the covariance matrix of the Z-tests (S z ) has Under the alternative hypothesis where the components fd d j g or fz j g are expected to be positive, then the covariance will likewise be expected to be positive and Frick's condition S z J §0 is readily satisfied. If this condition is not be satisfied, we would use the test Z S,L using Z~(z a z b )' andŜ S z in lieu ofD D andŜ S in (10).
It should be noted that this Z-based test is analogous to the Gastwirth [14] miximin efficient robust test (MERT) that is a obtained using the sum of the extreme Z-tests from a set of tests against a closed family of alternatives. For a family with only 2 alternatives (or tests), the MERT is equivalent to the above Zbased test.

Comparison of the Tests for Means
When the variances are equal (ŷ y a~ŷ y b ), it can readily be shown that the standardized scores test equals the scale-based test (Z S~ZS,Y ) regardless of the sample sizes or sample fractions. When the group sample sizes are equal with no missing values, it can also be shown that the standardized scores test equals the Zbased test (Z S,Y~ZS,Z ). When both the variances and sample sizes are equal, then all three tests are equal.
Direct computation of the three tests (Z S , Z S,Y , Z S,z ) over a range of sample sizes, variances and group differences shows that Z Z S,z w 1:009 Z Z S,Y w 1:032 Z Z S , i.e. with given proportionalities. Thus, Z S,Y and Z S,z are virtually equivalent with corr(Z S,Y ,Z S,z )~0:988 over the range of alternatives considered. These two tests are about 3% greater than the scale-based test with respective correlations of 0.977 and 0.953. Thus, on this basis the standardized scores or Z-based test would appear to be preferable.

General Expressions for Power and Sample Size for the Tests
For each variation of the test, expressions for the evaluation of sample size and power are readily obtained.
that may be a function of (d a ,d b ) depending on the underlying model. Also, let s 2 S~w 2 S =N represent the factorization of this variance into a term w 2 S and N. Therefore, from standard equations [15], the power of the test to reject and where the variance s 2 S is factored as presented below. Conversely, the sample size required to provide power 12b to detect specified values (d a ,d b ) is provided by To evaluate these equations, is it necessary to provide the components of w 2 S , i.e. (w 2 a , w 2 b , w ab ), and to specify the values (d a ,d b ) representing the minimal degree of superiority of treatment both outcomes of clinical interest.
For the standardized scores test in (16) the variance is likewise factored as s 2 and the required sample size from Likewise, for the Z-based test in (19), power is obtained from and the required sample size from where Expressions for the correlation are provided below for specific cases. Also, each of the above expressions for power can be expressed as E(Z)~Z 1{a zZ 1{b where E(Z) is also termed the noncentrality parameter of the test. Thus, the first term on the right hand side of (20), (23) and (25) is the respective expression for E(Z).

Sample Size and Power for Tests for Means
To assess sample size and power for a test, let E(n ia n ib n iab )~N(j ia j ib j iab ) denote the expected numbers observed in the ith group, where N is the total sample size in the two groups with at least one observed measurement (not including any subject missing both A and B measurements).

The Scale-Based Test
and s 2 When the groups are of equal size with the same fractions When there are equal-sized groups with no missing observations then j a~jb~jab~0 :5 and Then the power or sample size required to detect specified values d a and d b are provided by (20) or (22), respectively.
For example, suppose we desire to test the treatment group differences in both systolic (A) and diastolic (B) blood pressures, lower values of each being better. From existing data the respective SDs are y a~1 3 mm Hg and y b~7 mm Hg. The correlation of the two is r ab~0 :6 which yields y ab~( 0:6)(13)(7)~54:6: Assume that we wish to detect a treatment group difference equal to 0.25 SD for each measure, so that d a~( 0:25)(13)~3:25 and d b~( 0:25)(7)~1:75: For equalsized groups with no missing observations then j a~jb~jab~0 :5 and w 2 S~4 ½13 2 z7 2 z2(54:6)~1308:8. For a one-sided test at the 0.05 level, the sample size required to provide power of at least 0.9 is provided by For the standardized-scores test, from (16), When there are equal-sized groups with no missing observations (all fjg~0:5) then w 2 S,Y~8 (1zr ab ). Power and sample size are then obtained from (23) and (24).
For the above example, with equal sample sizes and no missing data, then corr(d d a ,d d b )~corr(X a ,X b )~r ab~0 :6. Since the difference is specified as a fraction of the standard deviation, d a~( 0:25)y a and d b~( 0:25)y b , then d a =y a~db =y b~0 :25 and the required sample size is that is slightly less than the N required for the scale-based test. This indicates that for this example, the test based on standardized scores would have greater power for a given N.
The same numerical result also is obtained using the Z-based test since in this case the two tests are equal.

Relative Efficiency Versus Other Tests
It is also instructive to compare the efficiency of the Wei-Lachin test versus two one-sided tests or an omnibus test. We do so here in the context of a test for means, and these results apply in general to other tests as well. Standard methods for the evaluation of the asymptotic relative (Pitman) efficiency (ARE) of two tests under a local alternative would not account for the necessary adjustment to the significance level for two tests. However, the ARE can be interpreted as the ratio of sample sizes needed to provide the same level of power for a specific alternative. This ratio of sample sizes can be derived directly from (22) relative to the like expression for either two separate tests or the omnibus test.
Pairwise Tests. Consider the power of the test for means with equal group sample sizes and residual variance y 2 j for the jth outcome where each is measured on the same scale so that the original scale-based test is appropriate. For a given alternative (d a w0,d b w0). For two tests with equal-sized groups, each being of size N/2, with no missing data (j ia~jib~jiab~1 =2), the variance of the difference for the jth outcome is assuming homoscedasticity. Then the equivalent expression for the total sample size required based on the separate tests is provided by using the Bonferroni correction for 2 one-sided tests. To simplify, assume that the differences of interest are a common fraction v of the standard deviations, i.e. d a~n y a and d b~n y b in which case Let N S denote the total sample size required for the Wei-Lachin test as obtained from (22) with the value w 2 S that is obtained from (30) to yield Thus, the ratio of sample sizes needed with the two-pairwise one-sided tests versus the Wei-Lachin test is Since Z 1{a=2 wZ 1{a and y a y b §y ab , then N P wN S . For example, consider a one-sided test at the 0.05 level (0.025 adjusted for two tests) with 90% power to detect an improvement E versus C at any level v. Assume a correlation among the A and B measures of 0.5 and variances y 2  (7) is provided by ' that is asymptotically distributed as chisquare on 2 df. The corresponding non-centrality parameter is where the inverse covariance matrix is Thus h 2~NO 2 The non-centrality parameter for a test at level a on K df that provides power 12b, designated as h 2 (a,b,K), is readily obtained, such as from the SAS function CNONCT. Then the required sample size is provided by For the above example, h 2 (0:05,0:10,2)~12:654 and Then, for the above example, the inverse efficiency relative to the Wei-Lachin test is provided by the ratio of N O to N S in (38)

Test for Multiple Proportions
Now consider a large sample test for a difference between groups in the probabilities (p ij ) of two Bernoulli variables X a and X b where the corresponding sample proportions are distributed as p ij *N(p ij ,y 2 ij =n ij ) with Bernoulli variance y 2 ij~p ij (1{p ij ) for the jth outcome within the ith group and sample sizes n ij~N j ij , i~E,C; j~a,b: The covariance of the Bernoulli variables within the ith group, Cov(X ia ,X ib ), is simply where p iab is the probability that both variables are positive [7]. Again we assume that a lower probability is better. If not, the (0, 1) categories should be reversed.
that is consistently estimable from the sample quantities [7]. Then the statistic Z S is constructed as in (5) based on the sample estimate of the variance s 2 S as in (13). Note that in this case, since all measures are based on Bernoulli variables, there is no advantage to using the test based on standardized scores. Alternately, the Z-based test would be constructed as in (19) For the assessment of sample size or power the covariance would be factored as S~V=N with terms (w 2 a ,w 2 b ,w 2 ab ) and where For example, assume that the outcomes in the control group are expected to have probabilities p Ca~pCb~0 :4 with joint probability p Cab~0 :2 and that the respective probabilities in the experimental group are p Ea~pEb~0 :3 with joint probability p Eab~0 :15. Then y 2 The correlation of the estimates is Thus, the Z-based test is again more efficient than the scalebased test.

Tests for Means and Proportions
Scale-Based Test. It is also possible to determine the joint distribution of a test for means of one outcome and a test for proportions of another. Let X A denote a quantitative measurement with means m ia and variance y 2 a , assuming homoscedasticity, and X ib denote a binary variable with probability p ib and variance y 2 ib~p ib (1{p ib ) in the ith group (i~E,C). The covariance of the two in the ith group is provided by where m ia (1) Cov(X Ea ,X Eb )n Eab n Ea n Eb z Cov(X Ca ,X Cb )n Cab n Ca n Cb y Eab n Eab n Ea n Eb z y Cab n Cab n Ca n Cb : To conduct the test these variances and covariances can be estimated consistently from the corresponding sample estimates. Sample size and power can then be evaluated as above.
For example, assume that we wish to test the difference between groups in the mean level of LDL and the prevalence of hypertension. Assume a SD y a~2 0 in both groups and that the difference of interest is d a~5 that corresponds to a 0.25 SD difference. While it is not necessary to specify the actual mean values within each group to compute d a , it is necessary to compute the covariance. Within each group assume that the overall mean values are m Ea~1 70 and m Ca~1 75 (corresponding to d a~5 ), and a greater treatment effect among those who are hypertensive with mean values m Ea (1) Z-Based Test. Alternately, since the scale-based test is not invariant under transformations, it would be more appropriate to employ a combination of the Z-tests. In this case, When there are equal sample sizes between groups with no missing data for either measure then Thus, the Z-based test would provide greater power in this case.

Tests for Multiple Event-times
For right censored event time data, a member of the family of Aalen-Gill tests [16,17], also known as the G r family of tests of Harrington and Fleming [18], can be used to test the hypothesis of equal hazard functions, or survival functions, between two groups. This family includes the logrank test that is asymptotically fully efficient under a proportional hazards model and is equivalent to the score test of the unadjusted group effect in a Cox Proportional Hazards model. It also includes the Peto-Peto-Prentice modified Wilcoxon test that is optimal under a survival proportional odds model. Andersen, Borgan, Gill and Keiding [19] describe a generalization of the tests for K.2 groups. These tests are equivalent to the family of weighted Mantel-Haenszel statistics described by Kalbfleisch and Prentice [20].
Wei and Lachin [2] describe a multivariate rank test for event times that is a generalization of the above families of tests to the case of multiple time-to-event outcomes. They also introduced the one-directional multivariate test described herein, what they termed the test of stochastic ordering, to assess whether the treatment group event times differed in a favorable direction for all of the outcomes. A SAS macro for these computations is available (see discussion). The computational details will not be provided herein.
Lakatos [21] presents a general approach to the evaluation of sample size and power for the Mantel-logrank test that allows for time varying hazard rates, proportional or non-proportional hazards, and other design features. When the hazard rates are assumed constant over time with a constant of proportionality, a simple exponential model applies in which case the methods of Rubenstein et al. [22] or Lachin and Foulkes [23] can be applied. Herein we describe the computation of sample size or power for the Wei-Lachin test for multiple event-time outcomes under the exponential model of Lachin and Foulkes that includes a generalization of the method described by Lachin [15] based on the difference in the exponential hazard rates. Freedman [24] showed that the latter expression can also be derived from the expected value of the logrank chi-square test value under a proportional hazards model. Lachin and Foulkes [23] also show that the power of the test based on the difference in the estimated hazards is virtually identical to that for a test based on the log hazard ratio.
We assume that there are two or more outcome events where no one outcome is a competing risk for the other outcomes, such as the time to development of diabetic retinopathy and time to developing diabetic nephropathy, neither of which is fatal. Let X ijk = 1 denote that the kth subject had the jth event in the ith group at time t ijk , and X ijk = 0 denote right censoring at time U ijk that in turn is the minimum of the loss to follow-up time and the administrative censoring time for those who remain free of the jth outcome, i = E, C; j = a, b. Then the total number of subjects with an event (called events) (D ij ) and total time at risk (T ij ) for the ith group and the jth outcome are T ij~X k X ijk t ijk z(1{X ijk )U ijk Â Ã : Note that the X ijk are non-iid Bernoulli variables with event probabilities that are a function of the underlying hazard rates for the event and losses to follow-up and the period of exposure U ijk .
Within each group, for each outcome assume a constant hazard rate l ij that is consistently estimated asl l ij~Dij =T ij . Let E(D ij ) designate the expected number of events based on the assumed hazard rate l ij , sample size, periods of recruitment and follow-up, and losses-to follow-up in that group. Asymptotically, where u 2  (d d ad d b ). A test based onD D will have power approximately equal to that of the Wei-Lachin multivariate one-directional test using the Wei-Lachin bivariate Aalen-Gill logrank test under proportional hazards. Thus, we describe the power of the bivariate logrank test based on the test of the difference in exponential hazards. Then the scale-based test employs " # that is consistently estimated usingl l ij and the observed D ij , j~a,b. File S1 shows that the covariance is expressed as where D iab is the number of subjects who experience both the A and B events and E D iabI ½ is the expected number with both events under the assumption that the Bernoulli variables X iak and X ibk are independent. Each is consistently estimated from the observed numbers of events and total time at risk. The computational expression for D iabI is also presented in File S1.
The resulting test as in (5) then is based on the variance estimatê that is solely a function of the numbers of individual and joint events, the corresponding event times and the corresponding times of at risk. Accordingly, the power of the test is a function of the expected numbers of events and expected time at risk that in turn are a function of the design parameters and sample size. Lachin and Foulkes [23] provide the expression for the probabilities of events fp ij g for given hazard rates for events fl ij g and losses to follow-up fg ij g, recruitment period R with recruitment shape parameter c and total follow-up Q, and sample size n ij~N j ij . Then the expected number of events is obtained as E(D ij )~Nj ij p ij and likewise the expected period at risk as E(T ij )~Nj ij t ij . File S1 also provides expressions for E(D iab ), E(D iabI ) and E(T ij ). Then s 2 Power and sample size are then obtained from (20) and (22). However, to obtain an analytic solution to these equations, a specific model must be specified for the dependence of the eventtimes with a given correlation, such as the Marshall and Olkin [25] bivariate exponential model. Hougaard [26] provides a review of such models. Alternately, a simulation model could be implemented using a given bivariate exponential distribution. Herein, a simpler approach is described using a shared frailty.
Assume that the two event types share a common frailty with parameter l iF . Then in the simulation model, in the ith group, three random exponential times are generated as and the correlated exponential event times are then obtained as t ia~m in(t 1 ,t 3 )*exponential(l ia ) t ib~m in(t 2 ,t 3 )*exponential(l ib ): from which the probability p iab of both events can be obtained.
For example, consider a Q~5 year study with linear (constant) recruitment over a R~3 year interval allowing for a loss-to-followup hazard rate of 0.05 per year and with equal size groups. Within the control group assume that the hazard rates are l Ca~0 :2=year and l Cb~0 :3=year and that the experimental therapy yields risk reductions of RR a~0 :8 and RR b~2 =3, or hazard rates of l Ea~0 :16=y and l Eb~0 :2=y so that d a~0 :04 and d b~0 :10. To allow for a correlation of the event times we assume shared frailties of l EF~0 :08 and l CF~0 :1: For a given sample size, the simulation model (herein with 10,000 replications) provides direct computation (within a small degree of error) of the expected quantities (E(D ij ), etc.) from which power is computed. By a simple search it was found that a n of 197 per group provides a one-sided one-directional test with 90% power. A similar computation using (25) shows that an n of 197 per group would provide power = 0.885 using the Z-based test, indicating that in this setting the Z-based test would have less power than the original scale based test.

Generalizations
It is also possible to obtain a test based on the combination of group differences in hazard rates and differences in proportions or means. As in the preceding sections this requires the derivation of the covariance of the measures within each treatment group.
Alternately, a multivariate one-directional test can be obtained using multiple regression models as now described.

Model-Based Analysis of Multiple Outcomes
The preceding sections describe the application of the Wei-Lachin test to a combination of the group differences in means or proportions or hazard rates. In each case the covariance of the group differences, or of the corresponding Z-values, is described. The test statistic can then be computed using a consistent sample estimate of the variances and covariance(s), and the expression for power can be obtained using specified values for these parameters. In principle it is possible to construct a test for combinations of other types of outcomes, such as the difference in rates (counts) of events under a Poisson model, and to derive the equations to assess the power of the tests. However, it is more convenient to provide model-based generalizations of this approach.
From basic principles, Pipper, Ritz and Bisgaard [27] describe the joint distribution of parameter estimates from multiple models, not necessarily all of the same type. Consider two models for each of two outcomes, each with K j parameters and coefficient estimatesĥ h j~(ĥ h j1 . . .ĥ h jKj )'. Arbitrarily, assume that the first parameter estimateĥ h j1 represents the difference between groups on some scale, no difference represented by a value of zero, and the remaining K j estimates represent the intercept (if any) and other covariate effects. Then the score vector for the 'th subject and I j (ĥ h j ) is the model based estimate of the expected information for the jth outcome. Also, let U j (ĥ h j ) denote the K j |N matrix where the 'th column is the score vector U 'j (ĥ h j ). Then the generalization of the information sandwich robust estimate of the covariance matrix of the joint set of estimatesĥ h~(ĥ h' aĥ h' b )' is provided by The estimated variances of the group coefficients in the two models is then provided by the elementsŝ s 2 a~S R (ĥ h a ) 1,1 and ŝ s 2 b~S R (ĥ h b ) 1,1 , and the covariance byŝ s ab~SR (ĥ h a ,ĥ h b ) 1,1 . The scale-based test is then provided by (5) with (ĥ h a1 ,ĥ h b1 ) substituted for (d d a ,d d b ). Alternately, Z-tests of the group effect in the two models are then provided by Z j~ĥ h j1 =s j , j~a,b, and the correlation of these tests by Cov(Z a ,Z b )~Corr(ĥ h a1 ,ĥ h b1 )~ŝ s ab =(ŝ s aŝ s b ). This provides the Zbased test as in (19).
Pipper et al. also describe the application of the joint models where data for a subject is missing for one of the component models (but not both). Under the assumption of missing completely at random, then the score vector elements for that subject are set to zero in the corresponding score matrix U.
It would be difficult to evaluate the sample size and power of such a model-based test. However, simple computations such as those herein could be applied, e.g. the power of a test for a difference in means and proportions when the actual analysis will employ a linear regression model and a logistic model.
Pipper et al. originally provided an R package multmod to fit multiple models and to compute the covariances of the coefficients in the models. That has since been replaced by the R package multcomp.

Example -The Diabetes Prevention Program
The Diabetes Prevention Program compared the risk of onset of type 2 diabetes and deterioration of metabolic function among participants randomly assigned to an intensive lifestyle intervention (ILS) versus treatment with the glucose lowering drug metformin and versus a placebo control with no lifestyle intervention [28]. The study showed that intensive lifestyle provided a 58% reduction in diabetes risk versus placebo and 39% versus metformin, and that metformin produced a 31% reduction versus placebo. The study also evaluated the differences among treatments in the prevalence of developing the metabolic syndrome, a metabolic state that is linked not only with risk of onset of diabetes but also the risk of developing cardiovascular disease. The prevalence of the metabolic syndrome is characterized by 3 or more of the following 5 criteria: abdominal obesity defined as a waist circumference .102 cm among men or .88 cm among women, serum triglycerides (a bad cholesterol) $150 mg/dL, HDL (a good cholesterol) ,40 mg/dL among men or ,50 mg/ dL among women, systolic/diastolic blood pressure $130/85 mm Hg, and fasting glucose $110 mg/dL, the latter met by many of the study subjects. [29] Of the 3234 randomized, 1388 (43%) already met the metabolic syndrome criteria. Among the remainder who were evaluated at 3 years of follow-up (i.e., free of the syndrome on entry), 22% (363 of 1673) had the syndrome present. [30] Herein we compare the prevalence of the metabolic syndrome and its components at 3 years of follow-up among those in the lifestyle versus metformin treated groups.
The classification of the metabolic syndrome is a composite outcome, i.e. a single binary trait to designate that the criteria were met. An alternative would be to construct an analysis of the 5 binary traits using the one-directional multivariate test described herein.
For two of the traits (waist circumference and HDL) there are separate criteria for men and women, and for hypertension both systolic and diastolic blood pressure are employed, whereas for the other two traits there is a single cutpoint for the corresponding quantitative measure. Thus an alternate analysis would be to used these three composite binary traits in conjunction with an analysis of the other two quantitative variables (triglycerides and glucose).
Alternately, rather than use any cutpoints to construct derived binary variables, an analysis could compare the groups with respect to the six quantitative traits (including systolic and diastolic blood pressure) simultaneously. Table 1 presents a comparison of the lifestyle versus metformin groups for each of the binary outcomes and each of the corresponding quantitative outcomes. The overall prevalence of the metabolic syndrome using the composite binary outcome does not differ significantly between groups, although the prevalence is about 2% lower in the lifestyle group.
For all variables other than HDL, higher values are worse, so that a positive difference between metformin minus lifestyle indicates a benefit for lifestyle. In order for the same to apply to HDL, the analysis employed the negative values of HDL.
All p-values are one-sided. Some of the one-sided p-values are .0.5 indicating a negative Z-value favoring metformin. However, most of these differences are close to zero. For no measure is there evidence that intensive lifestyle is worse than metformin, and all significant differences favor the lifestyle group. Thus, these data are consistent with the alternative hypothesis that lifestyle has a beneficial effect on some of the outcomes, and no adverse effect for any. Table 2 presents the correlations among the measurements. The modest to low correlations suggest that a multivariate test will provide greater power than individual tests, especially when the latter are adjusted for multiple tests. Table 3 then presents the Wei-Lachin scale-based and Z-based one-directional multivariate test Z and one-sided p-values for three different analyses of these data. As would be expected, the analysis of all six quantitative traits is more powerful or sensitive than the analyses involving binary traits, with p-values ,0.001 using either the scale or Z-based tests. The analysis of the 5 binary indicator variables produces less significant results, and the scale-based test for these data proves to be more powerful (larger Z-value) than the Z-based test, although both are significant. An alternative would be to conduct an analysis of the three binary traits defined from multiple criteria (waist, HDL, hypertension) and the other two quantitative traits (triglycerides and glucose). This yields results intermediate to those of the analysis of all quantitative and all binary traits.
Regardless of which of these options might have been chosen as the basis for the analysis, all would have provided a statistically significant result whereas the analysis of the composite metabolic syndrome outcome failed to demonstrate a beneficial effect of lifestyle versus metformin (Table 1, p = 0.22).

Discussion
A number of multivariate one-directional or one-sided tests have been described. Virtually all were developed to apply to a multivariate test of the difference in means between two groups for a multivariate outcome, such as repeated measures. These are also described for the case of two measures with group differencesd d a andd d b as described above.
For a test based on multivariate normal observations, such as K repeated measures, Kudo [31] described the multivariate onesided likelihood ratio test (LRT ) of the K-variate generalization of the ordered hypotheses in (2) assuming that the covariance matrix S is known, and Pearlman [32] described the LRT when the estimated covariance matrix is employed. For the case of the two statistics herein, Pearlman's LRT is based on the statistic where ''_'' designates the maximum of the two quantities. Thus, if eitherd d is negative the resulting test statistic quantity is zero. However, the distribution of S LR is computationally difficult and the test is not convenient for practical use.
Tang, Gnecco and Geller [33] proposed a computationally simpler approximation to the LRT. Their approximate or ALR test is not an approximation in the sense, say, of a series expansion, but rather is an approximation in the sense that the alternative hypothesis parameter space is an approximation of that of the LRT. Their statistic is of the form whereZ Z a andZ Z b are uncorrelated standardized Z-statistics obtained as linear transformations of theD D vector. Under the assumption that the covariance matrix is known, theñ Z~(Z Z aZ Z b )'~A'D D where A is a square matrix such that A'A~S {1 and A'SA~I, such as is obtained from a Choleski decomposition. The distribution of this statistic is a simplified Chibar-squared distribution [34], though still requiring some computation to obtain a p-value. However, when an estimate of the covariance matrix is employed to provide the A transformation matrix, various authors have shown that the test can be serverely liberal, i.e. has an inflated type I error probability. In this case, Tamhane and Logan [35] described an accurate approximation to the distribution of the resulting test using a mixture of Fdistributions, that also requires some computation to determine levels of significance.
However, this test has the unsavory feature that if eitherd d value is negative, regardless how greatly so, the value is set to zero in the computation of the test statistic. Thus, for example ifZ Z a~{ 1000 andZ Z b~1 0, then S ALR~1 0, and depending on the estimated covariance values, could reject H 0S in favor of H 1S , even though it is clear that H 1S does not apply. In a recent overview, Tamhane and Logan [36] have suggested that ''If several endpoints show moderate negative differences or even if a few show very large negative differences, then these tests should not be used because the a priori assumption of positive treatment effects in all endpoints is questionable.'' However, to apply this recommendation in practice violates the principle that the test statistic for a study be specified a priori. In effect, the recommended practice could be viewed as a two-stage inference process -first determine if the differences are positive, and if so conduct the test. This would clearly inflate the type I error probability.
Other tests have been proposed that are based in part on Hotelling's T 2 statistic that is equivalent to the expression in (8) and is distributed as T 2 on K df under the assumption of multivariate normality of the observations. Under this assumption, T 2 provides an optimal test of the null hypothesis against the global alternative presented in (7). Follman [37] zd d b )w0. This test also could lead to rejection of H 0 when either the true d a or d b is a large negative value and the other an even larger positive value. Table 1. Differences between the DPP intensive lifestyle (ILS, n = 571) versus metformin (MET, n = 557) treated patients at three years of follow-up with respect to quantitative trait components of the metabolic syndrome, and binary indicators of abnormal levels, and the overall incidence of the metabolic syndrome among those free of the syndrome on entry. Bloch, Lai and Tubert-Bitter [38] describe another test procedure which requires that T 2 reach significance at level a two-sided and that both individual one-sided t-tests of an indifference hypothesis be significant at level a. The indifference hypothesis is H 0I : (0 §d a w{e) and (0 §d b w{e) for some small positive value e, and the alternative hypothesis is H 1S as in (2) above so that the one-sided t-test is of the form This test was later criticized by Pearlman and Wu [39] who proposed use of the one-sided LRT of Pearlman [32] in lieu of T 2 , among other improvements. The result of either test, however, depends on the specification of the value e and thus the test may not be uniformly acceptable.
Other tests have also been applied, although not specifically designed to test H 0 against the one-sided alternative H 1S in (2). O'Brien [13] proposed his ordinary least squares (OLS) and weigthed least squares (WLS) tests of H 0 versus the alternative hypothesis of a common difference H 1A : d a~db~d =0. Thus the alternative hypothesis consists of the line of equality other than the origin. The one-sided version of this test will also be sensitive to alternatives where d a and d b are of similar positive magnitude, but will not be optimal against the general alternative H 1S . Pocock, Geller and Tsiatis [40] describe the application of these tests to the analysis of multiple outcomes in clinical trials on different scales.
For a two group comparison of a vector of repeated measures, under the usual normal errors assumptions O'Brien also suggested that his statistics were distributed as t. However, the exact small sample distribution with normal errors is not known and many authors have shown that the resulting t-statistics have an inflated type I error probability. For a vector of repeated measures in two groups, Lä uter [41] shows that statistics that employ weighted averages, as in O'Brien's WLS test, are indeed distributed as t provided that the weights are functions of the empirical covariance matrix estimated from all groups combined rather than the pooled within-groups covariance matrix estimate as employed by O'Brien. He proposes a family of such weighted tests that includes the Wei-Lachin test as a trivial special case. Frick [42] also showed that O'Brien's OLS test is biased.
Thus, among the various tests that have been proposed that could be applied to the assessment of simultaneous differences between groups for multiple outcomes, the Wei-Lachin test has the advantages that it is simple to compute; can be applied to mixtures of outcomes on different scales (e.g. means and proportions); that it has a large sample normal distribution (or a t-distribution with normal errors); provides a test with type I error probabilities close to the nominal levels with generally acceptable sample sizes; is directed towards the specific multivariate one-directional alternative of interest, is maximin efficient relative to the possible true but unknowable optimal test, and readily provides for the computation of sample size and power.
Rahlfsand Vester [43] describe applications of the Wei-Lachin test to the analysis of multiple outcomes using the multivariate Mann-Whitney difference analysis described initially by Thall and Lachin [11]. The authors are affiliated with idv Data Analysis and Study Planning that also markets a program (TESTIMATE) that conducts such Wei-Lachin analyses. Pan [44] also recently presented a review of various procedures including the Wei-Lachin test (called the SUM test therein) and some of the above referenced one-directional procedures and showed by simulation that the Wei-Lachin test had good power when the outcomes tended to jointly show beneficial effects.
Programs for computations herein are available from www.bsc. gwu.edu. These include the coefficient vector L for use in (10) when Frick's condition does not apply, the simulation event time model, and the Wei-Lachin multivariate rank test.

Ethical Statement
Neither animals or human subjects were involved in this methodological research.

Supporting Information
File S1 (PDF)

Author Contributions
Conceived and designed the methodological research: JML. Analyzed the data: JML. Wrote the paper: JML.