Figures
Abstract
The Mann-Whitney-Wilcoxon test, often referred to as the Mann-Whitney U test or Wilcoxon rank-sum test, is a non-parametric statistical test used to compare two independent groups when the dependent variable is ordinal or continuous but not normally distributed. It’s particularly useful for small sample sizes or when the assumptions of parametric tests, such as the t-test, are violated, including cases where the data is skewed. This study focuses on the Mann-Whitney-Wilcoxon test for ordinal data, which frequently arises in biomedical research when the proportional odds assumption does not hold. Currently, there are no optimal two-stage randomized clinical trial designs utilizing the Mann-Whitney-Wilcoxon test. To address this research gap, our study proposes optimal two-stage designs based on the Mann-Whitney-Wilcoxon test. We demonstrate the application of these designs through illustrative examples and evaluate their operating characteristics.
Citation: Park Y (2025) Optimal two-stage group sequential designs based on Mann-Whitney-Wilcoxon test. PLoS ONE 20(2): e0318211. https://doi.org/10.1371/journal.pone.0318211
Editor: Asli Suner Karakulah, Ege University, Faculty of Medicine, TÜRKIYE
Received: August 19, 2024; Accepted: January 10, 2025; Published: February 20, 2025
Copyright: © 2025 Park. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: No data was used for the research described in the article.
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Outcomes in clinical trials are often ordinal. In 2020, the World Health Organization (WHO) developed a nine-point clinical progression scale to track COVID-19 severity over time [1]. The scale ranges from 0 (no infection) to 8 (death), covering the full spectrum of clinical outcomes. This comprehensive scale has been widely used in COVID-19 trials and cohort studies [2–4]. Also, the modified Rankin Score (mRS) is a measure of disability ranging from 0 (no symptoms) to 6 (death). This is a widely used outcome measure in stroke clinical trials [5–7]. For example, randomized clinical trials such as NCT05199194, NCT05046106, and NCT00059332 primarily measure the mRS to evaluate the effectiveness of the intervention.
In practice, a dichotomized approach to ordered categorical variables is often used for comparative tests. For example, dichotomized mRS (0–1/2–6, 0–2/3–6, and 0–4/5–6) is commonly used [8–10]. However, as Altman and Royston (2006) [11] pointed out, dichotomizing leads to a reduction in statistical power to detect relevant treatment effects [12, 13]. To address the issue of analyzing ordinal endpoints, proportional odds logistic regression has been recommended. The proportional odds assumption states that the relationship between the independent variables and the log-odds of the cumulative probabilities of the ordinal categories is the same across all levels or categories of the dependent variable. Using the proportional odds model, Whitehead (1993) [14] proposes a statistical method to calculate the sample size based on the score test statistic for two-arm randomized clinical trials. The method by Whitehead (1993) [14] was improved by calculating higher order moments of the test statistic using the Cornish-Fisher series and the Edgeworth series [15]. These proportional odds methods preserve the ordinal nature of outcome measures, but the proportional odds assumption can be violated in clinical trials. As an alternative to avoid the proportional odds assumption, the Mann-Whitney-Wilcoxon test or other linear rank tests and win odds test are utilized. Shuster et al. (2002) [16] illustrates a minimax two-stage design using the calculation for a single-stage design and variances of the test statistics at the first and second stages. Shuster et al. (2004) [17] and Nowak et al. (2022) [18] investigate the Mann-Whitney-Wilcoxon test in the context of group sequential trials using the asymptotic canonical joint distribution. Hilton and Mehta (1993) [19] and Rabbee et al. (2003) [20] propose methods of sample size calculation applicable to any linear rank test under a general stochastically ordered alternative hypothesis for single-stage clinical trials. Moreover, Gasparyan et al. (2021) [21] proposes a sample size calculation based on the single-stage win odds test, which is used to analyze hierarchical composite endpoints, for ordinal variables.
Currently, no optimal two-stage randomized clinical trial designs utilize the Mann-Whitney-Wilcoxon test. Optimal two-stage designs based on the Mann-Whitney-Wilcoxon test can offer several advantages, particularly in terms of efficiency and flexibility [22–25]. These designs can reduce the total sample size needed to reach conclusive results, as unpromising trials can be stopped early, conserving resources and time. This flexibility not only protects participants from ineffective treatments but also minimizes patient exposure to potentially harmful interventions.
To fill this gap, we propose new methods to determine the sample size and monitoring rule for the optimal two-stage randomized clinical trial designs based on the Mann-Whitney-Wilcoxon test. We provide two types of the two-stage designs called F design and FS design allowing early stopping of the trial due to futility or superiority of the experimental treatment. In addition, we provide three criteria to select the optimal designs that minimize the expected total sample size. These methodologies offer clinical practitioners user-friendly study design options, enabling them to easily identify appropriate designs tailored to their specific study objectives.
The rest of this paper is organized as follows. We first propose optimal two-stage designs using the Mann-Whitney-Wilcoxon test. We provide explicit formulas for calculating sample sizes and identifying monitoring rules based on the optimal design parameters. We illustrate the methods with two examples and compare the designs. Additionally, we apply the proposed methods to redesign a trial and provide concluding remarks.
Optimal two-stage designs using Mann-Whitney-Wilcoxon test
We consider two-arm clinical trials in which patients are enrolled and randomized with a 1:1 ratio to either a control C or an experimental treatment E. Let X be a treatment assignment having 1 or 2 for the patient receiving treatment C or E, respectively, and Y be the ordinal outcome with J ordered levels, i.e., y ∈ { 1 , 2 , … , J } . Without loss of generality, we regard the smallest outcome Y = 1 as the best clinical outcome and the largest outcome Y = J as the worst clinical outcome. For i = 1 , 2 and j = 1 , 2 , … , J, let be the probability that an ordinal outcome on treatment i falls in the jth level. Let F and G be the distribution function of Y on treatment C and E, respectively. Suppose that we are interested in testing the null hypothesis
versus the alternative hypothesis
. The Mann-Whitney-Wilcoxon test is used to compare two independent samples based on the ranks of the observed ordinal outcome data. Specifically, let
and
denote the two independent random variables. Since the smallest value indicates the best outcome, we can compare the outcomes between two arms: for any l , m = 1 , 2 , … , the event of
indicates that E wins while C loses; the event
indicates that C wins while E loses; the event
indicates that C and E are equivalent. This constructs the test statistics as follows. Based on the samples
from F and
from G, the Mann-Whitney-Wilcoxon test statistic is defined in [26] and [17] by
This accounts for the proportion of all comparisons leading to a win for arm E, i.e., F ( y ) < G ( y ) . In the context of the clinical trials investigating the treatment efficacy of arms C and E, ordinal data measure the treatment efficacy, and the event that E wins corresponds to the superiority of the experimental drug E while the event that C wins corresponds to the futility of the experimental drug E. We use the Mann-Whitney-Wilcoxon test based on (1) to propose an optimal two-stage design for ordinal outcomes.
Before we see the optimal two-stage design for ordinal outcomes based on the Mann-Whitney-Wilcoxon test (1), we first review the Mann-Whitney-Wilcoxon test for the design. For i = 1 , 2, j = 1 , 2 , … , J, and k = 1 , 2, let denote the corresponding observed frequencies for the jth level at the kth analysis when treatment i is received. Then,
denotes the marginal frequency for the jth level at the kth analysis. Based on the accumulating data at the kth analysis, (1) becomes
which is the Mann-Whitney-Wilcoxon test statistic used at the kth analysis. The test statistic is standardized to
where the denominator of (3) is the estimate of standard deviation of under the null hypothesis [17]. For a large sample, Shuster et al. (2004) [17] provides the variance of
approximated by
, where
and
We notice that , which indicates the value of D under the null hypothesis, is zero. The quantity of D under the alternative hypothesis is denoted by
. To distinguish the statistics under the null and alternative hypotheses, hereafter we will use notations
and
for Q and R, respectively, under the null hypothesis and
and
for Q and R, respectively, under the alternative hypothesis.
Sample size calculation for two-stage design using Mann-Whitney-Wilcoxon test
The two-stage design consists of an interim analysis planned when the total accrual reaches patients per group, i.e.,
, and the final analysis is performed after the evaluation of total planned number of
patients per group, i.e.,
. At the interim, we monitor the treatment efficacy based on the accumulating data and determine whether to continue or stop the trial.
Futility monitoring design (F design).
When the total accrual reaches patients, an interim analysis is performed to test for futility, such that the trial stops if the experimental treatment E is more futile than the control C, i.e.,
for a prespecified constant
. If
, we continue to enroll
more patients for the second stage, and the test proceeds to the final analysis. Based on all enrolled patients in the trial, if
for a prespecified constant
, the test terminates with the rejection of the null hypothesis.
Let and
be given such that
and
denote the probability of making a correct decision at the interim analysis under the null hypothesis and the alternative hypothesis, respectively. By the definition of
, i.e.,
, we obtain
where denotes the critical value of the standard normal distribution at
. With the Eq (4), the definition of
, i.e.,
, gives
Assuming that the type I error rate is at most α and the expected power is at least 1 - β, we set and
, because
and
. Therefore, we obtain
and
where .
Futility and superiority monitoring design (FS design).
The interim analysis can be conducted to evaluate the efficacy of the intervention and to stop the trial if an experimental treatment is ineffective or is found to be significantly more effective than the control. Let ,
, and
be the prespecified thresholds for the analyses. Specifically, we stop the trial for futility if
, and we stop the trial for superiority if
. Otherwise, i.e.,
, we continue to enroll
more patients for the second stage. At the final analysis based on all
enrolled patients, we reject the null hypothesis if
.
Let and
be given such that
and
denote the probability of making a correct decision at the interim analysis under the null hypothesis and the alternative hypothesis, respectively, i.e.,
and
. Let
and
. By the definition of
,
, and
, we have
and
We notice that the overall type I error rate is , which is less than or equal to
. Since we want the type I error rate to be at most α, we set
to be
. In addition, we observe that
Since we want the expected power to be at least 1 - β, we set to be
. Therefore, we obtain
and
Optimal choice of design parameters
Determination of the thresholds and sample sizes for a two-stage design requires the specification of the design parameters such as and
for F design and
,
, and
for FS design. We consider three criteria to specify the design parameters: (1) minimizing the expected total sample size under the null hypothesis, (2) minimizing the expected total sample size under the alternative hypothesis, and (3) minimizing the expected overall total sample size. The overall total sample size is defined as the total sample size, averaged equally across the null and alternative hypotheses.
In two-stage designs with futility monitoring (i.e., F design), if , we stop the trial for futility, and the total sample size is
. Otherwise, i.e.,
, the trial continues to enroll
patients for the second stage, and the total sample size is
. Therefore, the expected total sample size under the null hypothesis is
, the expected total sample size under the alternative hypothesis is
, and the expected overall total sample size is
. Of note,
and
are functions of
and
. We choose the optimal values of
and
according to the criteria over
and
. The constraint of
guarantees
.
Similarly, in two-stage designs with futility and superiority monitoring (i.e., FS design), the expected total sample size under the null hypothesis is , the expected total sample size under the alternative hypothesis is
, and the expected overall total sample size is
. The values of
,
, and
depend on
, and
, and the optimal values of
,
, and
are chosen according to the criteria over
,
, and
. Any pairs of
,
, and
are excluded unless
.
We find that the minimax criterion is not applicable to determine the design parameters of the two-stage designs. In F design, minimizing the maximum sample size is equivalent to maximizing over
. Thus, it chooses the lower boundary of
as the optimal value to attain the minimum value of
regardless of
. Thus, the value of
cannot be uniquely determined for F design. Similarly, in FS design, the minimum value of
is attained over
and
satisfying
, but the value of
is not uniquely determined.
Decision rule for single-stage design
The decision rule for a single-stage design can be straightforwardly derived from that of a two-stage design. By setting , it becomes impossible to stop the trial for futility at stage 1. Furthermore, by setting
, the second stage sample size is effectively eliminated, as the entire sample is enrolled in a single stage. This simplifies the design, removing the need for interim analysis or additional patient accrual after the initial enrollment. Specifically, the single-stage design employs the Mann-Whitney-Wilcoxon test statistic denoted by W, calculated using data from all 2n patients at the end of the trial. The null hypothesis is rejected if W > t, where t is a prespecified threshold.
The decision rule ( t , n ) of the single–stage design is determined to ensure that the design satisfies and
, meeting the specified type I error rate and power requirements. Similarly to the derivation of Eqs (4)–(12), by using the property of the Mann-Whitney-Wilcoxon test statistic [17], we obtain
Examples
Example 1.
We consider a clinical trial whose overall clinical outcome is ordinal with 3 categories and a descending order of desirability, e.g., 1=clinical benefit without adverse effects (AEs), 2=clinical benefit with some AEs, and 3=survival without clinical benefit but with AEs. Suppose that the distribution of the desirability score for the standard treatment is uniform, i.e., the probabilities of each category for the standard treatment are 1 ∕ 3, while investigators wish to have the probabilities 1 ∕ 2, 1 ∕ 3, and 1 ∕ 6 for each category for the experimental treatment. Target error rates are set to be α = 0 . 05 and β = 0 . 2 for the test comparing two treatments.
First, we consider a two-stage design with futility monitoring. By applying the Eqs (4)–(7), the optimal decision rule is determined according to the criterion. Let EN0 denote the expected total sample size under the null hypothesis, and ENa denote the expected total sample size under the alternative hypothesis. As described, Criterion 1 is to minimize EN0, Criterion 2 is to minimize ENa, and Criterion 3 is to minimize (1/2)EN0 + (1/2)ENa. Second, we consider a two-stage design monitoring both futility and superiority. Similarly to the design with futility monitoring, by applying the Eqs (8)–(12), the decision rule
is determined according to the criterion. Lastly, we include a single-stage design, referred to as 1 design, to provide a basis for comparison. For consistency in notation, the decision rule ( t , n ) is represented in the two–stage format as
, where
and
. Table 1 summarizes the decision rules of the designs.
To investigate the operating characteristics of the proposed designs, we conduct a simulation study with 10000 replications. We report four performance metrics: overall type I error rate denoted by , power (i.e., one minus overall type II error rate), and the expected sample size under null and alternative hypotheses denoted by EN0 and ENa, respectively. For all three criteria, all designs preserve overall type I and II error rates at the target rates. The overall type I error rates are slightly smaller than the nominal level, which yet does not sacrifice the power, i.e., overall type II error rates are also smaller than the nominal level. Criterion 1 yields a smaller size of the first stage for both F design and FS design compared to other criteria, which leads to a smaller expected sample size under the null hypothesis. Specifically, both the F design and FS design with Criterion 1 spend only 30% of information of the trial to make the decision earlier than others. The FS design with Criterion 2 or Criterion 3 requires
83 patients per group, which is the smallest maximum sample size of the trial among all six decision rules. We also notice that the the F design with Criterion 2 uses extremely small threshold for futility monitoring, and
is calculated. With
being calculated, implementing a two-stage design becomes impractical and is therefore not recommended. While 1 design requires a smaller maximum sample size compared to the two-stage designs (F and FS designs), we observe that two-stage designs—except for the F design with Criterion 2—tend to use a smaller expected sample size under the null hypothesis. This is an advantage, as it reduces the number of patients enrolled when the treatment shows no effect, thus saving resources and minimizing patient exposure to potentially ineffective treatments. Moreover, most two-stage designs utilize a larger expected sample size under the alternative hypothesis, which is advantageous. Enrolling more patients in the effective treatment not only enhances the precision of estimating the treatment effect but also maximizes the potential benefits for participants receiving the treatment.
Example 2.
We consider an example trial illustrated in [14], whose primary endpoint is the patients’ response evaluating the progress after the treatment, which is categorized as very good (=1), good (=2), moderate (=3), or poor (=4). Let denote the probability that an ordinal outcome on treatment i falls in the jth level. Suppose that we have
, and
for the control group. Then, the control probability of a good or very good outcomes is expected to be 0 . 7. Investigators assume the common odds ratio of the cumulative distributions and wish to achieve the probability 0 . 85 of a good or very good outcomes to test the effect of the experimental treatment. This provides the reference improvement 0 . 887 for superiority of the experimental treatment and the response probability for the experimental treatment [14]. Table 2 describes the null and alternative probabilities that patients’ response falls in each category. We consider target error rates α = 0 . 05 and β = 0 . 1 for the test.
Table 3 summarizes the decision rules of F design, FS design, and 1 design. We investigate the operating characteristics of the proposed design through simulations with 10000 replications. We observe that all designs preserve the overall type I and II error rates at target levels. Similarly to Example 1, Criterion 1 yields a smaller size for the first stage and results in the minimum value of EN0 among the criteria. In all decision rules for two-stage designs in Table 3, the maximum sample size of the trial is almost the same.
The common odds ratio assumption for the cumulative distributions holds in this example. Using the proportional odds model, Whitehead (1993) [14] proposes a method for determining the sample size of the trial based on the score test. The method of Whitehead (1993) [14] requires a total sample size of 188 (i.e., 94 patients per group) to achieve the improvement of the experimental treatment for the trial with a type I error rate of 0.05 and power 90%. This is obtained based on a single-stage test. We notice that the difference in the required total sample size of the trial is very small, but the proposed two-stage designs use, on average, 52.65 fewer patients across the six decision rules compared to the single-stage design in the method of Whitehead (1993) [14], when the experimental treatment is not effective. Moreover, the proposed single-stage design using the Mann-Whitney-Wilcoxon test requires 36 fewer patients in total compared to using the score test.
Trial illustration
The trial NCT03183167 is a longitudinal cohort study for patients with intracerebral hemorrhage (ICH), which is a devastating sub-type of stroke with a high mortality rate. Hagen et al. (2019) [27] analyzes the study data collected from 2008 to 2015 in order to investigate whether systematic inflammatory response syndrome (SIRS) is predictive of a long-term functional outcome in patients with spontaneous ICH. Functional outcome is evaluated using the modified Rankin Scale (mRS) at 3 and 12 months. The mRS is a widely used measure for assessing functional disability and dependence in individuals who have experienced stroke or other neurological conditions. The ordinal form of mRS is more strongly related to the 5-year mortality rate than the dichotomous mRS, which splits the mRS into favorable or unfavorable outcomes (e.g., using 0-1/2-6 or 0-2/3-6 dichotomies) [13]. To compare patients with noninfectious SIRS to a control group of patients without systematic inflammatory response, Hagen et al. (2019) [27] use propensity score matching to have a well-balanced cohort of 104 patients per group [27]. The mRS at 3 months for patients with SIRS falls in categories 0 to 6 with probabilities 0.048, 0.077, 0.173, 0.125, 0.202, 0.125, 0.25 while the mRS at 3 months for patients in the control group falls in categories 0 to 6 with probabilities 0.058, 0.154, 0.25, 0.163, 0.154, 0.106, 0.115.
Using the information, we apply the proposed designs to determine the monitoring rule and sample size for the trial. Since investigators argue that SIRS is associated with poor mRS, the alternative hypothesis has the opposite inequality sign to our description, i.e., . Suppose that target error rates are α = 0 . 05 and β = 0 . 2. By applying Eqs (4)–(7) and (8)–(12), decision rules are obtained for F design and FS design. We also add the decision rule for 1 design. The performance is evaluated based on 10000 replications through simulation. The results are summarized in Table 4. The F design using Criterion 1 determines
. The decision rule yields a power of 83% and is expected to use 105.17 patients when SIRS is not predictive of poorer long-term functional outcome. The FS design using Criterion 1 determines
. The decision rule yields a power of 83.8% and is expected to use 106.95 patients when SIRS is not predictive of poorer long-term functional outcome. The F and FS designs with Criterion 1 show that overall error rates are controlled at target levels and spend only 30% of the trial’s information to make a decision earlier than others. Notably, the FS design using Criterion 2 or 3 requires a total sample size of less than 80 patients per group, which is smaller than other decision rules for two-stage designs. This decision rule would be preferable if investigators want to minimize the maximum sample size for the trial.
Discussion
We have proposed optimal two-stage clinical trial designs for ordinal data using the Mann-Whitney-Wilcoxon test. The proposed designs enable monitoring treatment efficacy during the trial and making an efficient decision earlier due to futility or superiority based on interim data. The proposed design works with three criteria depending on the objective of optimality. The optimal two-stage design with Criterion 1 minimizes the expected total sample size when the experimental treatment is not efficacious compared to the control. The optimal two-stage design with Criterion 2 minimizes the expected total sample size when the experimental treatment is efficacious compared to the control. The optimal two-stage design with Criterion 3 minimizes the expected overall total sample size. We provide explicit formulas for calculating sample sizes and identifying stopping boundaries for two-stage designs. The operating characteristics of the proposed designs are investigated through simulations. The simulation results show that overall type I and II error rates are well controlled at target rates. The proposed designs do not require a proportional odds assumption for the ordinal data and work well regardless of the proportional odds assumption. Thus, the proposed designs offer more options for clinical practitioners to consider when designing randomized clinical trials whose primary endpoint is an ordinal outcome.
In this study, we do not employ sample size adaptation. This choice aligns with our focus on a fixed sample size framework, which facilitates a more straightforward analysis and application of the proposed method. Optimizing the adaptation rule introduces additional complexities that are beyond the scope of this work. However, this is an area of interest for future research, where re-estimating the sample size at interim analysis could enhance the flexibility of clinical trial designs.
We have assumed that the primary ordinal data would be evaluated at fixed timelines and fully observed before the interim analysis. However, missing data in clinical trials is almost unavoidable due to factors such as patient dropout, loss of follow-up, or technical issues during data collection. For example, a COVID-19 patient discharged early may not have complete follow-up data for subsequent ordinal outcomes, such as disease progression. Additionally, clinical trials for immunotherapies, where responses can take weeks or months to manifest, and chronic conditions, where outcomes like organ failure or survival may occur long after the intervention, often encounter late-onset outcomes. These delayed events may remain unobserved by the end of the trial, resulting in incomplete ordinal outcome data. This presents opportunities to enhance the proposed method by incorporating advanced statistical approaches, such as multiple imputation or joint modeling, to address missing ordinal data effectively.
References
- 1. WHO Working Group on the Clinical Characterisation and Management of COVID-19 infection. A minimal common outcome measure set for COVID-19 clinical research. Lancet Infect Dis 2020;20(8):e192–7. pmid:32539990
- 2. Brown SM, Peltan ID, Webb B, Kumar N, Starr N, Grissom C, et al. Hydroxychloroquine versus Azithromycin for Hospitalized Patients with Suspected or Confirmed COVID-19 (HAHPS). Protocol for a pragmatic, open-label, active comparator trial. Ann Am Thorac Soc 2020;17(8):1008–15. pmid:32425051
- 3. Wang Y, Zhang D, Du G, Du R, Zhao J, Jin Y, et al. Remdesivir in adults with severe COVID-19: a randomised, double-blind, placebo-controlled, multicentre trial. Lancet 2020;395(10236):1569–78. pmid:32423584
- 4. Ivashchenko AA, Dmitriev KA, Vostokova NV, Azarova VN, Blinow AA, Egorova AN, et al. AVIFAVIR for treatment of patients with moderate Coronavirus Disease 2019 (COVID-19): interim results of a phase II/III multicenter randomized clinical trial. Clin Infect Dis 2021;73(3):531–4. pmid:32770240
- 5. van Swieten JC, Koudstaal PJ, Visser MC, Schouten HJ, van Gijn J. Interobserver agreement for the assessment of handicap in stroke patients. Stroke 1988;19(5):604–7. pmid:3363593
- 6. de Haan R, Limburg M, Bossuyt P, van der Meulen J, Aaronson N. The clinical meaning of Rankin “handicap” grades after stroke. Stroke 1995;26(11):2027–30. pmid:7482643
- 7. Kwon S, Hartzema AG, Duncan PW, Min-Lai S. Disability measures in stroke: relationship among the Barthel index, the functional independence measure, and the modified rankin scale. Stroke 2004;35(4):918–23. pmid:14976324
- 8. Hacke W, Bluhmki E, Steiner T, Tatlisumak T, Mahagne MH, Sacchetti ML. Dichotomized efficacy end points and global end-point analysis applied to the ECASS intention-to-treat data set: post hoc analysis of ECASS I.. Stroke. 1998;29(10):2073–5.
- 9. O’Donnell MJ, Fang J, D’Uva C, Saposnik G, Gould L, McGrath E, et al. The PLAN score: a bedside prediction rule for death and severe disability following acute ischemic stroke. Arch Intern Med 2012;172(20):1548–56. pmid:23147454
- 10. Flint AC, Xiang B, Gupta R, Nogueira RG, Lutsep HL, Jovin TG, et al. THRIVE score predicts outcomes with a third-generation endovascular stroke treatment device in the TREVO-2 trial. Stroke 2013;44(12):3370–5. pmid:24072003
- 11. Altman DG, Royston P. The cost of dichotomising continuous variables. BMJ 2006;332(7549):1080. pmid:16675816
- 12. Dijkland SA, Voormolen DC, Venema E, Roozenbeek B, Polinder S, Haagsma JA, et al. Utility-weighted modified rankin scale as primary outcome in stroke trials: a simulation study. Stroke 2018;49(4):965–71. pmid:29535271
- 13. Ganesh A, Luengo-Fernandez R, Wharton RM, Rothwell PM, Oxford vascular study. Ordinal vs dichotomous analyses of modified rankin scale, 5-year outcome, and cost of stroke. Neurology 2018;91(21):e1951–60. pmid:30341155
- 14. Whitehead J. Sample size calculations for ordered categorical data. Stat Med 1993;12(24):2257–71. pmid:8134732
- 15. Kolassa JE. A comparison of size and power calculations for the Wilcoxon statistic for ordered categorical data. Stat Med 1995;14(14):1577–81. pmid:7481194
- 16. Shuster J, Link M, Camitta B, Pullen J, Behm F. Minimax two-stage-designs with applications to tissue banking case-control studies. Stat Med 2002;21(17):2479–93. pmid:12205694
- 17. Shuster JJ, Chang MN, Tian L. Design of group sequential clinical trials with ordinal categorical data based on the Mann–Whitney–Wilcoxon test. Sequent Anal 2004;23(3):413–26.
- 18. Nowak CP, Mütze T, Konietschke F. Group sequential methods for the Mann-Whitney parameter. Stat Methods Med Res 2022;31(10):2004–20. pmid:35698787
- 19. Hilton JF, Mehta CR. Power and sample size calculations for exact conditional tests with ordered categorical data. Biometrics. 1993;49(2):609–16. pmid:8369392
- 20. Rabbee N, Coull BA, Mehta C, Patel N, Senchaudhuri P. Power and sample size for ordered categorical data. Stat Methods Med Res 2003;12(1):73–84. pmid:12617509
- 21. Gasparyan SB, Kowalewski EK, Folkvaljon F, Bengtsson O, Buenconsejo J, Adler J, et al. Power and sample size calculation for the win odds test: application to an ordinal endpoint in COVID-19 trials. J Biopharm Stat 2021;31(6):765–87. pmid:34551682
- 22. Simon R. Optimal two-stage designs for phase II clinical trials. Control Clin Trials 1989;10(1):1–10. pmid:2702835
- 23. Jung S-H, Lee T, Kim K, George SL. Admissible two-stage designs for phase II cancer clinical trials. Stat Med 2004;23(4):561–9. pmid:14755389
- 24. Jennison C, Turnbull BW. Adaptive and nonadaptive group sequential tests. Biometrika 2006;93(1):1–21.
- 25. Jennison C, Turnbull BW. Efficient group sequential designs when there are several effect sizes under consideration. Stat Med 2006;25(6):917–32. pmid:16220524
- 26.
Lehmann EL, et al. Statistical methods based on ranks. Nonparametrics. San Francisco, CA: Holden-Day; 1975.
- 27. Hagen M, Sembill JA, Sprügel MI, Gerner ST, Madžar D, Lücking H, et al. Systemic inflammatory response syndrome and long-term outcome after intracerebral hemorrhage. Neurol Neuroimmunol Neuroinflamm 2019;6(5):e588. pmid:31355322