Many-to-one comparisons after safety selection in multi-arm clinical trials

In phase II platform trials, ‘many-to-one’ comparisons are performed when K experimental treatments are compared with a common control to identify the most promising treatment(s) to be selected for Phase III trials. However, when sample sizes are limited, such as when the disease of interest is rare, only a single Phase II/III trial addressing both treatment selection and confirmatory efficacy testing may be feasible. In this paper, we suggest a two-step safety selection and testing procedure for such seamless trials. At the end of the study, treatments are first screened on the basis of safety, and those deemed to be sufficiently safe are then taken forwards for efficacy testing against a common control. All safety and efficacy evaluations are therefore performed at the end of the study, when for each patient all safety and efficacy data are available. If confirmatory conclusions are to be drawn from the trial, strict control of the family-wise error rate (FWER) is essential. However, to avoid unnecessary losses in power, no type I error rate should be “wasted” on comparisons which are no longer of interest because treatments have been dropped due to safety concerns. We investigate the impact on power and FWER control of multiplicity adjustments which correct efficacy tests only for the number of safe selected treatments instead of adjusting for all K null hypotheses the trial begins testing. We derive conditions under which strict control of the FWER can be achieved. Procedures using the estimated association between safety and efficacy outcomes are developed for the case when the correlation between endpoints is unknown. The operating characteristics of the proposed procedures are assessed via simulation.

Numerical example to illustrate the application of the two-step procedure in a clinical trial The following hypothetical example shows, how to analyse a trial, which has a safety selection step before efficacy testing as proposed.
Let us assume that many-to-one comparisons comparing three experimental arms (e.g., different administration forms of the same substance) to a common control arm shall be conducted with a Dunnett test (DT) for a trial for the treatment of Attention-deficit/hyperactivity disorder (ADHD) using n i = n = 22 patients per arm.
For each patient j (j ∈ {1, · · · , 22}) in arm i (i ∈ {0, 1, 2, 3}) a score from an ADHD rating scale is calculated at baseline (denoted by x (0) ij ) and at final visit (=: x (1) ij ). The primary outcome variable is the difference of this score between these two time-points ij . The mean difference between baseline and final visit measurements in group i will be denoted asx i . For matters of consistency with our framework the measurement at final visit will be subtracted from the baseline value, which implies that higher values of x i are favorable for the patient. As a continuous safety variable the QT-time is measured at each visit for each patient, since from its pharmacological properties the intake of the experimental substance is suspected to increase the probability for a Long QT syndrome, whereby higher values are considered as less favourable. It is decided to include treatments in the multiple testing procedure for efficacy, that do not have an averse effect on QT-time, i.e., increased values over time. Therefore only groups are forwarded to efficacy testing, if the mean QT-time changes between the final visit minus baseline visit (denoted asȳ i ) do not exceed the threshold of b y i = 30 milliseconds.
According to the definition of the DT, if all groups are selected, group i ∈ {1, 2, 3} is considered to show a significant effect, if the test-statistics D i exceeds used for the DT with K dimensions, ν degrees of freedom, and correlation matrix Σ (the correlation matrix of the D i ), which is implicitly defined by the many-to-one comparison problem. The test statistics D i are defined as where V is the pooled variance of all selected groups and the control. If less than three groups are selected, then just these groups (and the control) are pooled for variance estimation. Furthermore, the quantile for the multivariate t-distribution is then b x i := t ν=66−2−1,K=2,Σ,1−αnom when two groups are selected and b x i := t ν=44−1−1,1−αnom when one treatment group is selected. The latter amounts to the conventional t-quantile for comparing a single treatment-control comparison with 42 degrees of freedom). Now let us assume, that the following results are observed: In the following we would like to apply the different two-step selection and testing strategies to these numerical data. Case 2: Assumption of non-negative correlation (NA approach, see section 4.2) Here the design assumption was non-negative correlation between ADHD score and QT-Time across all groups. Therefore it was planned to conduct the two-step procedure using the natural correction (NA). In this case, the selection of just two groups instead of three allows to relax the critical boundaries for the DT: b x i = 2.26 (i ∈ {2, 3}). It is still not possible to declare a statistically significant difference effect of Group 3 compared to the Group 0, since D 3 = 1.19 < 2.26, but for Group 2 we obtain a statistically significant result by observing D 2 = 2.31 > 2.26.

Results
Case 3: Lower-boundary assumption (KC approach, see section 5.1) Here we consider the same situation as above, but instead of a zero correlation assumption, a lower boundary for the correlation between y and x in all groups is considered to be −0.25, according to data from previous studies. Designing a study using the KC procedure would mean use a lower significance level α a < α nom for the quantiles of the multivariate t-distribution above. For ρ = −0.25, α a can be calculated to be equal to 0.0211. The corresponding boundary for the DT for two hypotheses is b x i = 2.34 (i ∈ {2, 3}). In this case, both experimental treatments would fail to demonstrate a statistically significant results with KC assuming a lower boundary of −0.25 for the correlation.
Case 4: Estimation of the correlation (PI approach, see section 5.2) Again we consider the situation from above, but now the correlation between ADHD score and QT-Time will be estimated with the underlying assumption of equal correlations in all groups. In this example, the observed correlations areρ 1 = −0.10,ρ 2 = −0.07, and ρ 3 = −0.17. Using the Fisher transformation, the best estimate for the true correlation is -0.11. With PI approach the observed correlation is plugged in when adjusting the significance level resulting in an α a = 0.0233. The corresponding DT boundary for two hypotheses is b x i = 2.29. In this case, Group 2 will yield a statistically significant result.

3/4
For methods CO, NA and (KC) used in Case 1, Cases 2 and Case 3, respectively, the critical boundaries b x i can be calculated in the planning phase for each potential number of treatments finally being selected. Table 2 gives a summary of these boundaries, depending on the number of selected treatments. In contrast, the calculation of the critical boundaries of the PI approach (Case 4) depends not only on the actual number of selected treatment arms, but also on the observed correlations.