Optimal two-stage design of single arm Phase II clinical trials based on median event time test

The Phase II clinical trials aim to assess the therapeutic efficacy of a new drug. The therapeutic efficacy has been often quantified by response rate such as overall response rate or survival probability in the Phase II setting. However, there is a strong desire to use survival time, which is the gold standard endpoint for the confirmatory Phase III study, when investigators set the primary objective of the Phase II study and test hypotheses based on the median survivals. We propose a method for median event time test to provide the sample size calculation and decision rule of testing. The decision rule is simple and straightforward in that it compares the observed median event time to the identified threshold. Moreover, it is extended to optimal two-stage design for practice, which extends the idea of Simon’s optimal two-stage design for survival endpoint. We investigate the performance of the proposed methods through simulation studies. The proposed methods are applied to redesign a trial based on median event time for trial illustration, and practical strategies are given for application of proposed methods.


Introduction
The primary objective of Phase II clinical trials is to test whether the therapeutic intervention actually works in treating a disease or indication. The therapeutic efficacy has been often quantified by response rate in Phase II settings, for example, overall response rate defined as the proportion of subjects who achieve a confirmed complete response or partial response determined by RECIST 1.1 [1] or survival probability at a certain year defined as the proportion of patients alive (and without recurrence) at certain year after the start of treatment. In practice, Simon's two-stage minimax and optimal designs are widely used for binary endpoint [2][3][4][5][6][7]. Simon's designs allow early trial termination due to futility while Fleming's design allows early trial termination due to futility and superiority in Phase II trials [8,9]. Successful results in Phase II trials lead to proceed to confirmatory Phase III trials with more extensive development. Therefore, it is necessary to validate and use the surrogate endpoint for overall survival, which is the gold standard endpoint for Phase III study. When considering the survival a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 probability which enforces the survival time to the binary endpoint, we need much caution to analyze the results. Some designs address incomplete follow-up for some subjects by the time of the interim analysis to assess the survival probability [10][11][12][13]. Data augmentation can be used to address the timing of events for late-onset outcomes by imputing missing outcomes. [14,15].
Surprisingly, a recent FDA report shows 22 case studies investigating disagreement in the results between early and confirmatory phases [16]. Some cases showed unexpected failures in Phase III study from the promising results on clinical outcomes in the Phase II study (e.g., iniparib). Therefore, some investigators strongly desire to use time-to-event endpoint to evaluate the therapeutic efficacy of the drug for Phase II trials. They are interested in improvement of median survival compared to standard therapy. Finkelstein et al. [17], Sun et al. [18], Jung [19], Kwak and Jung [20], Wu et al. [21] propose designs with one-sample log-rank test which compares the survival distributions between prespecified null reference and the desired target. This idea comparing the survival distributions allows to conduct for hypothesis testing for the median survivals only when the survivals are assumed to follow exponential distribution. Wu [22] proposes a single-arm Phase II clinical trial design under a class of parametric cure models. Chu et al. [23] proposes the design for immunotherapy trials with random delayed treatment effect based on survival time.
In this paper, we propose new methods for a single arm Phase II study, which provide practical strategies for testing if there is an improvement of median survivals from a new drug compared to the standard therapy. The proposed median event time test uses the distribution of sample median to obtain the required sample size and threshold for the hypothesis testing at target type I and II errors. It provides the decision rule, which is very simple and straightforward, comparing the observed median survival with the identified threshold at the end of trial. This can be easily implemented with our shiny application and R codes built along with the method development. Moreover, this median event time test is extended in group sequential manner. It follows the idea of Simon's optimal two-stage design, in that the expected sample size is minimized under the null hypothesis, but considers survival endpoint to see the improvement based on the median event time test. This allows us to monitor futility of the drug based on median survival time and stop the trial early, which makes the design more efficient and practical.
The rest of the paper is organized as follows. In Methods Section, we provide a median event time test and propose an optimal two-stage design based on median event time test. In Results Section, simulation studies are presented, and a trial example is provided to illustrate the application of our methods. Lastly, we provide some comments in Discussion and concluding remarks in Conclusion.

Median event time test
Let Y 1 , . . ., Y n be random variables from an exponential distribution with mean μ. Then, the median ϕ is equivalent to μ log 2. Motivated by single arm Phase II studies whose primary endpoint is time-to-event, we formulate a hypothesis test based on median event time with H 0 : ϕ = ϕ 0 versus H a : ϕ = ϕ 1 for some values of ϕ 0 and ϕ 1 . In practice for intervention study, we use median time of standard drug or therapy and the expected median time from the intervention (for improvement) to specify the values of ϕ 0 and ϕ 1 , respectively. Let� n ðyÞ be a sample median event time obtained from a sample of size n. Then, for any λ � 0, the rejection region for the test is fy :� n ðyÞ > lg. For any α and β in unit interval (0, 1), this test is the level α test with power 1 − β such that Prf� n ðyÞ > ljH 0 g ¼ a and Prf� n ðyÞ � ljH a g ¼ b. Since the first conditional probability indicates the probability of rejecting the null hypothesis when the null hypothesis is true, the value of α indicates the type I error rate. Also, the second conditional probability indicates the probability of failure of rejecting null hypothesis when the null hypothesis is not true, and the value of β indicates the type II error rate. The distribution function of the sample median event time is derived in the following theoretical result.
Theorem 1 Let Y 1 , . . ., Y n be random variables from an exponential distribution with median ϕ. Then, the probability density function (pdf) of sample median of Y 1 , . . ., Y n is either g n ðmÞ ¼ n! log 2f1 À exp ðÀ m log 2=�Þg k exp fÀ mðk þ 1Þ log 2=�g=ðk!k!�Þ for n = 2k + 1 or The proof is in Appendix A. Theorem 1 provides a cumulative distribution function (cdf) of median event time given by G n ðlÞ ¼ R l 0 g n ðmÞ dm, where g n denotes the pdf of the sample median, for λ � 0. Since the cdf of sample median event time has no closed form, a numerical search over grid is required to identify sample size n and threshold λ such that the empirical errorsâðn; lÞ andbðn; lÞ are close to the nominal target values of type I and II error rate (i.e., α and β), respectively. The empirical errors are calculated aŝ aðn; lÞ ¼ Prð� n > ljH 0 Þ ¼ 1 À G n ðljH 0 Þ andbðn; lÞ ¼ Prð� n � ljH a Þ ¼ G n ðljH a Þ; where G n (�|H 0 ) and G n (�|H a ) denote the cdf of sample median of event time under null and alternative hypotheses, respectively. It implies that the identified sample size n justifies to achieve f1 Àbðn; lÞg � 100% power based on one-sided test with a significance level of aðn; lÞ to detect improvement of the sample median time of the experimental drug against the median time of standard drug. This test states that at the end of trial, i.e., based on sample of size n, the null hypothesis is rejected if� n > l and we argue that experimental drug increases in median event time from ϕ 0 to ϕ 1 . We call this median event time test.
Let's consider a clinical trial with hypothesis testing of median progression free survivals (PFS) ϕ 0 = 10 months versus ϕ 1 = 17 months. Suppose that maximum number of patients for this study is 100. We consider a grid search over the integer n between 1 and 100 and a real number λ between 10 and 17 with the increment 0.1. For the target error rates of α = 0.05 and β = 0.2, our numerical study results in n = 42 and λ = 14.1 months minimizing deviation of the empirical errors from the target rates, i.e., fâðn; lÞ À ag 2 þ fbðn; lÞ À bg  Table 1 summarizes the decision rule for testing median survival ϕ 0 versus ϕ 1 . In Table 1, scenarios with different value of ϕ 0 and ϕ 1 are considered. The choice of ϕ 0 and ϕ 1 is determined intentionally. To see the performance of method, we fixed ϕ 0 or ϕ 1 to vary ϕ 1 or ϕ 0 , respectively.
We have proposed median event time test for exponential survivals. The proposed method is extended for other parametric survival distributions. We provide the distribution function of the sample median event time for uniform and weibull survivals in Appendix B, which determines the decision rule.
This proposed median event time test is attractive with three reasons. First, this is newly proposed for an exact median event time test using the observed median event time. Second, author provides a software to calculate sample size and identify the threshold for the test, which is open for public use (https://yeonhee.shinyapps.io/METTshinyapp/). It does not require complicated statistical analysis for decision but implements easily. The simple and straightforwardly interpretable rule can save lots of time for drug development. Third, it has potential to extend for many applications. For example, the decision rule can be used to monitor futility in group sequential designs (for Phase II or III studies), and the threshold λ provides a good candidate (or range) for the Bayesian monitoring rule. Specifically, it provides foundation for optimal two-stage design based on median event time test which is described later.
Before we move to next section, we provide a remark on the median event time test. To develop the exact median event time test, we considered hypothesized values for median of survivals Y i = min(S i , U i ), where S i denote the time-to-event and U i denote the administrative censoring time for the ith patient. Our setting in this section uses Y i and does not require information for accrual rate and follow-up time. However, when we consider a hypothesis testing with median of survivals S i , we should care the censoring information. Assume that the time to arrival of the patient and survival time (i.e., S i ) follow some exponential distributions. Then, U i , which is the time to arrival of the last patient plus follow-up time minus time to arrival of the ith patient, follows another exponential distribution. We notice that minimum of    two exponential random variables (i.e., Y i ) follow some exponential distribution. Therefore, this setting for hypothesis testing with Y i looks reasonable with the exponential survival assumption for S i and the right censoring.

Optimal two-stage design based on median event time test
The proposed median event time is straightforward for clinicians to interpret and justify the sample size for the exact level of errors. It is critical, especially in rare disease trials, to obtain promising evidence with the target error rates and minimize the expected sample size. Moreover, from an ethical and practical viewpoint, it is desirable to stop the trial early if the therapeutic intervention is not effective. This motivates to propose a single arm Phase II trial with an interim planned when the total accrual reached n 1 patients. The final analysis will be performed after the follow-up of all planned number of n patients. A two-stage design using median event time test is proposed as follows. In the first stage (i.e., at interim), we determine go/no-go of the trial based on the observed median event time for those n 1 patients. When the observed median event time is less than or equal to the threshold λ, the trial is stopped for futility. Otherwise, the trial continues to enroll (n − n 1 ) patients in the second stage. At final analysis, we argue that the drug is sufficiently promising to evaluate against the standard therapy if the observed median event time based on all trial data is larger than the threshold. The expected sample size for the two-stage design above is EN = n 1 + (1 − PET)(n − n 1 ), where PET denotes the probability of early termination after the first stage. EN depends on n, n 1 and λ, and the optimal two-stage design using median event time test is proposed in that EN is minimized under the null hypothesis. In the following, we describe how to specify n, n 1 and λ for the optimal two-stage design. Let M be a maximum sample size for the study, e.g., this can be specified by clinicians or by sample size calculator for one-arm nonparametric statistics (https://stattools.crab.org/Calculators/oneNonParametricSurvival.htm). Let � 1 and � 2 denote the acceptable difference between the estimated type I and II error rates from the trial design and the target error rates. For each prespecified values of ϕ 0 , ϕ 1 , α and β, Step 1. For each n 1 between 1 and M − 1, search n and λ minimizing fâðn 1 ; n; lÞ À ag 2 þ fbðn 1 ; n; lÞ À bg 2 , wherê over n 1 < n � M and ϕ 0 � λ � ϕ 1 .
As seen in median event time test, we don't have closed forms for the marginal or joint distributions of sample median event times to calculate empirical probabilities in (1) and (2). The numerical search is used in the first step. The criteria given in the second step targets at minimizing the expected sample size, EN. This optimality criteria can be modified according to the study objective, e.g., the design minimizes expected total study length. Moreover, our design is flexible to use different thresholds, λ 1 for the interim analysis and λ 2 for the final analysis. Investigators fix a threshold λ 1 to stop early for futility and search threshold λ 2 satisfying the target error rates (or they can search both thresholds): aðn 1 ; n; l 1 ; l 2 Þ ¼ Prð� n 1 > l 1 ;� n > l 2 jH 0 Þ bðn 1 ; n; l 1 ; l 2 Þ ¼ Prð� n 1 � l 1 jH a Þ þ Prð� n 1 > l 1 ;� n � l 2 jH a Þ jâðn 1 ; n; l 1 ; l 2 Þ À aj < � 1 and jbðn 1 ; n; l 1 ; l 2 Þ À bj < � 2 : Although the proposed method does not assume specific survival distribution, the probability calculation requires to specify the survival distribution. We can borrow information from the previous research to specify the survival distribution (e.g., exponential, uniform, or weibull distribution). As an example to illustrate the optimal two-stage design based on median event time test, we assume that the survivals follow from an exponential distribution with median ϕ � and will be right-censored for subjects who have not yet met the criteria at the date of the last valid disease assessment. Setting with median survivals ϕ 0 = 10 and ϕ 1 = 17 for the standard therapy and a new therapy, respectively, and target error rates α = 0.05 and β = 0.2, we consider the maximum allowable sample size M = 50 (which is obtained from a sample size calculator for one-arm nonparametric statistics). Patients arrived according to a Poisson process with the accrual rate of 1.04 patients per month. We continue follow-up for 24 months after the last patient was enrolled. Empirical estimate of type I and II errors were obtained based on 1000 simulation trials. Then, our method with � 1 = 0.02 and � 2 = 0.025 attains the optimal results at n 1 = 32, n = 42 and λ = 15.2. It implies that our two-stage design accrue n 1 = 32 patient for the first stage. At interim, if the observed median event time based on these n 1 patients is less than or equal to λ = 15.2 months, the study will be early stopped for futility. Otherwise, additional n − n 1 = 10 patients will be accrued in stage 2, resulting in a total sample size of n = 42. At the end of trial, if the observed median event time based on all n = 42 patients is larger than λ = 15.2 months, we reject the null hypothesis and claim that the treatment is sufficiently promising. Different null and alternative median event times can be considered, and the results of the optimal two-stage design are summarized in Table 2.
The proposed optimal two-stage design finds, in the first stage, the total sample size n and threshold λ for the given interim size n 1 and target error rates (i.e., α and β). In other words, both n and λ are searched. In case where study has a certain planned total sample size, the proposed design is tailored to identify the threshold λ in the first stage for the given interim size n 1 , total sample size, and target error rates. In other words, only λ is searched, and n is prespecified. As an example, the value of M obtained from one-arm nonparametric statistics in Table 2 is used for the prespecified total sample size n in scenarios. The results are also summarized in Table 2.
Replacing M = 100 in Table 1 with the value obtained from one-arm nonparametric statistics, we obtained the same results. Tables 1 and 2 show that decision rule of two-stage design yields smaller expected sample size under the null than the one-stage design except the case with ϕ 0 = 8 and ϕ 1 = 17 requires 2 or 3 more patients to be enrolled. When ϕ 0 = 3 and ϕ 1 = 5, the total sample size n for two-stage design is smaller than one-stage design, and the expected sample size under the null is much smaller. More patients can avoid from treating the ineffective drug under the two-stage design.

Results
We investigated the performance of the proposed optimal two-stage design based on median event time test. First, back to the setting we examined for Table 2 with null median 10 months and alternative median 17 months, we are interested how the operating characteristics are changed with follow-up time and accrual rate. Given the specified rule from Table 2 (i.e., n 1 = 32, n = 42 and λ = 15.2), we considered follow-up times 6,9,12,15,18,21,24,27, 30, 36 months and accrual rates 0.5, 0.6, 0.7, 0.8, 0.9, 1.04, 1.1, 1.2, 1.3, 1.5 patients per month. The results are described in Fig 2. The proposed design is robust to the follow-up time but impacted by the accrual rate. As more patients are accrued and we have more available events (i.e., the observed median survival is less), it is more likely to stop earlier due to futility, which increases type II error rate but decrease type I error rate. Therefore, when the study is designed by using the optimal two-stage design based on median event time test, we need to have more reliable information for accrual rate, for example, the previous history for accrual rate and close collaboration with clinicians investigating the study can provide the right design.  Moreover, we examined the performance of the proposed design with non-exponential survivals. Table 3 provides the results when survivals are generated from either uniform or weibull distributions. We used uniform distribution with minimum value 0 and maximum value 2ϕ � to generate flat survivals and Weibull distribution with scale parameter ϕ � /(log 2) 1/2 and shape parameters 2 to obtain an increasing hazard. Compared with the results in Table 2 where survivals are assumed to follow exponential distribution, we can see the different decision rule (n 1 and λ) and required sample size (n) for nonexponential survival times.
We compared the proposed two-stage designs (called METT2E, METT2U, and METT2W for optimal two-stage designs assuming exponential, uniform, and weibull survivals, respectively) with three designs: (1) a restricted KJ design, called r-KJ [12], which tests for whole survival curves based on one-sample log-rank test statistics proposed by Kwak and Jung [20]; (2) a two-stage design minimizing expected sample size, called OES [11]; (3) a two-stage design minimizing expected total study length, called OETSL [11]. Both OES and OETSL use the normalized Z-statistic to test and determine decision rules at each stage. Because r-KJ, OES, and OETSL require the clinically meaningful time point, in our simulations we used 6 months. The null and alternative values of the 6 months survival probability were determined by the survival distribution. We assumed the follow-up time is 24 months and accrual rate is 1.04 patients per month, which is the same as the setting of Tables 2 and 3. Table 4 provides the comparison results of EN 0 and PET 0 for several hypothesis testings based on median survival times ϕ 0 and ϕ 1 . As seen in Tables 2 and 3, both METT2E and METT2U are likely to enroll more patients in the trial (i.e., n is close to M), compared to METT2W in most cases. The number of patients in the first stage (i.e., n 1 ) obtained from METT2U or METT2W is smaller than that of METT2E. Thus, METT2W yields smaller expected sample size than METT2E and METT2U. Since r-KJ does not restrict to certain survival distribution and both OES and OETSL assume the weibull survivals, the results of METT2W are comparable with r-KJ, OES, and OETSL. In most cases, METT2W uses smaller expected sample size and stops the trial early for futility when therapeutic intervention is not effective. METT2E and METT2U also yielded smaller expected sample size and larger probability to stop trial for futility under the null hypothesis compared to r-KJ.

Application: Trial illustration
We provide an application of the proposed methods with a trial NCT00780494. This is a Phase II, single arm, single-institution study of bevacizumab in combination with carboplatin and capecitabine for patients with unresectable or metastatisc gastroesophageal junction or gastric cancers. The study was started on February 2009 and completed on December 2017 to accrue enrollment of 35 participants. It enrolled two patients per month and follow up at time of study completion for 12 months. The study objective is to investigate the efficacy of the addition of bevacizumab to standard chemotherapy based on progression-free survival (PFS), which is defined as the duration of time from the start of treatment to time of disease progression or death. The median PFS is 5 months with the standard treatment therapy and the study team hypothesized the addition of bevacizumab to standard chemotherapy improve PFS by 90%, i.e., the median PFS is 9.5 months (https://clinicaltrials.gov/ProvidedDocs/94/ NCT00780494/Prot_SAP_000.pdf). We applied the proposed methods to redesign the trial based on median PFS. Target error rates are α = 0.05 and β = 0.2, and the maximum sample size is 35 obtained from one-arm nonparametric test. First, we considered a case where investigators want a trial without interim monitoring to collect the data with all 35 patients. The exact median event time test was applied, and provided the decision rule with threshold 7.5 months for a sample of size 29. This yielded 80.48% power. Second, we supposed that investigators want an interim for futility monitoring. Assuming the exponential survivals, we applied optimal two-stage design based on median event time test. We set � 1 = � 2 = 0.005 to get closer empirical error rates to the target rates, and it determined n 1 = 33, n = 34 and λ = 8 months withâ ¼ 0:047 andb ¼ 0:196. The rule implies that an additional patient will be enrolled after the first stage, and it is inappropriate from a practical point of view. Rather than considering two-stage design with a threshold, two-stage design with different thresholds λ 1 and λ 2 would be suggested to obtain the reasonable decision rule. Table 5 provides the summary results of the optimal two-stage design, which search n 1 , n, and λ 2 for a fixed threshold λ 1 for the first stage. When survivals are assumed to follow the exponential distribution and threshold for the first stage is 5.5 months, the optimal two-stage design based on median event time test determined n 1 = 17, n = 32, λ 2 = 8.5 months. Specifically, the interim analysis will be performed based on the first 17 enrolled patients by comparing the observed median PFS with 5.5 months. If the observed median PFS is smaller than or equal to 5.5 months, the study is stopped early for futility. Otherwise, additional 15 patients are enrolled to have the total sample size of 32. At the final analysis, the observed median PFS with 32 patients is compared with the threshold 8.5 months, and we claim the addition of bevacizumab to the standard therapy improves PFS if the observed median PFS is larger than the threshold. This trial decision rule yielded power of 80.2%. When the true median PFS is 5 months, the expected sample size for this trial (i.e., EN) is 17.77 and the probability of early stopping (i.e., PET) is 94%. The required sample size and decision rule are changed for the different choice of λ 1 .
In this trial illustration, PFS was assumed to follow exponential distribution, and the decision rule was identified for the trial under the assumption. We further investigated with the nonexponential assumption. As seen in Table 5, decision rules, especially interim monitoring rules, for exponential versus nonexponential survivals are different, which implies that survival distribution assumption matters to design the clinical trials. In practice, earlier phase trials or experts' knowledge would be able to provide information of survival distribution for Phase II study. We can borrow the information to identify the appropriate decision rule for the study.

Discussion
Most existing methods generally assume exponential survival distribution to develop statistical methods or design based on median survival time for convenience [24,25]. From our simulation studies, we found that operating characteristics of the design depend on the survival distribution, and the decision rule of median event time test is different according to the survival assumption. Type I or II error rate can be inflated when survival distribution is misspecified. It is critical for median event time test to specify survival distribution for robust clinical trial research. The specification of survival distributions can be determined by the relevant pilot study or historical trials.
We have proposed several designs for median event time test: single stage, two stage with single threshold, and two stage with two different thresholds. According to investigators' interest, study objective, and trial assumption, further simulation investigation may be required to explore more rules. Close collaboration with clinicians as well as statistical practice will guide the better and ethic design for the study.