Sample Size Reassessment and Hypothesis Testing in Adaptive Survival Trials

Mid-study design modifications are becoming increasingly accepted in confirmatory clinical trials, so long as appropriate methods are applied such that error rates are controlled. It is therefore unfortunate that the important case of time-to-event endpoints is not easily handled by the standard theory. We analyze current methods that allow design modifications to be based on the full interim data, i.e., not only the observed event times but also secondary endpoint and safety data from patients who are yet to have an event. We show that the final test statistic may ignore a substantial subset of the observed event times. An alternative test incorporating all event times is found, where a conservative assumption must be made in order to guarantee type I error control. We examine the power of this approach using the example of a clinical trial comparing two cancer therapies.


Introduction
There are often strong ethical and economic arguments for conducting interim analyses [1] of an ongoing clinical trial and for making changes to the design if warranted by the accumulating data. One may decide, for example, to increase the sample size on the basis of promising interim results. Or perhaps one might wish to drop a treatment from a multi-arm study on the basis of unsatisfactory safety data. Owing to the complexity of clinical drug development, it is not always possible to anticipate the need for such modifications, and therefore not all contingencies can be dealt with in the statistical design.
Unforeseen interim modifications complicate the frequentist statistical analysis of the trial considerably. Over recent decades many authors have investigated so-called "adaptive designs" in an effort to maintain the concept of type I error control [2][3][4][5][6]. While the theory behind these methods is now well understood if responses are observed immediately, subtle problems arise when responses are delayed, e.g., in survival trials. [7] proposed adaptive survival tests that are constructed using the independent increments property of logrank test statistics [8][9][10]. However, as pointed out by [11], these methods only work if interim decision making is based solely on the interim logrank test statistics and any secondary endpoint data from patients who have already had an event. In other words, investigators must remain blind to the data from patients who are censored at the interim analysis. [12] argue that decisions regarding interim design modifications should be as substantiated as possible, and propose a test procedure that allows investigators to use the full interim data. This methodology, similar to that of [13], does not require any assumptions regarding the joint distribution of survival times and short-term secondary endpoints, as do, e.g., the methods proposed by [14], [15,16] and [17].
In this article we analyze the proposals of [13] and [12] and show that they are both based on weighted inverse-normal test statistics [18], with the common disadvantage that the final test statistic may ignore a substantial subset of the observed survival times. This is a serious limitation, as disregarding part of the observed data is generally considered inappropriate even if statistical error probabilities are controlled-see, for example, the discussion on overrunning in group sequential trials [17]. We quantify the potential inflation of the type I error rate if all observed data were used in these approaches. By adjusting the critical boundaries for the least favourable scenario we derive an alternative testing procedure which allows both, sample size reassessment and the use of all observed data.
The article is organized as follows. In Section 2 we review standard adaptive design theory and the recent methods of [13] and [12], as well as calculating the maximum type I error rate if the ignored data is naively reincorporated into the test statistic. In addition we construct a fulldata guaranteed level-α test. In Section 3 we illustrate the procedures in clinical trial example and discuss the efficiency of the considered testing procedures. We present our conclusions in Section 4. R code to reproduce our results in provided in a supplementary file (S1 File).

Adaptive Designs
Comprehensive accounts of adaptive design methodology can be found in [6,19]. For testing a null hypothesis, H 0 : θ = 0, against the one-sided alternative, H a : θ > 0, the two-stage adaptive test statistic is of the form f 1 (p 1 ) + f 2 (p 2 ), where p 1 is the p-value based on first-stage data, p 2 is the pvalue based on second-stage data, and f 1 and f 2 are prespecified monotonically decreasing functions. Consider the simplest case that no early rejection of the null hypothesis is possible at the end of the first stage. We will restrict attention to the weighted inverse-normal test statistic [18], where F denotes the standard normal distribution function and w 1 and w 2 are prespecified weights such that w 2 , then H 0 may be rejected at level α. The assumptions required to make this a valid level-α test are as follows [20].
Assumption 1 Let X int 1 denote the data available at the interim analysis, where X int 1 2 R n with distribution function Gðx int 1 ; yÞ. In general, X int 1 will contain information not only concerning the primary endpoint, but also measurements on secondary endpoints and safety data. It is assumed that the first-stage p-value function p 1 : 0Þ u for all u 2 0; 1 ½ :

Assumption 2
At the interim analysis, a second-stage design d is chosen. The second-stage design is allowed to depend on the unblinded first-stage data without prespecifying an adaptation rule. Denote the second-stage data by Y, where Y 2 R m . It is assumed that the distribution function of Y, denoted by F d;x int 1 ðy; yÞ, is known for all possible second stage designs, δ, and all first-stage outcomes, x int 1 .
Immediate responses. The aforementioned assumptions are easy to justify when primary endpoint responses are observed more-or-less immediately. In this case X int 1 contains the responses of all patients recruited prior to the interim analysis. A second-stage design δ can subsequently be chosen with the responses from a new cohort of patients contributing to Y.
Delayed responses and the independent increments assumption. An interim analysis may take place whilst some patients have entered the study but have yet to provide a data point on the primary outcome measure. Most approaches to this problem [7,8,10] attempt to take advantage of the well known independent increments structure of score statistics in group sequential designs [21]. As pictured in Fig 1, X int 1 will generally include responses on shortterm secondary endpoints and safety data from patients who are yet to provide a primary outcome measure, while Y consists of some delayed responses from patients recruited prior to the interim analysis, mixed together with responses from a new cohort of patients.
Let SðX int 1 Þ and I ðX int 1 Þ denote the score statistic and Fisher's information for θ, calculated from primary endpoint responses in X int 1 . Assuming suitable regularity conditions, the asymptotic null distribution of SðX int 1 Þ is Gaussian with mean zero and variance I ðX int 1 Þ [22]. The independent increments assumption is that for all first-stage outcomes x int 1 and second-stage designs δ, the null distribution of Y is such that at least approximately, where SðX int 1 ; YÞ and IðX int 1 ; YÞ denote the score statistic and Fisher's information for θ, calculated from primary endpoint responses in ðX int 1 ; YÞ. Unfortunately, Eq (2) is seldom realistic in an adaptive setting. [11] show that if the adaptive strategy at the interim analysis is dependent on short-term outcomes in X int 1 that are correlated with primary endpoint outcomes in Y, i.e., from the same patient, then a naive appeal to the independent increments assumption can lead to very large type I error inflation. Delayed responses with patient-wise separation. An alternative approach, which we call "patient-wise separation", redefines the first-stage p-value, p 1 : R p ! [0, 1], to be a function of X 1 , where X 1 denotes all the data from patients recruited prior to the interim analysis at calendar time T int , followed-up until a pre-fixed maximum calendar time T max . In this case p 1 may not be observable at the time the second-stage design δ is chosen. This is not a problem, as long as no early rejection at the end of the first stage is foreseen. Any interim decisions, such as increasing the sample size, do not require knowledge of p 1 . It is assumed that Y consists of responses from a new cohort of patients, such that x int 1 could be formally replaced with x 1 in the aforementioned adaptive design assumptions. We call this patient-wise separation because data from the same patient cannot contribute to both p 1 and p 2 .
[23] and [24] apply this approach when a patient's primary outcome can be measured after a fixed period of follow-up, e.g., 4 months. However, one must take additional care with a time-to-event endpoint, as one is typically not prepared to wait for all first-stage patientsthose patients recruited prior to T int -to have an event. Rather, p 1 is defined as the p-value from a statistical test applied to the data from first-stage patients followed up until time T end , for some T end T max . In this case it is vital that T end be fixed at the start of the trial, either explicitly or implicitly [12,13]. Otherwise, if T end were to depend on the adaptive strategy at the interim analysis, this would impact the distribution of p 1 and could lead to type I error inflation.
The situation is represented pictorially in Fig 2. An unfortunate consequence of pre-fixing T end is that this will not, in all likelihood, correspond to the end of follow-up for second-stage patients. All events from first-stage patients that occur after T end make no contribution to the statistic Eq (1).

Adaptive Survival Studies
Consider a randomized clinical trial comparing survival times on an experimental treatment, E, with those on a control treatment, C. For simplicity, we will focus on the logrank statistic for testing the null hypothesis H 0 : θ = 0 against the one-sided alternative H a : θ > 0, where θ is the log hazard ratio, assuming proportional hazards. Similar arguments could be applied to the Cox model. Let D 1 (t) and S 1 (t) denote the number of uncensored events and the usual logrank score statistic, respectively, based on the data from first-stage patients-those patients recruited prior to the interim analysis-followed up until calendar time t, t 2 [0, T max ]. Under the null hypothesis, assuming equal allocation and a large number of events, the variance of S 1 (t) is approximately equal to D 1 (t)/4 [25]. The first-stage p-value must be calculated at a prefixed time point T end : The number of events can be prefixed at d 1 , say, with T end chosen implicitly Jenkins et al., method.
[13] describe a "patient-wise separation" adaptive survival trial, with test statistic Eq (1), first-stage p-value Eq (3) and T end defined as in Eq (4). While their focus is on subgroup selection, we will appropriate their method for the simpler situation of a single comparison, where at the interim analysis one has the possibility to adapt the preplanned number of events from second-stage patients-i.e., those patients recruited post T int . The weights in Eq (1) are pre-fixed in proportion to the pre-planned number of events to be contributed from each stage, i.e., w 2 the total originally required number of events. The second-stage p-value corresponds to a logrank test based on second-stage patients, i.e., where T Ã 2 ≔ min ft : D 2 ðtÞ ¼ d Ã 2 g with S 2 (t) and D 2 (t) defined analogously to S 1 (t) and D 1 (t), and where d Ã 2 is specified at the interim analysis. Irle and Schäfer method. Instead of explicitly combining stage-wise p-values, [12] employ the closely related "conditional error" approach [3,4,26]. They begin by prespecifying a level-α test with decision function, φ, taking values in {0, 1} corresponding to non-rejection and rejection of H 0 , respectively. For a survival trial, this entails specifying the sample size, duration of follow-up, test statistic, recruitment rate, etc. Then, at some not necessarily prespecified timepoint, T int , an interim analysis is performed. The timing of the interim analysis induces a partition of the trial data, (X 1 , X 2 ), where X 1 and X 2 denote the data from patients recruited prior-T int and post-T int , respectively, followed-up until time T max . For a standard log-rank test, the decision function is where D(T end ) and S(T end ) denote the number of uncensored events and the usual logrank score statistic, respectively, based on data from all patients followed-up until time T end : = min{t : D(t) = d} for some prespecified number of events d.
At the interim analysis, the general idea is to use the unblinded first-stage data x int 1 to define a second-stage design, δ, without the need for a prespecified adaptation strategy. Again, the definition of δ includes factors such as sample size, follow-up period, recruitment rate, etc., in addition to a second-stage decision function ψ: R m ! {0, 1} based on second-stage data Y 2 R m . Ideally, one would like to choose ψ such that E H 0 ðc j X int i.e., the overall procedure controls the type I error rate at level α. Unfortunately, this approach is not directly applicable in a survival trial where X int 1 contains short-term data from first-stage patients surviving beyond T int . This is because it is impossible to calculate E H 0 ðφ j X int owing to the unknown joint distribution of survival times and the secondary endpoints already observed at the interim analysis. One may, however, condition on X 1 rather than on X int 1 and choose ψ such that E H 0 (ψ j X 1 = x 1 ) = E H 0 (φ j X 1 = x 1 ), thus ensuring type I error control following the same argument as Eq (6). For example, it is possible to extend patient follow-up and use the second-stage decision function where T Ã : = min{t : D(t) = d Ã }, d Ã ! d is chosen at the interim analysis, and b Ã is a cutoff value that must be determined. [12] show that, asymptotically, In each case, calculation of the right-hand-side is facilitated by the asymptotic result that, assuming equal allocation under the null hypothesis, for t 2 [0, T max ], One remaining subtlety is that E H 0 fc j S 1 ðT Ã Þ ¼ s Ã 1 g can only calculated at calendar time T Ã , where T Ã > T int . Determination of b Ã must therefore be postponed until this later time.
Using result Eq (8), it is straightforward to show that ψ = 1 if and only if Z > F −1 (1 − α), where Z is defined as in Eq (1) with p 1 defined as in Eq (3), the second-stage p-value function defined as and the specific choice of weighting w 2 1 ¼ D 1 ðT end Þ=DðT end Þ. Full details are provided in supplementary material (S2 File).
Remark 1. The Irle and Schäfer method uses the same test statistic as the Jenkins et al. method, with a clever way of implicitly defining the weights and the end of first-stage followup, T end . It has two potential advantages. Firstly, the timing of the interim analysis need not be prespecified-in theory, one is permitted to monitor the accumulating data and at any moment decide that design changes are necessary. Secondly, if no changes to the design are necessary, i.e., the trial completes as planned at calendar time T end , then the original test Eq (5) is performed. In this special case, no events are ignored in the final test statistic.
Remark 2. From first glance at Eq (7), it may appear that the events from first-stage patients, occurring after T end , always make a contribution to the final test statistic. However, this data is still effectively ignored. We have shown in the online supplement that the procedure is equivalent to a p-value combination approach where p 1 depends only on data available at time T end . In addition, the distribution of p 2 is asymptotically independent of the data from first-stage patients: note that S(T Ã ) − S 1 (T Ã ) and S 2 (T Ã ) are asymptotically equivalent [12]. The procedure therefore fits our description of a "patient-wise separation" design, and the picture is the same as in Fig 2. The first-stage patients have in effect been censored at T end , despite having been followed-up for longer. This fact has important implications for the choice of d Ã . If one chooses d Ã based on conditional power arguments, one should be aware that the effective sample size has not increased by d Ã − d. Rather, it has increased by d Ã − d − {D 1 (T Ã ) − D 1 (T end )}, which could be very much smaller.
Remark 3. A potential disadvantage of the Irle and Schäfer method compared to the Jenkins et al. method is that one is not permitted to adapt any aspect of the recruitment process prior to time T end . Contrary to what is claimed in [12], it is not valid to extend the recruitment period (or speed up recruitment as in the example they give) to reach an increased number of events d Ã within the originally planned trial duration. This is because T end is defined implicitly as T end : = min{t : D(t) = d} under the assumptions of the original design. Therefore T end is unobservable if the recruitment process is changed in response to the interim data. [27] discuss this issue further.

Hypothesis tests based on all available follow-up data
Suppose that the trial continues until calendar time T Ã 2 (T end , T max ). Data from first-stage patients-those patients recruited prior to T int -accumulating between times T end and T Ã should be ignored. In this section we will investigate what happens, in a worst case scenario, if this illegitimate data is naively incorporated into the adaptive test statistic Eq (1). Specifically, we find the maximum type I error associated with the test statistic Since T Ã depends on the interim data in a complicated way, the null distribution of Eq (10) is unknown. One can, however, consider properties of the stochastic process In other words, we consider continuous monitoring of the logrank statistic based on first-stage patient data. The worst-case scenario assumption is that the responses on short-term secondary endpoints, available at the interim analysis, can be used to predict the exact calendar time the process Z(t) reaches its maximum. In this case, one could attempt to engineer the second stage design such that T Ã coincides with this timepoint, and the worst-case type I error rate is therefore Although the worst-case scenario assumption is clearly unrealistic, Eq (11) serves as an upper bound on the type I error rate. It can be found approximately via standard Brownian motion results. Let u: = D 1 (t)/D 1 (T max ) denote the information time at calendar time t, and let S 1 (u) denote the logrank score statistic based on first-stage patients, followed-up until information time u. It can be shown that B(u): = S 1 (u)/ {D 1 (T max )/4} 1/2 behaves asymptotically like a Brownian motion with drift ξ: = θ {D 1 (T max )/4} 1/2 [28]. We wish to calculate P y¼0 max where u 1 = D 1 (T end )/D 1 (T max ). While the integrand on the right-hand-side is difficult to evaluate exactly, it can be found to any required degree of accuracy by replacing the square root stopping boundary with a piecewise linear boundary [29]. The two parameters that govern the size of Eq (11) are w 1 and u 1 . Larger values of w 1 reflect an increased weighting of the first-stage data, which increases the potential inflation. In addition, a low value for u 1 increases the window of opportunity for stopping on a random high. Table 1 shows that for a nominal α = 0.025 level test, the worst-case type I error can be up to 15% when u 1 = 0.1 and w 1 = 0.9. As u 1 ! 0 the worst-case type I error rate tends to 1 for any value of w 1 > 0 [30].
A full-data guaranteed level-α test. If one is unprepared to give up the guarantee of type I error control, an alternative test can be found by increasing the cut-off value for Z Ã from 3 Results

Clinical trial example
The upper bound on the type I error rate varies substantially across w 1 and u 1 . To give an indication of what can be expected in practice, consider a simplified version of the trial described in [12]. A randomized trial is set up to compare chemotherapy (C) with a combination of radiotherapy and chemotherapy (E). The anticipated median survival time on C is 14 months. If E were to increase the median survival time to 20 months then this would be considered a clinically relevant improvement. Assuming exponential survival times, this gives anticipated hazard rates λ C = 0.050 and λ E = 0.035, and a target log hazard ratio of θ R = −log(λ E /λ C ) % 0.36. If the error rates for testing H 0 : θ = 0 against H a : θ = θ R are α = 0.025 (one-sided) and β = 0.2, the required number of deaths (assuming equal allocation) is The relationship between the required number of events and the sample size depends on the recruitment pattern, and we will consider two scenarios. In our "slow recruitment" scenario, patients are recruited uniformly at a rate of 8 per month for a maximum of 60 months with an interim analysis performed at 23 months. In our"fast recruitment" scenario, patients are recruited uniformly at a rate of 50 per month for a maximum of 18 months with an interim analysis after 8 months. In both cases, the only adaptation we allow at the interim analysis is to increase the number of events. Recruitment must continue as planned but the follow-up period may be extended. The maximum duration of the trial is restricted to 100 months in the first case and 30 months in the second case. Fig 3 shows the expected number of events as a function of time for both scenarios assuming exponentially distributed survival times with hazards equal to the planned values.
The maximum type I error inflation, determined via w 1 and u 1 , will depend on the observed number of events from first-and second-stage patients at calendar times T int and T end . However, the expected pattern of events in Fig 3 provide some indication. In the slow recruitment scenario, we expect to recruit 179 patients by the time of the interim analysis. We also expect 149 of the first 248 events to come from patients recruited prior to the interim analysis. These numbers would give w 1 = (149/248) 1/2 , u 1 = 149/179 and, according to Eq (12), max α = 0.044. For the fast recruitment scenarios the respective quantities are w 1 = (169/248) 1/2 , u 1 = 169/264 and max α = 0.060.
On the efficiency of the full-data level-α test. Consider the full-data guaranteed level-α test defined above. Recall that this test has the advantage of allowing interim decision making to be based on all available data whilst using a final test statistic that takes account of all observed event times. Unfortunately, this advantage is likely to be outweighed by the loss in power resulting from the increased cut-off value, as can be seen in Fig 4. The difference between the noncentrality parameters of Z(T Ã ) and Z(T end ) is plotted against the time extension T Ã − T end for various choices of θ. In the slow recruitment scenario the increase in the noncentrality parameter is outweighed by the increase in the cut-off value, even when the loghazard ratio is as large as was expected in the planning phase. In the fast recruitment setting, it is possible for the increase in the noncentrality parameter to exceed the increase in the cut-off value when the trial is extended substantially. However, the trial would typically only need to be increased substantially if the true effect size were lower than planned. And in this case (θ 0.66θ R ) one can see that the increased cut-off value still dominates.

Discussion
Unblinded sample-size recalculation has been criticized for its lack of efficiency relative to classical group sequential designs [31,32]. If the recalculation is made on the basis of an early estimate of treatment effect, the final sample size is likely to have high variability [33], and, in addition, the test decision is based on a non-sufficient statistic. [34] show how, for a given reestimation rule, a classical group sequential design can be found with an essentially identical power function but lower expected sample size.
In response to these arguments [35] emphasize that "the real benefit of the adaptive approach arises through the ability to invest sample resources into the trial in stages". An efficient group sequential trial, on the other hand, requires a large up-front sample size commitment and aggressive early stopping boundaries. From the point of view of the trial sponsor, the added flexibility may in some circumstances outweigh the loss of efficiency.
In this paper we have shown that when the primary endpoint is time-to-event, a fully unblinded sample-size recalculation-i.e., a decision based on all available efficacy and safety data-has additional drawbacks not considered in the aforementioned literature. Recently proposed methods [12,13] share the common disadvantage that some patients' event times are ignored in the final test statistic. This is usually deemed unacceptable by regulators. Furthermore, it is the long-term data of patients recruited prior to the interim analysis that is ignored, such that more emphasis is put on early events in the final decision making. This neglect becomes more serious, therefore, if the hazard rates differ substantially only at large survival times. Note, however, that a standard logrank test would already be inefficient in this scenario [36].
The relative benefit of the Irle and Schäfer method [12], in comparison with that of Jenkins et al. [13], is that the timing of the interim analysis need not be pre-specified and, in addition, the method is efficient if no design changes are necessary. On the other hand, the Irle and Schäfer method has the serious practical flaw that it is not permissible to change any aspect of the recruitment process in response to the interim data.
Confirmatory clinical trials with time-to-event endpoints appear to be one of the most important fields of application of adaptive methods [37]. It is therefore especially important that investigators considering an unblinded sample size re-estimation in this context are aware of the additional issues involved. We have shown that all considered procedures will require giving up an important statistical property-a situation summarized succinctly in Table 2.
The relevance of these issues is highlighted by the recently published VALOR trial in acute myeloid leukaemia [38]. Treatment effect estimates from phase II data suggested that 375 events might be sufficient to confirm efficacy. However, there is always uncertainty surrounding such an estimate. A smaller effect size-corresponding to upwards of 500 required events-would still be clinically meaningful, but funding such a trial was beyond the resources of the study sponsor. The solution was to initiate the trial with the smaller sample size but plan an interim analysis, whereby promising results would trigger additional investment. In this case, the interim decision rules were pre-specified and, upon observing a promising hazard ratio after 173 events, the total required number of events was increased to 562. The final analysis was based on a weighted combination of log-rank statistics, corresponding to method (A) in Table 2. It is important to emphasize that the validity of this approach relies on the secondstage sample size being a function of the interim hazard ratio. Had other information-e.g., disease progressions-played a part in the interim decision making, then the type I error rate could have been compromised as described in this paper.
While statistical theory can be developed to control the type I error rate given certain model assumptions, there is always the potential for "operational bias" to enter an adaptive trial. FDA draft guidance [39] emphasizes the need to shield investigators as much as possible from knowledge of the adaptive changes. The very knowledge that sample size has been increasedimplying a "promising" interim effect estimate-could lead to changes of behavior in terms of treating, managing, and evaluating study participants. As a minimum, the European Medicines Agency requires that the primary analysis "be stratified according to whether patients were randomized before or after the protocol amendment" [40]. Aside from the regulatory importance, it is also in the sponsor's interest to minimize operational bias when trial outcomes will influence significant investment decisions [41]. For a further discussion on the regulatory and logistical challenges sponsors may face we refer to [6,19].
We have focussed our attention on the type I error control and power of the various procedures. Estimation of the treatment effect size following an adaptive survival trial is also an important topic. Current available methods can be found in [8], [42] and [43].
Supporting Information S1 File. R code to reproduce results.