## Figures

## Abstract

Mid-study design modifications are becoming increasingly accepted in confirmatory clinical trials, so long as appropriate methods are applied such that error rates are controlled. It is therefore unfortunate that the important case of time-to-event endpoints is not easily handled by the standard theory. We analyze current methods that allow design modifications to be based on the full interim data, i.e., not only the observed event times but also secondary endpoint and safety data from patients who are yet to have an event. We show that the final test statistic may ignore a substantial subset of the observed event times. An alternative test incorporating all event times is found, where a conservative assumption must be made in order to guarantee type I error control. We examine the power of this approach using the example of a clinical trial comparing two cancer therapies.

**Citation: **Magirr D, Jaki T, Koenig F, Posch M (2016) Sample Size Reassessment and Hypothesis Testing in Adaptive Survival Trials. PLoS ONE 11(2):
e0146465.
doi:10.1371/journal.pone.0146465

**Editor: **Robert K. Hills,
Cardiff University, UNITED KINGDOM

**Received: **July 31, 2015; **Accepted: **December 17, 2015; **Published: ** February 10, 2016

**Copyright: ** © 2016 Magirr et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

**Data Availability: **All relevant data are within the paper and code is available at http://dx.doi.org/10.6084/m9.figshare.1608320.

**Funding: **DM was funded by the Medical Research Council (MR/J004979/1) and the Austrian Science Fund (FWF P23167). TJ was supported by the National Institute for Health Research (NIHR-CDF-2010-03-32). FK was supported by European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement no 602552 (IDEAL). MP was supported by the EU FP7 HEALTH.2013.4.2-3 project Asterix (Grant Number 603160). The views expressed in this publication are those of the authors and should not be attributed to any of the funding institutions, or organisations with which the authors are affiliated. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

**Competing interests: ** The authors have declared that no competing interests exist.

## 1 Introduction

There are often strong ethical and economic arguments for conducting interim analyses [1] of an ongoing clinical trial and for making changes to the design if warranted by the accumulating data. One may decide, for example, to increase the sample size on the basis of promising interim results. Or perhaps one might wish to drop a treatment from a multi-arm study on the basis of unsatisfactory safety data. Owing to the complexity of clinical drug development, it is not always possible to anticipate the need for such modifications, and therefore not all contingencies can be dealt with in the statistical design.

Unforeseen interim modifications complicate the frequentist statistical analysis of the trial considerably. Over recent decades many authors have investigated so-called “adaptive designs” in an effort to maintain the concept of type I error control [2–6]. While the theory behind these methods is now well understood if responses are observed immediately, subtle problems arise when responses are delayed, e.g., in survival trials.

[7] proposed adaptive survival tests that are constructed using the independent increments property of logrank test statistics [8–10]. However, as pointed out by [11], these methods only work if interim decision making is based solely on the interim logrank test statistics and any secondary endpoint data from patients who have already had an event. In other words, investigators must remain blind to the data from patients who are censored at the interim analysis. [12] argue that decisions regarding interim design modifications should be as substantiated as possible, and propose a test procedure that allows investigators to use the full interim data. This methodology, similar to that of [13], does not require any assumptions regarding the joint distribution of survival times and short-term secondary endpoints, as do, e.g., the methods proposed by [14], [15, 16] and [17].

In this article we analyze the proposals of [13] and [12] and show that they are both based on weighted inverse-normal test statistics [18], with the common disadvantage that the final test statistic may ignore a substantial subset of the observed survival times. This is a serious limitation, as disregarding part of the observed data is generally considered inappropriate even if statistical error probabilities are controlled—see, for example, the discussion on overrunning in group sequential trials [17]. We quantify the potential inflation of the type I error rate if all observed data were used in these approaches. By adjusting the critical boundaries for the least favourable scenario we derive an alternative testing procedure which allows both, sample size reassessment and the use of all observed data.

The article is organized as follows. In Section 2 we review standard adaptive design theory and the recent methods of [13] and [12], as well as calculating the maximum type I error rate if the ignored data is naively reincorporated into the test statistic. In addition we construct a full-data guaranteed level-*α* test. In Section 3 we illustrate the procedures in clinical trial example and discuss the efficiency of the considered testing procedures. We present our conclusions in Section 4. R code to reproduce our results in provided in a supplementary file (S1 File).

## 2 Methods

### 2.1 Adaptive Designs

Comprehensive accounts of adaptive design methodology can be found in [6, 19]. For testing a null hypothesis, *H*_{0}: *θ* = 0, against the one-sided alternative, *H*_{a}: *θ* > 0, the two-stage adaptive test statistic is of the form *f*_{1}(*p*_{1}) + *f*_{2}(*p*_{2}), where *p*_{1} is the p-value based on first-stage data, *p*_{2} is the p-value based on second-stage data, and *f*_{1} and *f*_{2} are prespecified monotonically decreasing functions. Consider the simplest case that no early rejection of the null hypothesis is possible at the end of the first stage. We will restrict attention to the weighted inverse-normal test statistic [18],
(1)
where Φ denotes the standard normal distribution function and *w*_{1} and *w*_{2} are prespecified weights such that . If *Z* > Φ^{−1}(1 − *α*), then *H*_{0} may be rejected at level *α*. The assumptions required to make this a valid level-*α* test are as follows [20].

**Assumption 1**

Let denote the data available at the interim analysis, where with distribution function . In general, will contain information not only concerning the primary endpoint, but also measurements on secondary endpoints and safety data. It is assumed that the first-stage p-value function *p*_{1}: ℝ^{n} → [0, 1] satisfies

**Assumption 2**

At the interim analysis, a second-stage design *d* is chosen. The second-stage design is allowed to depend on the unblinded first-stage data without prespecifying an adaptation rule. Denote the second-stage data by *Y*, where . It is assumed that the distribution function of *Y*, denoted by , is known for all possible second stage designs, *δ*, and all first-stage outcomes, .

**Assumption 3**

The second-stage p-value function *p*_{2}: ℝ^{m} → [0, 1] satisfies .

#### Immediate responses.

The aforementioned assumptions are easy to justify when primary endpoint responses are observed more-or-less immediately. In this case contains the responses of all patients recruited prior to the interim analysis. A second-stage design *δ* can subsequently be chosen with the responses from a new cohort of patients contributing to *Y*.

#### Delayed responses and the independent increments assumption.

An interim analysis may take place whilst some patients have entered the study but have yet to provide a data point on the primary outcome measure. Most approaches to this problem [7, 8, 10] attempt to take advantage of the well known independent increments structure of score statistics in group sequential designs [21]. As pictured in Fig 1, will generally include responses on short-term secondary endpoints and safety data from patients who are yet to provide a primary outcome measure, while *Y* consists of some delayed responses from patients recruited prior to the interim analysis, mixed together with responses from a new cohort of patients.

Let and denote the score statistic and Fisher’s information for *θ*, calculated from primary endpoint responses in . Assuming suitable regularity conditions, the asymptotic null distribution of is Gaussian with mean zero and variance [22]. The independent increments assumption is that for all first-stage outcomes and second-stage designs *δ*, the null distribution of *Y* is such that
(2)
at least approximately, where and denote the score statistic and Fisher’s information for *θ*, calculated from primary endpoint responses in .

Unfortunately, Eq (2) is seldom realistic in an adaptive setting. [11] show that if the adaptive strategy at the interim analysis is dependent on short-term outcomes in that are correlated with primary endpoint outcomes in *Y*, i.e., from the same patient, then a naive appeal to the independent increments assumption can lead to very large type I error inflation.

#### Delayed responses with patient-wise separation.

An alternative approach, which we call “patient-wise separation”, redefines the first-stage p-value, *p*_{1}: ℝ^{p} → [0, 1], to be a function of *X*_{1}, where *X*_{1} denotes all the data from patients recruited prior to the interim analysis at calendar time *T*^{int}, followed-up until a pre-fixed maximum calendar time *T*^{max}. In this case *p*_{1} may not be observable at the time the second-stage design *δ* is chosen. This is not a problem, as long as no early rejection at the end of the first stage is foreseen. Any interim decisions, such as increasing the sample size, do not require knowledge of *p*_{1}. It is assumed that *Y* consists of responses from a new cohort of patients, such that could be formally replaced with *x*_{1} in the aforementioned adaptive design assumptions. We call this patient-wise separation because data from the same patient cannot contribute to both *p*_{1} and *p*_{2}.

[23] and [24] apply this approach when a patient’s primary outcome can be measured after a fixed period of follow-up, e.g., 4 months. However, one must take additional care with a time-to-event endpoint, as one is typically not prepared to wait for all first-stage patients—those patients recruited prior to *T*^{int}—to have an event. Rather, *p*_{1} is defined as the p-value from a statistical test applied to the data from first-stage patients followed up until time *T*^{end}, for some *T*^{end} ≤ *T*^{max}. In this case it is vital that *T*^{end} be fixed at the start of the trial, either explicitly or implicitly [12, 13]. Otherwise, if *T*^{end} were to depend on the adaptive strategy at the interim analysis, this would impact the distribution of *p*_{1} and could lead to type I error inflation.

The situation is represented pictorially in Fig 2. An unfortunate consequence of pre-fixing *T*^{end} is that this will not, in all likelihood, correspond to the end of follow-up for second-stage patients. All events from first-stage patients that occur after *T*^{end} make no contribution to the statistic Eq (1).

### 2.2 Adaptive Survival Studies

Consider a randomized clinical trial comparing survival times on an experimental treatment, *E*, with those on a control treatment, *C*. For simplicity, we will focus on the logrank statistic for testing the null hypothesis *H*_{0}: *θ* = 0 against the one-sided alternative *H*_{a}: *θ* > 0, where *θ* is the log hazard ratio, assuming proportional hazards. Similar arguments could be applied to the Cox model. Let *D*_{1}(*t*) and *S*_{1}(*t*) denote the number of uncensored events and the usual logrank score statistic, respectively, based on the data from first-stage patients—those patients recruited prior to the interim analysis—followed up until calendar time *t*, *t* ∈ [0, *T*^{max}]. Under the null hypothesis, assuming equal allocation and a large number of events, the variance of *S*_{1}(*t*) is approximately equal to *D*_{1}(*t*)/4 [25]. The first-stage p-value must be calculated at a prefixed time point *T*^{end}:
(3)
The number of events can be prefixed at *d*_{1}, say, with *T*^{end} chosen implicitly
(4)

#### Jenkins et al., method.

[13] describe a “patient-wise separation” adaptive survival trial, with test statistic Eq (1), first-stage p-value Eq (3) and *T*^{end} defined as in Eq (4). While their focus is on subgroup selection, we will appropriate their method for the simpler situation of a single comparison, where at the interim analysis one has the possibility to adapt the pre-planned number of events from second-stage patients—i.e., those patients recruited post *T*^{int}. The weights in Eq (1) are pre-fixed in proportion to the pre-planned number of events to be contributed from each stage, i.e., , where *d*_{1} + *d*_{2} is the total originally required number of events. The second-stage p-value corresponds to a logrank test based on second-stage patients, i.e.,
where with *S*_{2}(*t*) and *D*_{2}(*t*) defined analogously to *S*_{1}(*t*) and *D*_{1}(*t*), and where is specified at the interim analysis.

#### Irle and Schäfer method.

Instead of explicitly combining stage-wise p-values, [12] employ the closely related “conditional error” approach [3, 4, 26].

They begin by prespecifying a level-*α* test with decision function, *φ*, taking values in {0, 1} corresponding to non-rejection and rejection of *H*_{0}, respectively. For a survival trial, this entails specifying the sample size, duration of follow-up, test statistic, recruitment rate, etc. Then, at some not necessarily prespecified timepoint, *T*^{int}, an interim analysis is performed. The timing of the interim analysis induces a partition of the trial data, (*X*_{1}, *X*_{2}), where *X*_{1} and *X*_{2} denote the data from patients recruited prior- *T*^{int} and post- *T*^{int}, respectively, followed-up until time *T*^{max}. For a standard log-rank test, the decision function is
(5)
where *D*(*T*^{end}) and *S*(*T*^{end}) denote the number of uncensored events and the usual logrank score statistic, respectively, based on data from all patients followed-up until time *T*^{end}: = min{*t* : *D*(*t*) = *d*} for some prespecified number of events *d*.

At the interim analysis, the general idea is to use the unblinded first-stage data to define a second-stage design, *δ*, without the need for a prespecified adaptation strategy. Again, the definition of *δ* includes factors such as sample size, follow-up period, recruitment rate, etc., in addition to a second-stage decision function *ψ*: ℝ^{m} → {0, 1} based on second-stage data *Y* ∈ ℝ^{m}. Ideally, one would like to choose *ψ* such that , as this would ensure that
(6)
i.e., the overall procedure controls the type I error rate at level *α*. Unfortunately, this approach is not directly applicable in a survival trial where contains short-term data from first-stage patients surviving beyond *T*^{int}. This is because it is impossible to calculate and , owing to the unknown joint distribution of survival times and the secondary endpoints already observed at the interim analysis. One may, however, condition on *X*_{1} rather than on and choose *ψ* such that *E*_{H0}(*ψ* ∣ *X*_{1} = *x*_{1}) = *E*_{H0}(*φ* ∣ *X*_{1} = *x*_{1}), thus ensuring type I error control following the same argument as Eq (6). For example, it is possible to extend patient follow-up and use the second-stage decision function
(7)
where *T**: = min{*t* : *D*(*t*) = *d**}, *d** ≥ *d* is chosen at the interim analysis, and *b** is a cutoff value that must be determined. [12] show that, asymptotically,
and
In each case, calculation of the right-hand-side is facilitated by the asymptotic result that, assuming equal allocation under the null hypothesis, for *t* ∈ [0, *T*^{max}],
(8)

One remaining subtlety is that can only calculated at calendar time *T**, where *T** > *T*^{int}. Determination of *b** must therefore be postponed until this later time.

Using result Eq (8), it is straightforward to show that *ψ* = 1 if and only if *Z* > Φ^{−1}(1 − *α*), where *Z* is defined as in Eq (1) with *p*_{1} defined as in Eq (3), the second-stage p-value function defined as
(9)
and the specific choice of weighting . Full details are provided in supplementary material (S2 File).

**Remark 1.** The Irle and Schäfer method uses the same test statistic as the Jenkins et al. method, with a clever way of implicitly defining the weights and the end of first-stage follow-up, *T*^{end}. It has two potential advantages. Firstly, the timing of the interim analysis need not be prespecified—in theory, one is permitted to monitor the accumulating data and at any moment decide that design changes are necessary. Secondly, if no changes to the design are necessary, i.e., the trial completes as planned at calendar time *T*^{end}, then the original test Eq (5) is performed. In this special case, no events are ignored in the final test statistic.

**Remark 2.** From first glance at Eq (7), it may appear that the events from first-stage patients, occurring after *T*^{end}, always make a contribution to the final test statistic. However, this data is still effectively ignored. We have shown in the online supplement that the procedure is equivalent to a p-value combination approach where *p*_{1} depends only on data available at time *T*^{end}. In addition, the distribution of *p*_{2} is asymptotically independent of the data from first-stage patients: note that *S*(*T**) − *S*_{1}(*T**) and *S*_{2}(*T**) are asymptotically equivalent [12]. The procedure therefore fits our description of a “patient-wise separation” design, and the picture is the same as in Fig 2. The first-stage patients have in effect been censored at *T*^{end}, despite having been followed-up for longer. This fact has important implications for the choice of *d**. If one chooses *d** based on conditional power arguments, one should be aware that the effective sample size has not increased by *d** − *d*. Rather, it has increased by *d** − *d* − {*D*_{1}(*T**) − *D*_{1}(*T*^{end})}, which could be very much smaller.

**Remark 3.** A potential disadvantage of the Irle and Schäfer method compared to the Jenkins et al. method is that one is not permitted to adapt any aspect of the recruitment process prior to time *T*^{end}. Contrary to what is claimed in [12], it is not valid to extend the recruitment period (or speed up recruitment as in the example they give) to reach an increased number of events *d** within the originally planned trial duration. This is because *T*^{end} is defined implicitly as *T*^{end}: = min{*t* : *D*(*t*) = *d*} under the assumptions of the original design. Therefore *T*^{end} is unobservable if the recruitment process is changed in response to the interim data. [27] discuss this issue further.

### 2.3 Hypothesis tests based on all available follow-up data

Suppose that the trial continues until calendar time *T** ∈ (*T*^{end}, *T*^{max}). Data from first-stage patients—those patients recruited prior to *T*^{int}—accumulating between times *T*^{end} and *T** should be ignored. In this section we will investigate what happens, in a worst case scenario, if this illegitimate data is naively incorporated into the adaptive test statistic Eq (1). Specifically, we find the maximum type I error associated with the test statistic
(10)

Since *T** depends on the interim data in a complicated way, the null distribution of Eq (10) is unknown. One can, however, consider properties of the stochastic process
In other words, we consider continuous monitoring of the logrank statistic based on first-stage patient data. The worst-case scenario assumption is that the responses on short-term secondary endpoints, available at the interim analysis, can be used to predict the exact calendar time the process *Z*(*t*) reaches its maximum. In this case, one could attempt to engineer the second stage design such that *T** coincides with this timepoint, and the worst-case type I error rate is therefore
(11)

Although the worst-case scenario assumption is clearly unrealistic, Eq (11) serves as an upper bound on the type I error rate. It can be found approximately via standard Brownian motion results. Let *u*: = *D*_{1}(*t*)/*D*_{1}(*T*^{max}) denote the information time at calendar time *t*, and let *S*_{1}(*u*) denote the logrank score statistic based on first-stage patients, followed-up until information time *u*. It can be shown that *B*(*u*): = *S*_{1}(*u*)/ {*D*_{1}(*T*^{max})/4}^{1/2} behaves asymptotically like a Brownian motion with drift *ξ*: = *θ* {*D*_{1}(*T*^{max})/4}^{1/2}[28]. We wish to calculate
(12)
where *u*_{1} = *D*_{1}(*T*^{end})/*D*_{1}(*T*^{max}). While the integrand on the right-hand-side is difficult to evaluate exactly, it can be found to any required degree of accuracy by replacing the square root stopping boundary with a piecewise linear boundary [29].

The two parameters that govern the size of Eq (11) are *w*_{1} and *u*_{1}. Larger values of *w*_{1} reflect an increased weighting of the first-stage data, which increases the potential inflation. In addition, a low value for *u*_{1} increases the window of opportunity for stopping on a random high. Table 1 shows that for a nominal *α* = 0.025 level test, the worst-case type I error can be up to 15% when *u*_{1} = 0.1 and *w*_{1} = 0.9. As *u*_{1} → 0 the worst-case type I error rate tends to 1 for any value of *w*_{1} > 0 [30].

Nominal level *α* = 0.025 one-sided.

## 3 Results

### 3.1 Clinical trial example

The upper bound on the type I error rate varies substantially across *w*_{1} and *u*_{1}. To give an indication of what can be expected in practice, consider a simplified version of the trial described in [12]. A randomized trial is set up to compare chemotherapy (C) with a combination of radiotherapy and chemotherapy (E). The anticipated median survival time on C is 14 months. If E were to increase the median survival time to 20 months then this would be considered a clinically relevant improvement. Assuming exponential survival times, this gives anticipated hazard rates *λ*_{C} = 0.050 and *λ*_{E} = 0.035, and a target log hazard ratio of *θ*_{R} = −log(*λ*_{E}/*λ*_{C}) ≈ 0.36. If the error rates for testing *H*_{0} : *θ* = 0 against *H*_{a} : *θ* = *θ*_{R} are *α* = 0.025 (one-sided) and *β* = 0.2, the required number of deaths (assuming equal allocation) is

The relationship between the required number of events and the sample size depends on the recruitment pattern, and we will consider two scenarios. In our “slow recruitment” scenario, patients are recruited uniformly at a rate of 8 per month for a maximum of 60 months with an interim analysis performed at 23 months. In our“fast recruitment” scenario, patients are recruited uniformly at a rate of 50 per month for a maximum of 18 months with an interim analysis after 8 months. In both cases, the only adaptation we allow at the interim analysis is to increase the number of events. Recruitment must continue as planned but the follow-up period may be extended. The maximum duration of the trial is restricted to 100 months in the first case and 30 months in the second case.

Fig 3 shows the expected number of events as a function of time for both scenarios assuming exponentially distributed survival times with hazards equal to the planned values.

Slow recruitment: 8 patients per month for a maximum of 60 months. Fast recruitment: 50 patients per month for a maximum of 18 months. Vertical lines are at *T*^{int}, *T*^{end} and *T*^{max}.

The maximum type I error inflation, determined via *w*_{1} and *u*_{1}, will depend on the observed number of events from first- and second-stage patients at calendar times *T*^{int} and *T*^{end}. However, the expected pattern of events in Fig 3 provide some indication. In the slow recruitment scenario, we expect to recruit 179 patients by the time of the interim analysis. We also expect 149 of the first 248 events to come from patients recruited prior to the interim analysis. These numbers would give *w*_{1} = (149/248)^{1/2}, *u*_{1} = 149/179 and, according to Eq (12), max *α* = 0.044. For the fast recruitment scenarios the respective quantities are *w*_{1} = (169/248)^{1/2}, *u*_{1} = 169/264 and max *α* = 0.060.

#### On the efficiency of the full-data level-*α* test.

Consider the full-data guaranteed level-*α* test defined above. Recall that this test has the advantage of allowing interim decision making to be based on all available data whilst using a final test statistic that takes account of all observed event times. Unfortunately, this advantage is likely to be outweighed by the loss in power resulting from the increased cut-off value, as can be seen in Fig 4. The difference between the noncentrality parameters of *Z*(*T**) and *Z*(*T*^{end}) is plotted against the time extension *T** − *T*^{end} for various choices of *θ*. In the slow recruitment scenario the increase in the noncentrality parameter is outweighed by the increase in the cut-off value, even when the log-hazard ratio is as large as was expected in the planning phase. In the fast recruitment setting, it is possible for the increase in the noncentrality parameter to exceed the increase in the cut-off value when the trial is extended substantially. However, the trial would typically only need to be increased substantially if the true effect size were lower than planned. And in this case (*θ* ≤ 0.66*θ*_{R}) one can see that the increased cut-off value still dominates.

Horizontal lines are drawn at *k** − Φ^{−1}(0.975), where *k** denotes the cut-off value of the full-data guaranteed level-*α* test, and Φ denotes the standard normal distribution function.

## 4 Discussion

Unblinded sample-size recalculation has been criticized for its lack of efficiency relative to classical group sequential designs [31, 32]. If the recalculation is made on the basis of an early estimate of treatment effect, the final sample size is likely to have high variability [33], and, in addition, the test decision is based on a non-sufficient statistic. [34] show how, for a given re-estimation rule, a classical group sequential design can be found with an essentially identical power function but lower expected sample size.

In response to these arguments [35] emphasize that “the real benefit of the adaptive approach arises through the ability to invest sample resources into the trial in stages”. An efficient group sequential trial, on the other hand, requires a large up-front sample size commitment and aggressive early stopping boundaries. From the point of view of the trial sponsor, the added flexibility may in some circumstances outweigh the loss of efficiency.

In this paper we have shown that when the primary endpoint is time-to-event, a fully unblinded sample-size recalculation—i.e., a decision based on all available efficacy and safety data—has additional drawbacks not considered in the aforementioned literature. Recently proposed methods [12, 13] share the common disadvantage that some patients’ event times are ignored in the final test statistic. This is usually deemed unacceptable by regulators. Furthermore, it is the long-term data of patients recruited prior to the interim analysis that is ignored, such that more emphasis is put on early events in the final decision making. This neglect becomes more serious, therefore, if the hazard rates differ substantially only at large survival times. Note, however, that a standard logrank test would already be inefficient in this scenario [36].

The relative benefit of the Irle and Schäfer method [12], in comparison with that of Jenkins et al. [13], is that the timing of the interim analysis need not be pre-specified and, in addition, the method is efficient if no design changes are necessary. On the other hand, the Irle and Schäfer method has the serious practical flaw that it is not permissible to change any aspect of the recruitment process in response to the interim data.

Confirmatory clinical trials with time-to-event endpoints appear to be one of the most important fields of application of adaptive methods [37]. It is therefore especially important that investigators considering an unblinded sample size re-estimation in this context are aware of the additional issues involved. We have shown that all considered procedures will require giving up an important statistical property—a situation summarized succinctly in Table 2.

Methods considered: (A), data is combined assuming independent stage-wise increments; (B), patient-wise separation with pre-fixed end of first-stage follow-up; (C), naive patient-wise separation without pre-fixed end of first-stage follow-up; and (D), patient-wise separation using the full-data guaranteed level-*α* test.

The relevance of these issues is highlighted by the recently published VALOR trial in acute myeloid leukaemia [38]. Treatment effect estimates from phase II data suggested that 375 events might be sufficient to confirm efficacy. However, there is always uncertainty surrounding such an estimate. A smaller effect size—corresponding to upwards of 500 required events—would still be clinically meaningful, but funding such a trial was beyond the resources of the study sponsor. The solution was to initiate the trial with the smaller sample size but plan an interim analysis, whereby promising results would trigger additional investment. In this case, the interim decision rules were pre-specified and, upon observing a promising hazard ratio after 173 events, the total required number of events was increased to 562. The final analysis was based on a weighted combination of log-rank statistics, corresponding to method (A) in Table 2. It is important to emphasize that the validity of this approach relies on the second-stage sample size being a function of the interim hazard ratio. Had other information—e.g., disease progressions—played a part in the interim decision making, then the type I error rate could have been compromised as described in this paper.

While statistical theory can be developed to control the type I error rate given certain model assumptions, there is always the potential for “operational bias” to enter an adaptive trial. FDA draft guidance [39] emphasizes the need to shield investigators as much as possible from knowledge of the adaptive changes. The very knowledge that sample size has been increased—implying a “promising” interim effect estimate—could lead to changes of behavior in terms of treating, managing, and evaluating study participants. As a minimum, the European Medicines Agency requires that the primary analysis “be stratified according to whether patients were randomized before or after the protocol amendment” [40]. Aside from the regulatory importance, it is also in the sponsor’s interest to minimize operational bias when trial outcomes will influence significant investment decisions [41]. For a further discussion on the regulatory and logistical challenges sponsors may face we refer to [6, 19].

We have focussed our attention on the type I error control and power of the various procedures. Estimation of the treatment effect size following an adaptive survival trial is also an important topic. Current available methods can be found in [8], [42] and [43].

## Supporting Information

### S2 File. The connection between conditional error and the combination test.

doi:10.1371/journal.pone.0146465.s002

(PDF)

## Acknowledgments

DM was funded by the Medical Research Council (MR/J004979/1) and the Austrian Science Fund (FWF P23167), TJ was supported by the National Institute for Health Research (NIHR-CDF-2010-03-32). FK was supported by European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement no 602552 (IDEAL). MP was supported by the EU FP7 HEALTH.2013.4.2-3 project Asterix (Grant Number 603160). The views expressed in this publication are those of the authors and should not be attributed to any of the funding institutions, or organisations with which the authors are affiliated

## Author Contributions

Conceived and designed the experiments: DM TJ FK MP. Contributed reagents/materials/analysis tools: DM. Wrote the paper: DM TJ FK MP.

## References

- 1. Jiang Z 1, Wang L, Li C, Xia J, Jia H. A practical simulation method to calculate sample size of group sequential trials for time-to-event data under exponential and Weibull distribution. PLoS One. 2012;7(9):e44013. doi: 10.1371/journal.pone.0044013. pmid:22957040
- 2.
Bauer P, Köhne K. Evaluation of experiments with adaptive interim analyses. Biometrics. 1994;50:1029–1041. Correction:
*Biometrics*1996; 52:380. doi: 10.2307/2533441. pmid:7786985 - 3. Proschan MA, Hunsberger SA. Designed Extension of Studies Based on Conditional Power. Biometrics. 1995;51:1315–1324. Available from: http://www.jstor.org/stable/2533262. doi: 10.2307/2533262. pmid:8589224
- 4. Müller HH, Schäfer H. Adaptive Group Sequential Designs for Clinical Trials: Combining the Advantages of Adaptive and of Classical Group Sequential Approaches. Biometrics. 2001;57:886–891. doi: 10.1111/j.0006-341X.2001.00886.x. pmid:11550941
- 5. Hommel G. Adaptive modifications of hypotheses after an interim analysis. Biometrical Journal. 2001;43(5):581–589.
- 6. Bauer P, Bretz F, Dragalin V, König F, Wassmer G. Twenty-five years of confirmatory adaptive designs: opportunities and pitfalls. Statistics in Medicine. 2015 (Early View).
- 7. Schäfer H, Müller HH. Modification of the sample size and the schedule of interim analyses in survival trials based on data inspections. Statistics in medicine. 2001;20(24):3741–3751. doi: 10.1002/sim.1136. pmid:11782030
- 8. Wassmer G. Planning and analyzing adaptive group sequential survival trials. Biometrical Journal. 2006;48(4):714–729. doi: 10.1002/bimj.200510190. pmid:16972724
- 9. Desseaux K, Porcher R. Flexible two-stage design with sample size reassessment for survival trials. Statistics in medicine. 2007;26(27):5002–5013. doi: 10.1002/sim.2966. pmid:17577242
- 10. Jahn-Eimermacher A, Ingel K. Adaptive trial design: A general methodology for censored time to event data. Contemporary clinical trials. 2009;30(2):171–177. doi: 10.1016/j.cct.2008.12.002. pmid:19130902
- 11. Bauer P, Posch M. Letter to the editor. Statistics in Medicine. 2004;23:1333–1334.
- 12. Irle S, Schäfer H. Interim design modifications in time-to-event studies. Journal of the American Statistical Association. 2012;107:341–348. doi: 10.1080/01621459.2011.644141.
- 13. Jenkins M, Stone A, Jennison C. An Adaptive Seamless Phase II/III Design for Oncology Trials with Subpopulation Selection Using Correlated Survival Endpoints. Pharmaceutical Statistics. 2011;10:347–356. doi: 10.1002/pst.472. pmid:22328327
- 14. Stallard N. A confirmatory seamless phase II/III clinical trial design incorporating short-term endpoint information. Statistics in medicine. 2010;29(9):959–971. pmid:20191605
- 15. Friede T, Parsons N, Stallard N, Todd S, Valdes Marquez E, Chataway J, et al. Designing a seamless phase II/III clinical trial using early outcomes for treatment selection: An application in multiple sclerosis. Statistics in medicine. 2011;30(13):1528–1540. pmid:21341301
- 16. Friede T, Parsons N, Stallard N. A conditional error function approach for subgroup selection in adaptive clinical trials. Statistics in Medicine. 2012;31(30):4309–4320. doi: 10.1002/sim.5541. pmid:22865774
- 17. Hampson LV, Jennison C. Group sequential tests for delayed responses (with discussion). Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2013;75(1):3–54. doi: 10.1111/j.1467-9868.2012.01030.x.
- 18. Lehmacher W, Wassmer G. Adaptive Sample Size Calculations in Group Sequential Trials. Biometrics. 1999;55(4):pp. 1286–1290. Available from: http://www.jstor.org/stable/2533757. doi: 10.1111/j.0006-341X.1999.01286.x. pmid:11315085
- 19. Bretz F, Koenig F, Brannath W, Glimm E, Posch M. Adaptive Designs for Confirmatory Clinical Trials. Statistics in Medicine. 2009;28:1181–1217. doi: 10.1002/sim.3538. pmid:19206095
- 20. Brannath W, Gutjahr G, Bauer P. Probabilistic Foundation of Confirmatory Adaptive Designs. Journal of the American Statistical Association. 2012;107:824–832. doi: 10.1080/01621459.2012.682540.
- 21.
Jennison C, Turnbull BW. Group Sequential Methods with Applications to Clinical Trials. Boca Raton, FL: Chapman and Hall; 2000.
- 22.
Cox DR, Hinkley DV. Theoretical statistics. CRC Press; 1979.
- 23. Liu Q, Pledger GW. Phase 2 and 3 Combination Designs to Accelerate Drug Development. Journal of the American Statistical Association. 2005;100(470):493–502. Available from: http://amstat.tandfonline.com/doi/abs/10.1198/016214504000001790. doi: 10.1198/016214504000001790.
- 24. Schmidli H, Bretz F, Racine-Poon A. Bayesian predictive power for interim adaptation in seamless phase II/III trials where the endpoint is survival up to some specified timepoint. Statistics in medicine. 2007;26(27):4925–4938. doi: 10.1002/sim.2957. pmid:17590875
- 25.
Whitehead J. The Design and Analysis of Sequential Clinical Trials. Chichester: Wiley; 1997.
- 26. Posch M, Bauer P. Adaptive Two Stage Designs and the Conditional Error Function. Biometrical Journal. 1999;41:689–696.
- 27. Mehta C, Schäfer H, Daniel H, Irle S. Biomarker driven population enrichment for adaptive oncology trials with time to event endpoints. Statistics in Medicine. 2014;33(26):4515–4531. doi: 10.1002/sim.6272. pmid:25130879
- 28.
Proschan MA, Lan KKG, Wittes JT. Statistical Monitoring of Clinical Trials. New York: Springer; 2006.
- 29. Wang L, Pötzelberger K. Boundary crossing probability for Brownian motion and general boundaries. Journal of Applied Probability. 1997;34:54–65. doi: 10.2307/3215174.
- 30. Proschan MA, Follmann DA, Waclawiw MA. Effects of Assumption Violations on Type I Error Rate in Group Sequential Monitoring. Biometrics. 1992;48:1131–1143. doi: 10.2307/2532704.
- 31. Tsiatis AA, Mehta C. On the Inefficiency of Adaptive Design for Monitoring Clinical Trials. Biometrika. 2003;90:367–378. doi: 10.1093/biomet/90.2.367.
- 32. Jennison C, Turnbull BW. Adaptive and Nonadaptive Group Sequential Tests. Biometrika. 2006;93:1–21. doi: 10.1093/biomet/93.1.1.
- 33. Bauer P, Koenig F. The reassessment of trial perspectives from interim data–a critical view. Statistics in Medicine. 2006;25(1):23–36. doi: 10.1002/sim.2180. pmid:16220517
- 34. Jennison C, Turnbull BW. Mid-course sample size modification in clinical trials based on the observed treatment effect. Statistics in Medicine. 2003;22(6):971–993. doi: 10.1002/sim.1457. pmid:12627413
- 35. Mehta CR, Pocock SJ. Adaptive increase in sample size when interim results are promising: A practical guide with examples. Statistics in medicine. 2011;30(28):3267–3284. doi: 10.1002/sim.4102. pmid:22105690
- 36. Lagakos S, Schoenfeld D. Properties of proportional-hazards score tests under misspecified regression models. Biometrics. 1984;p. 1037–1048. doi: 10.2307/2531154. pmid:6534407
- 37. Elsäßer A, Regnstrom J, Vetter T, Koenig F, Hemmings RJ, Greco M, et al. Adaptive clinical trial designs for European marketing authorization: a survey of scientific advice letters from the European Medicines Agency. Trials. 2014;15(1):383. doi: 10.1186/1745-6215-15-383. pmid:25278265
- 38. Ravandi F, Ritchie EK, Sayar H, Lancet JE, Craig MD, Vey N, et al. Vosaroxin plus cytarabine versus placebo plus cytarabine in patients with first relapsed or refractory acute myeloid leukaemia (VALOR): a randomised, controlled, double-blind, multinational, phase 3 study. The Lancet Oncology. 2015;16(9):1025–1036. doi: 10.1016/S1470-2045(15)00201-6. pmid:26234174
- 39.
Food and Drug Administration. Draft Guidance for Industry: Adaptive Design Clinical Trials for Drugs and Biologics. 2010. Available from: http://www.fda.gov/downloads/Drugs/Guidances/ucm201790.pdf.
- 40.
European Medicines Agency Committee for Medicinal Products for Human Use. Reflection Paper on Methodological Issues in Confirmatory Clinical Trials Planned with an Adaptive Design. 2007. Doc. Ref. CHMP/EWP/2459/02. Available from: http://www.ema.europa.eu/docs/en_GB/document_library/Scientific_guideline/2009/09/WC500003616.pdf.
- 41. Cuffe RL, Lawrence D, Stone A, Vandemeulebroecke M When is a seamless study desirable? Case studies from different pharmaceutical sponsors. Pharmaceutical statistics. 2014;13(4):229–237. doi: 10.1002/pst.1622. pmid:24891148
- 42. Brannath W, Zuber E, Branson M, Bretz F, Gallo P, Posch M, et al. Confirmatory adaptive designs with Bayesian decision tools for a targeted therapy in oncology. Statistics in Medicine. 2009;28(10):1445–1463. doi: 10.1002/sim.3559. pmid:19266565
- 43. Carreras M, Gutjahr G, Brannath W. Adaptive seamless designs with interim treatment selection: a case study in oncology. Statistics in Medicine. 2015;34(8);1317–1333. doi: 10.1002/sim.6407. pmid:25640198