One-sample log-rank tests with consideration of reference curve sampling variability

The one-sample log-rank test is the method of choice for single-arm Phase II trials with time-to-event endpoint. It allows to compare the survival of the patients to a reference survival curve that typically represents the expected survival under standard of care. The classical one-sample log-rank test, however, assumes that the reference survival curve is deterministic. This ignores that the reference curve is commonly estimated from historic data and thus prone to statistical error. Ignoring sampling variability of the reference curve results in type I error rate inflation. For that reason, a new one-sample log-rank test is proposed that explicitly accounts for the statistical error made in the process of estimating the reference survival curve. The test statistic and its distributional properties are derived using martingale techniques in the large sample limit. In particular, a sample size formula is provided. Small sample properties regarding type I and type II error rate control are studied by simulation. A case study is conducted to study the influence of several design parameters of a single-armed trial on the inflation of the type I error rate when reference curve sampling variability is ignored.


Introduction
The one-sample log-rank test is the method of choice for single-arm Phase II trials with time-toevent endpoint. It allows to compare the survival of the patients to a prefixed reference survival curve that typically represents the expected survival under standard of care. First proposed by [1], its practical implementation including sample size calculation has been described by [2]. The onesample log-rank test is often criticized in different directions. First, it has been reported repeatedly in the literature that the classical one-sample log-rank statistic tends to be conservative (see [3,4]). One reason for the test's inaccuracy is the dependence between the estimators of mean and variance of the original one-sample log-rank statistic when sample size is small. Several attempts have been made in the literature to correct for this (see [3,4,5,6,7,8]). Amongst those, the proposal made by [6] is presently implemented in the commercial software PASS [9] for sample size calculation for the one-sample log-rank test. Another more conceptual point of criticism against the one-sample log-rank test relates to the process of selecting the reference survival curve. It is common practice to choose the reference survival curve in the light of historic data on standard treatment. This implies that choice of the reference survival curve itself is thus prone to statistical error which, however, is ignored in the classical one-sample log-rank statistic. As lined out in [10], this is as general problem in clinical trials with historical controls. Accordingly, common one-sample logrank tests rather assume that the reference survival curve is a priori known and deterministic as in [2,3,4,5,6,7,8]. This ignores that the reference curve resulted from an estimation process and complicates interpretation of the test results. Moreover, historic data often suffer from not reflecting recent advances in diagnostics and/or concomitant therapy for standard of care.
To overcome these interpretative limitations we propose a new one-sample log-rank test that explicitly accounts for statistical error made in the process of estimating and fixing the reference survival curve. Principally, the new test applies to both historic and prospective comparisons of a new treatment to a standard in the framework of Phase II survival trials. In the latter case, the new test may also be interpreted as a two-sample test for survival distributions.
The paper is organized as follows. After settling notation and the testing problem, we describe the test statistic and it distributional properties. Additionally, we provide sample size calculation methods. Calculation of rejection regions and sample size are based on the approximate distribution of the new test statistic in the large sample limit. Therefore small sample properties of the new test regarding type I and type II error rate control are studied by simulation, and compared to the classical one-sample log-rank test / two-sample log-rank test. These simulations and a case study shed light on the inflation of the type I error rate that results from ignoring the sampling variability of the reference curve in the planning phase of a new single-armed trial. We conclude with a discussion of future research. Mathematical proofs are shifted to Appendix A.
2 General Aspects

Notation
We consider a survival trial with survival data from two treatment groups A (control intervention, prospectively collected or historic data) and B (experimental intervention, prospectively collected data). Let N x denote the set of patients from group x = A, B, n x := |N x | the number of such patients, and n := n A + n B the total number of patients. In particular, we denote by π = n B /n A the treatment group allocation ratio. We denote by T x,i or C x,i the time from entry to event or censoring for patient i from group x = A, B, respectively. Let X x,i := T x,i ∧ C x,i denote the minimum of both. As usual, we assume that the T x,i and C x,i are mutually independent (non-informative censoring). For each s ≥ 0, we denote by F s the σ-algebra of information available by study time s: F s := σ (I(T x,i ≤ s), T x,i · I(T x,i ≤ s), I(C x,i ≤ s), C x,i · I(C x,i ≤ s); i = 1, . . . , n x , x = A, B) . (1) Based on the observed data, we calculate the number of events from treatment group x = A, B up to study time s ≥ 0 as N x (s) := i∈Nx N x,i (s), N i (s) := I(T x,i ≤ s, T x,i ≤ C x,i ), (2) and the number at risk Y x (s) := i∈Nx I(T x,i ∧ C x,i ≥ s) by study time s ≥ 0 in treatment group x = A, B. Let J x (s) := I(Y x (s) > 0) indicate whether there are still patients at risk in treatment group x by study time s. As usual, we let λ x (s) := lim ∆→0 P (s ≤ T x,i < s + ∆|T x,i ≥ s)/∆ denote the hazard of a patient i from treatment group x = A, B. We denote by Λ x (s) := s 0 λ x (u)du the corresponding cumulative hazard function for treatment group x = A, B, respectively. Finally, we denote by f Tx , F Tx , S Tx (or f Cx , F Cx , S Cx ) the density, distribution function and survival function of the time to event T x,i (or time to censoring C x,i ) in treatment group x = A, B. Notice that λ x , Λ x , f Tx , F Tx , S Tx and f Cx , F Cx , S Cx are assumed to coincide for all patients from the same treatment group x = A, B.
We will also need the Nelsen-Aalen estimator of the cumulative hazard function Λ x (s) for group x = A, B, and the corresponding estimator of the variance function We consider N x , Y x , J x , Λ x and σ x as stochastic processes in study time s ≥ 0, adapted to the filtration (F s ) s≥0 . Notice that we define 0/0 := 0 whenever formal division of zero by zero occurs in a mathematical expression.

The Testing Problem
We consider testing the null hypothesis that the survival function of patients from the experimental treatment group B coincides with the reference curve that is given by the true survival function under standard of care, i.e.
for some maximum observation time s max > 0. Notice that H 0 deviates from the null hypothesis of the classical one-sample log-rank tests (see [1] or [2]) which assumes a known reference survival curve. Nevertheless, H 0 typically is the null hypothesis of actual interest also in a one-sample setting.

Motivation
Starting point is the stochastic process M 0 (s) When H 0 holds true, M 0 is (known to be) a mean-zero F s -martingale. M 0 depends on data from the experimental treatment arm B, only, and is commonly used as a basis to construct one-sample log-rank tests (see [11]). Notice, however, the difficulty that M 0 depends on the true unknown cumulative hazard function Λ A under standard of care. In the context of the classical one-sample log-rank test it is common practice to estimate Λ A from historic data, and to identify the obtained estimate Λ A with Λ A , while treating Λ A as a deterministic function. I.e., the classical one-sample log-rank test effectively assesses the null hypothesis H 0 : Λ B = Λ A using the test statistic M 0 (s) := n while pretending that Λ A is an a priori known deterministic reference function representing the expected survival under standard of care. This, however, may detract from the actually interesting null hypothesis H 0 : Λ B = Λ A when random deviation of Λ A from Λ A is large. To avoid those interpretive difficulties, we here propose to incorporate the process of reference curve estimation into the one-sample log-rank statistic: Replacing Λ A with its Nelsen-Aalen estimate Λ A (see [12,13,14]) in the definition of M 0 while treating Λ A as random, we obtain a new stochastic process M 0 (s) ] that (i) can be calculated from the data, and (ii) may be used as test statistic for the original null hypothesis H 0 as we will see below. Notice that replacing Λ A with Λ A in M 0 increases the variance of the stochastic process since Λ A contributes additional variability. Deriving the correct rejection regions thus requires separate consideration which is not covered by the underlying one-sample test methodology. The resulting significance test may also be interpreted as a two-sample survival test, as the reference curve coincides with the true survival function under standard of care. Our proceeding defines a general strategy to lift existing methodology for one-sample survival tests to a multi-sample setting for a variety of different design settings as will be further discussed.

Test Statistic and Significance Test
Consider the F s -adapted stochastic processes with N B , Λ A and σ A acc. to (2), (3) and (4). Assume that the null hypothesis H 0 holds true. Then by Theorem 1 (see Appendix A) the following applies: (i) M 0 is a mean-zero F s -martingale with asymptotically independent increments, i.e. for any 0 < s 1 < s 2 < s max , M 0 (s 1 ) and M 0 (s 2 ) − M 0 (s 1 ) are approximately independent when sample size n is sufficiently large, and (ii) for each fixed s > 0 we have M 0 (s) d → N (0, Σ 2 (s)) in distribution as n → ∞, where Σ(s) := plim n→∞ Σ(s) = lim n→∞ E[ Σ(s)] (see Appendix A, Lemma 1). In particular, the random variable is approximately standard normally distributed under the null hypothesis H 0 . Notice that the parameters n A and n B cancel out in the definition of Z, so that Z can be calculated from the observed data. Thus an approximate level α test of H 0 is defined by rejecting H 0 whenever |Z| ≥ Φ −1 (1−α/2), where α is the desired two-sided significance level and Φ is the standard normal distribution function.
To enable easy application of the proposed significance test in clinical practice, we provide R code (see [15]) that calculates the value of the test statistic Z and the corresponding two-sided p-value for given input data set. The R code as well as instructions how to prepare the input data are given in S1 File.

Sample Size Calculation
Sample size is calculated under the proportional hazards planning alternative K 0 : Λ B (s) = ω 0 ·Λ A (s) for some prefixed hazard ratio 0 < ω 0 < 1. By Theorem 2 (see Appendix A), the test statistic Z from (6) is approximately normally distributed under planning alternative hypothesis K 0 with unit variance and mean log(ω 0 ) · µ · σ −1 where S T A (u)·S C A (u) du and π = n B /n A denoting the treatment arm allocation ratio.
Large negative values of the test statistic Z support validity of the planning alternative K 0 . The power 1 − β := P K0 (Z ≤ Φ −1 (α/2)) of the trial is thus approximately given by In practice, the following assumptions on accrual and censoring are commonly made when calculating the required sample size of a survival trial: • Patients enter the trial uniformly between year 0 and year a with prefixed constant accrual rate r, say, and are then followed-up for further f ≥ s max years until the time of final analysis in year a + f year.
• No loss to follow-up, i.e.
These assumption amount to thus further simplifying above expressions for µ and σ. For prefixed two-sided significance level α, hazard ratio 0 < ω 0 < 1, treatment group allocation ratio π = n B /n A , overall accrual rate r, length of the follow-up period f , and control arm cumulative hazard function Λ A (s), it remains to choose the only remaining free parameter a in (8) such that a desired power 1 − β is achieved. With the parameter a calculated this way, the required number of patients n to achieve a power of 1 − β under planning alternative K 0 is n = r · a. In S1 File we provide R code (see [15]) that calculates the required number of patients n for settings when the survival times in the reference group A are Weibull distributed Λ A (s) := − log(S 1 )· t κ with prefixed shape parameter κ and prefixed 1-year survival rate S A (1) = S 1 .

Simulation Study I: Comparison with the Classical One-
Sample Log-Rank Test

Design
In the application of the classical one-sample log-rank test from [1,2] it is common practice to estimate the standard arm hazard function Λ A from historic data, and to choose the obtained estimate Λ A as the reference curve, while treating Λ A as deterministic. This may lead to type I error rate inflation when the underlying null hypothesis to be tested is H 0 : Λ B = Λ A , because the random deviation of Λ A from Λ A is neglected and the variance of the involved test statistics is thus underestimated. The objective of this simulation study I is to quantify the amount of type I error rate inflation in settings of clinical relevance: We study and compare the empirical type I error rates when (i) the classical one-sample log-rank test (without correction for sampling variability of the reference curve) and (ii) the new one-sample log-rank test (with correction for sampling variability of the reference curve) is used to test null hypotheses H 0 : Λ B = Λ A . In our simulations, patients were assumed to enter the trial uniformly between year 0 and year a with overall accrual rate of r = 100 per year. Accordingly the calendar times of entry were generated according to a uniform distribution Y x,i ∼ U(0, a) on [0, a]. After the end of the accrual period, patients were assumed to be followed up for further f = 3 years, while assuming no loss to follow-up. Accordingly, we set C x,i := a + f − Y x,i ∼ U(f, a + f ). Survival times T A,i in the control intervention group A were generated according to a Weibull distribution Λ A (s) := − log(S 1 ) · t κ with prefixed shape parameter κ and 1-year survival rate S T A (1) = S 1 = 0.5. Survival times T B,i in the experimental intervention group B were generated acc. to a Weibull distribution with Λ B (s) := ω · Λ A (s), where ω is the true hazard ratio.
To perform the classical one-sample log-rank test, the standard arm data was used to calculate the Nelsen-Aalen estimate Λ A of Λ A . The obtained estimate Λ A was then treated as a deterministic function and used as (prefixed, deterministic) reference cumulative hazard function in the classical one-sample log-rank statistic (see [1,2]). In contrast, the new test was performed according to our previously shown results.
To study the impact of sample size and allocation ratio on the amount of type I error rate inflation, the total sample size n = r · a of the virtual data sets was chosen as n = 100, 500, 1000. For each of these total sample sizes we considered allocation ratios π ∈ {2, 1, 1/2, 1/4, 1/8, 1/16}. Scenarios with π ≤ 1/2 are more likely to reflect common practice as the size of the experimental cohort is typically smaller than the size of the historical control cohort. To study the impact of different shapes of the survival distribution, we considered different values for the Weibull shape parameter from the interval [0. 1,5].
For each parameter constellation, we generated 10000 samples of size n to which we applied both the new test as well as the classical one-sample log-rank test. The desired two-sided significance level was 5%. Results are shown in Table 1 and discussed below.

Results
The classical one-sample log-rank test does not account for sampling variability of the reference curve estimate. This leads to type I error rate inflation when the underlying null hypothesis to be tested is H 0 : Λ B = Λ A . As expected, our simulations support that the amount of type I error rate inflation of the classical one-sample log-rank test is most pronounced when the allocation ratio π is large. For any fixed allocation ratio, the inflation slightly decreases with increasing overall sample size but remains on a similar level. For ratios π ≥ 1, the true type I error rate is more than three  Empirical type I error rates αnew and αLR for testing H0 : ΛB = ΛA using the new test statistic Z and the classical one-sample log-rank statistic, respectively, for Weibull distributed survival times with shape parameter κ and 1-year survival rate S1 = 0.5 in the control arm. Theoretical two-sided significance level: 5%. Underlying total sample size of n with allocation ratio π.
times higher than the desired one (∼ 17% instead of 5% for π = 1). For low allocation ratios as 1/8 or 1/16, the actual type I error still exceeds the nominal level, but to an extent that might be acceptable for a phase II trial (∼ 6.5% for π = 1/16 and n = 1000). With a view to the classical one-sample logrank test, this supports that choice of the reference curve should be based on a historic control that is at least 10 times larger than the new experimental trial cohort. Reassuringly, the new test that explicitly accounts for reference curve variability realizes an empirical type I error rate close to the desired 5% in almost all scenarios. Notice that the new test would hardly be applied in the scenario with n = 100 and π = 1/16 as this implies a trial with n B = 100/17 ≈ 6 only. So the entries for n = 100 and π = 1/16 have to be interpreted with care, but are shown for reasons transparency and completeness. The simulations thus support that neglecting the reference curve variability relevantly compromises type I error rate control when testing null hypotheses H 0 : Λ B = Λ A . Notice that the classical one-sample log-rank test only realizes strict type I error rate control for testing the null hypothesis H 0 : Λ B = Λ A which, however, detracts from the null hypothesis H 0 : Λ B = Λ A when random deviation of Λ A from Λ A is large.
6 Simulation Study II: Comparison with the Two-Sample Log-Rank Test

Design
We proposed a significance test for null hypothesis H 0 based on the approximate large sample distribution of the test statistic Z introduced before. Despite of being derived as a one-sample logrank test with consideration of reference curve variability, the new test may also be interpreted as a two-sample survival test. This simulation therefore aims to study performance of the new survival test for sample sizes of practical relevance, as compared to the classical two-sample log-rank test (see [16,17]). Asymptotically (i.e. for sufficiently large sample size) the classical two-sample logrank test is known to be the optimal test under proportional hazards (PH) alternatives. It is thus of particular interest to compare performance of both tests under PH alternatives. In our simulations, patients were assumed to enter the trial uniformly between year 0 and year a with overall accrual rate of r = 100 per year. Accordingly, the calendar times of entry were generated according to a uniform distribution Y x,i ∼ U(0, a) on [0, a]. Patients were allocated equally to both treatment arms A and B (allocation ratio π = 1), corresponding to an annual accrual rate of 50 patients per group. After the end of the accrual period, patients were assumed to be followed up for further f = 3 years, while assuming no loss to follow-up. Accordingly, we Survival times T A,i in the control intervention group A were generated acc. to a Weibull distribution Λ A (s) := − log(S 1 ) · t κ with prefixed shape parameter κ and 1-year survival rate S T A (1) = S 1 . To implement the PH condition, survival times T B,i in the experimental intervention group B were generated according to a Weibull distribution with Λ B (s) := ω · Λ A (s), where ω is the true hazard ratio. The true hazard ratio ω has to be distinguished from the expected hazard ratio ω 0 , which defines the planning alternative K 0 : Λ B = ω 0 · Λ A underlying sample size calculation.
The classical two-sample log-rank test serves as reference. Sample size n of the virtual trials was thus calculated as follows: In a first step, we used Schoenfeld's formula from [18] to calculate the required number of events d for the two-sample log-rank test to achieve a power of 1 − β under the planning alternative K 0 : Λ B = ω 0 · Λ A for allocated two-sided significance level α. The expected number of events under the planning alternative K 0 : for the indeterminate a yields the required length of the accrual period. The required total sample size is n := r · a (i.e. n/2 per treatment group).
To cover scenarios of larger and smaller sample sizes, we let the expected hazard ratio ω 0 range in the set {0.5, 0.67, 0.8}. To study the impact of different shapes of the survival distribution, we considered different values for the Weibull shape parameter from the interval [0. 1,5]. To study the impact of the event rate, we chose a reference arm 1-year survival rate S 1 of 0.5 (Table 2), 0.8 (see Appendix C) and 0.2 (see Appendix D).
For each parameter constellation, we generated 10000 samples of size n to which we applied both the new test as well as the classical two-sample log-rank test. We finally also used formula (8) Empirical type I error rates (αnew and αLR) and powers (1 − βnew and 1 − βLR) for the new test and for the classical two-sample log-rank test, respectively, under proportional hazards alternatives for Weibull distributed survival times with shape parameter κ and 1-year survival rate S1 = 0.5 in the control arm. Theoretical two-sided significance level: 5%. Underlying total sample size n (or n ) in Scenario 1 (or Scenario 2) calculated to achieve a theoretical power of 80% under the planning alternative H1 : ΛB = ω0 · ΛAfor the classical log-rank test using Schoenfeld's formula (or for the new test using formula (9)).
to calculate the sample size n such that our new test achieves a power of 1 − β under planning alternative K 0 : Λ B = ω 0 · Λ A for allocated two-sided significance level of 5%, and then repeated above simulations based on a total sample size of n instead of n. Reported in Tables 1-3 are the empirical type I and type II error rates for each parameter constellation and test based on a sample size of n (Scenario 1) or n (Scenario 2).

Results of the Main Setting (Table 2, Scenario 1)
Reassuringly, for large sample sizes (ω 0 = 0.8), both tests preserve the desired significance level and achieve similar power levels close to the desired 80% for all shape parameter values κ. On closer inspection, one notices that both tests tend to be conservative for small values of κ, and slightly anticonservative for larger values of κ on an acceptable degree (empirical type I error ranging between 4.5% for κ = 0.1 and 5.3% for κ = 5 and ω 0 = 0.8). For the classical two-sample log-rank test, this effect is overlapped by a general tendency to anti-conservativeness when sample size is small (ω = 0.5), resulting in an empirical type I error up to 5.9% for the classical two-sample log-rank test when κ = 5 and ω = 0.5. For shape parameters κ ≤ 1, both test perform similarly well with empirical type I error rate close to 5%. Interestingly, the new test even surpasses the classical two-sample log-rank test regarding power performance when κ ≤ 1. This effect is most pronounced for exponentially distributed survival time (κ = 1), when the new test achieves a power up to 83% as compared to 80% for the classical two-sample log-rank test. For shape parameter close to the exponential distribution κ ≈ 1, the new test is observed to show even better type I error rate control than the classical two-sample log-rank test when sample size is small (ω = 0.5).
For the extreme scenario of large shape parameters κ ≥ 2 in combination with small sample size ω = 0.5, however, the new test is observed to become quite conservative with profound loss in power (40% instead of 80% for κ = 5 and ω = 0.5). This is due to the fact that the new test requires estimation of the control arm cumulative hazard function, which seems to fail when sample size of the control arm is small and at the same time early events are rare (n A = 34 for κ = 5 and ω = 0.5). In contrast, the classical two-sample log-rank test maintains power also in these extreme scenarios, with a tendency towards anti-conservativeness, though.
This behavior of both tests is consistently observed amongst scenarios with different event rates (see the tables in Appendices C and D).

Case Study
As seen in the preceding simulations, the type I error of the classical one-sample log-rank test always exceeds the nominal type I error level if the sampling variability of the reference curve is not taken into account. However, the magnitude of this excess depends on the data from the reference cohort as well as the sample size in the new, experimental cohort. The only difference between the test statistic of the classical one-sample log-rank test and the new test is the denominator. Let R := Σ OSLR (∞)/ Σ(∞) denote the ratio of the standardisations without and with consideration of the sampling variability. The expected level of a two-sided classical one-sample log-rank test with nominal level α neglecting the sampling variability is then Analogously, E[ √ R] can be approximated via a first-order Taylor expansion by which is now a quantity we can estimate from given historical control data and design parameters of a trial.
From the computations in [6] we get After another approximation and some computations (see Appendix B), we also get Under the null hypothesis, this can be estimated by plugging in Kaplan-Meier estimates from the control group for F T B resp. S T B . For a given historical control group, these formulas can now be used to compute the type I error inflation when sampling variability is not taken into account. Of course, the treatment group allocation ratio π is essential for the extent of this inflation.
We will now illustrate the influence of basic design parameters on the type I error inflation with a practical example. We employ the setting of the Mayo Clinical trial in primary biliary cirrhosis of the liver (PBC), which is a rare but fatal chronic disease whose cause is still unknown (see [19]). In this double-blinded randomized trial the drug D-penicillamine (DPCA) was compared with a placebo. The study data is publicly available via the survival package in R [20,15]. + + ++ + + + ++ + + + + + + + + + + + + + + + + + + ++ + + + + + + + + ++ + + + + ++ + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + ++ ++ ++ + + + + +++ + + + + + + + + + ++ + + ++ + + + + + +++ + + + + + + ++ + + ++ + ++ ++ + + + + + + + + + + + + + + + + + + + + + + + + There, we also display the empirical distribution of the censoring variable C in this cohort. As we will see below, this distribution also plays a substantial role for our computations here. We now suppose, that a new treatment becomes available and the data from this trial shall be used to compare the survival under this treatment to the survival under treatment with DPCA. This shall be accomplished in a trial in which patients are recruited uniformly over a accrual period of length a and followed-up in an additional period of length f . The allocation ratio will again be denoted by π. If one cannot find a suitable parametric model to be fitted to the data, the Kaplan-Meier resp. Nelson-Aalen estimates (see Fig 1) are employed as reference curves for the one-sample log-rank test. Similar to our first simulation study (see Simulation Study I: Comparison with the Classical One-Sample Log-Rank Test), we investigate the influence of the allocation ratio on the inflation of the type I error level in the first part of our study. We choose π ∈ {0.01, 0.02, 0.03, . . . , 1}, a = 2 and f ∈ {2, 4, 6, 8}. The results in terms of the actual type I error level of the one-sample log-rank test can be found on the left hand side of Fig 2. For any fixed f , the actual type I error level increases nearly linearly in the range of allocation ratios considered here. So, as a rule of thumb, each additional trial patient raises the level by a fixed number of percentage points. This number however seems to depend on the length of the follow-up, where a longer duration of the follow-up period leads to steeper increases. In the second part of this case study, we take a closer look at the role of the trial duration. As already seen in the first part, longer trials lead to a larger inflation of the error levels. To analyse this dependence, we now choose π = 0.5, a ∈ {2, 4, 6} and f ∈ {0, 0.05, 0.1, . . . , 6}. The results can be found on the right hand side of Fig 2. As we can see, trials with a longer total duration (a+f ) tend to lead to a higher type I error inflation. This effect is most substantial if the total trial duration is close to the longest observation in the reference data set which amounts to about 12.5 years. In this case, the testing procedure needs to utilize parts of the Nelson-Aalen estimator which are affected by a high amount of variability because of the high number of censored observations. However, the inflated type I error neither behaves completely monotonically w.r.t. the accrual duration a nor the follow-up duration f . Even if the variance of the classical one-sample log-rank test and the additional variance which is due to the sampling variability (see appendix A) increase monotonically in a and f , the ratio R can increase if the increase of the former is steeper than the increase of the latter. Nevertheless, there is a clear tendency towards a larger inflation of the type I error if either a or f increases.

Discussion
Traditional one-sample log-rank tests compare the survival function of an experimental treatment to a prefixed reference survival curve, which typically represents the expected survival under standard of care. Choice of the reference survival curve is typically based on historic data on standard therapy and thus prone to statistical error. Nevertheless, traditional one-sample log-rank tests do not account for this variance of the reference curve estimator. Here we study and propose a nonparametric one-sample log-rank test that explicitly accounts for sampling variability of the reference curve.
The new test may also be interpreted as two-sample test for survival distributions, while inheriting the interpretability from the underlying one-sample log-rank test. Admittedly, our simulations suggest that it may be advisable to compare the data of a historical control cohort with the new data in a single-arm Phase II trial via the two-sample log-rank test if one wants to account for sampling variability of the reference curve or in case of allocation ratios close to 1. Nevertheless, in Phase II settings with fast events (Weibull shape parameter κ ≤ 1), our simulations reveal the potential of the new test to outperform the classical two-sample log-rank test even under PH alternatives. A non-consideration of the sampling variability leads to an inflation of the type I error rate. The extent of this inflation depends in particular on the size of the control cohort. A major objective of this work was to investigate how large this control must be chosen so that the type I error inflation remains within an acceptable range. In this regard, our simulations support that the classical one-sample log-rank test is adequate if the historical control cohort is at least about 10 times larger than the new cohort (π ≤ 1/10) and the maximum follow-up in the new trial is reasonably small in view of the follow-up duration in the historic cohort (see Case Study). Conceptually, the proposed new test also sheds light on a general strategy for lifting existing methodology for single-arm survival trials to a randomized, multi-arm setting. This might be of interest for designing confirmatory survival trials with interim analyses. Performance of interim analyses in clinical trials is of ethical and economic interest. On the one hand, interim analyses enable faster decisions regarding rejection or acceptance of the underlying null hypothesis when the treatment effect is larger or smaller than initially expected. Moreover, interim analyses offer the possibility for data dependent modifications of the trial (e.g. sample size recalculation) in the case of new insights, thus increasing the prospects of the trial. Trial designs with interim analyses offering such kind of flexibility at full type I error rate control are commonly referred to as confirmatory adaptive designs [21,22]. Whereas methodology for confirmatory adaptive designs is well understood for trials with short-term endpoints as in [23,24], subtle problems arise for adaptive survival trials. With standard methodology for group-sequential adaptive survival trials from [25], the degree of flexibility is highly limited. For example, in a survival trial with primary endpoint overall survival (OS), essentially only interim information on the survival status of the patients may be used for design modifications (e.g. sample size recalculation). Further interim information, e.g. on progression status of the patients, must not be used for design modifications in these classical adaptive Phase III survival trials, because otherwise type I error rate inflation may occur (see [26]). This situation is clinically unsatisfactory. If larger degree of flexibility is desired, the patient-wise separation approach as initially proposed by [27] has to be chosen which, however, either implies neglection of some part of the observed survival data in the test statistic or requires some worstcase adjustments resulting in a conservative design as shown in [28]. Until today, no satisfactory methodology for adaptive Phase III survival trial exists, that offers larger flexibility while avoiding those problems involved with the patients-wise separation approach. Recently, however, such methodology was proposed for single arm Phase II survival trials. In [29], an adaptive one-sample log-rank test was suggested that allows the simultaneous use of several time-to-event endpoints for data-dependent design modifications, while avoiding those problems involved with the patient-wise separation approach. In a similar way the common one-sample log-rank test was lifted to a twosample setting in this paper, we expect that the multivariate adaptive one-sample log-rank test proposed by [29] may be lifted to a two-sample setting, thus solving an outstanding problem in the theory of adaptive design methodology. Implementation of this idea, however, is beyond the scope of this paper and will be contents of an upcoming paper. The objective of this paper is to provide methodology for accounting for sampling variability of the reference curve in classical one-sample log-rank tests, and to show feasibility of the underlying lifting procedure regarding type I and type II error rate control.

Acknowledgments
The work of Moritz Fabian Danzer was funded by the German Science Foundation (Deutsche Forschungsgemeinschaft, DFG, grant number 413730122). Theorem 1. Let s 0 > 0 be given s.t. S X A (s 0 ) = S T A (s 0 )S C A (s 0 ) =: p 0 > 0 and assume that the null hypothesis H 0 : Then the following is true:   Proof. Under the contiguous alternatives, the difference between the mean-zero martingale M from (A.4) and M 0 is As n → ∞, the first factor converges to −γ Proof. For this proof, we introduce the following abbreviations: as n → ∞ by law of large numbers, it remains to prove that Θ P → lim n→∞ E[Θ] as n → ∞. The proof decomposes into several steps. From appeal to triangle inequality we conclude that for any ε > 0 (A.7) By Lemma 6, lim n→∞ E[Θ] exists, i.e. the first summand on the right hand side of equation (A.7) vanishes in the limit n → ∞. By conditioning on the outcomes in group B, the second summand can be rewritten as where π = n B /n A is the prefixed treatment group allocation ratio. Notice that the third inequality holds because x B is a fixed value and not a random variable. To finish the proof, we show that sup s ∈[0,s0] Var[n A · Ψ M A (s )] → 0 (Step I) and sup s ∈[0,s0] Var[n A · Ψ Λ A (s )] → 0 (Step II) as n A → ∞. Proof of Step I : For any s ∈ [0, s 0 ], we also have For both summands we can apply the results from Lemma 4.2 (i) of [30] as the probability that Y A (s) = 0 goes to zero uniformly on [0, s 0 ] to get We can finally plug this estimate in to obtain where the first inequality from the third row follows from Lemma 1 from [31]. Those inequalities hold for all s ≤ s 0 . We thus conclude that for any s ≤ s 0 which finishes the proof where S Tx (S Cx ) is the survival function of the time to event T x,i (time to censoring C x,i ) in treatment group x = A, B.
Proof. This can easily be shown with Hellands proposition [32] and using the fact that Y x (u)/n x P → y x (u) ≡ S Tx (u) · S Cx (u).

Lemma 5.
Let X x,i := T x,i ∧ C x,i . Then, for any i = j, the density f Xx,i (u) and survival function S Xx,i (u) of X x,i as well as the density f Xx,i∧Xx,j (u) of X x,i ∧ X x,j are where f Tx and S Tx (f Cx and S Cx ) are density and survival function of the time to event T x,i (time to censoring C x,i ) in treatment group x = A, B.
Proof. Follows from elementary calculation with probability distributions using the independence of T x,i , T x,j , C x,i and C x,j for i = j. Lemma 6. Let Σ the process defined in (A.1). Then, pointwise in s, the limit Σ(s) := lim n→∞ E[ Σ(s)] exists. Under the contiguous alternatives Λ B (·) = ω n Λ A (·) with ω n = exp(−n −1/2 γ) for some γ ≥ 0, Σ(s) amounts to where f Tx , F Tx , S Tx (f Cx , F Cx , S Cx ) are density, distribution function and survival function of the time to event T x,i (time to censoring C x,i ) in treatment group x = A, B, where σ A (·) is the function from (A.11), and π = n B /n A the prefixed treatment arm allocation ratio.
Proof. Since each of the families of random variables where the third equality holds by dominated convergence under application of the estimate from Lemma 1 in [30] and the convergence of the expectation holds by Lemma 4.2 in [31]. In particular, the second summand on the right hand side of (A.14) vanishes as n → ∞. So we are done by supplying the explicit value of the density function f X A,i ∧X A,j from (A.12). Notice that the last equality in (A.15) holds, because lim n→∞ Λ B = Λ A under the contiguous alternatives.

Appendix B: Computation of Expectation of Σ(∞)
In this section we derive the expectation of Σ(∞) as shown in Section 7. Firstly, E[ Σ 2 (∞)] can be decomposed into where the first summand is the same as the quantity in (10). For the second summand, we have as n → ∞. The expectation on the right hand side is given by where we applied the product rule several times.
Appendix C: Empirical type I error rates in scenarios with high survival rates Table 3: Empirical type I error rates (α new and α LR ) and powers (1 − β new and 1 − β LR ) for the new test and for the classical two-sample log-rank test, respectively, under proportional hazards alternatives for Weibull distributed survival times with shape parameter κ and 1-year survival rate S 1 = 0.8 in the control arm. Theoretical two-sided significance level: 5%. Underlying total sample size n (or n ) in Scenario 1 (or Scenario 2) calculated to achieve a theoretical power of 80% under the planning alternative H 1 : Λ B = ω 0 · Λ A for the classical log-rank test using Schoenfeld's formula (or for the new test using formula (9)). Appendix D: Empirical type I error rates in scenarios with low survival rates Table 4: Empirical type I error rates (α new and α LR ) and powers (1 − β new and 1 − β LR ) for the new test and for the classical two-sample log-rank test, respectively, under proportional hazards alternatives for Weibull distributed survival times with shape parameter κ and 1-year survival rate S 1 = 0.2 in the control arm. Theoretical two-sided significance level: 5%. Underlying total sample size n (or n ) in Scenario 1 (or Scenario 2) calculated to achieve a theoretical power of 80% under the planning alternative H 1 : Λ B = ω 0 · Λ A for the classical log-rank test using Schoenfeld's formula (or for the new test using formula (9)).