Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Two-sample survival tests based on control arm summary statistics

  • Jannik Feld ,

    Roles Conceptualization, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Validation, Visualization, Writing – original draft

    Jannik.Feld@ukmuenster.de

    Affiliation Institute of Biostatistics and Clinical Research, University of Münster, Münster, Germany

  • Moritz Fabian Danzer,

    Roles Formal analysis, Investigation, Methodology, Validation

    Affiliation Institute of Biostatistics and Clinical Research, University of Münster, Münster, Germany

  • Andreas Faldum,

    Roles Project administration, Resources, Supervision

    Affiliation Institute of Biostatistics and Clinical Research, University of Münster, Münster, Germany

  • Anastasia Janina Hobbach,

    Roles Conceptualization, Writing – review & editing

    Affiliation Department of Cardiology I, Coronary, Peripheral Vascular Disease and Heart Failure, University Hospital Münster, Münster, Germany

  • Rene Schmidt

    Roles Conceptualization, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Supervision, Validation, Writing – original draft

    Affiliation Institute of Biostatistics and Clinical Research, University of Münster, Münster, Germany

Abstract

The one-sample log-rank test is the preferred method for analysing the outcome of single-arm survival trials. It compares the survival distribution of patients with a prefixed reference survival curve that usually represents the expected outcome under standard of care. However, classical one-sample log-rank tests assume that the reference curve is known, ignoring that it is frequently estimated from historical data and therefore susceptible to sampling error. Neglecting the variability of the reference curve can lead to an inflated type I error rate, as shown in a previous paper. Here, we propose a new survival test that allows to account for the sampling error of the reference curve without knowledge of the full underlying historical survival time data. Our new test allows to perform a valid historical comparison of patient survival times when only a historical survival curve rather than the full historic data is available. It thus applies in settings where the two-sample log-rank test is not applicable as method of choice due to non-availability of historic individual patient survival time data. We develop sample size calculation formulas, give an example application and study the performance of the new test in a simulation study.

Introduction

The one-sample log-rank test [1] is the method of choice for single-arm Phase II trials with time-to-event endpoint. It allows to compare the survival of the patients to a prefixed reference survival curve that typically represents the expected survival under standard of care.

A central point of criticism of the one-sample log-rank test relates to the process of selecting the reference survival curve. It is common practice to choose the reference survival curve in the light of historic data on standard treatment. This implies that choice of the reference survival curve itself is prone to statistical error, which however, is ignored in the classical one-sample log-rank statistic. In classical log-rank tests the prefixed reference curve is rather treated as deterministic. Doing so results in an inflation of the type I error rate, when in fact the prefixed reference curve resulted from an estimation process [2].

For this reason, it is standard practice to use the two-sample log-rank test to compare patient survival with a historical control group whenever historical survival data are available. However, situations may arise in which the full historical survival data on individual patient level is not (or no longer) available to the researcher. This may occur for a variety of reasons, ranging from copyright to privacy reasons, such as when data sharing is not covered by informed consent or individual patients withdrew their consent. In such a situation, generally only summary statistics such as a survival distribution estimator based on historical survival times together with basic data on sample size, event numbers, and length of follow-up will be available in peer-reviewed scientific publications. Then, application of the two-sample log-rank test is not possible because the required complete historical survival data on individual patient level are not available. Conversely, it is also not permissible to use the published survival estimator as deterministic reference curve in a classical one-sample log-rank test, since this ignores the sampling error of the reference curve, leading to inflation of the type I error rate [2].

In this paper, we present a solution to this problem. We propose a new survival test that, under mild assumptions, allows the sampling error of the reference curve to be taken into account without knowledge of the full historical survival time data. This allows to perform a valid historical comparison of patient survival times when only an estimator of the historical survival curve is available instead of the full historical survival data. Our proposed test can formally also be interpreted as a two-sample test for survival times, and can thus be compared with the two-sample log-rank test in terms of test performance. However, the intended application focus of our new test is comparison of survival against a historical reference in the absence of the full historical survival time data.

The paper is organized as follows. After settling notation and the testing problem, we describe the test statistic and its distributional properties. We then will provide an example application on real world data. Additionally, we provide sample size calculation methodology. Calculation of rejection regions and sample size are based on the approximate distribution of the new test statistic in the large sample limit. Therefore small sample properties of the new test regarding type I and type II error rate control are studied by simulation, and compared to the two-sample log-rank test. We conclude with a discussion of future research.

General aspects

Notation

We consider a single-arm clinical trial with survival data from a treatment group B, say (experimental intervention, prospectively collected data). This survival data shall be tested against the survival under standard of care treatment A, say represented by the reference survival curve . However, in practice the true survival curve under standard of care is unknown and thus researchers may rely on an estimator of the reference survival curve taken e.g. from a publication in a peer reviewed journal. We denote with a corresponding estimator of the cumulative hazard either taken from the literature or obtained through the equation . These estimators stem from a group of historical patients who were treated under standard of care. However, the individual data on patient level are usually not available for the public due to data protection and privacy reasons.

Let denote the set of patients from group B, the number of such patients. We denote by TB,i and CB,i the time from entry to event or censoring for patient respectively. Let XB,iTB,iCB,i denote the minimum of both. As usual, we assume that the TB,i and CB,i are mutually independent (non-informative censoring).

Based on the observed data, we calculate the number of events from treatment group B up to study time s ≥ 0 as (1)

As usual, we let λB(s) ≔ limΔ→0 P(sTB,i < s + Δ|TB,is)/Δ denote the true hazard of a patient and denote by the corresponding cumulative hazard function for s ≥ 0, respectively.

Finally, we denote by and the distribution function and survival function of the time to event TB,i and denote with the density function of the censoring time CB,i for . We assume the tuples (CB,i, TB,i)i to be independent and identically distributed and assume CB,i and TB,i to be independent for each . We denote by aB the recruitment period length, with fB the follow up period length resulting in a trial period length of aB + fB for the new intervention trial. In particular, we assume that patients enter the trial between calendar time 0 and aB, and are then further followed-up until calendar time aB + fB, unless they are not censored or experience an event. Therefore the intended follow up of the first patient is aB + fB and of the last patient is fB respectively. Furthermore we denote with rBnB/aB the recruitment rate of the new trial.

The above definitions shall be applied analogously to group A. In particular, we denote by π = nB/nA the ratio of the group sizes. However, notice again that usually the individual data XA,i, TA,i and CA,i for are unknown to the researcher and only the survival curve estimator with summary statistics about number at risk and sample size are available.

The testing procedure

We are interested in testing for some prefixed time-horizon smax > 0, i.e. that the survival of the patients under the new experimental treatment coincides with survival under standard of care. The classical one-sample log-rank test could be used to test H0 if the reference cumulative hazard function ΛA was known. However, the practical user faces the problem that the true reference curve ΛA representing survival under standard of care is unknown. For this reason ΛA is often replaced within all test statistics by its Nelson-Aalen estimate derived from historic data. This means that the classical one-sample log-rank test of H0 is based on the statistic (2) with (3) and estimator of rejecting H0 whenever |ZOSLR| ≥ Φ−1(1 − α/2), where Φ−1 is the standard normal quantile function. This proceeding would only be admissible if the variance of was negligible. In practice, however, is typically not negligible [2]. Then, using the process with the classical standardisation leads to an under-estimation of variance and thus inflation of the type I error rate. The factor of under-estimation of variance may be approximated by if recruitment and censoring mechanism were equal in the new treatment group B and the historical group underlying the estimate . This motivates us to propose the following random variable as a new survival test statistic (4)

Then, an approximate level α test of H0 is defined by rejecting H0 whenever (5) where α is the desired two-sided significance level. Notice again that we used the assumption of equal recruitment and censoring mechanisms in the new experimental and historic control cohort. We will keep this assumption for the rest of the manuscript and will investigate how deviation from this assumption affects the type I error rate control within simulation studies. Furthermore, notice that procedure (5) is equivalent to adjusting the nominal α-level of the classical one-sample log-rank test methodology to or analogously using the critical boundary

Notice that in theory one could use a corrected variance estimator for (see Eq. (5) of [2]) instead of our approximation . However, using the statistic Zπ yields some advantages. Firstly, calculation of Zπ can be executed with existing standard software whereas needs additional implementation effort to be calculated. Secondly, for computation of the full dataset of the historical patients is needed, whereas Zπ is based only on the sample size nA and the estimated reference curve [3] (or equivalently the Kaplan Meier estimator [4]). This enables the usage of the Zπ-test (5) even in absence of full historical survival time data.

A real world example

Our hypothetical example stems from clinical cardiology and originates from the context of the ‘obesity paradox’. This paradox refers to the observation that in certain cardiovascular diseases, overweight patients may exhibit a more favourable prognosis compared to non-obese patients. This phenomenon has been observed in several cardiovascular diseases, including heart failure, coronary heart disease and acute myocardial infarction. Despite the well-established link between obesity and an increased risk of cardiovascular disease, studies have found that obese individuals particularly those with pre-existing cardiovascular disease, may have a lower mortality risk compared to non-obese patients. This observation has generated considerable interest and debate in the medical community. In this context it might be interesting to test the null hypothesis H0 that the survival of obese patients suffering from acute myocardial infarction coincides with the survival of non-obese patients suffering acute myocardial infarction against a significance level of α = 5%.

For researchers it could be beneficial to use reference survival curves published by large registries or obtained from health insurance data, as these could serve as a suitable option for representing the general population.

However, in our experience, when health insurance companies grant researchers access to their insurance data they strictly prohibit to extract individual patient level data. Thus our method to make a comparison based upon summary statistics is needed. The following survival curve in Fig 1 is obtained from data of the “AOK- Die Gesundheitskasse” (AOK), a collective of 11 local German health insurance funds covering one third of the German population.

thumbnail
Fig 1. Reference survival curve.

Kaplan Meier estimator of overall survival of patients hospitalized with a myocardial infarction in 2014 aged 30–60, without obesity and without a prior indication of chronic heart failure.

https://doi.org/10.1371/journal.pone.0305434.g001

It includes all patients hospitalized with a myocardial infarction in 2014 aged 30–60, without obesity and without a prior indication of chronic heart failure. For our hypothetical example, this survival curve should take on the role of the historically estimated reference curve on which the access to the underlying patient level data XA,i, I(TA,iCA,i) for may be restricted to the researchers.

We used the WebPlotDigitizer website (https://apps.automeris.io/wpd/) to extract the x-y data at 481 points from the obtained image, which we interpolated to reconstruct the reference survival function to finally calculate the reference cumulative hazard function .

We used this reference cumulative hazard function to test against a sample of obese patients from the year 2014 aged 30–60 without a prior indication of chronic heart failure representing the set of patients . The Kaplan Meier estimators and of both collectives are presented in Fig 2.

thumbnail
Fig 2. Overall survival of both collectives.

Kaplan Meier estimators and of overall survival of patients hospitalized with a myocardial infarction in 2014 aged 30–60, without a prior indication of chronic heart failure.

https://doi.org/10.1371/journal.pone.0305434.g002

If the classical one-sample log-rank test (2) is used without considering the sampling variability of the reference curve, the test would yield a p-value of pOSLR = 0.0320, yielding a significant difference between the two groups. However using our new technique (4) with π = 2632/10061 we obtain a p-value of pπ = 0.0562 being non significant.

Another approach to test the two groups would be to use a reconstruction algorithm [5] to obtain reconstructed censored survival times for on individual patient level of the reference cohort to plug into the classical two-sample log-rank test. This approach would lead to a p-value of pTSR = 0.0449, yielding statistical significance once again. However, this procedure can only be used with great caution. On one hand, reconstruction algorithms deliver solid results when used to reproduce Kaplan Meier point estimators since they are designed to reproduce these as accurately as possible. However, using the reconstructed data for estimating hazard ratios or plugging them into a two-sample log-rank test may lead to poor results, as small, but systemic, deviations that are of little relevance for point estimators may add up over the course of the tested time frame [5]. These deviations are not accounted for by the methodology of the two-sample log-rank test and may endanger the correct type I error rate control. This effect is especially prominent when the number of extracted x-y data points from the Kaplan Meier plot image is lower than the number of individual patients data points which need to be reconstructed. In this case the algorithm is under-saturated.

Finally, if researchers had both original data sets and applied a two-sample log-rank test, the resulting p-value would be pTS = 0.0503 indicating no significance. Thus, both the classical one-sample log-rank test and the two-sample log-rank test using reconstructed data resulted in anti-conservative p-values. In summary our analysis agrees with the original two-sample log-rank test which does not confirm the obesity paradox for the cohort of patients aged 30–60 suffering from acute myocardial infarction.

Sample size calculation

General formula

Testing H0 to the level α using Zπ is equivalent to testing H0 using the classical one-sample log-rank test to the adjusted level αnominal. Thus the sample size formula for the classical one-sample log-rank test from [6] immediately yields a sample size formula for the new test based on Zπ. In order to achieve a power of approximately 1 − β for an allocated significance level α and constant hazard ratio ω ≔ ΛB(s)/ΛA(s) for all s ≥ 0, the new test procedure (5) has to be performed based on a total of events [6]. Notice that above equation is a direct result of plugging our adjusted significance level αnominal into the sample size formula of the classical one-sample log-rank test (see Eq. (4) of [6]). However the existing methodology and software can not be used directly as the adjusted significance level αnominal itself depends on the sample size through π. This number of events d can be expected to be achieved if the sample size is chosen such that

Through administrative censoring at calendar time aB + fB it is clear that is only non-zero on the interval [0, aB + fB] and thus the integral limits reduce accordingly. Through fixing two of the three parameters aB, fB and rB by clinical consideration and by using the identities nB = rBaB and π = (rBaB)/nA as well as recognizing the functional dependency of on the parameters aB and fB, one can numerically solve the above equation for the last free parameter of aB, fB and rB. We provide R-Code in the supplemental material to perform this sample size calculation.

Closed formula for rare late events

The above methodology describes a general approach for sample size calculation. However, in the clinical setting, it may be advantageous for the statistician to have a closed formula at hand. For this reason we will develop a closed formula for a special case often met in practice.

We assume the following conditions:

  1. (i): No censoring except for administrative censoring at calendar time aB + fB
  2. (ii): A plateau of the survival function at the edge of the observation window [0, smax] such that

Assumption (ii) may be approximately met when the survival curve stagnates for a long period of time. An example might be given by the survival curve of patients after a complex surgery, which sharply declines shortly after the procedure but stagnates after a year. Also, in the context of cause specific deaths such a plateau may occur naturally.

We now evaluate the power formula (6)

Under the assumptions (i) and (ii), the probability of observing an event is equal to the probability of having an event in general, and thus the expectation on the left hand side is tantamount to nBpB, with . We can obtain the required sample size nB by solving the above equation for for . Furthermore, using the identity π = nB/nA, solving the equation above is equivalent to calculating the roots of the quadratic polynomial

Notice that since the intercept and linear coefficient are negative for 1 − β > 0.5, the polynomial has a positive root if and only if the quadratic coefficient is positive e.g. (7)

This yields a lower bound for nA. When nA falls below this threshold it is not possible to get a significant result to a power of at least 50% when sampling variability is taken into account. Notice that the assumptions (i) and (ii) were introduced, to easily approximate the probability of observing an event P(TB < CB). However, using the inequality , we can also obtain a lower bound for the more general sample size calculation: (8)

Furthermore if inequality (7) holds, the distinct positive root of the polynomial can be calculated and thus the required sample size nB is given by

where

is the quadratic coefficient,

c1 ≔ −2 ⋅ ω3/2z1−α/2z1−β is the linear coefficient and

is the intercept.

Simulation study: Comparison with the two-sample log-rank test

Design

We proposed a significance test for the null hypothesis H0 based on the approximate large sample distribution of the test statistic Zπ introduced before. Notice that only the sample size nA and the graph of the Nelson Aalen estimate of group A is needed to compute the test statistic Zπ. Knowledge of full survival data for group A is not required. However, in case of full knowledge of the data of group A, our test with consideration of reference curve variability, may also be interpreted as a two-sample survival test. This simulation, therefore aims to study performance of the new survival test for sample sizes of practical relevance, compared to the classical two-sample log-rank test, which by design takes the variability of the historical data into account (see [7, 8]). Asymptotically (i.e. for sufficiently large sample size) the classical two-sample log-rank test is known to be the optimal two-sample test under proportional hazards (PH) alternatives. It is thus of particular interest to compare performance of both tests under PH alternatives.

In clinical practice the sample size calculation methodology is most likely to be used with the recruitment rate rB and follow up duration fB fixed by clinical consideration and thus solving for the accrual period length aB, resulting in the sample size nBrBaB. However, since our approximation is based on the assumption that the censoring and recruitment mechanisms are equal to the historic cohort, we use the sample size equation to solve for rB with predefined accrual period length aB and follow up duration fB (and thus fixing the administrative censoring and recruitment mechanism independently from the sample size), to have clear insight in the performance of our test if the assumption of equal censoring and recruitment mechanisms is violated.

In our simulations, patients were assumed to enter the trial uniformly between year 0 and year aA = aB = 3 and followed up for fB ∈ {2, 3, 4} in the intervention cohort and fA = 3 years in the historic cohort.

Accordingly, the calendar times of entry were generated according to a uniform distribution on [0, ax].

We considered scenarios with no loss to follow-up setting and also considered scenarios with additional independent exponentially distributed censoring , where overall censoring was set as . In these Scenarios we set and varied between scenarios to analyse the robustness of the test, when censoring mechanisms are not equal.

Survival times TA,i in the control group A were generated according to a Weibull distribution i.e. ΛA(s) ≔ − log(s) ⋅ tκ with prefixed shape parameter κ ∈ {0.1, 0.25, 0.5, 1.0, 1.5, 2.0, 5.0} and 1-year survival rate . To implement the PH condition, survival times TB,i in the experimental intervention group B were generated according to a Weibull distribution i.e. ΛB(s)2/3 ⋅ ΛA(s), fixing ω = 2/3 as the true hazard ratio such that . Additional simulation results with ω = 1/2 and ω = 4/5 will be presented in the S1 Appendix.

We let the number of patients in the historic group nA ∈ {200, 400, 800} vary between scenarios and used our sample size equation to determine the required number of patients in the intervention group nB for a targeted power of 1 − β = 80% for our new test under planning alternative K0: ΛB = 2/3 ⋅ ΛA for a two-sided significance level of α = 5%. From these parameter constellations we omitted any scenario with π > 2 since these might be unrealistic in practical use of the methodology.

For each parameter constellation, we generated 10000 samples to which we applied both the new test as well as the classical two-sample log-rank test. Reported in Tables 1 and 2 are the empirical type I and type II error rates for each parameter constellation. To increase readability of the tables we omitted potential confidence intervals for our estimators, but notice that with 10000 samples potential 95%-confidence intervals for the type I error rate estimators given in Tables 1 and 2 should have a width of about 0.004.

thumbnail
Table 1. Comparison of empirical type I and II errors of the new procedure and two-sample log-rank test with varying follow-up length.

https://doi.org/10.1371/journal.pone.0305434.t001

thumbnail
Table 2. Comparison of empirical type I and II errors of the new procedure and two-sample log-rank test with varying censoring mechanism.

https://doi.org/10.1371/journal.pone.0305434.t002

Results

Reassuringly, our new test statistic keeps the significance level of α = 5%. Even in scenarios where the equal censoring mechanism assumption was moderately violated, the error rate inflation was negligible. The performance of the new test is comparable to the two-sample log-rank test which is known to be asymptotically optimal under the proportional hazards assumption. In scenarios with early events (Weibull shape parameter ≤1.5) the performance was even slightly better. The sample size calculation algorithm, however had slightly mixed results. Many scenarios achieved the targeted power of 80% quite accurately. However, in some extreme scenarios with particularly small sample sizes and extreme shape parameters of the reference survival-curve, the sample size calculation yielded underpowered trials. For these combinations, we recommend validating the sample size by simulation and comparing the sample size to a two-armed trial using the two-sample log-rank test.

Discussion

Traditional one-sample log-rank tests compare the survival of an experimental treatment to a prefixed reference survival curve, which typically represents the expected survival under standard of care. The choice of the reference survival curve is often based on historic data on standard therapy and thus prone to statistical error. Nevertheless, traditional one-sample log-rank tests do not account for this variance of the reference curve estimator. This non-consideration of the sampling variability leads to an inflation of the type I error rate. The extent of this inflation depends in particular on the relative size of the control cohort compared to the intervention cohort [2].

Here we study and propose a non-parametric survival test that explicitly accounts for sampling variability of the reference curve. Our proposed test does not require the availability of complete data of the historical group. Instead, only the sample size and the graph of the Nelson-Aalen or Kaplan-Meier estimate is required to implement the test. Thus, in contrast to the two-sample log-rank test, our new test can also be calculated if access to the raw data of the historical group is restricted e.g. due to data protection precautions. However, when the full historical data is accessible, the new test may also be interpreted as two-sample test for survival distributions, while inheriting the interpretability from the underlying one-sample log-rank test. Admittedly, our simulations confirm that the two-sample log rank test is the method of choice to compare the data of a historical control cohort with the new data in single-arm Phase II trials if sampling variability of the reference curve is pronounced and if full data of the historical cohort is available.

We provided sample size calculation methodology for our new proposed test methodology, which accounts for the added variance of the estimation process for the the historical cohort. As expected, the calculated sample sizes are of similar size as the sample sizes obtained through two-sample planning methodology like the Schoenfeld formula [9].

Traditionally, one-sample methods are popular because they require a small sample size: By using a one-sample log-rank test, not only the sample size of the control arm is saved, but also the reference curve is considered deterministic, thereby diminishing the variability of the test statistic. This again relevantly reduces the required sample size. However, if the reference curve itself is also estimated from data the assumption of a deterministic reference curve does not hold. Then the latter reduction of sample size comes at the price of an inflated type I error rate if not addressed by adequate statical considerations [2]. To obtain a valid test procedure the variability has to be accounted for, as our new proposed survival test does.

In summary we advise to use our new test methodology if the reference curve is obtained through an estimation process and access to the full historic data is restricted. It is important to note, that the usual use case of the two-sample log-rank test is a randomised controlled trial which fundamentally differs from the case of restricted reference data which is by design unrandomised and leaves no room for adjustment techniques of confounders since individual patient level data are not available. Therefore it is crucial to exercise extreme caution when using a historical control from a summarized Kaplan Meier survival curve. Thus application of the new test may lead to significant differences between the compared survival curves, but it may be unclear whether this difference stems from treatment or confounding. For this reason the methodology is suitable for phase II trials but not for phase III trials as a basis for drug approval or health technology assessment decision making. In addition, when choosing the reference curve, it must be ensured that the recruitment and censoring mechanisms of the historical cohort are equal to those of the new cohort in order to guarantee an exact type I error rate control. Therefore, we advise to use the classical two-sample log rank test in case of individual patient level survival data availability, since it does not rely on the assumption of equal recruitment and censoring mechanisms. Nevertheless, the new test has a similar performance as the two-sample log rank test which is known to be optimal under proportional hazards. Furthermore it should be noted that the new test methodology can be used with existing standard software, as the approach is equivalent to an adjustment of the nominal α-level in the standard one-sample log-rank test.

Supporting information

S1 Appendix. Mathematical details.

Mathematical statements and corresponding proofs.

https://doi.org/10.1371/journal.pone.0305434.s001

(PDF)

S1 Text. Sample size Pi.

R code. Supplementary R code for sample size calculation.

https://doi.org/10.1371/journal.pone.0305434.s002

(TXT)

S2 Text. Simulation function.

R code. Supplementary R code for simulating design scenarios.

https://doi.org/10.1371/journal.pone.0305434.s003

(TXT)

S3 Text. Cpp code.

R code. Supplementary R code for simulating design scenarios.

https://doi.org/10.1371/journal.pone.0305434.s004

(TXT)

S4 Text. Simulation different censoring.

R code. Supplementary R code for simulations for tables with differing censoring mechanisms.

https://doi.org/10.1371/journal.pone.0305434.s005

(TXT)

S5 Text. Simulation different followUp.

R code. Supplementary R code for simulations for tables with differing follow-up lengths.

https://doi.org/10.1371/journal.pone.0305434.s006

(TXT)

Acknowledgments

We would like to thank Prof. Dr. med. Holger Reinecke for revising our real world example. We would like to thank our cooperation partners at the AOK Research Institute (WIdO) for granting us permission to use their data within our real world example. This permission was granted within the framework of the GenderVasc project (Gender-specific real care situation of patients with arteriosclerotic cardiovascular diseases). We acknowledge support from the Open Access Publication Fund of the University of Muenster.

References

  1. 1. Breslow NE. Analysis of Survival Data under the Proportional Hazards Model. Int Stat Rev. 1975; 43:45–58.
  2. 2. Danzer MF, Feld J, Faldum A, Schmidt R Reference curve sampling variability in one–sample log–rank tests PLOS ONE; 2022. pmid:35862473
  3. 3. Aalen OO, Borgan O, Gjessing HK. Survival and Event History Analysis. New York: Springer Science + Business Media; 2008.
  4. 4. Kaplan EL, Meier P, Nonparametric Estimation from Incomplete Observations. Journal of the American Statistical Association. 1958; 53:457–481.
  5. 5. Guyot P, Ades AE, Ouwens M, Welton NJ. Enhanced secondary analysis of survival data: reconstructing the data from published Kaplan-Meier survival curves. BMC Med Res Methodol; 2012 12, 9. pmid:22297116
  6. 6. Schmidt R, Kwiecien R, Faldum A, Berthold F, Hero B, Ligges S. Sample size calculation for the one-sample log-rank test. Stat Med. 2015; 34(6):1031–1040. pmid:25500942
  7. 7. Mantel N. Evaluation of survival data and two new rank order statistics arising in its consideration. Cancer Chemother Rep. 1966; 50(3):163–170. pmid:5910392
  8. 8. Peto R and Peto J. Asymptotically Efficient Rank Invariant Test Procedures. J R Stat Soc Ser A. 1972; 135(2):185–207.
  9. 9. Schoenfeld D. The asymptotic properties of nonparametric tests for comparing survival distributions. Biometrika. 1981; 68:316–319.